CN116304120A - Multimedia retrieval method, device, computing equipment and storage medium - Google Patents

Multimedia retrieval method, device, computing equipment and storage medium Download PDF

Info

Publication number
CN116304120A
CN116304120A CN202211434753.4A CN202211434753A CN116304120A CN 116304120 A CN116304120 A CN 116304120A CN 202211434753 A CN202211434753 A CN 202211434753A CN 116304120 A CN116304120 A CN 116304120A
Authority
CN
China
Prior art keywords
sub
attribute
vector
text
multimedia
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211434753.4A
Other languages
Chinese (zh)
Inventor
杨希
潘喆
闫伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202211434753.4A priority Critical patent/CN116304120A/en
Publication of CN116304120A publication Critical patent/CN116304120A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a multimedia retrieval method, a device, a computing device and a storage medium, wherein the method comprises the following steps: receiving a search text, and carrying out semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes; inputting each sub-attribute and the correlation into a pre-trained semantic matching model to obtain labels of each sub-attribute; and determining similarity values between the labels of the sub-attributes and the labels of the pre-obtained multimedia sub-classes, and determining the multimedia sub-classes matched with the search text according to the sizes of the similarity values. According to the invention, a multistage matching mechanism of the text semantic tags and the multimedia tags is realized in a hierarchical semantic matching mode, so that the relation between the text semantic tags and the multimedia tags can be better mined, and the recognition degree and the accuracy are higher.

Description

Multimedia retrieval method, device, computing equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a multimedia retrieval method, a device, a computing device and a storage medium.
Background
With the popularization of 5G and the reduction of network bandwidth cost, video data is becoming one of the mainstream data in social, security and other fields. Due to the growth in the number of videos, the diversity of multimedia video content, and the complexity of the data structure, it has become a challenge how to quickly and efficiently retrieve the video desired by the user from the video library.
There are two types of current search schemes, one is based on user keywords for search, and the other is based on preset optional features. However, when a plurality of different entity objects appear in the search text (i.e. terms such as questions used in the user search), the above search scheme cannot distinguish the relationship between the attribute and the entity object, which may lead to confusion between search semantics; in addition, since the current keyword information does not have an association relationship or a sequence relationship, the problems of incorrect recognition, missing recognition and the like may be caused.
Disclosure of Invention
The present invention has been made in view of the above-mentioned problems, and it is an object of the present invention to provide a multimedia retrieval method, apparatus, computing device and storage medium which overcomes or at least partially solves the above-mentioned problems.
According to an aspect of the present invention, there is provided a multimedia retrieval method, the method comprising:
Receiving a search text, and carrying out semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes;
inputting each sub-attribute and the correlation into a pre-trained semantic matching model to obtain labels of each sub-attribute;
and determining the similarity value between the label of each sub-attribute and the label of the pre-obtained multimedia sub-class, and determining the multimedia sub-class matched with the search text according to the size of each similarity value.
Optionally, the receiving the search text, performing semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes includes:
analyzing the search text to obtain the word segmentation and the part of speech of the search text;
performing dependency syntactic analysis on the search text by combining the segmentation words and the parts of speech thereof to obtain dependency relations among the segmentation words;
and determining the sub-attribute of the search text and the correlation relationship among the sub-attributes according to the segmentation word, the part of speech and the dependency relationship.
Optionally, the determining the sub-attribute of the search text and the correlation between the sub-attributes according to the word segmentation, the part of speech and the dependency relationship thereof includes:
Coding the word segmentation and the part of speech thereof to obtain word segmentation digital codes, and inputting the word segmentation digital codes into an encoder for conversion to obtain word segmentation vectors; carrying out syntactic analysis on the search text to obtain a syntactic feature vector;
splicing and fusing the word segmentation vector and the syntactic feature vector to obtain a first hidden vector;
processing the first hidden vector by a multi-layer perceptron to obtain a second hidden vector;
and constructing a dependency relation matrix according to the retrieval text and the dependency relation, determining a product value of each position in the dependency relation matrix according to the dependency relation matrix, the dependency relation initializing vector and the second hidden vector, and determining the correlation relation between the sub-attributes according to the magnitude of the product value and a first loss function.
Optionally, the first loss function includes a first cross entropy loss function and a KL divergence loss function;
the first cross entropy loss function takes a normalized exponential function and an actual relation value as independent variables, and the KL divergence loss function takes an average value of the first hidden vector and an average value of the target sub-attribute hidden vector as independent variables.
Optionally, the training step of the semantic matching model includes:
Extracting sub-attributes from the retrieval text sample according to semantic analysis results, and generating a positive sample vector, a retrieval text vector and a negative sample vector according to the sub-attribute conversion;
the positive sample vector, the search text vector and the negative sample vector are respectively subjected to word segmentation and feature coding by an encoder and then are respectively input into respective two-way long and short memory network models to obtain a positive ratio coding vector, a text ratio coding vector and a negative ratio coding vector;
processing the positive ratio coding vector, the text ratio coding vector and the negative ratio coding vector through a normalization exponential function and a second loss function to obtain a label of the sub-attribute in the search text sample;
and comparing the label of the sub-attribute with the manual label to obtain the corrected semantic matching model.
Optionally, the second loss function includes a second cross entropy loss function and a classification loss function;
the second cross entropy loss function takes a normalized exponential function and an actual tag value as independent variables, and the two-classification loss function takes the similarity among the positive-direction ratio coding vector, the text-ratio coding vector and the negative-direction ratio coding vector as independent variables.
Optionally, the step of obtaining the multimedia sub-class label includes:
extracting key frames in the multimedia, carrying out target detection on the key frames, and analyzing according to preset hierarchical categories to obtain labels of the corresponding hierarchical categories.
According to another aspect of the present invention, there is provided a multimedia retrieval apparatus, the apparatus comprising:
the semantic analysis module is suitable for receiving the search text, and carrying out semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes;
the label acquisition module is suitable for inputting each sub-attribute and the correlation into a pre-trained semantic matching model to obtain labels of each sub-attribute;
and the retrieval matching module is suitable for determining similarity values between the labels of the sub-attributes and the labels of the pre-obtained multimedia sub-classes, and determining the multimedia sub-classes matched with the retrieval text according to the size of each similarity value.
According to yet another aspect of the present invention, there is provided a computing device comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
The memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the multimedia retrieval method.
According to still another aspect of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the above-described multimedia retrieval method.
According to the multimedia retrieval scheme, the problem that the existing keyword extraction mode possibly causes confusion among the search semantics is solved by carrying out fine-grained hierarchical semantic attribute extraction on the retrieval text and a matching mechanism of the semantic tags of the retrieval text and the multimedia tags; therefore, the relation between the semantic meaning of the search text and the multimedia label can be better mined, and the search of a plurality of targets is realized by using one search text, so that the method has higher recognition degree and accuracy.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flowchart of a multimedia retrieval method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a multimedia search scenario according to an embodiment of the present invention;
FIG. 3 illustrates a flow chart of a search text semantic analysis provided by an embodiment of the present invention;
FIG. 4 is a flow chart illustrating the extraction of semantic attributes at a fine granularity level of a search text provided by an embodiment of the present invention;
FIG. 5 is a schematic flow chart of unified identification of multiple main body, multiple sub-attribute and coverage attribute according to an embodiment of the present invention;
FIG. 6 illustrates a flow chart for retrieving text semantic matches provided by an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a multimedia retrieval apparatus according to an embodiment of the present invention;
FIG. 8 is a flow chart of video retrieval using a device according to an embodiment of the present invention;
FIG. 9 illustrates a schematic diagram of a computing device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Fig. 1 shows a flowchart of an embodiment of a multimedia retrieval method of the present invention, which is applied in a computing device. The computing device comprises various computers, intelligent terminals, tablet computers and the like. As shown in fig. 1, the method comprises the steps of:
step 110: and receiving a search text, and carrying out semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes.
In conjunction with the schematic view of the multimedia retrieval scenario shown in fig. 2, the embodiment of the present invention may be used to retrieve corresponding multimedia resources by receiving terms such as a retrieval question input by a user, where the multimedia includes video, sound, image collection, animation, and the like.
According to the multimedia search scenario shown in fig. 2, a user inputs a term, a question, etc. to be searched in a search box to obtain a corresponding search text, and obtains sub-attributes through a semantic analysis method. The sub-attribute is not just a single word, but a sequence of fragments, and has better recognition capability than a single word-segmentation and state-bearing word (such as a negative sentence: "without glasses"). In addition, the step 110 can also identify the correlation relationship between the sub-attributes, such as the parallel relationship and the subordinate relationship, and can also identify the key parent attribute from the sub-attributes, so that different retrieval requirements of a plurality of targets can be identified from a sentence of the retrieval text.
Step 120: inputting each sub-attribute and the correlation relationship into a pre-trained semantic matching model to obtain labels of each sub-attribute.
In order to retrieve the corresponding multimedia resource according to the retrieval text, the sub-attribute information obtained by the retrieval text needs to be matched with the label of the multimedia resource. According to the method, the sub-attribute label with the highest matching degree is obtained through a pre-trained semantic matching model, so that matching accuracy is improved.
Step 130: and determining the similarity value between the label of each sub-attribute and the label of the pre-obtained multimedia sub-class, and determining the multimedia sub-class matched with the search text according to the size of each similarity value.
With continued reference to fig. 2, the multimedia data such as video is stored in the database after corresponding tags are extracted by analysis, wherein the multimedia subclasses are used for representing the hierarchical structure of the multimedia, and can represent any one of the classes, and when retrieving, the multimedia data is matched according to the tags in the retrieved text and the tags in the corresponding classes of the multimedia, and the corresponding multimedia is determined or positioned according to the matching result.
It should be noted that, step 120 and step 130 may be performed in combination, for example, a semantic matching model may be added to the similarity value analysis to obtain a multimedia resource that matches the user search. Preferably, the multimedia meeting the user requirement can be determined by sequencing the similarity values and exceeding a preset threshold.
In summary, according to the embodiment shown in fig. 1, by performing fine-grained hierarchical semantic attribute extraction on the search text and a matching mechanism of the semantic tags of the search text and the multimedia tags, the problem that the existing keyword extraction mode may cause confusion between search semantics is solved; therefore, the relation between the semantic meaning of the search text and the multimedia label can be better mined, and the search of a plurality of targets is realized by using one search text, so that the method has higher recognition degree and accuracy.
In one or some embodiments, the receiving the search text, performing semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes includes:
analyzing the search text to obtain the word segmentation and the part of speech of the search text;
performing dependency syntactic analysis on the search text by combining the segmentation words and the parts of speech thereof to obtain dependency relations among the segmentation words;
and determining the sub-attribute of the search text and the correlation relationship among the sub-attributes according to the segmentation word, the part of speech and the dependency relationship.
Wherein, in order to make the expression of the search text more accurate, the method further comprises the step of preprocessing the search text, comprising the following steps: case conversion, complex conversion, error correction, and the like.
In one or some embodiments, the determining the sub-attribute of the search text and the correlation between the sub-attributes according to the word segmentation and the part of speech and the dependency relationship thereof includes:
coding the word segmentation and the part of speech thereof to obtain word segmentation digital codes, and inputting the word segmentation digital codes into an encoder for conversion to obtain word segmentation vectors; carrying out syntactic analysis on the search text to obtain a syntactic feature vector;
Splicing and fusing the word segmentation vector and the syntactic feature vector to obtain a first hidden vector;
processing the first hidden vector by a multi-layer perceptron to obtain a second hidden vector;
and constructing a dependency relation matrix according to the retrieval text and the dependency relation, determining a product value of each position in the dependency relation matrix according to the dependency relation matrix, the dependency relation initializing vector and the second hidden vector, and determining the correlation relation between the sub-attributes according to the magnitude of the product value and a first loss function.
In a specific example, the above-described semantic analysis process for the search text is further described with reference to the flowchart of the semantic analysis for the search text shown in fig. 3. The overall flow of the dependency relationship-based core sub-attribute extraction is as follows:
(1) Text preprocessing: the search text is mainly preprocessed, and the operations which can be performed include: case conversion, complex conversion, error correction, and the like.
(2) Text lexical analysis: the search text is segmented and part-of-speech analyzed, for example, "a male wearing glasses" may be segmented into "a/m wearing/v glasses/n/u male/n", and the part-of-speech classification criterion of ansj, CTB, PKU may be adopted in this example. Of course, other segmentation and part-of-speech tagging criteria may be employed.
(3) Text syntactic analysis: and carrying out dependency syntactic analysis on the search text by combining word segmentation results and parts of speech, and analyzing association attributes among words through the dependency syntactic analysis. Dependency properties used in this example include, but are not limited to: ATT (subject), DE (subject), SBV (subject), VOB (object), COO (general parallel), COS (shared parallel), ADV (subject), HED (core), etc.; for example, "a man wearing glasses" can be parsed into the following attribute relationship table 1 by word segmentation, part-of-speech analysis, dependency syntax analysis.
Table 1 Property splitting and correlation thereof
Sequence(s) Word segmentation Part of speech Dependency sequence number Dependency relationship
1 One or more of m 5 NUM
2 Wearing the article v 4 DE
3 Glasses with glasses n 2 VOB
4 A kind of electronic device u 5 ATT
5 Male (Men) n 0 HED
Preferably, the dependencies in this example may employ existing standards such as PKU Multi-view Chinese Treebank.
(4) Fine-grained hierarchical semantic attribute extraction: the extraction of the attribute with fine granularity can be realized, and the identification of the dependency relationship between the attribute and the core entity can be realized, and the specific process is shown in the flow of the extraction of the semantic attribute with fine granularity of the retrieval text in fig. 4.
Compared with the existing attribute extraction method of the dependency relation extraction model, the method adopted by the example has higher generalization capability and accuracy. The whole scheme comprises the following steps:
Step 1: the results of the lexical analysis are converted into digital codes X (X1, X2, X3,..xn) in combination with a dictionary and input to an encoder.
Step 2: converting the digital code into vector features H (X) (H1, H2, H3, & gt, hn) =encoder (X) by an Encoder (the Encoder may select a common Encoder such as word2vec, BERT, roBERTa, etc.);
step 3: splicing and combining the syntactic features D (x) (D1, D2, D3,) and the H obtained by the encoder to obtain new first hidden vector features H' (x) (h1 × D1, H2 × D2,.. Hn × dn);
step 4: passing the first hidden vector feature through a primary multi-layer perceptron layer (MLP) to obtain a second hidden vector G (H' (x)) with a new dimension, wherein G is a mapping function representation of the MLP layer;
step 5: in combination with the flowchart of unified recognition of multiple main bodies, multiple sub-attributes and coverage attributes shown in fig. 5, a corresponding dependency relation matrix is constructed for a search text, and six relations of H (representing a head), E (representing a tail), L (representing a connection relation), C (representing a subordinate relation), O (representing a parallel relation) and N (having no relation) exist between different positions. Each relationship respectively initializes W q 、d q 、W k 、d k For scoring the relationship between i, j positions in the matrix, where q i =W q G(H’(x i ))+d q ,k j =W k G(H’(x j ) A final score of S (i, j) =q i Tk j . The scoring of the upper triangular matrix is achieved through the mode. The relationship of the highest final score is finally determined by the Softmax function.
Referring to FIG. 5, in conjunction with sequence labeling and dependency syntax analysis, fine-grained attribute identification may be implemented and correlations between attributes may be identified. For example, the identification of different corresponding attributes of a plurality of entities (men and children) can be realized in one search question, and then the accurate search of a plurality of targets can be realized according to one search text.
The decoding mode of the search text is as follows: the H attribute character is found first as the first character, then the L attribute character is found in the same row, then the next L or E attribute character is found to the column where the next L character is located, and the E attribute character is used as the end of the segment. In addition, a subordinate relation (C) and a parallel relation (O) exist between the E character attribute (representative attribute fragment) and the E character attribute (representative attribute fragment).
In one or some embodiments, the first loss function includes a first cross entropy loss function and a KL divergence loss function;
the first cross entropy loss function takes a normalized exponential function and an actual relation value as independent variables, and the KL divergence loss function takes an average value of the first hidden vector and an average value of the target sub-attribute hidden vector as independent variables.
In order to make the model obtain a convergence result, a softmax function can be adopted to convert the output result of the neural network once, the output result is represented in a probability form, and then a difference between the output result and the real classification is calculated by using a loss function, so that model parameters are optimized in an iterative mode and the like.
Specifically, loss=cross entrotyloss (Softmax (QK), Y) may be set to obtain the result of the correlation of the sub-attributes. In addition, considering that the overlay of sub-attributes cannot change the original semantics, it is necessary to constrain the recognition effect of the sub-attributes, loss sub-attribute constraint=kl (Avg (H '(X)), avg (H' (X) Sub-attribute 1 ,X Sub-attribute 2 ,...,X Sub-attribute N ) A) where KL represents the KL distance and Avg represents the vector average, since the input of the sub-attribute is initially unknown, a sequence of decoded predicted sub-attributes is employed.
The final model loss may be represented as follows (where α is a weight factor):
Loss=α·CrossEntroyLoss(Softmax(QK),Y)+(1-α)·KL(Avg(H’(X)),Avg(H’(X sub-attribute 1 ,X Sub-attribute 2 ,...,X Sub-attribute N )))
In short, through the semantic analysis module, a sub-attribute list of the user question and a correlation between sub-attributes can be obtained.
In one or some embodiments, the training step of the semantic matching model includes:
Extracting sub-attributes from the retrieval text sample according to semantic analysis results, and generating a positive sample vector, a retrieval text vector and a negative sample vector according to the sub-attribute conversion;
the positive sample vector, the search text vector and the negative sample vector are respectively subjected to word segmentation and feature coding by an encoder and then are respectively input into respective two-way long and short memory network models to obtain a positive ratio coding vector, a text ratio coding vector and a negative ratio coding vector;
processing the positive ratio coding vector, the text ratio coding vector and the negative ratio coding vector through a normalization exponential function and a second loss function to obtain a label of the sub-attribute in the search text sample;
and comparing the label of the sub-attribute with the manual label to obtain the corrected semantic matching model.
And, in one or some alternative embodiments, the second loss function includes a second cross entropy loss function and a bi-classification loss function;
the second cross entropy loss function takes a normalized exponential function and an actual tag value as independent variables, and the two-classification loss function takes the similarity among the positive-direction ratio coding vector, the text-ratio coding vector and the negative-direction ratio coding vector as independent variables.
Specifically, the method is combined with a search text semantic matching flow chart shown in fig. 6, and is used for matching and searching a semantic analysis result and a video analysis result, and the process mainly comprises two parts of model training and decoding and searching. The training process is mainly used for training an adaptive decoder, and can realize the alignment of video analysis result decoding and semantic analysis result decoding. In the decoding and searching process, only the semantic analysis label (decoding characteristic) and the label (decoding characteristic) of the pre-stored video analysis result are required to be matched and searched.
The video analysis features are mainly level 3: attribute major class-attribute minor class-attribute entity. If decoding is directly carried out according to a conventional method, then matching is carried out through similarity, firstly, the matching categories are too many (generally, the class 2 is divided into tens of classes, and the class 3 is divided into thousands of classes) and screening information cannot be fully utilized; moreover, the text similarity between attribute entities is too high, and the conventional semantics are too close (e.g. "pedestrian structuring-whether umbrella-yes-and" pedestrian structuring-whether umbrella-no "), resulting in a significant match error. Therefore, in this embodiment, the combination of classification and contrast learning is proposed to realize the optimization and reinforcement of the similarity in the fine classification, so as to realize a better level matching effect.
Referring to fig. 6, the semantic matching process in this embodiment is as follows:
first, in the training process, for Query (i.e., user question (search text)), a large number of sub-attributes are extracted according to the semantic analysis module, such as: "umbrella and glasses-equipped man", the "umbrella-equipped man", "glasses-equipped man", and "man" are extracted, and 3 pieces of child property (if the child property has a parent structure, it is necessary to incorporate the parent node as one piece). For example, the positive and negative samples of the video structure corresponding to "men" are: "human structured-sex-male" and "human structured-sex-female". The positive and negative examples of the video structuring corresponding to the male with umbrella are as follows: "pedestrian structuring-whether umbrella-yes" and "pedestrian structuring-whether umbrella-no". Wherein the negative samples can be one or more, or can be sampled. The positive examples of "men wearing long clothing" are "pedestrian structured-upper garment-long sleeves", the negative examples thereof may be "pedestrian structured-upper garment-short sleeves" or "pedestrian structured-upper garment-vest", etc.
The positive and negative sample and the attribute fragment of the text are used as input X, X for word segmentation and coding Positive direction 、X Negative pole . And inputs the digitized encoded information into a neural network encoder (e.g., BERT, roBERTa, etc.). The Encoder (X), the Encoder (X positive) and the Encoder (X negative) are obtained respectively.
The coding information is respectively passed through BiLSTM network to obtain semantic information BiLSTM (Encoder (X)) with further information fusion.
Where the sub-attribute, positive and negative samples of the question should belong to the same two-level subclass label, for simplicity of representation, F (X) =softmax (BiLSTM (encor (X)) is defined herein and the loss function is defined as follows:
Loss 1 cross Entry Loss (F (X), Y)) +CrossEntry Loss (F (X positive), Y)) +CrossEntry Loss (F (X negative), Y))
In addition, the sub-attribute labels of the questions should be able to distinguish between the positive and negative examples by similarity, and considering that the positive and negative examples belong to the same two-stage classification, so that the corresponding degree of distinction can be enhanced, and the loss function can be defined as follows:
Loss 2 α·BinaryLoss (S (BiLSTM (Encoder (X), biLSTM (Encoder (X positive)), 1) + (1- α) ·BinaryLoss (S (BiLSTM (Encoder (X), biLSTM (Encoder (X negative)), 0)
Where S may be a similarity function of various text representations, such as a cosine distance, the closer to 0 the less similar the representation is, the closer to 1 the more similar the representation is.
The final training module may represent the loss function as: loss=β·loss 1 +(1-β)·Loss 2
In one or some embodiments, the step of obtaining the multimedia sub-class label includes:
extracting key frames in the multimedia, carrying out target detection on the key frames, and analyzing according to preset hierarchical categories to obtain labels of the corresponding hierarchical categories.
Specifically, taking video as an example, for analysis of the content, generally, analysis means such as extracting a key frame from the video, detecting a target of the key frame, and identifying a general image are adopted to realize labeling of the key content in the key frame.
In the embodiment of the invention, the multimedia can be analyzed according to the labels which are set to be fixed-level categories. The fixed hierarchy categories include: pedestrian structuring-age group-old people-pedestrian structuring-jacket style-short sleeve, vehicle structuring-general vehicle type-passenger car and the like. The same keyframe may identify multiple targets, and there may be multiple structured information for different targets.
In summary, the multimedia retrieval execution process of the embodiment of the invention is roughly divided into the following stages:
(1) In the reasoning and search execution process, the semantic analysis module needs to be performed in real time to analyze the search text of the user into corresponding semantic sub-attributes.
(2) The sub-attributes after parsing are subjected to specific coding (i.e. BiLSTM (Encoder (X)) through a pre-trained Encoder model.
(3) And inputting the ratio codes into a semantic matching model, and predicting to obtain the labels (namely the classification results obtained through Softmax) corresponding to the secondary subclasses of the search text.
(4) With all the encoding results of the video subclasses (i.e. the pre-obtained BiLSTM (Encoder (X) Subclass 1 ))、BiLSTM(Encoder(X Subclass 2 ))、....、BiLSTM(Encoder(X Subclass n ) A) to perform similarity matching, the matching mode still adopts an S function. The most similar class with a threshold greater than 0.5 is preferably selected as the matching class to locate the video asset.
A user may retrieve questions (text) that may have multiple parent attributes at the same level and a parent attribute may have multiple child attributes. Through the steps, a search question can be matched with a plurality of video subclasses at the same level.
Fig. 7 shows a schematic structural diagram of an embodiment of the multimedia retrieval apparatus of the present invention. As shown in fig. 7, the apparatus 700 includes:
the semantic analysis module 710 is adapted to receive a search text, and perform semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes;
the tag obtaining module 720 is adapted to input each sub-attribute and the correlation into a pre-trained semantic matching model to obtain a tag of each sub-attribute;
The search matching module 730 is adapted to determine a similarity value between the label of each sub-attribute and the label of the pre-obtained multimedia sub-category, and determine the multimedia sub-category matching the search text according to the size of each similarity value.
Specifically, referring to the connection structure schematic diagram of the video retrieval module shown in fig. 8, in a specific operation, multimedia (taking video as an example) needs to be analyzed first to obtain tag classification information of the multimedia; then analyzing the search text to obtain corresponding sub-attributes and correlation; and finally, matching the search text with the multimedia resources through a semantic matching model to obtain an optimal matching result, so that the multimedia resources wanted by the user are positioned according to the search text.
In one or some embodiments, the semantic analysis module 710 is further adapted to:
analyzing the search text to obtain the word segmentation and the part of speech of the search text;
performing dependency syntactic analysis on the search text by combining the segmentation words and the parts of speech thereof to obtain dependency relations among the segmentation words;
and determining the sub-attribute of the search text and the correlation relationship among the sub-attributes according to the segmentation word, the part of speech and the dependency relationship.
In one or some embodiments, the semantic analysis module 710 is further adapted to:
coding the word segmentation and the part of speech thereof to obtain word segmentation digital codes, and inputting the word segmentation digital codes into an encoder for conversion to obtain word segmentation vectors; carrying out syntactic analysis on the search text to obtain a syntactic feature vector;
splicing and fusing the word segmentation vector and the syntactic feature vector to obtain a first hidden vector;
processing the first hidden vector by a multi-layer perceptron to obtain a second hidden vector;
and constructing a dependency relation matrix according to the retrieval text and the dependency relation, determining a product value of each position in the dependency relation matrix according to the dependency relation matrix, the dependency relation initializing vector and the second hidden vector, and determining the correlation relation between the sub-attributes according to the magnitude of the product value and a first loss function.
In one or some embodiments, the first loss function includes a first cross entropy loss function and a KL divergence loss function;
the first cross entropy loss function takes a normalized exponential function and an actual relation value as independent variables, and the KL divergence loss function takes an average value of the first hidden vector and an average value of the target sub-attribute hidden vector as independent variables.
In one or some embodiments, the training step of the semantic matching model in the tag acquisition module 720 includes:
extracting sub-attributes from the retrieval text sample according to semantic analysis results, and generating a positive sample vector, a retrieval text vector and a negative sample vector according to the sub-attribute conversion;
the positive sample vector, the search text vector and the negative sample vector are respectively subjected to word segmentation and feature coding by an encoder and then are respectively input into respective two-way long and short memory network models to obtain a positive ratio coding vector, a text ratio coding vector and a negative ratio coding vector;
processing the positive ratio coding vector, the text ratio coding vector and the negative ratio coding vector through a normalization exponential function and a second loss function to obtain a label of the sub-attribute in the search text sample;
and comparing the label of the sub-attribute with the manual label to obtain the corrected semantic matching model.
In one or some embodiments, the second loss function includes a second cross entropy loss function and a classification loss function;
the second cross entropy loss function takes a normalized exponential function and an actual tag value as independent variables, and the two-classification loss function takes the similarity among the positive-direction ratio coding vector, the text-ratio coding vector and the negative-direction ratio coding vector as independent variables.
In one or some embodiments, the step of retrieving the multimedia sub-class label in the matching module 730 includes:
extracting key frames in the multimedia, carrying out target detection on the key frames, and analyzing according to preset hierarchical categories to obtain labels of the corresponding hierarchical categories.
The key points and beneficial effects of the multimedia retrieval scheme disclosed by the invention comprise:
firstly, the embodiment of the invention provides a fine-granularity hierarchical semantic attribute extraction mode, which can realize extraction of sub-attributes compared with the traditional semantic analysis method, wherein the sub-attributes are not only a single word, but also a fragment sequence, and have better recognition capability for new words and words with states (such as negative sentence type: no glasses). In addition, the method can also identify the association relationship, such as parallel relationship and subordinate relationship, among the sub-attributes. A key parent attribute may be identified from the child attributes to identify different retrieval requirements for a plurality of targets by retrieving text.
Secondly, the embodiment of the invention provides a semantic coding mode of multi-level comparison matching, which can realize category classification identification under secondary classification, and can better complete semantic coding of multi-level classification labels through training of positive and negative sample comparison, thereby being capable of more accurately realizing matching of semantic sub-attributes and video labels.
The embodiment of the invention solves the problem of confusion among search terms possibly caused by keyword extraction in the existing scheme (such as searching for a man wearing a hat and a woman drinking, and easily searching for a woman wearing a hat and a man drinking by the existing scheme), and can realize the search of a plurality of targets in one search question, thereby overcoming the defect of the existing scheme and improving the matching accuracy.
An embodiment of the present invention provides a non-volatile computer storage medium storing at least one executable instruction that can perform the multimedia retrieval method in any of the above-described method embodiments.
FIG. 9 illustrates a schematic diagram of an embodiment of a computing device of the present invention, and the embodiments of the present invention are not limited to a particular implementation of the computing device.
As shown in fig. 9, the computing device may include: a processor 902, a communication interface (Communications Interface), a memory 906, and a communication bus 908.
Wherein: processor 902, communication interface 904, and memory 906 communicate with each other via a communication bus 908. A communication interface 904 for communicating with network elements of other devices, such as clients or other servers. The processor 902 is configured to execute the program 910, and may specifically perform relevant steps in the foregoing embodiments of the multimedia retrieval method for a computing device.
In particular, the program 910 may include program code including computer-operating instructions.
The processor 902 may be a central processing unit, CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
A memory 906 for storing a program 910. Memory 906 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 910 may be specifically configured to cause the processor 902 to perform operations corresponding to the above-described embodiments of the multimedia retrieval method.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (10)

1. A multimedia retrieval method, the method comprising:
receiving a search text, and carrying out semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes;
inputting each sub-attribute and the correlation into a pre-trained semantic matching model to obtain labels of each sub-attribute;
and determining the similarity value between the label of each sub-attribute and the label of the pre-obtained multimedia sub-class, and determining the multimedia sub-class matched with the search text according to the size of each similarity value.
2. The method of claim 1, wherein receiving the search text, performing semantic analysis on the search text, and obtaining at least one sub-attribute and a correlation between the sub-attributes comprises:
analyzing the search text to obtain the word segmentation and the part of speech of the search text;
performing dependency syntactic analysis on the search text by combining the segmentation words and the parts of speech thereof to obtain dependency relations among the segmentation words;
and determining the sub-attribute of the search text and the correlation relationship among the sub-attributes according to the segmentation word, the part of speech and the dependency relationship.
3. The method according to claim 2, wherein determining the sub-attribute of the search text and the correlation between the sub-attributes according to the word and the part of speech and the dependency relationship thereof comprises:
coding the word segmentation and the part of speech thereof to obtain word segmentation digital codes, and inputting the word segmentation digital codes into an encoder for conversion to obtain word segmentation vectors; carrying out syntactic analysis on the search text to obtain a syntactic feature vector;
splicing and fusing the word segmentation vector and the syntactic feature vector to obtain a first hidden vector;
processing the first hidden vector by a multi-layer perceptron to obtain a second hidden vector;
and constructing a dependency relation matrix according to the retrieval text and the dependency relation, determining a product value of each position in the dependency relation matrix according to the dependency relation matrix, the dependency relation initializing vector and the second hidden vector, and determining the correlation relation between the sub-attributes according to the magnitude of the product value and a first loss function.
4. The method of claim 3, wherein the first loss function comprises a first cross entropy loss function and a KL divergence loss function;
The first cross entropy loss function takes a normalized exponential function and an actual relation value as independent variables, and the KL divergence loss function takes an average value of the first hidden vector and an average value of the target sub-attribute hidden vector as independent variables.
5. The method according to any one of claims 1-4, wherein the training step of the semantic matching model comprises:
extracting sub-attributes from the retrieval text sample according to semantic analysis results, and generating a positive sample vector, a retrieval text vector and a negative sample vector according to the sub-attribute conversion;
the positive sample vector, the search text vector and the negative sample vector are respectively subjected to word segmentation and feature coding by an encoder and then are respectively input into respective two-way long and short memory network models to obtain a positive ratio coding vector, a text ratio coding vector and a negative ratio coding vector;
processing the positive ratio coding vector, the text ratio coding vector and the negative ratio coding vector through a normalization exponential function and a second loss function to obtain a label of the sub-attribute in the search text sample;
and comparing the label of the sub-attribute with the manual label to obtain the corrected semantic matching model.
6. The method of claim 5, wherein the second loss function comprises a second cross entropy loss function and a classification loss function;
the second cross entropy loss function takes a normalized exponential function and an actual tag value as independent variables, and the two-classification loss function takes the similarity among the positive-direction ratio coding vector, the text-ratio coding vector and the negative-direction ratio coding vector as independent variables.
7. The method according to any one of claims 1-4, wherein the step of obtaining the multimedia sub-class label comprises:
extracting key frames in the multimedia, carrying out target detection on the key frames, and analyzing according to preset hierarchical categories to obtain labels of the corresponding hierarchical categories.
8. A multimedia retrieval apparatus, the apparatus comprising:
the semantic analysis module is suitable for receiving the search text, and carrying out semantic analysis on the search text to obtain at least one sub-attribute and a correlation between the sub-attributes;
the label acquisition module is suitable for inputting each sub-attribute and the correlation into a pre-trained semantic matching model to obtain labels of each sub-attribute;
And the retrieval matching module is suitable for determining similarity values between the labels of the sub-attributes and the labels of the pre-obtained multimedia sub-classes, and determining the multimedia sub-classes matched with the retrieval text according to the size of each similarity value.
9. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the traffic scheduling method according to any one of claims 1 to 7.
10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the traffic orchestration method according to any one of claims 1-7.
CN202211434753.4A 2022-11-16 2022-11-16 Multimedia retrieval method, device, computing equipment and storage medium Pending CN116304120A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211434753.4A CN116304120A (en) 2022-11-16 2022-11-16 Multimedia retrieval method, device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211434753.4A CN116304120A (en) 2022-11-16 2022-11-16 Multimedia retrieval method, device, computing equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116304120A true CN116304120A (en) 2023-06-23

Family

ID=86811877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211434753.4A Pending CN116304120A (en) 2022-11-16 2022-11-16 Multimedia retrieval method, device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116304120A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578729A (en) * 2023-07-13 2023-08-11 腾讯科技(深圳)有限公司 Content search method, apparatus, electronic device, storage medium, and program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578729A (en) * 2023-07-13 2023-08-11 腾讯科技(深圳)有限公司 Content search method, apparatus, electronic device, storage medium, and program product
CN116578729B (en) * 2023-07-13 2023-11-28 腾讯科技(深圳)有限公司 Content search method, apparatus, electronic device, storage medium, and program product

Similar Documents

Publication Publication Date Title
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN109062893B (en) Commodity name identification method based on full-text attention mechanism
CN107526799B (en) Knowledge graph construction method based on deep learning
CN108932342A (en) A kind of method of semantic matches, the learning method of model and server
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
CN115034224A (en) News event detection method and system integrating representation of multiple text semantic structure diagrams
CN112711660A (en) Construction method of text classification sample and training method of text classification model
CN112101031B (en) Entity identification method, terminal equipment and storage medium
CN112667813B (en) Method for identifying sensitive identity information of referee document
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN113157859A (en) Event detection method based on upper concept information
CN111125457A (en) Deep cross-modal Hash retrieval method and device
CN116150367A (en) Emotion analysis method and system based on aspects
CN116304120A (en) Multimedia retrieval method, device, computing equipment and storage medium
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
CN114742016A (en) Chapter-level event extraction method and device based on multi-granularity entity differential composition
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN115115432B (en) Product information recommendation method and device based on artificial intelligence
CN116958677A (en) Internet short video classification method based on multi-mode big data
CN110941958A (en) Text category labeling method and device, electronic equipment and storage medium
CN116562291A (en) Chinese nested named entity recognition method based on boundary detection
Hong et al. Fine-grained feature generation for generalized zero-shot video classification
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination