WO2024104438A1 - 多媒体检索方法、装置、设备、介质及程序产品 - Google Patents

多媒体检索方法、装置、设备、介质及程序产品 Download PDF

Info

Publication number
WO2024104438A1
WO2024104438A1 PCT/CN2023/132098 CN2023132098W WO2024104438A1 WO 2024104438 A1 WO2024104438 A1 WO 2024104438A1 CN 2023132098 W CN2023132098 W CN 2023132098W WO 2024104438 A1 WO2024104438 A1 WO 2024104438A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
vector
text
attributes
loss function
Prior art date
Application number
PCT/CN2023/132098
Other languages
English (en)
French (fr)
Inventor
杨希
潘喆
闫伟
Original Assignee
中移(苏州)软件技术有限公司
中国移动通信集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中移(苏州)软件技术有限公司, 中国移动通信集团有限公司 filed Critical 中移(苏州)软件技术有限公司
Publication of WO2024104438A1 publication Critical patent/WO2024104438A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of artificial intelligence technology, and specifically to a multimedia retrieval method, device, equipment, medium and program product.
  • the current search scheme is based on user keywords.
  • the above search scheme cannot distinguish the relationship between the attribute and the entity object, resulting in confusion between the search semantics; and because there is no association and order relationship between the keywords, it may cause problems such as misrecognition or missed recognition.
  • the present application provides a multimedia retrieval method, apparatus, device, medium and program product.
  • the present application embodiment first provides a multimedia retrieval method, the method comprising:
  • the receiving of the search text, performing semantic analysis on the search text, and obtaining at least one sub-attribute and the correlation between the sub-attributes includes:
  • Sub-attributes of the search text and correlations between the sub-attributes are determined according to the segmented words, the parts of speech of the segmented words, and the dependency relationships.
  • determining the sub-attributes of the search text and the correlation between the sub-attributes according to the segmented words, the part of speech of the segmented words, and the dependency relationship includes:
  • Encode the word segmentation and the part of speech of the word segmentation to obtain the word segmentation digital code input the word segmentation digital code into the encoder for conversion to obtain the word segmentation vector; perform syntactic analysis on the search text to obtain syntactic features vector;
  • a dependency matrix is constructed according to the search text and the dependency, a product value of each position in the dependency matrix is determined according to the dependency matrix, the dependency initialization vector and the second latent vector, and a correlation between the sub-attributes is determined according to the product value and a first loss function.
  • the first loss function includes a first cross entropy loss function and a KL divergence loss function
  • the first cross entropy loss function takes the normalized exponential function and the actual relationship value as independent variables
  • the KL divergence loss function takes the average value of the first latent vector and the average value of the target sub-attribute latent vector as independent variables.
  • the method further comprises:
  • the positive sample vector, the search text vector and the negative sample vector are processed respectively by a cascaded encoder and a bidirectional long short-term memory network model to obtain a positive ratio encoding vector, a text ratio encoding vector and a negative ratio encoding vector respectively; wherein the encoder is used to perform word segmentation and feature encoding on the positive sample vector, the search text vector and the negative sample vector respectively;
  • the text ratio coding vector and the negative ratio coding vector are processed by a normalized exponential function and a second loss function, a label of a sub-attribute in the retrieved text sample is obtained;
  • the parameters of the semantic matching model are adjusted based on the labels of the sub-attributes and the manual labels to obtain the pre-trained semantic matching model.
  • the second loss function includes a second cross entropy loss function and a binary classification loss function
  • the second cross entropy loss function takes the normalized exponential function and the actual label value as independent variables;
  • the binary classification loss function takes the similarity between the positive ratio encoding vector, the text ratio encoding vector and the negative ratio encoding vector as independent variables.
  • the multimedia subcategory tag is obtained by the following steps:
  • Extract key frames from multimedia perform target detection on the key frames, analyze them according to preset hierarchical categories, and obtain labels of corresponding hierarchical categories.
  • the embodiment of the present application also provides a multimedia retrieval device, the device comprising:
  • a semantic analysis module is configured to receive a search text, perform semantic analysis on the search text, and obtain at least one sub-attribute and a correlation between the sub-attributes;
  • a label acquisition module is configured to process each of the sub-attributes and the correlation through a pre-trained semantic matching model to obtain a label for each of the sub-attributes;
  • the search and matching module is configured to determine a similarity value between the label of each of the sub-attributes and the pre-obtained multimedia sub-category label, and determine a multimedia sub-category matching the search text according to each of the similarity values.
  • the semantic analysis module is configured to analyze the search text to obtain the participles of the search text and the parts of speech of the participles; perform dependency syntactic analysis on the search text in combination with the participles and the parts of speech of the participles to obtain dependency relationships between the participles; and determine the sub-attributes of the search text and the correlation between the sub-attributes based on the participles, the parts of speech of the participles and the dependency relationships.
  • the semantic analysis module is configured to encode the word segmentation and the part of speech of the word segmentation to obtain a word segmentation digital code, input the word segmentation digital code into the encoder for conversion to obtain a word segmentation vector; perform syntactic analysis on the search text to obtain a syntactic feature vector; concatenate and fuse the word segmentation vector and the syntactic feature vector to obtain a first latent vector; process the first latent vector through a multilayer perceptron to obtain a second latent vector; construct a dependency matrix based on the search text and the dependency relationship, determine the product value of each position in the dependency matrix based on the dependency matrix, the dependency initialization vector and the second latent vector, and determine the correlation between the sub-attributes based on the product value and the first loss function.
  • the first loss function includes a first cross entropy loss function and a KL divergence loss function
  • the first cross entropy loss function takes the normalized exponential function and the actual relationship value as independent variables
  • the KL divergence loss function takes the average value of the first latent vector and the average value of the target sub-attribute latent vector as independent variables.
  • the label acquisition module is configured to extract sub-attributes from the retrieval text sample according to the semantic analysis results, and generate positive sample vectors, retrieval text vectors and negative sample vectors according to the sub-attribute conversion; the positive sample vectors, the retrieval text vectors and the negative sample vectors are processed respectively by a cascaded encoder and a bidirectional long short-term memory network model to obtain positive ratio encoding vectors, text ratio encoding vectors and negative ratio encoding vectors respectively; the positive ratio encoding vectors, the text ratio encoding vectors and the negative ratio encoding vectors are processed by a normalized exponential function and a second loss function to obtain labels of sub-attributes in the retrieval text sample; the labels of the sub-attributes are compared with manual labels to obtain the pre-trained semantic matching model; wherein the encoder is used to perform word segmentation and feature encoding on the positive sample vectors, the retrieval text vectors and the negative sample vectors respectively.
  • the second loss function includes a second cross entropy loss function and a binary classification loss function
  • the second cross entropy loss function takes the normalized exponential function and the actual label value as independent variables;
  • the binary classification loss function takes the similarity between the positive ratio encoding vector, the text ratio encoding vector and the negative ratio encoding vector as independent variables.
  • the retrieval and matching module is configured to extract key frames from multimedia, perform target detection on the key frames, analyze them according to preset hierarchical categories, and obtain labels of corresponding hierarchical categories.
  • the embodiment of the present application further provides a computing device, comprising: a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface communicate with each other through the communication bus;
  • the memory is configured to store at least one executable instruction.
  • the executable instruction is executed by the processor, the multimedia retrieval method described in any of the previous embodiments can be implemented.
  • An embodiment of the present application further provides a computer storage medium, wherein the storage medium stores at least one executable instruction, and when the executable instruction is executed by a processor of an electronic device, it can implement any of the multimedia retrieval methods described above.
  • An embodiment of the present application further provides a computer program, which includes a computer-readable code.
  • the processor of the electronic device executes the computer program to implement any of the multimedia retrieval methods described above.
  • An embodiment of the present application also provides a computer program product, which includes a computer-readable code, or a non-volatile computer-readable storage medium that carries the computer-readable code.
  • a computer program product which includes a computer-readable code, or a non-volatile computer-readable storage medium that carries the computer-readable code.
  • the multimedia retrieval solution provided in the embodiment of the present application, by performing fine-grained hierarchical semantic analysis on the retrieval text and the similarity value between the retrieval text semantic label and the multimedia subcategory label, the problem that the keyword extraction method in the related technology may cause confusion between the search semantics is solved, thereby being able to better explore the relationship between the retrieval text semantics and the multimedia subcategory label, and using one retrieval text to achieve retrieval of multiple targets with higher recognition and accuracy.
  • FIG1 shows a flowchart of a multimedia retrieval method provided in an embodiment of the present application
  • FIG2 is a schematic diagram showing a multimedia retrieval scenario provided by an embodiment of the present application.
  • FIG3 shows a flowchart of semantic analysis of search text provided in an embodiment of the present application
  • FIG4 shows a flowchart of extracting fine-grained hierarchical semantic attributes of search text provided by an embodiment of the present application
  • FIG5 shows a schematic diagram of a process for unified identification of multiple subjects, multiple sub-attributes, and coverage attributes provided by an embodiment of the present application
  • FIG6 shows a flowchart of retrieval text semantic matching provided by an embodiment of the present application
  • FIG. 7 shows a schematic diagram of the structure of a multimedia retrieval device provided in an embodiment of the present application.
  • FIG8 shows a schematic diagram of the structure of a video retrieval module provided in an embodiment of the present application.
  • FIG9 shows a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
  • FIG1 shows a flowchart of a multimedia retrieval method provided in an embodiment of the present application, and the method is applied to a computing device.
  • the computing device includes various computers, smart terminals, and tablet computers. As shown in FIG1 , the method includes the following steps:
  • Step 110 receiving a search text, performing semantic analysis on the search text, and obtaining at least one sub-attribute and correlations between sub-attributes.
  • the embodiment of the present application can be used to retrieve corresponding multimedia resources by receiving search questions and other terms input by the user, wherein the above-mentioned multimedia resources include videos, sounds, image collections, and animations, etc.
  • the user enters the terms, questions, etc. that he wants to search for in the search box 201 to obtain the corresponding retrieval text, and obtains the sub-attributes through the semantic analysis method.
  • the sub-attribute can be not only a single word, but a fragment sequence, which has better recognition ability than a single participle or a word with a state (such as a negative sentence: "not wearing glasses”); it can also identify the correlation between sub-attributes, such as parallel relationships and subordinate relationships, and can also identify key parent attributes from sub-attributes, so that different retrieval requirements of multiple targets can be identified from a sentence of the retrieval text; illustratively, in FIG2 , the retrieval text of "looking for people wearing glasses and black clothes" can be entered in the search box 201.
  • Step 120 Process each sub-attribute and related relationship through a pre-trained semantic matching model to obtain a label for each sub-attribute.
  • This step obtains the sub-attribute label with the highest matching degree through the pre-trained semantic matching model, which can improve the matching accuracy.
  • Step 130 Determine the similarity value between the label of each sub-attribute and the pre-obtained multimedia sub-category label, and determine the multimedia sub-category matching the search text according to each similarity value.
  • multimedia data such as video can be analyzed and corresponding tags extracted in advance and then saved in a database 202 .
  • a multimedia subclass can represent a hierarchical structure of multimedia, or any category of multimedia.
  • a match is performed between the tags in the retrieval text and the multimedia subtags of the corresponding category of multimedia, and then the corresponding multimedia is determined or located based on the matching results.
  • step 120 and step 130 may be combined and executed.
  • multimedia resources matching the user's search may be obtained through a semantic matching model and a similarity value analysis set.
  • multimedia with a similarity greater than a preset threshold can be determined as multimedia that meets the user's needs.
  • receiving a search text, performing semantic analysis on the search text, and obtaining at least one sub-attribute and the correlation between the sub-attributes may be achieved in the following manner:
  • the search text may also be preprocessed; exemplarily, the above preprocessing may include: uppercase and lowercase conversion, traditional and simplified conversion, and typo correction.
  • the search text can be finely divided, thereby improving the accuracy of the dependency relationship, and further improving the accuracy of the correlation relationship between sub-attributes.
  • determining the sub-attributes of the search text and the correlation between the sub-attributes according to the segmentation, the part of speech of the segmentation, and the dependency relationship can be achieved in the following manner:
  • Encode the word segmentation and its part of speech to obtain the word segmentation digital code input the word segmentation digital code into the encoder for conversion to obtain the word segmentation vector; perform syntactic analysis on the retrieved text to obtain the syntactic feature vector; concatenate and fuse the word segmentation vector and the syntactic feature vector to obtain the first latent vector; process the first latent vector through a multilayer perceptron (MLP) to obtain the second latent vector; construct a dependency matrix according to the retrieved text and the dependency relationship, determine the product value of each position in the dependency matrix according to the dependency matrix, the dependency initialization vector and the second latent vector, and determine the correlation between the sub-attributes according to the product value and the first loss function.
  • MLP multilayer perceptron
  • Step 301 Text preprocessing.
  • the text preprocessing includes preprocessing the search text, and the operations that can be performed by the preprocessing include: uppercase and lowercase conversion, traditional and simplified Chinese conversion, and wrong character correction.
  • Step 302 lexical analysis of the text.
  • text lexical analysis may include word segmentation and part-of-speech analysis of the retrieved text.
  • word segmentation For example, when performing text lexical analysis on "a man wearing glasses”, it may be segmented into “a/m man wearing/v glasses/n/u”, where m, v and n may be the parts of speech of "a”, “glasses” and "man”, respectively.
  • part-of-speech classification standards of ansj, CTB and PKU may be used to perform text lexical analysis.
  • word segmentation and part-of-speech tagging standards may also be used to perform text lexical analysis.
  • Step 303 Syntactic analysis of the text.
  • the search text may be subjected to dependency syntactic analysis in combination with the word segmentation results and the part of speech of the word segmentation, and the analysis of the associated attributes between each word may be realized through the dependency syntactic analysis.
  • the dependency attributes in the embodiments of the present application include but are not limited to: ATT (attributive), DE (of), SBV (subject), VOB (object), COO (general coordination), COS (shared coordination), ADV (adverbial) and HED (core); for example, "a man wearing glasses” can be parsed into the data list shown in Table 1 through word segmentation, part-of-speech analysis and dependency syntax analysis.
  • dependency relationship in this example can be represented by existing standards such as PKU Multi-view Chinese Treebank. allow.
  • Step 304 Extract fine-grained hierarchical semantic attributes.
  • fine-grained attribute extraction can be provided to realize the identification of dependency relationships between attributes and core entities, and the specific process can be shown in FIG. 4 as the process of extracting fine-grained hierarchical semantic attributes from retrieved text.
  • Step 401 Lexical analysis output.
  • the result of the lexical analysis may be converted into a digital code X (x1, x2, x3, . . . , xn) in combination with a dictionary and input into an encoder.
  • Step 402 Convert the digital code into a feature vector through an encoder.
  • the encoder may include word2vec, BERT, RoBERTa, etc.
  • Step 403 Syntactic analysis output.
  • the syntactic features output by the syntactic analysis may be D(x)(d1, d2, d3, ..., dn); wherein d1, d2, d3, ..., dn may be n syntactic features.
  • Step 404 feature fusion.
  • the syntactic features and the feature vectors may be concatenated to obtain a first latent vector;
  • the first latent vector may be a matrix H'(x)(h1. ⁇ d1,h2 ⁇ d2,....hn ⁇ dn)
  • Step 405 Obtain a second latent vector.
  • the matrix corresponding to the first latent vector can be processed by MLP to obtain the second latent vector G(H’(x)), where G is the mapping function representation of the MLP layer.
  • Step 406 Decoding based on dependency.
  • the upper triangular matrix is scored in the above manner, and the relationship with the highest final score is finally determined by the Softmax function.
  • fine-grained attribute recognition can be achieved, and the correlation between attributes can be identified. For example, in a search question, different corresponding attributes of multiple entities (man, child) can be identified, and then accurate retrieval of multiple targets can be achieved based on a search text.
  • the decoding method of the search text can be: first find the H attribute character as the first character, then find the L attribute character in the same row, then find the next L or E attribute character in the column where the L attribute character is located, and end with the E attribute character as the segment.
  • the expression of word segmentation vectors and syntactic feature vectors can be simplified; and, by concatenating and fusing the word segmentation vectors and the syntactic feature vectors to obtain the first latent vector, the richness of information in the first latent vector can be improved; at the same time, by processing the first latent vector through MLP to obtain the second latent vector, the classification and regression processing of the features contained in the first latent vector can be realized; on this basis, the accuracy of the product value and the correlation can be improved.
  • the first loss function includes a first cross entropy loss function and a KL divergence loss function
  • the first cross entropy loss function takes the normalized exponential function and the actual relationship value as independent variables
  • the KL divergence loss function takes the average value of the first latent vector and the average value of the target sub-attribute latent vector as independent variables.
  • the output result of the neural network can be converted by using the softmax function, the output result can be expressed in the form of probability, and then the first loss function is used to calculate the gap between the output result and the true classification, so as to optimize the model parameters through iteration and other methods.
  • loss CrossEntroyLoss (Softmax (QK), Y).
  • the loss sub-attribute constraint can be: KL (Avg (H' (X)), Avg (H' (X sub-attribute 1 , X sub-attribute 2 , ..., X sub-attribute N ))); where KL represents the KL distance, Avg represents the vector average, and since the input of the sub-attribute is unknown at the beginning, the sequence of predicted sub-attributes after decoding is used.
  • the final model Loss of the first loss calculation function can be shown as formula (1):
  • is a weight factor, and its value is a positive number less than 1.
  • the multimedia retrieval method provided in the embodiments of the present application further includes:
  • the positive sample vector, the search text vector and the negative sample vector are processed respectively by a cascaded encoder and a bidirectional long short-term memory network model (Bi-directional Long Short-Term Memory, BiLSTM), and a positive ratio encoding vector, a text ratio encoding vector and a negative ratio encoding vector are obtained correspondingly;
  • the positive ratio encoding vector, the text ratio encoding vector and the negative ratio encoding vector are processed by a normalized exponential function and a second loss function to obtain labels of sub-attributes in the search text sample;
  • the parameters of the semantic matching model are adjusted based on the labels of the sub-attributes and the manual labels to obtain a pre-trained semantic matching model.
  • the encoder is used to perform word segmentation and feature encoding on the positive sample vector, the retrieved text vector and the negative sample vector respectively.
  • the second loss function includes a second cross entropy loss function and a binary classification loss function
  • the second cross entropy loss function takes the normalized exponential function and the actual label value as independent variables
  • the binary classification loss function takes the similarity between the positive ratio encoding vector, the text ratio encoding vector and the negative ratio encoding vector as independent variables.
  • model training is mainly used to train an adapted decoder, through which the decoding of video analysis results and semantic analysis results can be aligned.
  • decoding retrieval process it is only necessary to match and retrieve the labels (decoding features) of the semantic analysis with the labels (decoding features) of the pre-stored video analysis results.
  • the video analysis features mainly include three levels of features, namely, attribute categories, attribute subcategories, and attribute entities.
  • the conventional method is used to decode directly and then match by similarity, there will be too many matching categories (generally, the second-level classification includes dozens of categories, and the third-level classification includes thousands of categories), so the screening information cannot be fully utilized; and it is also easy to cause the text similarity between attribute entities to be too high and the conventional semantics to be too close (for example: "pedestrian structured-whether to carry an umbrella-yes” and “pedestrian structured-whether to carry an umbrella-no"), resulting in a huge matching error. Therefore, in this embodiment, by combining classification and contrastive learning, the similarity optimization and strengthening within the sub-classification is realized, and a better hierarchical matching effect is achieved.
  • the semantic matching process in this embodiment is as follows:
  • Query 601 that is, the user question (retrieval text)
  • a large number of sub-attributes are extracted based on the semantic analysis module. For example, “a man with an umbrella and wearing glasses” extracts “a man with an umbrella”, “a man wearing glasses” and “a man”, three sub-attribute fragments (if the sub-attribute has a parent structure, it needs to be combined with the parent node as a fragment).
  • the video structured third-level positive sample 602 and the third-level negative sample 603 corresponding to “man” may be: “human body structured-gender-male” and “human body structured-gender-female”, respectively.
  • the three-level positive samples and three-level negative samples of the video structure corresponding to "man with umbrella” can be: "pedestrian structured-whether to carry umbrella-yes” and “pedestrian structured-whether to carry umbrella-no”.
  • the third-level positive sample of "man wearing long clothes” is “pedestrian structured-top-long sleeves”
  • its third-level negative sample can be “pedestrian structured-top-short sleeves” or “pedestrian structured-top-vest” and so on.
  • the third-level positive sample and the third-level negative sample can be one or more, and can also be obtained through sampling.
  • the positive and negative samples and the attribute fragments of the text can be segmented and encoded to obtain encoding information, and the encoding information corresponding to the above-mentioned data can be respectively recorded as X, Xpositive and Xnegative ; exemplarily, the above-mentioned encoding information is input into the encoder 604, and Encoder(X), Encoder( Xpositive ) and Encoder( Xnegative ) can be obtained respectively; wherein the encoder can include BERT and RoBERTa.
  • Encoder(X) can be input into BiLSTM 606 to obtain the semantic information BiLSTM(Encoder(X)) with further information fusion.
  • BiLSTM Encoder (X)
  • F(X) Softmax(BiLSTM(Encoder(X))) (2)
  • the sub-attribute labels of the question should be able to distinguish the positive and negative three-level samples by similarity.
  • the corresponding discrimination can be strengthened, and the second loss function can be defined as The number is shown in formula (4):
  • S can be a similarity function, such as cosine-sim, etc.
  • the above calculation factors may include BiLSTM(Encoder(X)), BiLSTM(Encoder( X- )) and BiLSTM(Encoder(X -positive )) in formula (4).
  • can be a positive number less than or 1.
  • the multimedia subcategory tag may be obtained by:
  • Extract key frames from multimedia perform target detection on the key frames, analyze them according to preset hierarchical categories, and obtain labels of corresponding hierarchical categories.
  • the analysis of its content generally involves extracting key frames from the video, performing target detection on the key frames, and combining general image recognition and other analysis methods to achieve labeling of key content in the key frames.
  • multimedia can be analyzed according to tags of fixed hierarchical categories that have been set, wherein the fixed hierarchical categories include: pedestrian structure-age group-elderly, pedestrian structure-top style-short sleeve, and vehicle structure-general vehicle type-passenger bus, etc.
  • multiple targets can be identified from the same key frame, and different targets can have multiple structural information.
  • the semantic parsing module needs to be performed in real time to parse the user's search text into corresponding semantic sub-attributes.
  • the ratio encoding is input into the semantic matching model to predict the corresponding label of the secondary subclass of the retrieved text (i.e., the classification result obtained by Softmax).
  • the matching method still uses the S function.
  • the subclass with a threshold greater than 0.5 and the most similarity is selected as the matching class, so as to locate the video resource.
  • a user search question may have multiple parent attributes of the same level, and a parent attribute may have multiple child attributes.
  • a search question may be matched with multiple video subcategories of the same level.
  • FIG7 shows a schematic diagram of the structure of a multimedia search device provided in an embodiment of the present application.
  • the device 700 includes:
  • the semantic analysis module 710 is configured to receive a search text, perform semantic analysis on the search text, and obtain at least one sub-attribute and a correlation between the sub-attributes;
  • the label acquisition module 720 is configured to process each sub-attribute and related relationship input through a pre-trained semantic matching model to obtain a label for each sub-attribute;
  • the search matching module 730 is configured to determine the similarity value between the label of each sub-attribute and the pre-obtained multimedia sub-category label, and determine the multimedia sub-category matching the search text according to each similarity value.
  • the video retrieval module shown in FIG8 in specific operations, it is first necessary to analyze the multimedia (taking video as an example) through the video analysis unit 801 to obtain the label classification information of the multimedia; then analyze the search text through the semantic analysis unit 802 to obtain the corresponding sub-attributes and related relationships; finally, match the search text and multimedia resources through the semantic matching and retrieval unit 803 to obtain the optimal matching result, thereby locating the multimedia resources desired by the user according to the search text.
  • the semantic analysis module 710 is further configured to analyze the search text to obtain the word segments and the part of speech of the word segments of the search text;
  • the search text is subjected to dependency syntactic analysis based on the word part of speech to obtain the dependency relationship between the word segmentations;
  • the parts of speech of the segmented words and the dependency relationship, the sub-attributes of the search text and the correlation between the sub-attributes are determined.
  • the semantic analysis module 710 is configured to encode the word segmentation and the part of speech of the word segmentation to obtain the word segmentation digital code, input the word segmentation digital code into the encoder for conversion to obtain the word segmentation vector; perform syntactic analysis on the retrieved text to obtain a syntactic feature vector; concatenate and fuse the word segmentation vector and the syntactic feature vector to obtain a first latent vector; process the first latent vector through MLP to obtain a second latent vector; construct a dependency matrix based on the retrieved text and the dependency relationship, determine the product value of each position in the dependency matrix based on the dependency matrix, the dependency initialization vector and the second latent vector, and determine the correlation between the sub-attributes based on the product value and the first loss function.
  • the first loss function includes a first cross entropy loss function and a KL divergence loss function
  • the first cross entropy loss function takes the normalized exponential function and the actual relationship value as independent variables
  • the KL divergence loss function takes the average value of the first latent vector and the average value of the target sub-attribute latent vector as independent variables.
  • the label acquisition module 720 is configured to extract sub-attributes from the retrieval text sample according to the semantic analysis results, and generate positive sample vectors, retrieval text vectors and negative sample vectors according to the sub-attribute conversion; the positive sample vectors, retrieval text vectors and negative sample vectors are processed respectively by a cascaded encoder and a BiLSTM model to obtain a positive ratio encoding vector, a text ratio encoding vector and a negative ratio encoding vector; wherein the encoder is used to perform word segmentation and feature encoding on the positive sample vector, the retrieval text vector and the negative sample vector respectively; the positive ratio encoding vector, the text ratio encoding vector and the negative ratio encoding vector are processed by a normalized exponential function and a second loss function to obtain labels of sub-attributes in the retrieval text sample; the parameters of the semantic matching model are adjusted based on the labels of the sub-attributes and the manual labels to obtain a pre-trained semantic matching model.
  • the second loss function includes a second cross entropy loss function and a binary classification loss function
  • the second cross entropy loss function takes the normalized exponential function and the actual label value as independent variables
  • the binary classification loss function takes the similarity between the positive ratio encoding vector, the text ratio encoding vector and the negative ratio encoding vector as independent variables.
  • the search and matching module 730 is configured to extract key frames in multimedia, perform target detection on the key frames, analyze them according to preset hierarchical categories, and obtain labels of corresponding hierarchical categories.
  • the embodiment of the present application first proposes a fine-grained hierarchical semantic attribute extraction method. Compared with the traditional semantic analysis method, it can realize the extraction of sub-attributes, where the sub-attribute is not just a single word, but a sequence of fragments. It has better recognition ability for new words and words with states (such as negative sentences: "not wearing glasses"). In addition, this method can also identify the association relationship between sub-attributes, such as parallel relationship and subordinate relationship. The key parent attributes are identified in the attributes, so as to identify the different retrieval requirements of multiple targets through the retrieval text.
  • the embodiment of the present application proposes a semantic encoding method of multi-level comparison and matching, which can realize category classification and recognition under secondary classification, and can better complete the semantic encoding of multi-level classification labels through training of positive and negative sample comparison, so as to more accurately realize the matching of semantic sub-attributes and video labels.
  • the embodiments of the present application overcome the confusion between search terms that may be caused by keyword extraction in existing solutions (for example, if one wants to search for "a man wearing a hat and a woman drinking", it is easy to search for the woman wearing a hat and the man drinking through the existing solutions).
  • the embodiments of the present application can realize the retrieval of multiple targets in one search question, thereby overcoming the defects of the existing solutions and improving the matching accuracy.
  • An embodiment of the present application provides a non-volatile computer storage medium, wherein the computer storage medium stores at least one executable instruction, and the computer executable instruction can execute the multimedia retrieval method in any of the above method embodiments.
  • FIG9 shows a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
  • the specific embodiment of the present application does not limit the specific implementation of the computing device.
  • the computing device may include: a processor (Processor) 902, a communication interface (Communications Interface) 904, a memory (Memory) 906, and a communication bus 908.
  • a processor Processor
  • Communication interface Communication interface
  • Memory Memory
  • the processor 902, the communication interface 904, and the memory 906 communicate with each other via a communication bus 908.
  • the communication interface 904 is configured to communicate with other devices such as a client or other server network elements.
  • the processor 902 is configured to implement the multimedia retrieval method provided in any of the previous embodiments when executing the program 910.
  • the program 910 may include program codes, which include computer operation instructions.
  • the processor 902 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.
  • the one or more processors included in the computing device may be processors of the same type, such as one or more CPUs; or may be processors of different types, such as one or more CPUs and one or more ASICs.
  • the memory 906 is configured to store a program 910 .
  • Memory 906 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk storage.
  • An embodiment of the present application further provides a computer-readable storage medium, wherein the storage medium stores at least one executable instruction, and when the executable instruction is executed by a processor of an electronic device, it can implement any of the multimedia retrieval methods described above.
  • An embodiment of the present application further provides a computer program, which includes a computer-readable code.
  • the processor of the electronic device executes the computer program to implement any of the multimedia retrieval methods described above.
  • An embodiment of the present application also provides a computer program product, which includes a computer-readable code, or a non-volatile computer-readable storage medium that carries the computer-readable code.
  • a computer program product which includes a computer-readable code, or a non-volatile computer-readable storage medium that carries the computer-readable code.
  • modules in the devices in the embodiments may be adaptively changed and arranged in one or more devices different from the embodiments.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and in addition they may be divided into a plurality of submodules or subunits or subcomponents. Except that at least some of such features and/or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstracts and drawings) and all processes or units of any method or device disclosed in this manner may be combined in any combination. Unless otherwise expressly stated, each feature disclosed in this specification (including the accompanying claims, abstracts and drawings) may be replaced by an alternative feature providing the same, equivalent or similar purpose.
  • the various component embodiments of the present application can be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It should be understood by those skilled in the art that a microprocessor or digital signal processor (DSP) can be used in practice to implement some or all functions of some or all components according to the embodiments of the present application.
  • DSP digital signal processor
  • the application can also be implemented as a device or apparatus program (e.g., computer program and computer program product) for executing a part or all of the methods described herein.
  • Such a program implementing the present application can be stored on a computer-readable medium, or can have the form of one or more signals. Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.
  • the present application discloses a multimedia retrieval method, apparatus, device, medium and program product; wherein the method comprises: receiving a retrieval text, performing semantic analysis on the retrieval text, obtaining at least one sub-attribute and a correlation between sub-attributes; processing each of the sub-attributes and the correlation through a pre-trained semantic matching model, obtaining a label for each sub-attribute; determining a similarity value between the label of each sub-attribute and a pre-obtained multimedia sub-class label, and determining a multimedia sub-class matching the retrieval text according to each similarity value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种多媒体检索方法、装置、设备、介质及程序产品;其中方法包括:接收检索文本,对检索文本进行语义分析,得到至少一个子属性以及子属性之间的相关关系;通过预训练的语义匹配模型对各个所述子属性和所述相关关系进行处理,得到各个子属性的标签;确定各个子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个相似度值确定与所述检索文本匹配的多媒体子类。本申请能够确定检索文本语义标签与多媒体子类标签的相似度值,能够更好的挖掘检索文本语义与多媒体标签之间的关系,具有更高的识别度以及准确率。

Description

多媒体检索方法、装置、设备、介质及程序产品
相关申请的交叉引用
本申请要求2022年11月16日提交的中国专利申请号为202211434753.4、申请人为中移(苏州)软件技术有限公司,中国移动通信集团有限公司,申请名称为“多媒体检索方法、装置、计算设备和存储介质”的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及一种多媒体检索方法、装置、设备、介质及程序产品。
背景技术
随着第五代移动通信技术(5th Generation Mobile Communication Technology,5G)的普及化以及网络带宽成本的降低,视频数据逐渐成为社交以及安防等领域的主流数据之一。随着视频数量的井喷式增长、多媒体视频内容的多样性和数据结构的复杂性的加剧,如何快速有效地从视频库中检索出用户想要的视频已经成为难题。
目前的检索方案是基于用户关键词进行检索。然而,当用户关键词中的检索文本(即用户检索时用的问句等用语)中出现多个不同实体对象时,上述的检索方案无法区分属性与实体对象之间的关系,从而导致搜索语义之间的混乱;并且,由于关键词之间不具备关联关系和顺序关系,从而可能会导致错误识别或者遗漏识别等问题。
发明内容
鉴于上述问题,本申请提供了一种多媒体检索方法、装置、设备、介质及程序产品。
本申请实施例提供的技术方案是这样的:
本申请实施例首先提供了一种多媒体检索方法,所述方法包括:
接收检索文本,对所述检索文本进行语义分析,得到至少一个子属性以及所述子属性之间的相关关系;
通过预训练的语义匹配模型对各个所述子属性和所述相关关系进行处理,得到各个所述子属性的标签;
确定各个所述子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个所述相似度值确定与所述检索文本匹配的多媒体子类。
在一些实施例中,所述接收检索文本,对所述检索文本进行语义分析,得到至少一个子属性以及所述子属性之间的相关关系,包括:
分析所述检索文本,得到所述检索文本的分词以及所述分词的词性;
结合所述分词以及所述分词的词性对所述检索文本进行依存句法分析,得到所述分词之间的依存关系;
根据所述分词、所述分词的词性以及所述依存关系,确定所述检索文本的子属性以及所述子属性之间的相关关系。
在一些实施例中,所述根据所述分词、所述分词的词性以及所述依存关系,确定所述检索文本的子属性以及所述子属性之间的相关关系,包括:
对所述分词以及所述分词的词性进行编码,得到分词数字编码,将所述分词数字编码输入到编码器中进行转换得到分词向量;将所述检索文本进行句法分析得到句法特征 向量;
将所述分词向量和所述句法特征向量进行拼接融合,得到第一隐向量;
通过多层感知器处理所述第一隐向量,得到第二隐向量;
根据所述检索文本和所述依存关系构建依存关系矩阵,根据所述依存关系矩阵、依存关系初始化向量和所述第二隐向量,确定所述依存关系矩阵中各位置的乘积值,并根据所述乘积值和第一损失函数确定所述子属性之间的相关关系。
在一些实施例中,所述第一损失函数包括第一交叉熵损失函数和KL散度损失函数;
其中,所述第一交叉熵损失函数以归一化指数函数和实际关系值为自变量,所述KL散度损失函数以所述第一隐向量的平均值和目标子属性隐向量的平均值为自变量。
在一些实施例中,所述方法还包括:
从检索文本样本中根据语义分析结果提取出子属性,根据所述子属性转换生成正样本向量、检索文本向量以及负样本向量;
通过级联设置的编码器以及双向长短记忆网络模型分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行处理,对应得到正向比编码向量、文本比编码向量和负向比编码向量;其中,所述编码器用于分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行分词和特征编码;
将所述正向比编码向量、所述文本比编码向量以及所述负向比编码向量经过归一化指数函数和第二损失函数处理后,得到所述检索文本样本中子属性的标签;
基于所述子属性的标签与人工标签调整语义匹配模型的参数,得到预训练的所述语义匹配模型。
在一些实施例中,所述第二损失函数包括第二交叉熵损失函数和二分类损失函数;
其中,所述第二交叉熵损失函数以所述归一化指数函数和实际标签值为自变量;所述二分类损失函数以所述正向比编码向量、所述文本比编码向量和所述负向比编码向量之间的相似度为自变量。
在一些实施例中,所述多媒体子类标签通过以下步骤得到:
抽取多媒体中的关键帧,并对所述关键帧进行目标检测,按照预设的层级类别进行分析,得到相应层级类别的标签。
本申请实施例还提供了一种多媒体检索装置,所述装置包括:
语义分析模块,被配置为接收检索文本,对所述检索文本进行语义分析,得到至少一个子属性以及所述子属性之间的相关关系;
标签获取模块,被配置为通过预训练的语义匹配模型对各个所述子属性和所述相关关系进行处理,得到各个所述子属性的标签;
检索匹配模块,被配置为确定各个所述子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个所述相似度值确定与所述检索文本匹配的多媒体子类。
在一些实施例中,所述语义分析模块,被配置为分析所述检索文本,得到所述检索文本的分词以及所述分词的词性;结合所述分词以及所述分词的词性对所述检索文本进行依存句法分析,得到所述分词之间的依存关系;根据所述分词、所述分词的词性以及所述依存关系,确定所述检索文本的子属性以及所述子属性之间的相关关系。
在一些实施例中,所述语义分析模块,被配置为对所述分词以及所述分词的词性进行编码,得到分词数字编码,将所述分词数字编码输入到编码器中进行转换得到分词向量;将所述检索文本进行句法分析得到句法特征向量;将所述分词向量和所述句法特征向量进行拼接融合,得到第一隐向量;通过多层感知器处理所述第一隐向量,得到第二隐向量;根据所述检索文本和所述依存关系构建依存关系矩阵,根据所述依存关系矩阵、依存关系初始化向量和所述第二隐向量,确定所述依存关系矩阵中各位置的乘积值,并根据所述乘积值和第一损失函数确定所述子属性之间的相关关系。
在一些实施例中,所述第一损失函数包括第一交叉熵损失函数和KL散度损失函数;
其中,所述第一交叉熵损失函数以归一化指数函数和实际关系值为自变量,所述KL散度损失函数以所述第一隐向量的平均值和目标子属性隐向量的平均值为自变量。
在一些实施例中,所述标签获取模块,被配置为从检索文本样本中根据语义分析结果提取出子属性,根据所述子属性转换生成正样本向量、检索文本向量以及负样本向量;通过级联设置的编码器以及双向长短记忆网络模型分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行处理,对应得到正向比编码向量、文本比编码向量和负向比编码向量;将所述正向比编码向量、所述文本比编码向量以及所述负向比编码向量经过归一化指数函数和第二损失函数处理后,得到所述检索文本样本中子属性的标签;将所述子属性的标签与人工标签进行对比,得到预训练的所述语义匹配模型;其中,所述编码器用于分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行分词和特征编码。
在一些实施例中,所述第二损失函数包括第二交叉熵损失函数和二分类损失函数;
其中,所述第二交叉熵损失函数以所述归一化指数函数和实际标签值为自变量;所述二分类损失函数以所述正向比编码向量、所述文本比编码向量和所述负向比编码向量之间的相似度为自变量。
在一些实施例中,所述检索匹配模块,被配置为抽取多媒体中的关键帧,并对所述关键帧进行目标检测,按照预设的层级类别进行分析,得到相应层级类别的标签。
本申请实施例还提供了一种计算设备,包括:处理器、存储器、通信接口以及通信总线,所述处理器、所述存储器以及所述通信接口通过所述通信总线完成相互间的通信;
所述存储器,被配置为存放至少一可执行指令,所述可执行指令被所述处理器执行时,能够实现如前任一实施例所述的多媒体检索方法。
本申请实施例还提供了一种计算机存储介质,所述存储介质中存储有至少一可执行指令,所述可执行指令被电子设备的处理器执行时,能够实现如前任一所述的多媒体检索方法。
本申请实施例还提供了一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行时用于实现如前任一所述的多媒体检索方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括计算机可读代码,或者承载所述计算机可读代码的非易失性计算机可读存储介质,在所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行时实现如前任一项所述的多媒体检索方法。
根据本申请实施例提供的多媒体检索方案,通过对检索文本进行细粒度层级语义分析以及检索文本语义标签与多媒体子类标签之间的相似度值,解决了相关技术中的关键词抽取方式可能导致搜索语义之间的混乱问题,从而能够更好地挖掘检索文本语义与多媒体子类标签之间的关系,利用一个检索文本实现多个目标的检索,具有更高的识别度以及准确率。
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请 的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了本申请实施例提供的多媒体检索方法流程图;
图2示出了本申请实施例提供的多媒体检索场景的示意图;
图3示出了本申请实施例提供的检索文本语义分析的流程图;
图4示出了本申请实施例提供的检索文本细粒度层级语义属性抽取的流程图;
图5示出了本申请实施例提供的多主体、多子属性以及覆盖属性统一识别的流程示意图;
图6示出了本申请实施例提供的检索文本语义匹配的流程图;
图7示出了本申请实施例提供的多媒体检索装置的结构示意图;
图8示出了本申请实施例提供的视频检索模块的结构示意图;
图9示出了本申请实施例提供的计算设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本申请的示例性实施例。虽然附图中显示了本申请的示例性实施例,然而应当理解,可以以各种形式实现本申请而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本申请,并且能够将本申请的范围完整的传达给本领域的技术人员。
图1示出了本申请实施例提供的多媒体检索方法的流程图,该方法应用于计算设备中。所述计算设备包括各种计算机、智能终端以及平板电脑等。如图1所示,该方法包括以下步骤:
步骤110:接收检索文本,对检索文本进行语义分析,得到至少一个子属性以及子属性之间的相关关系。
结合图2所示的多媒体检索场景示意图,本申请实施例可以用于通过接收用户输入的检索问句等用语检索相应的多媒体资源,其中,上述多媒体资源包括视频、声音、图像集合以及动画等。
根据图2所示的多媒体检索场景,用户在搜索框201中输入想要搜索的用语、问句等得到相应的检索文本,通过语义分析方法,得到子属性。其中,子属性可以不仅仅是一个单独的词,而是一个片段序列,相较于单独的分词、带有状态的词(比如否定句式:“不戴眼镜”)具有更好的识别能力;还可以识别子属性之间的相关关系,比如并列关系,从属关系,还可以从子属性中识别出关键的父属性,从而可以从检索文本的一句话中识别出多个目标的不同检索需求;示例性地,在图2中,可以在搜索框201中输入“找戴眼镜穿黑色衣服的人”的检索文本。
步骤120:通过预训练的语义匹配模型对各个子属性和相关关系进行处理,得到各个子属性的标签。
示例性地,为了根据检索文本检索到相应的多媒体资源,需要对检索文本得到的子属性信息与多媒体资源的标签进行匹配。该步骤通过预训练的语义匹配模型,得到匹配度最高的子属性标签,可以提高匹配的准确度。
步骤130:确定各个子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个相似度值确定与检索文本匹配的多媒体子类。
如图2所示,视频等多媒体数据可以预先通过分析提取相应标签后保存到数据库202中。
示例性地,多媒体子类可以表示多媒体的层级结构,还可以代表多媒体的任一个类别,在检索时根据检索文本中的标签和多媒体相应类别的多媒体子标签进行匹配,然后根据匹配结果确定或定位到相应的多媒体。
需要指出的是,步骤120和步骤130可以合并执行,比如可以通过语义匹配模型与相似度值分析集合即可得到与用户检索匹配的多媒体资源。
示例性地,通过对相似度值进行排序,并将相似度至大于预设阈值的多媒体可确定为满足用户需求的多媒体。
综上所述,通过对检索文本进行细粒度层级语义分析以及检索文本语义标签与多媒体子类标签之间的相似度值,解决了相关技术中的关键词抽取方式可能导致搜索语义之间的混乱问题,从而能够更好地挖掘检索文本语义与多媒体子类标签之间的关系,利用一个检索文本实现多个目标的检索,具有更高的识别度以及准确率。
在一些实施例中,接收检索文本,对检索文本进行语义分析,得到至少一个子属性以及所述子属性之间的相关关系,可以通过以下方式实现:
分析检索文本,得到检索文本的分词以及分词的词性;结合分词以及分词的词性对检索文本进行依存句法分析,得到分词之间的依存关系;根据分词、分词的词性以及依存关系,确定检索文本的子属性以及子属性之间的相关关系。
示例性地,为了使得检索文本的表达更为准确,还可以对检索文本进行预处理;示例性地,上述预处理可以包括:大小写转换、繁简转换以及错字纠错等。
通过上述处理,能够实现对检索文本的精细划分,从而能够提高依存关系的精准度,进而提高子属性与子属性之间的相关关系的精准度。
在一些实施例中,根据分词、分词的词性以及依存关系,确定检索文本的子属性以及子属性之间的相关关系,可以通过以下方式实现:
对分词以及分词的词性进行编码,得到分词数字编码,将分词数字编码输入到编码器中进行转换得到分词向量;将检索文本进行句法分析得到句法特征向量;将分词向量和句法特征向量进行拼接融合,得到第一隐向量;通过多层感知器(Multilayer Perceptron,MLP)处理第一隐向量,得到第二隐向量;根据检索文本和依存关系构建依存关系矩阵,根据依存关系矩阵、依存关系初始化向量和所述第二隐向量,确定依存关系矩阵中各位置的乘积值,并根据乘积值和第一损失函数确定子属性之间的相关关系。
在一个具体的示例中,结合图3所示的检索文本语义分析的流程图,对上述对检索文本的语义分析过程作出进一步的说明。该基于依存关系的核心子属性抽取的整体流程包括如下步骤:
步骤301、文本预处理。
示例性地,文本预处理包括对检索文本进行预处理,上述预处理可以执行的操作包括:大小写转换、繁简转换以及错字纠错等。
步骤302、文本词法分析。
示例性地,文本词法分析可以包括对检索文本进行分词以及词性分析,比如对“一个戴眼镜的男子”执行文本词法分析,可以将其切分为“一个/m戴/v眼镜/n的/u男子/n”,其中,m、v以及n可以分别为“一个”、“眼镜”以及“男子”的词性。
示例性地,可以采用ansj、CTB以及PKU的词性分类标准执行文本词法分析。
示例性地,也可以采用其他的分词与词性标注标准执行文本词法分析。
步骤303、文本句法分析。
示例性地,可以结合分词结果以及分词的词性对检索文本进行依存句法分析,通过依存句法分析,实现各个词与词之间关联属性的分析。
示例性地,本申请实施例中的依存关系属性包括但不限于:ATT(定语)、DE(的)、SBV(主语)、VOB(宾语)、COO(一般并列)、COS(共享并列)、ADV(状语)以及HED(核心)等;比如“一个戴眼镜的男子”可以通过分词、词性分析以及依存句法分析解析成表1所示的数据列表。
示例性地,本示例中的依存关系可以采用PKU Multi-view Chinese Treebank等现有标 准。
表1
步骤304、细粒度层级语义属性抽取。
示例性地,可以提供细粒度的属性抽取,实现属性与核心实体之间的依赖关系识别,其具体的过程可以如图4所示的检索文本细粒度层级语义属性抽取的流程。
与现有的依存关系抽取模型的属性抽取方法相比,本申请实施例采用的方法具备更高的泛化能力以及准确率。整体方案步骤如下:
步骤401、词法分析输出。
示例性地,可以将词法分析的结果结合词典转换为数字编码X(x1,x2,x3,...,xn),并输入至编码器。
步骤402、通过编码器将数字编码转换为特征向量。
示例性地,编码器可以包括word2vec、BERT以及RoBERTa等。
示例性地,编码器可以通过H(x)(h1,h2,h3,...,hn)=Encoder(X)将数字编码转换为向量特征;其中,h1,h2,h3,...,hn可以为n个特征向量,n为大于3的整数。
步骤403、句法分析输出。
示例性地,句法分析输出的句法特征可以为D(x)(d1,d2,d3,...,dn);其中,d1,d2,d3,...,dn可以为n个句法特征。
步骤404、特征融合。
示例性地,可以将句法特征与特征向量进行拼接结合,得到第一隐向量;示例性地,第一隐向量可以为矩阵H’(x)(h1。⊕d1,h2⊕d2,....hn⊕dn)
步骤405、得到第二隐向量。
示例性地,可以将第一隐向量对应的矩阵经过MLP处理,得到第二隐向量G(H’(x)),其中G为MLP层的映射函数表示。
步骤406、基于依赖关系的解码。
结合图5所示的多主体、多子属性以及覆盖属性统一识别的流程示意图,首先针对检索文本,构建相应的依存关系矩阵501,设置不同的位置之间,存在H(表示头部)、E(表示尾部)、L(表示连接关系)、C(表示从属关系)、O(表示并列关系)、N(没有关系)六种关系;示例性地,可以将上述各种关系的取值分别初始化为Wq、dq、Wk以及dk,用于对矩阵中位于i、j位置之间的关系进行打分,其中qi=WqG(H'(xi))+dq,kj=WkG(H'(xj)),最终得分为S(i,j)=qikj;其中,i、j、k、q分别为大于或等于1的整数。
示例性地,通过上述方式实现对上三角矩阵完成打分。最终通过Softmax函数确定最终得分最高的关系。
参见图5所示的,结合序列标注以及依存句法分析,可以实现细粒度属性识别,并且可以识别属性之间的相关关系。比如在一个搜索问句中可以实现对多个实体(男人、孩子)的不同对应属性的识别,进而可以根据一个检索文本实现对多个目标的精确检索。
示例性地,检索文本的解码方式可以为:先找到H属性字符作为第一个字符,然后在同一行找L属性字符,然后在L属性字符所在列找下一个L或者E属性字符,以E属性字符作为片段结尾。另外E字符属性(代表的属性片段)与E字符属性(代表的属性片段)之间也存在从属关系(C)以及并列关系(O)。
通过以上处理,能够简化分词向量以及句法特征向量的表达形式;并且,通过对分词向量和句法特征向量进行拼接融合得到第一隐向量,能够提高第一隐向量中信息的丰富程度;与此同时,通过MLP处理第一隐向量得到第二隐向量,能够实现对第一隐向量中包含特征的分类和回归处理;在此基础上,能够提高乘积值和相关关系的精准度。
在一些实施例中,所述第一损失函数包括第一交叉熵损失函数和KL散度损失函数;
其中,第一交叉熵损失函数以归一化指数函数和实际关系值为自变量,KL散度损失函数以第一隐向量的平均值和目标子属性隐向量的平均值为自变量。
示例性地,为了使得上述模型得到收敛的结果,可以采用softmax函数对神经网络的输出结果进行一次换算,将输出结果用概率的形式表现出来,然后利用第一损失函数计算输出结果与真实分类之间的差距,从而通过迭代等方式优化模型参数。
示例性地,可以通过loss=CrossEntroyLoss(Softmax(QK),Y)获得子属性的相关关系的结果。另外,由于子属性的覆盖叠加不能改变原始语义,因此,需要对子属性的识别效果进行约束,loss子属性约束可以为:KL(Avg(H'(X)),Avg(H'(X子属性1,X子属性2,...,X子属性N)));其中,KL表述KL距离,Avg表示向量求平均,由于子属性的输入一开始是未知的,因此采用的是解码后的预测子属性的序列。
示例性地,最终的第一损失计算函数的模型Loss可以如式(1)所示:
其中,α为权重因子,其取值为小于1的正数。
总之,通过语义分析模块,可以得到用户问句的子属性列表,以及子属性之间的相关关系。
通过上述操作,能够提高相关关系的精准度和灵活度。
在一些实施例中,本申请实施例提供的多媒体检索方法还包括:
从检索文本样本中根据语义分析结果提取出子属性,根据子属性转换生成正样本向量、检索文本向量以及负样本向量;
通过级联设置的编码器以及双向长短记忆网络模型(Bi-directional Long Short-Term Memory,BiLSTM)分别对正样本向量、检索文本向量以及负样本向量进行处理,对应得到正向比编码向量、文本比编码向量和负向比编码向量;将正向比编码向量、文本比编码向量以及负向比编码向量经过归一化指数函数和第二损失函数处理后,得到所述检索文本样本中子属性的标签;基于子属性的标签与人工标签调整语义匹配模型的参数,得到预训练的语义匹配模型。
其中,编码器用于分别对正样本向量、检索文本向量以及负样本向量进行分词和特征编码。
通过上述训练流程,能够提高预训练的语义匹配模型确定子属性的标签的精准度和稳定性。
在一些实施例中,第二损失函数包括第二交叉熵损失函数和二分类损失函数;
其中,第二交叉熵损失函数以归一化指数函数和实际标签值为自变量,二分类损失函数以正向比编码向量、文本比编码向量和负向比编码向量之间的相似度为自变量。
具体的,结合图6所示的检索文本语义匹配流程图,该流程图用于实现语义分析结果与视频分析结果的匹配与检索,其过程主要包括模型训练与解码检索两个部分。
示例性地,模型训练主要用于训练适配的解码器,通过该解码器能够实现视频分析结果解码与语义分析结果解码的对齐。在解码检索过程中,只需要将语义分析的标签(解码特征)与已经预存好的视频分析结果的标签(解码特征)进行匹配检索即可。
示例性地,视频分析特征主要包括三级特征即属性大类、属性小类以及属性实体。
若按常规方法直接解码,然后通过相似度进行匹配,则会导致由于匹配类别太多(其中一般2级分类包括几十类,3级分类包括成千上万类),从而无法充分利用筛选信息;并且,也很容易导致属性实体之间文本相似度太高且常规语义过于接近(比如:“行人结构化-是否带伞-是”与“行人结构化-是否带伞-否”),从而导致极大的匹配误差。因此,在本实施例中通过提出分类与对比学习的结合实现细分类内部的相似度优化强化,实现更优的层级匹配效果。
参见图6所示的,该实施例中语义匹配过程如下:
首先,在训练过程中,对于Query 601(也就是用户问句(检索文本)),根据语义分析模块提取出大量子属性,比如:“带伞和戴眼镜的男子”,提取出“带伞的男子”、“戴眼镜的男子”以及“男子”,3个子属性片段(如果子属性存在父结构,那么需要结合父节点作为一个片段)。
比如,“男子”对应的视频结构化三级正样本602以及三级负样本603可以分别为:“人体结构化-性别-男”以及“人体结构化-性别-女”。
比如,“带伞的男子”对应的视频结构化的三级正样本以及三级负样本可以为:“行人结构化-是否带伞-是”以及“行人结构化-是否带伞-否”。
比如,“穿长衣服的男人”的三级正样本为“行人结构化-上衣-长袖”,其三级负样本可以是“行人结构化-上衣-短袖”或者是“行人结构化-上衣-背心”等等。
其中,三级正样本以及三级负样本可以是一个或者多个,也可以通过采样得到。
示例性地,可以将正负样例以及文本的属性片段,进行将分词和编码得到编码信息,上述各个数据对应的编码信息可以对应记为X、X以及X;示例性地,将上述编码信息输入至编码器604中,可以分别得到Encoder(X)、Encoder(X)以及Encoder(X);其中,编码器可以包括BERT以及RoBERTa。
示例性地,Encoder(X)经过emdedding 605处理之后可以输入至BiLSTM 606,得到信息进一步融合的语义信息BiLSTM(Encoder(X))。
其中,问句的子属性、正负三级样本应该属于同一个二级子类标签,BiLSTM(Encoder(X))可以经过SoftMax 607执行如式(2)所示的处理操作:
F(X)=Softmax(BiLSTM(Encoder(X)))       (2)
上述处理过程所对应的第二损失函数可以如式(3)所示:
示例性地,问句的子属性标签应该能够通过相似度区分正负三级样本的不同,同时考虑到,正负样属于同一个二级分类,因此可以强化相应的区分度,可定义第二损失函 数如式(4)所示:
其中,S可以为相似度函数,比如cosine-sim等,上述相似度函数的计算结果越接近0表示计算因子之间越不相似,计算结果越接近1表示计算因子之间越相似;其中,上述计算因子可以包括式(4)中的BiLSTM(Encoder(X))、BiLSTM(Encoder(X))以及BiLSTM(Encoder(X))。
示例性地,训练模块可以将损失函数通过式(5)表示:
Loss_final=βLoss1+(1-β)Loss2          (5)
其中,β可以为小于或1的正数。
通过以上流程,可以提高预训练的语义匹配模型的标签确定的精准度和稳定性。
在一些实施例中,所述多媒体子类标签可以通过以下方式得到:
抽取多媒体中的关键帧,并对所述关键帧进行目标检测,按照预设的层级类别进行分析,得到相应层级类别的标签。
示例性地,以视频为例,对于其内容的分析,一般采用从视频中抽取关键帧,并对关键帧进行目标检测,并结合通用图像识别等分析手段,实现该关键帧中关键内容的标签化。
本申请的实施例中,可以根据已经设定好固定层级类别的标签对多媒体进行分析。其中的固定层级类别包括:行人结构化-年龄段-老人、行人结构化-上衣款式-短袖、以及车辆结构化-通用车型-客车等。
示例性地,从同一个关键帧中可以识别出多个目标,不同的目标可以有多个结构化信息。
通过以上方式,能够实现对多媒体的层级类别的标签的灵活精准确定。
综上,本申请实施例对多媒体检索执行的过程大致分为如下几个阶段:
(1)在推理与检索执行过程中,语义解析模块需要实时进行,将用户的检索文本解析为对应的语义子属性。
(2)通过经预训练的编码器,将解析完成的子属性进行比编码(即得到BiLSTM(Encoder(X))。
(3)将比编码输入到语义匹配模型,预测得到检索文本二级子类相应的标签(即通过Softmax得到的分类结果)。
(4)与视频子类的所有编码结果(即预先得到的BiLSTM(Encoder(X子类1))、BiLSTM(Encoder(X子类2))、....、BiLSTM(Encoder(X子类n))进行相似度匹配,匹配方式仍然采用S函数。优选的选择阈值大于0.5且最相似的子类作为匹配类,从而定位到视频资源。
示例性地,一个用户检索问句(文本),可以有多个同级别的父属性,一个父属性可以具备多个子属性。通过上述步骤,可以实现一个检索问句与同级多个视频子类之间的匹配。
图7示出了本申请实施例提供的多媒体检索装置的结构示意图。如图7所示,该装置700包括:
语义分析模块710,被配置为接收检索文本,对检索文本进行语义分析,得到至少一个子属性以及子属性之间的相关关系;
标签获取模块720,被配置为通过预训练的语义匹配模型对各个子属性和相关关系输入进行处理,得到各个子属性的标签;
检索匹配模块730,被配置为确定各个子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个相似度值确定与所述检索文本匹配的多媒体子类。
具体的,参见图8所示的视频检索模块的结构示意图,在具体操作时,首先需要通过视频分析单元801对多媒体(以视频为例)进行分析,得到多媒体的标签分级信息;然后通过语义分析单元802对检索文本进行分析,得到相应的子属性以及相关关系;最后,通过语义匹配与检索单元803对检索文本和多媒体资源进行匹配,得到最优的匹配结果,从而根据检索文本定位到用户想要的多媒体资源。
在一些实施例中,语义分析模块710,还被配置为分析检索文本,得到检索文本的分词以及分词的词性;
结合分词以分词的词性对检索文本进行依存句法分析,得到分词之间的依存关系;
根据所述分词、分词的词性以及依存关系,确定检索文本的子属性以及子属性之间的相关关系。
在一些实施例中,语义分析模块710,被配置为对分词以及分词的词性进行编码,得到分词数字编码,将分词数字编码输入到编码器中进行转换得到分词向量;将检索文本进行句法分析得到句法特征向量;将所述分词向量和句法特征向量进行拼接融合,得到第一隐向量;通过MLP处理第一隐向量,得到第二隐向量;根据检索文本和依存关系构建依存关系矩阵,根据依存关系矩阵、依存关系初始化向量和第二隐向量,确定依存关系矩阵中各位置的乘积值,并根据乘积值和第一损失函数确定子属性之间的相关关系。
在一些实施例中,第一损失函数包括第一交叉熵损失函数和KL散度损失函数;
其中,第一交叉熵损失函数以归一化指数函数和实际关系值为自变量,KL散度损失函数以第一隐向量的平均值和目标子属性隐向量的平均值为自变量。
在一些实施例中,标签获取模块720,被配置为从检索文本样本中根据语义分析结果提取出子属性,根据子属性转换生成正样本向量、检索文本向量以及负样本向量;通过级联设置的编码器以及BiLSTM模型分别对正样本向量、检索文本向量以及负样本向量进行处理,得到正向比编码向量、文本比编码向量和负向比编码向量;其中,编码器用于分别对正样本向量、检索文本向量以及负样本向量进行分词和特征编码;将正向比编码向量、文本比编码向量以及负向比编码向量经过归一化指数函数和第二损失函数处理后,得到检索文本样本中子属性的标签;基于子属性的标签与人工标签调整语义匹配模型的参数,得到预训练的语义匹配模型。
在一些实施例中,第二损失函数包括第二交叉熵损失函数和二分类损失函数;
其中,第二交叉熵损失函数以归一化指数函数和实际标签值为自变量,二分类损失函数以正向比编码向量、文本比编码向量和负向比编码向量之间的相似度为自变量。
在一些实施例中,检索匹配模块730,被配置为抽取多媒体中的关键帧,并对所述关键帧进行目标检测,按照预设的层级类别进行分析,得到相应层级类别的标签。
本申请公开的多媒体检索方案的关键点和有益效果包括:
本申请实施例首先提出了一种细粒度层级语义属性抽取方式,相对于传统语义分析方法来说,能够实现子属性的抽取,其中子属性不仅仅是一个单独的词,而是一个片段序列,对于新词、带有状态的词(比如否定句式:“不戴眼镜”)具有更好的识别能力。此外,该方式还能够识别子属性之间的关联关系,比如并列关系,从属关系。可以从子 属性中识别出关键的父属性,从而通过检索文本识别出多个目标的不同检索需求。
本申请实施例提出了一种多层级对比匹配的语义编码方式,该方式能够实现二级分类下的类别分类识别,并且能够通过正负样本对比的训练,更好的完成对多级分类标签的语义编码,从而能够更准确的实现语义子属性与视频标签的匹配。
本申请实施例克服了现有方案中关键词抽取可能导致的检索用语之间的混乱问题(比如要搜索“一个戴帽子的男人和一个喝酒的女人”,通过现有方案很容易搜索出戴帽子的女人,以及喝酒的男人),本申请实施例能够在一个检索问句中实现多个目标的检索,从而克服了现有方案的缺陷,且提高了匹配的精确度。
本申请实施例提供了一种非易失性计算机存储介质,所述计算机存储介质存储有至少一可执行指令,该计算机可执行指令可执行上述任意方法实施例中的多媒体检索方法。
图9示出了本申请实施例提供的计算设备的结构示意图,本申请具体实施例并不对计算设备的具体实现做限定。
如图9所示,该计算设备可以包括:处理器(Processor)902、通信接口(Communications Interface)904、存储器(Memory)906、以及通信总线908。
其中:处理器902、通信接口904、以及存储器906通过通信总线908完成相互间的通信。通信接口904,被配置为与其它设备比如客户端或其它服务器等的网元通信。处理器902,被配置为执行程序910时,实现如前任一实施例提供的多媒体检索方法。
具体地,程序910可以包括程序代码,该程序代码包括计算机操作指令。
处理器902可能是中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。计算设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器906,被配置为存放程序910。
存储器906可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
本申请实施例还提供了一种计算机可读存储介质,所述存储介质中存储有至少一可执行指令,所述可执行指令被电子设备的处理器执行时,能够实现如前任一所述的多媒体检索方法。
本申请实施例还提供了一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行时用于实现如前任一所述的多媒体检索方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括计算机可读代码,或者承载所述计算机可读代码的非易失性计算机可读存储介质,在所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行时实现如前任一所述的多媒体检索方法。
在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本申请实施例也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本申请的内容,并且上面对特定语言所做的描述是为了披露本申请的最佳实施方式。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本申请的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本申请并帮助理解各个发明方面中的一个或多个,在上面对本申请的示例性实施例的描述中,本申请实施例的各个特征有时被一起分组到单 个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本申请要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本申请的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本申请的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本申请的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本申请实施例的一些或者全部部件的一些或者全部功能。本申请还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本申请的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
应该注意的是上述实施例对本申请进行说明而不是对本申请进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本申请可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤,除有特殊说明外,不应理解为对执行顺序的限定。
工业实用性
本申请公开了一种多媒体检索方法、装置、设备、介质及程序产品;其中方法包括:接收检索文本,对检索文本进行语义分析,得到至少一个子属性以及子属性之间的相关关系;通过预训练的语义匹配模型对各个所述子属性和所述相关关系进行处理,得到各个子属性的标签;确定各个子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个相似度值确定与所述检索文本匹配的多媒体子类。

Claims (18)

  1. 一种多媒体检索方法,所述方法包括:
    接收检索文本,对所述检索文本进行语义分析,得到至少一个子属性以及所述子属性之间的相关关系;
    通过预训练的语义匹配模型对各个所述子属性和所述相关关系进行处理,得到各个所述子属性的标签;
    确定各个所述子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个所述相似度值确定与所述检索文本匹配的多媒体子类。
  2. 根据权利要求1所述的方法,其中,所述接收检索文本,对所述检索文本进行语义分析,得到至少一个子属性以及所述子属性之间的相关关系,包括:
    分析所述检索文本,得到所述检索文本的分词以及所述分词的词性;
    结合所述分词以及所述分词的词性对所述检索文本进行依存句法分析,得到所述分词之间的依存关系;
    根据所述分词、所述分词的词性以及所述依存关系,确定所述检索文本的子属性以及所述子属性之间的相关关系。
  3. 根据权利要求2所述的方法,其中,所述根据所述分词、所述分词的词性以及所述依存关系,确定所述检索文本的子属性以及所述子属性之间的相关关系,包括:
    对所述分词以及所述分词的词性进行编码,得到分词数字编码,将所述分词数字编码输入到编码器中进行转换得到分词向量;将所述检索文本进行句法分析得到句法特征向量;
    将所述分词向量和所述句法特征向量进行拼接融合,得到第一隐向量;
    通过多层感知器处理所述第一隐向量,得到第二隐向量;
    根据所述检索文本和所述依存关系构建依存关系矩阵,根据所述依存关系矩阵、依存关系初始化向量和所述第二隐向量,确定所述依存关系矩阵中各位置的乘积值,并根据所述乘积值和第一损失函数确定所述子属性之间的相关关系。
  4. 根据权利要求3所述的方法,其中,所述第一损失函数包括第一交叉熵损失函数和KL散度损失函数;
    其中,所述第一交叉熵损失函数以归一化指数函数和实际关系值为自变量,所述KL散度损失函数以所述第一隐向量的平均值和目标子属性隐向量的平均值为自变量。
  5. 根据权利要求1-4任一项所述的方法,其中,所述方法还包括:
    从检索文本样本中根据语义分析结果提取出子属性,根据所述子属性转换生成正样本向量、检索文本向量以及负样本向量;
    通过级联设置的编码器以及双向长短记忆网络模型分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行处理,对应得到正向比编码向量、文本比编码向量和负向比编码向量;其中,所述编码器用于分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行分词和特征编码;
    将所述正向比编码向量、所述文本比编码向量以及所述负向比编码向量经过归一化指数函数和第二损失函数处理后,得到所述检索文本样本中子属性的标签;
    基于所述子属性的标签与人工标签调整语义匹配模型的参数,得到预训练的所述语义匹配模型。
  6. 根据权利要求5所述的方法,其中,所述第二损失函数包括第二交叉熵损失函数和二分类损失函数;
    其中,所述第二交叉熵损失函数以所述归一化指数函数和实际标签值为自变量;所述二分类损失函数以所述正向比编码向量、所述文本比编码向量和所述负向比编码向量 之间的相似度为自变量。
  7. 根据权利要求1-4中任一项所述的方法,其中,所述多媒体子类标签通过以下步骤得到:
    抽取多媒体中的关键帧,并对所述关键帧进行目标检测,按照预设的层级类别进行分析,得到相应层级类别的标签。
  8. 一种多媒体检索装置,所述装置包括:
    语义分析模块,被配置为接收检索文本,对所述检索文本进行语义分析,得到至少一个子属性以及所述子属性之间的相关关系;
    标签获取模块,被配置为通过预训练的语义匹配模型对各个所述子属性和所述相关关系进行处理,得到各个所述子属性的标签;
    检索匹配模块,被配置为确定各个所述子属性的标签与预得到的多媒体子类标签之间的相似度值,根据各个所述相似度值确定与所述检索文本匹配的多媒体子类。
  9. 根据权利要求8所述的装置,其中:
    所述语义分析模块,被配置为分析所述检索文本,得到所述检索文本的分词以及所述分词的词性;结合所述分词以及所述分词的词性对所述检索文本进行依存句法分析,得到所述分词之间的依存关系;根据所述分词、所述分词的词性以及所述依存关系,确定所述检索文本的子属性以及所述子属性之间的相关关系。
  10. 根据权利要求9所述的装置,其中:
    所述语义分析模块,被配置为对所述分词以及所述分词的词性进行编码,得到分词数字编码,将所述分词数字编码输入到编码器中进行转换得到分词向量;将所述检索文本进行句法分析得到句法特征向量;将所述分词向量和所述句法特征向量进行拼接融合,得到第一隐向量;通过多层感知器处理所述第一隐向量,得到第二隐向量;根据所述检索文本和所述依存关系构建依存关系矩阵,根据所述依存关系矩阵、依存关系初始化向量和所述第二隐向量,确定所述依存关系矩阵中各位置的乘积值,并根据所述乘积值和第一损失函数确定所述子属性之间的相关关系。
  11. 根据权利要求9所述的装置,其中,所述第一损失函数包括第一交叉熵损失函数和KL散度损失函数;
    其中,所述第一交叉熵损失函数以归一化指数函数和实际关系值为自变量,所述KL散度损失函数以所述第一隐向量的平均值和目标子属性隐向量的平均值为自变量。
  12. 根据权利要求8至11任一所述的装置,其中,所述标签获取模块,被配置为从检索文本样本中根据语义分析结果提取出子属性,根据所述子属性转换生成正样本向量、检索文本向量以及负样本向量;通过级联设置的编码器以及双向长短记忆网络模型分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行处理,对应得到正向比编码向量、文本比编码向量和负向比编码向量;将所述正向比编码向量、所述文本比编码向量以及所述负向比编码向量经过归一化指数函数和第二损失函数处理后,得到所述检索文本样本中子属性的标签;基于所述子属性的标签与人工标签调整语义匹配模型的参数,得到预训练的所述语义匹配模型;其中,所述编码器用于分别对所述正样本向量、所述检索文本向量以及所述负样本向量进行分词和特征编码。
  13. 根据权利要求12所述的装置,其中,所述第二损失函数包括第二交叉熵损失函数和二分类损失函数;
    其中,所述第二交叉熵损失函数以所述归一化指数函数和实际标签值为自变量;所述二分类损失函数以所述正向比编码向量、所述文本比编码向量和所述负向比编码向量之间的相似度为自变量。
  14. 根据权利要求8至11任一所述的装置,其中,所述检索匹配模块,被配置为抽取多媒体中的关键帧,并对所述关键帧进行目标检测,按照预设的层级类别进行分析,得 到相应层级类别的标签。
  15. 一种计算设备,包括:处理器、存储器、通信接口以及通信总线,所述处理器、所述存储器以及所述通信接口通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令被所述处理器执行时,能够实现如权利要求1-7任一所述的多媒体检索方法。
  16. 一种计算机可读存储介质,所述存储介质中存储有至少一可执行指令,所述可执行指令被电子设备的处理器执行时,能够实现如权利要求1-7任一所述的多媒体检索方法。
  17. 一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行时用于实现如权利要求1至7任一所述的多媒体检索方法。
  18. 一种计算机程序产品,所述计算机程序产品包括计算机可读代码,或者承载所述计算机可读代码的非易失性计算机可读存储介质,在所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行时实现如权利要求1至7任一所述的多媒体检索方法。
PCT/CN2023/132098 2022-11-16 2023-11-16 多媒体检索方法、装置、设备、介质及程序产品 WO2024104438A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211434753.4A CN116304120A (zh) 2022-11-16 2022-11-16 多媒体检索方法、装置、计算设备和存储介质
CN202211434753.4 2022-11-16

Publications (1)

Publication Number Publication Date
WO2024104438A1 true WO2024104438A1 (zh) 2024-05-23

Family

ID=86811877

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/132098 WO2024104438A1 (zh) 2022-11-16 2023-11-16 多媒体检索方法、装置、设备、介质及程序产品

Country Status (2)

Country Link
CN (1) CN116304120A (zh)
WO (1) WO2024104438A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304120A (zh) * 2022-11-16 2023-06-23 中移(苏州)软件技术有限公司 多媒体检索方法、装置、计算设备和存储介质
CN116578729B (zh) * 2023-07-13 2023-11-28 腾讯科技(深圳)有限公司 内容搜索方法、装置、电子设备、存储介质和程序产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815333A (zh) * 2019-01-14 2019-05-28 金蝶软件(中国)有限公司 信息获取方法、装置、计算机设备和存储介质
CN110472090A (zh) * 2019-08-20 2019-11-19 腾讯科技(深圳)有限公司 基于语义标签的图像检索方法以及相关装置、存储介质
CN111967242A (zh) * 2020-08-17 2020-11-20 支付宝(杭州)信息技术有限公司 一种文本信息的抽取方法、装置及设备
US20220343626A1 (en) * 2019-08-15 2022-10-27 Vision Semantics Limited Text Based Image Search
CN116304120A (zh) * 2022-11-16 2023-06-23 中移(苏州)软件技术有限公司 多媒体检索方法、装置、计算设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815333A (zh) * 2019-01-14 2019-05-28 金蝶软件(中国)有限公司 信息获取方法、装置、计算机设备和存储介质
US20220343626A1 (en) * 2019-08-15 2022-10-27 Vision Semantics Limited Text Based Image Search
CN110472090A (zh) * 2019-08-20 2019-11-19 腾讯科技(深圳)有限公司 基于语义标签的图像检索方法以及相关装置、存储介质
CN111967242A (zh) * 2020-08-17 2020-11-20 支付宝(杭州)信息技术有限公司 一种文本信息的抽取方法、装置及设备
CN116304120A (zh) * 2022-11-16 2023-06-23 中移(苏州)软件技术有限公司 多媒体检索方法、装置、计算设备和存储介质

Also Published As

Publication number Publication date
CN116304120A (zh) 2023-06-23

Similar Documents

Publication Publication Date Title
Li et al. Leveraging linguistic structures for named entity recognition with bidirectional recursive neural networks
WO2024104438A1 (zh) 多媒体检索方法、装置、设备、介质及程序产品
Fu et al. CRNN: a joint neural network for redundancy detection
CN111832290B (zh) 用于确定文本相关度的模型训练方法、装置、电子设备及可读存储介质
Zhou et al. Named entity recognition using BERT with whole world masking in cybersecurity domain
CN113569050B (zh) 基于深度学习的政务领域知识图谱自动化构建方法和装置
CN111814477B (zh) 一种基于争议焦点实体的争议焦点发现方法、装置及终端
Mozafari et al. BAS: an answer selection method using BERT language model
CN113051380B (zh) 信息生成方法、装置、电子设备和存储介质
Xiao et al. User preference mining based on fine-grained sentiment analysis
Li et al. Graph convolution over multiple latent context-aware graph structures for event detection
Yang et al. Ensemble sentiment analysis method based on R-CNN and C-RNN with fusion gate
Zhang et al. Exploring modular task decomposition in cross-domain named entity recognition
Chou et al. Boosted web named entity recognition via tri-training
CN112528653B (zh) 短文本实体识别方法和系统
CN110941958A (zh) 一种文本类目标注方法、装置、电子设备及存储介质
CN115115432B (zh) 基于人工智能的产品信息推荐方法及装置
Priyadarshi et al. The first named entity recognizer in Maithili: Resource creation and system development
Wang et al. Dual-perspective fusion network for aspect-based multimodal sentiment analysis
Liu et al. A Method Combining Text Classification and Keyword Recognition to Improve Long Text Information Mining
Xin et al. [Retracted] Recognition of Unknown Entities in Specific Financial Field Based on ERNIE‐Doc‐BiLSTM‐CRF
CN116629387B (zh) 一种用于训练缺失条件下的文本处理方法及处理系统
Guan et al. Mask-based text scoring for product title summarization
Li et al. Weighted‐Attribute Triplet Hashing for Large‐Scale Similar Judicial Case Matching
Parkar et al. A survey paper on the latest techniques for sarcasm detection using BG method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23890867

Country of ref document: EP

Kind code of ref document: A1