CN114461890A

CN114461890A - Hierarchical multi-modal intellectual property search engine method and system

Info

Publication number: CN114461890A
Application number: CN202111531155.4A
Authority: CN
Inventors: 周凡; 苏志宏; 林谋广
Original assignee: Guangdong Ronggu Innovation Industrial Park Co ltd; Sun Yat Sen University
Current assignee: Guangdong Ronggu Innovation Industrial Park Co ltd; Sun Yat Sen University
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-05-10

Abstract

The invention discloses a hierarchical multi-modal intellectual property search engine method. The method comprises the following steps: preprocessing an input image; inputting an input image into a hierarchical depth image retrieval model to obtain an image retrieval result; inputting a text field in the data set into a text semantic retrieval model to obtain a text retrieval result; inputting the text retrieval result into a similar intellectual property recommendation model to obtain a similar recommendation result; and performing multi-mode result fusion on the image retrieval result, the text retrieval result and the similar recommendation result to obtain a fused text result, and reordering the fused text result and the query text input by the user to obtain a final retrieval result. The invention also discloses a hierarchical multi-modal intellectual property search engine system. According to the invention, through the layered depth image retrieval model and the text semantic retrieval model, the retrieval speed is increased, the retrieval precision is maintained, and compared with a paper search method, the scheme can better express the retrieval requirements of users.

Description

Hierarchical multi-modal intellectual property search engine method and system

Technical Field

The invention relates to multi-modal search and deep learning, in particular to a hierarchical multi-modal intellectual property search engine method and system.

Background

In the big data era, artificial intelligence has wide application in various industries. For the retrieval of intellectual property rights, the intellectual network is large in scale and complex in connection, and the intellectual nodes have heterogeneity. In the face of massive information, a search mode based on a classified directory and key words is more and more difficult to adapt to the search requirements of users, and the search mode is urgently needed to be improved from a word-based level to a semantic-based level, and the high-increment, high-timeliness and multi-mode intellectual property hypergraph network modeling technology is researched and developed to accurately capture the real intention behind the sentence input by the user and search the sentence according to the real intention, so that the search result which best meets the requirements of the user is more accurately returned to the user.

Intellectual property retrieval is a technical application for realizing the retrieval by inputting user segments and returning the segments to the search results which best meet the requirements of the user. Most of the prior art applications are intellectual property retrieval in a text search mode. However, since this technique is monomodal, it is often difficult for the user's segment input to accurately express the user's search requirements. At this time, a multi-modal modeling technique is required to accurately capture the true intention of the user.

Multimodal intellectual search is a multimodal search technology applied to intellectual property search, and is a search technology in which a user can input different types of input to search, such as searching pictures with text, searching texts with drawings, and the like. In the intellectual property search, a user can input a search field and a picture related to a search intention of the user, and the search engine is combined to search the two types of input information to obtain a search result which best meets the requirement of the user. However, the search engine is limited by the development of artificial intelligence related technologies, and is difficult to perform fusion analysis on the two different types of inputs, and the returned search result often has a certain bias, which affects the final search result.

One of the existing technologies is a method for generating a relevant search result by analyzing a user search word, the method includes acquiring a search word input by a user, determining a user demand type according to the search word, and determining a corresponding guidance policy according to the demand type; and generating related search results of the thesis according to the guiding strategy and the search terms, displaying the related search results on a search result page, and providing the search result page for the user. The disadvantage of this approach is that this technique is monomodal, and the user's segment input often has difficulty accurately expressing the user's search requirements.

The second prior art is a method for designing and implementing a Chinese knowledge search system based on encyclopedic. The method realizes knowledge search based on encyclopedic entities through the processes of word segmentation, part-of-speech tagging, synonym conversion, problem word conversion, core entity identification, retrieval, result rearrangement and the like. The method has the disadvantages that the similarity between the query text and the text to be retrieved is calculated in the modes of synonym conversion and the like during retrieval, the similarity is not calculated in a word vector embedding feature space, the similarity is realized through a synonym vocabulary in encyclopedic, and the intervention of an encyclopedic knowledge base is needed.

Disclosure of Invention

The invention aims to overcome the defects of the existing method and provides a hierarchical multi-mode intellectual property search engine method and system. The method and the device solve the main problems that firstly, the prior art determines the requirement type of the user according to the search word by acquiring the search word input by the user, and determines the corresponding guide strategy according to the requirement type, but the technology is in a single mode, and the segment input of the user is difficult to accurately express the retrieval requirement of the user. Secondly, the design and implementation of the existing encyclopedic-based Chinese knowledge search system are realized by calculating the similarity between the query text and the text to be searched through synonym conversion and other modes during searching, and the technology needs to be realized through a synonym vocabulary in the encyclopedic and needs the intervention of an encyclopedic knowledge base.

In order to solve the above problems, the present invention provides a hierarchical multi-modal intellectual property search engine method, including:

screening an input image and a text field from an intellectual property database, and processing the image into a uniform size;

inputting an input image in an intellectual property data set into a hierarchical depth image retrieval model to obtain an image retrieval result, firstly extracting deep features of the image by using an image deep feature extraction network for an inquiry picture, then obtaining a binary code of the inquiry picture by using a Hash coding network and binarization operation, carrying out coarse retrieval by using the Hash value, taking the first K results, carrying out fine retrieval based on the image deep features, and finally obtaining an image retrieval result R of the hierarchical depth image retrieval model_v；

Inputting text fields in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then performing retrieval through Euclidean distances among document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property to obtain a text retrieval result;

inputting the obtained text retrieval into a similar intellectual property recommendation model to obtain a similar recommendation result;

performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result, and fusing the obtained results to form R_bWill fuse the text results R_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_f；

Preferably, the input image and the text field are screened from the intellectual property database, and the image is processed into a uniform size, specifically:

the method comprises the steps of processing images such as a flow chart, a network structure chart and the like in the input intellectual property into a uniform size by means of center equal-proportion cutting and equal-proportion scaling, and performing data enhancement means such as random rotation angles on the input images every time.

Preferably, the input image in the intellectual property data set is input into the hierarchical depth image retrieval model to obtain an image retrieval result, the deep features of the image are extracted by the image deep feature extraction network for the query picture, then the binary code of the query picture is obtained through the hash coding network and the binarization operation, the hash value is used for rough retrieval, the first K results are taken and then the image deep feature-based fine retrieval is carried out, and finally the retrieval result R of the hierarchical depth image retrieval model is obtained_vThe method specifically comprises the following steps:

selecting ResNet-50 as a skeleton network of the model, inputting an input image into a ResNet model pre-trained on an image classification data set ImageNet, and extracting visual characteristics of the clothing image;

inputting the extracted visual features into a Hash coding network for Hash coding, inputting the high latitude image features extracted by the feature extraction network into a full connection layer, outputting n-dimensional binary values, converting the binary values into Hash features through binarization operation, and updating the whole network parameters including the deep feature extraction network of the image in the error back propagation process so as to better fit the weight of the network on a Hash coding task, wherein the loss function is as follows:

n is the number of image pairs selected by a batch during training, h_i,1,h_i,2Representing the net output of two images in the ith pair, i.e. binary-like feature representation, s_iAnd whether the two images in the ith image pair are similar or not is shown, whether the two clothes images belong to the same category or not is shown in the data set, the similarity is 1, otherwise, the similarity is 0, t is a boundary threshold parameter, and alpha is a regularization strength parameter.

The loss boxThe first two terms of the number can well enable class binary codes generated by pictures of the same class to be as close as possible, while class binary codes of pictures of different classes are as far away from each other as possible, so that h can be generated in the process of optimizing loss_i,1,h_i,2Respectively approaching to-1 or 1 as much as possible, thus generating output approaching to binary, and finally carrying out binarization by taking 0 as a boundary point, thus obtaining standard binary output;

and carrying out rough search in Hamming space. In the rough retrieval stage, the query picture is output through the network to obtain n-dimensional binary representation b_qBinary representation of any garment in the database b_iThe clothing items in the database are ordered according to the following hamming distances:

and searching in the image feature space. In the fine retrieval stage, the first K items of results obtained in the coarse retrieval stage are taken, and the output obtained by ResNet of the query picture is represented as r_qThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as r_iAnd then sorting the Euclidean distances based on the deep features of the image:

dist_f(r_q，r_i)＝||r_q，r_i||₂

thereby obtaining more accurate retrieval results.

Preferably, the query text is classified in advance through a text classification network, the search range is effectively narrowed through category screening, and then the search is performed through the euclidean distance between document feature vectors obtained by embedding the text into the feature extraction model in the screened intellectual property rights, so as to obtain a text search result, specifically:

extracting a feature vector of an input query text by using a doc2 vec-based text embedded feature extraction model, inputting the query text into a text classification network by using the text classification network constructed by taking LSTM as a core component to obtain a classified text category, and searching through Euclidean distance between document feature vectors obtained by the text embedded feature extraction model in screened intellectual property rights.

Preferably, the text retrieval obtained above is input into a similar intellectual property recommendation model, and a similar result is recommended, specifically:

in the recommendation model, the relevance is defined as the mutual information of two words x, y:

the stronger the relevance of the two words, the larger the mutual information value, which usually shows that the two words often appear in the same sentence and often match. Training is carried out by using a skip-gram model in word2vec and an optimized acceleration module thereof, namely h-softmax, so as to obtain the word w_iThe word w in the case of occurrence_kProbability modeling of occurrence, namely:

p(w_k|w_i)

preferably, the performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result, and the obtained similar recommendation result specifically includes:

searching out the top k _ visual text results R similar to the images in the search library through a hierarchical depth image search model_vSearching front k _ text results R similar to text description semantics through a text semantic search model_tFollowed by reaction of R_tThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results R_s. Finally, the three results are fused together to form a fused text result R_bR is to be_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_f。

Correspondingly, the invention also provides a hierarchical multi-modal intellectual property search engine system, which comprises:

the image preprocessing unit is used for screening the input images and the text fields from the intellectual property database and processing the images into a uniform size;

an image retrieval unit for inputting the input image in the intellectual property data set into the hierarchical depth image retrieval model to obtain an image retrieval result, extracting the deep features of the image by using an image deep feature extraction network for the query picture, then obtaining the binary code of the query picture through a Hash coding network and binarization operation, carrying out coarse retrieval by using the Hash value, taking the first K results, carrying out fine retrieval based on the deep features of the image, and finally obtaining the retrieval result R of the hierarchical depth image retrieval model_v；

The text retrieval unit is used for inputting text fields in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then retrieving the screened intellectual property through Euclidean distance between document feature vectors obtained by embedding the text into a feature extraction model;

the text recommendation unit is used for inputting the obtained text retrieval into a similar intellectual property recommendation model and recommending a similar result;

a multi-modal fusion unit for performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result, and fusing the obtained results to form R_bWill fuse the text results R_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_f。

The implementation of the invention has the following beneficial effects:

the method applies a Chinese text semantic retrieval model, and utilizes a text classifier to classify in advance to reduce the retrieval range, so that the text is represented better, and the retrieval speed and precision are improved; the invention designs a hierarchical depth image retrieval model, extracts the deep features of the image captured by the network through the deep features of the image, generates simple Hash features by utilizing a Hash coding network, performs rough retrieval on the Hash features and then performs fine retrieval on the deep features, effectively improves the retrieval speed and simultaneously keeps the retrieval precision with equivalent effect; compared with the traditional visual characteristic characterization method, the method has wider adaptability in the pre-training process by using the pre-training and fine-tuning scheme; the similar intellectual property recommendation model designed by the invention adopts a probability-driven method to quantify the similarity, and can search a search result close to the user intention; according to the multi-mode fused intellectual property search engine system designed by the invention, a user can input the query text and the related query picture, the search engine can comprehensively consider the two types of input and return the retrieval result, and the multi-mode input mode can more accurately capture the search intention of the user.

Drawings

FIG. 1 is a general flow diagram of a hierarchical multi-modal intellectual property search engine method of an embodiment of the present invention;

FIG. 2 is a block diagram of a hierarchical multi-modal intellectual property search engine system in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of hierarchical depth image retrieval according to an embodiment of the present invention;

fig. 4 is a text semantic retrieval flow diagram according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a general flow diagram of a hierarchical multi-modal intellectual property search engine method according to an embodiment of the present invention, as shown in fig. 1, the method comprising:

s1, screening the input image and the text field from the intellectual property database, and processing the image into a uniform size;

s2, inputting the input image in the intellectual property data set into the hierarchical depth image retrieval model to obtain the image retrieval result, firstly extracting the deep features of the image by using the image deep feature extraction network for the query picture, then obtaining the binary code of the query picture by using the Hash value to carry out rough retrieval, taking the first K results and carrying out fine retrieval based on the image deep feature to finally obtain the retrieval result R of the hierarchical depth image retrieval model_v；

S3, inputting the text field in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through category screening, and then retrieving the screened intellectual property through Euclidean distance between document feature vectors obtained by embedding the text into a feature extraction model;

s4, inputting the obtained text retrieval into a similar intellectual property recommendation model, and recommending a similar result;

and S5, performing multi-mode result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result.

Step S1 is specifically as follows:

s1-1, inputting original images such as flow charts and network structure diagrams in intellectual property, processing the input original images into uniform size by means of center equal proportion cutting and equal proportion scaling, and performing data enhancement means such as random rotation angle on each input image.

Step S2, as shown in fig. 3, is as follows:

and inputting the original image into a ResNet model pre-trained on an image classification data set ImageNet, and extracting the visual characteristics of the clothing image. Selecting ResNet-50 as a skeleton network of a model, wherein the network has a 50-layer network layer structure, the characteristic dimensionality of the output of the last layer is 2048, and the full-connection layer of the last layer of the original network is not added into the network;

and inputting the extracted visual features into a hash coding network for hash coding. And inputting the high latitude image features extracted by the feature extraction network into the full connection layer, outputting n-dimensional binary values, and finally converting the n-dimensional binary values into hash features through binarization operation. In the error back propagation process, parameters of the whole network including the image deep layer feature extraction network are updated so as to better fit the weight of the network on the hash coding task. The loss function is as follows:

where N is the number of image pairs selected by a batch during training, h_i,1,h_i,2Representing the net output of two images in the ith pair, i.e. binary-like feature representation, s_iAnd whether the two images in the ith image pair are similar or not is shown, whether the two clothes images belong to the same category or not is shown in the data set, the similarity is 1, otherwise, the similarity is 0, t is a boundary threshold parameter, and alpha is a regularization strength parameter.

In the first term of the loss function when the two images are of the same category

Punishment is carried out on image pairs which are dissimilar in binary output; in the second term when the two images are of different categories

And punishing the image pair with similar class binary output, wherein t is the distance expected from the network output corresponding to the two images with different classes. The first two terms of the loss function can well enable class binary codes generated by pictures of the same class to be as close as possible, and class binary codes of pictures of different classes to be as far away from each other as possible. Alpha (| | | h)_i，1|-1||₁+|||h_i，2|-1||₁) Optimizing the loss for regularizing termsLet h in the process of_i,1,h_i,2As close to-1 or 1, respectively, as possible, so that an output close to binary can be generated. And finally, when carrying out binarization, taking 0 as a demarcation point to obtain standard binary output.

S2-3, performing rough search in Hamming space. In the rough retrieval stage, the query picture is output through the network to obtain n-dimensional binary representation b_qBinary representation of any garment in the database b_iThe clothing items in the database are ordered according to the following hamming distances:

and S2-4, searching in the image feature space. In the fine search stage, the first K items of results obtained in the coarse search stage are taken, and the output obtained by ResNet of the query picture is represented as r_qThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as r_iAnd then, sorting the Euclidean distances based on the deep features of the image:

diSt_f(r_q，r_i)＝||r_q，r_i||₂.

thereby obtaining more accurate retrieval results.

Step S3, as shown in fig. 4, is as follows:

s3-1: and inputting the query text into a text embedded feature extraction model to extract a feature vector of the text. The text embedding feature extraction model of the method is based on doc2vec, and the training goal of the model is to maximize the average logarithmic probability of predicting the current word:

where N is the document length, k is half the window size, w_iAre words. The probability p is processed by softmax, and the probability before normalization is as follows: :

p_u＝b+Kf(w_i-k，…，w_i+k，para；W，D)

wherein b and K are softmax parameters, f is an intermediate vector representation obtained by cascading or averaging the word vector extracted from W and the document vector extracted from D, and then predicting the next word by using the vector:

doc2vec has yet another framework: PV-DBOW, which is the inverse of PV-DM, predicts words in a document window with a document vector, regardless of context word order. The text embedding feature extraction model constructed by the method is realized based on doc2vec, and the advantages of the doc2vec model are fully utilized for modification. PV-DM considers word sequence while understanding semantic information, PV-DBOW has the advantage of less stored data (because a word vector matrix W is not required to be stored), so in order to utilize the advantages of the PV-DBOW and the DBOW and obtain more accurate and stable vector representation, document vectors obtained by two models are combined for use, and the combination method is to cascade two vectors obtained by the same document to obtain a vector with higher dimension and higher document identification degree.

And S3-2, inputting the query text into the text classification network to classify the text type. According to the method, an LSTM is used as a core component to construct a text classification network, and the whole network comprises an Embellding layer, a spatialdropout1d layer, an LSTM layer and an FC layer. Firstly, data cleaning is carried out on an input query text, a vocabulary is constructed, a document is labeled according to the vocabulary and constructed into a 250-dimensional label vector, and then the vector is input into an embedding layer, so that a 100-dimensional document embedded expression containing 250 timestamps is obtained. In order to properly reduce the dependency between each timeester, a Spatialropout1D layer is used for processing, then the document is embedded into the input LSTM, because a sequence does not need to be generated, and the final goal is classification, a many-to-one structure is adopted in the LSTM layer, only the output of the last timeester is taken, and the 100-dimensional output is obtained. And finally, generating an n-dimensional vector through a full connection layer, namely an FC layer, wherein n represents the number of clothing categories, which is a multi-classification task, so that the output is processed by adopting a softmax activation function, and finally the intellectual property category to which the query text belongs is obtained. .

S3-3, S31 andthe result output at S32 is subjected to vector search within the class. And searching the screened intellectual property rights through Euclidean distance between document feature vectors obtained by embedding the text into the feature extraction model. Formula for calculating Euclidean distance and dist_f(r_q，r_i)＝||r_q，r_i||₂Similarly;

step S4 is specifically as follows:

s4-1, in the recommendation model, firstly, the relevance is defined as the mutual information of two words x, y:

the stronger the relevance of two words, the larger the mutual information value, which usually shows that two words often appear in the same sentence and often match. To solve the above formula, firstly, the word w is aligned_iThe word w in the case of occurrence_kProbability modeling of occurrence, namely:

p(w_k|w_i)

to do this, the skip-gram model in word2vec and its optimized acceleration module, namely h-softmax, are the best choices.

The H-softmax can accelerate the training process because the Huffman tree is adopted to encode words, the high-frequency word path is short, and the low-frequency word path is long, so that the dictionary can be effectively compressed, and the probability calculation is accelerated. In the Huffman tree structure, each leaf node represents a word, and the word w₂For example, from root node to w₂The intermediate node of (2) is m (w)₂,1)、m(w₂,2)、m(w₂And 3), then input w_iPrediction of w₂The probability of (c) is the product of the probabilities on the path through these intermediate nodes:

p(w₂|w_i)＝p(m(w₂，1)，left)*p(m(w₂，2)，left)*p(m(w₂，3)，right).

intermediate node m (w)_iJ) the probability of going down is:

wherein the symbolic function is represented at node m (w)_iJ) left or right walk:

step S5 is specifically as follows:

retrieving front k _ visual text results R _ v similar to the images in the retrieval library through a hierarchical depth image retrieval model, and retrieving front k _ text results R similar to text description semantics through a text semantic retrieval model_tFollowed by reaction of R_tThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results R_s. Finally, the three results are fused together to form R_b，R_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_fThereby realizing the fusion of the multi-mode intellectual property retrieval results

Accordingly, the present invention also provides a hierarchical multi-modal intellectual property search engine system, as shown in fig. 2, comprising:

the image preprocessing unit 1 is used for screening input images and text fields from an intellectual property database and processing the images into a uniform size.

Specifically, images such as a flow chart and a network structure chart in the input intellectual property are processed into a uniform size by means of center equal proportion cutting and equal proportion scaling, and data enhancement means such as random rotation angles are carried out on each input image.

An image retrieval unit 2 for inputting the input image in the intellectual property data set into the hierarchical depth image retrieval model to obtain the image retrieval result, and for the query picture, extracting the deep layer of the image by using the image deep layer feature extraction networkObtaining binary codes of the query pictures through a Hash coding network and binarization operation, carrying out coarse retrieval by utilizing the Hash value, taking the first K results, carrying out fine retrieval based on image deep-layer features, and finally obtaining an image retrieval result R of a hierarchical depth image retrieval model_v。

Specifically, ResNet-50 is selected as a skeleton network of the model, an input image is input into a ResNet model pre-trained on an image classification data set ImageNet, and the visual characteristics of the clothing image are extracted;

The first two items of the loss function can well enable class binary codes generated by pictures of the same class to be as close as possible, while class binary codes of pictures of different classes are as far away from each other as possible, so that h can be generated in the process of optimizing loss_i,1,h_i,2Respectively approaching to-1 or 1 as much as possible, thus generating output approaching to binary, and finally carrying out binarization by taking 0 as a boundary point, thus obtaining standard binary output;

and carrying out rough search in Hamming space.In the rough retrieval stage, the query picture is output through the network to obtain n-dimensional binary representation b_qBinary representation of any garment in the database b_iThe clothing items in the database are ordered according to the following hamming distances:

and searching in the image feature space. In the fine retrieval stage, the output obtained by ResNet of the query picture is represented as r according to the first K items obtained in the coarse retrieval stage_qThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as r_iAnd then sorting the Euclidean distances based on the deep features of the image:

dist_f(r_q，r_i)＝||r_q，r_i||₂

thereby obtaining more accurate retrieval results.

And the text retrieval unit 3 is used for classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then retrieving through Euclidean distances among document feature vectors obtained by embedding the text into a feature extraction model in the screened intellectual property rights to obtain a text retrieval result.

Specifically, a doc2 vec-based text embedded feature extraction model is used for extracting feature vectors of an input query text, a text classification network constructed by taking LSTM as a core component is used for inputting the query text into the text classification network to obtain classified text categories, and searching is carried out through Euclidean distance between document feature vectors obtained through the text embedded feature extraction model in screened intellectual property rights.

And the text recommendation unit 4 is used for inputting the obtained text retrieval into a similar intellectual property recommendation model and recommending a similar result.

Specifically, in the recommendation model, the relevance is defined as mutual information of two words x, y:

p(w_k|w_i)

and a multi-modal fusion unit 5, configured to perform multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result, and the obtained similar recommendation result.

Specifically, the top k _ visual text results R with pictures similar to those in the search library are searched out through a hierarchical depth image search model_vSearching front k _ text results R similar to text description semantics through a text semantic search model_tFollowed by reaction of R_tThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results R_s. Finally, the three results are fused together to form a fused text result R_bR is to be_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_f。

Therefore, the invention can effectively improve the retrieval speed and keep the retrieval precision with equivalent effect by using the hierarchical depth image retrieval model, extracting the deep features of the network captured image by using the deep features of the image, generating the simple Hash features by using the Hash coding network, and carrying out coarse retrieval by the Hash features and fine retrieval by the deep features; the text semantic retrieval model utilizes a text classifier to classify in advance to reduce the retrieval range, and can better represent the text; by utilizing a similar intellectual property recommendation model and adopting a probability-driven method to quantify the similarity, a retrieval result close to the user intention can be retrieved; compared with the traditional visual feature characterization method, the pre-training process has wider adaptability by using the pre-training and fine-tuning scheme, and the visual features which are general and have specific scene meanings can be extracted; according to the multi-mode fused intellectual property search engine system, a user can input the query text and the related query picture, the search engine can comprehensively consider the two types of input and return the retrieval result, and the multi-mode input mode can more accurately capture the search intention of the user.

The method and system for hierarchical multi-modal intellectual property search engine provided by the embodiment of the present invention are described in detail, and the specific examples are applied herein to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A hierarchical multi-modal intellectual property search engine method, the method comprising:

inputting an input image in an intellectual property data set into a hierarchical depth image retrieval model to obtain an image retrieval result, firstly extracting deep features of the image by using an image deep feature extraction network for an inquiry picture, then obtaining a binary code of the inquiry picture by using a Hash coding network and binarization operation, carrying out coarse retrieval by using the Hash value, taking the first K results, carrying out fine retrieval based on the image deep features, and finally obtaining a retrieval result R of the hierarchical depth image retrieval model_v；

Inputting text fields in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then performing retrieval through Euclidean distance between document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property;

inputting the obtained text retrieval into a similar intellectual property recommendation model, and recommending a similar result;

performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result, and fusing the obtained results to form R_bWill fuse the text results R_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_f。

2. The method of claim 1, wherein the method of hierarchical multi-modal intellectual property search engine screens out input images and text fields from the intellectual property database and processes the images into uniform size, specifically:

processing input images such as flow charts and network structure diagrams in input intellectual property rights into a uniform size by using a central equal-proportion cutting and equal-proportion scaling mode, and performing data enhancement means such as random rotation angles on the input images every time.

3. The hierarchical multi-modal intellectual property search engine method according to claim 1, wherein the input image in the intellectual property data set is input into the hierarchical depth image search model to obtain an image search result, the deep features of the image are extracted by the deep feature extraction network for the query picture, then the binary code of the query picture is obtained through the hash coding network and the binarization operation, the hash value is used for rough search, the first K results are taken for fine search based on the deep features of the image, and finally the search result R of the hierarchical depth image search model is obtained_vThe method specifically comprises the following steps:

n is the number of image pairs selected by a batch during training, h_i,1,h_i,2Representing the net output of two images in the ith pair, i.e. binary-like feature representation, s_iWhether two images in the ith image pair are similar or not is shown, whether the two clothes images belong to the same category or not is shown in the data set, the similarity is 1, otherwise, the similarity is 0, t is a boundary threshold parameter, and alpha is a regularization strength parameter;

carrying out rough retrieval in Hamming space, and in the rough retrieval stage, obtaining n-dimensional binary representation b after the query picture is output through network_qAnd b is represented by any two-value of clothing in database_iThe clothing items in the database are ordered according to the following hamming distances:

searching in the image feature space, and in the fine searching stage, taking the first K items of results obtained in the coarse searching stage, and expressing the output obtained by ResNet of the query picture as r_qThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as r_iAnd then sorting the Euclidean distances based on the deep features of the image:

dist_f(r_q,r_i)＝‖r_q,r_i‖₂.

thereby obtaining more accurate retrieval results.

4. The method according to claim 1, wherein the query text is classified in advance through a text classification network, the search range is effectively narrowed through category screening, and then the search is performed through Euclidean distance between document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property, so as to obtain a text search result, specifically:

extracting a feature vector of an input query text by using a doc2 vec-based text embedded feature extraction model;

inputting the query text into a text classification network by using the text classification network constructed by taking an LSTM as a core component to obtain a classified text category;

and searching the screened intellectual property rights through Euclidean distance between document feature vectors obtained by embedding the text into the feature extraction model.

5. The method as claimed in claim 1, wherein said text retrieval is inputted into a similar intellectual property recommendation model, and a similar result is recommended, specifically:

the stronger the relevance of the two words, the larger the mutual information value, which usually shows that the two words often appear in the same sentence and often match, the skip-gram model in word2vec and the optimized acceleration module thereof are used for training, and the word w is trained_iThe word w in the case of occurrence_kProbability modeling of occurrence, namely:

p(w_k|w_i)。

6. the method according to claim 1, wherein the multi-modal image search result, the text search result and the similar recommendation result are combined into a multi-modal result, specifically:

searching out the top k _ visual text results R with similar pictures in the search library through a hierarchical depth image search model_vSearching front k _ text results R similar to text description semantics through a text semantic search model_tFollowed by reaction of R_tThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results R_sThen the three results are fused together to form a fused text result R_bR is to be_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_f。

7. An intellectual property search engine system based on hierarchical multi-modal, the system comprising:

an image retrieval unit for extracting deep features of the image for the query pictureExtracting deep features of the image by a network, obtaining a binary code of the query picture by a Hash coding network and a binarization operation, performing rough retrieval by using the Hash value, performing fine retrieval based on the deep features of the image after taking the first K results, and finally obtaining a retrieval result R of a hierarchical deep image retrieval model_v；

The text retrieval unit is used for classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then retrieving through Euclidean distance between document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property;

8. The system as claimed in claim 7, wherein the image pre-processing unit processes the inputted images of intellectual property such as flow chart and network structure chart into a uniform size by using a center equal-proportion cutting and equal-proportion scaling method, and performs data enhancement such as random rotation angle on each inputted image.

9. The system of claim 7, wherein the image retrieval unit is configured to input the input image into a ResNet model pre-trained on an image classification dataset ImageNet, extract visual features of the garment image, input the extracted visual features into a hash coding network for hash coding, update parameters of the whole network during error back propagation, perform a coarse retrieval in a hamming space, perform a fine retrieval in an image feature space, and sort the retrieval results based on euclidean distances of deep features of the image, thereby obtaining more accurate retrieval results.

10. The system of claim 7, wherein the text retrieval unit is configured to embed a text based on doc2vec into the feature extraction model to extract feature vectors of the inputted query text, input the query text into the text classification network using a text classification network constructed by using LSTM as a core component to obtain classified text categories, and retrieve the selected intellectual property rights by using euclidean distances between the document feature vectors obtained by the text embedding feature extraction model.

11. The system of claim 7, wherein the text recommendation unit is required to define mutual information with relevance of two words x and y in the recommendation model, the stronger the relevance of the two words, the larger the mutual information value, which usually shows that the two words often appear in the same sentence and often match, and train with a skip-gram model in word2vec and h-softmax thereof to obtain the word w_iThe word w in the case of occurrence_kProbability of occurrence modeling.

12. The system of claim 7, wherein the multi-modal fusion unit retrieves the top k visual text results R similar to the search library by using the hierarchical depth image search model_vSearching front k _ text results R similar to text description semantics through a text semantic search model_tFollowed by reaction of R_tThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results R_sThen the above three results are mergedTogether form a fused text result R_bR is to be_bReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result R_f。