CN114461890A - Hierarchical multi-modal intellectual property search engine method and system - Google Patents

Hierarchical multi-modal intellectual property search engine method and system Download PDF

Info

Publication number
CN114461890A
CN114461890A CN202111531155.4A CN202111531155A CN114461890A CN 114461890 A CN114461890 A CN 114461890A CN 202111531155 A CN202111531155 A CN 202111531155A CN 114461890 A CN114461890 A CN 114461890A
Authority
CN
China
Prior art keywords
text
retrieval
image
result
intellectual property
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111531155.4A
Other languages
Chinese (zh)
Inventor
周凡
苏志宏
林谋广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ronggu Innovation Industrial Park Co ltd
Sun Yat Sen University
Original Assignee
Guangdong Ronggu Innovation Industrial Park Co ltd
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ronggu Innovation Industrial Park Co ltd, Sun Yat Sen University filed Critical Guangdong Ronggu Innovation Industrial Park Co ltd
Priority to CN202111531155.4A priority Critical patent/CN114461890A/en
Publication of CN114461890A publication Critical patent/CN114461890A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a hierarchical multi-modal intellectual property search engine method. The method comprises the following steps: preprocessing an input image; inputting an input image into a hierarchical depth image retrieval model to obtain an image retrieval result; inputting a text field in the data set into a text semantic retrieval model to obtain a text retrieval result; inputting the text retrieval result into a similar intellectual property recommendation model to obtain a similar recommendation result; and performing multi-mode result fusion on the image retrieval result, the text retrieval result and the similar recommendation result to obtain a fused text result, and reordering the fused text result and the query text input by the user to obtain a final retrieval result. The invention also discloses a hierarchical multi-modal intellectual property search engine system. According to the invention, through the layered depth image retrieval model and the text semantic retrieval model, the retrieval speed is increased, the retrieval precision is maintained, and compared with a paper search method, the scheme can better express the retrieval requirements of users.

Description

Hierarchical multi-modal intellectual property search engine method and system
Technical Field
The invention relates to multi-modal search and deep learning, in particular to a hierarchical multi-modal intellectual property search engine method and system.
Background
In the big data era, artificial intelligence has wide application in various industries. For the retrieval of intellectual property rights, the intellectual network is large in scale and complex in connection, and the intellectual nodes have heterogeneity. In the face of massive information, a search mode based on a classified directory and key words is more and more difficult to adapt to the search requirements of users, and the search mode is urgently needed to be improved from a word-based level to a semantic-based level, and the high-increment, high-timeliness and multi-mode intellectual property hypergraph network modeling technology is researched and developed to accurately capture the real intention behind the sentence input by the user and search the sentence according to the real intention, so that the search result which best meets the requirements of the user is more accurately returned to the user.
Intellectual property retrieval is a technical application for realizing the retrieval by inputting user segments and returning the segments to the search results which best meet the requirements of the user. Most of the prior art applications are intellectual property retrieval in a text search mode. However, since this technique is monomodal, it is often difficult for the user's segment input to accurately express the user's search requirements. At this time, a multi-modal modeling technique is required to accurately capture the true intention of the user.
Multimodal intellectual search is a multimodal search technology applied to intellectual property search, and is a search technology in which a user can input different types of input to search, such as searching pictures with text, searching texts with drawings, and the like. In the intellectual property search, a user can input a search field and a picture related to a search intention of the user, and the search engine is combined to search the two types of input information to obtain a search result which best meets the requirement of the user. However, the search engine is limited by the development of artificial intelligence related technologies, and is difficult to perform fusion analysis on the two different types of inputs, and the returned search result often has a certain bias, which affects the final search result.
One of the existing technologies is a method for generating a relevant search result by analyzing a user search word, the method includes acquiring a search word input by a user, determining a user demand type according to the search word, and determining a corresponding guidance policy according to the demand type; and generating related search results of the thesis according to the guiding strategy and the search terms, displaying the related search results on a search result page, and providing the search result page for the user. The disadvantage of this approach is that this technique is monomodal, and the user's segment input often has difficulty accurately expressing the user's search requirements.
The second prior art is a method for designing and implementing a Chinese knowledge search system based on encyclopedic. The method realizes knowledge search based on encyclopedic entities through the processes of word segmentation, part-of-speech tagging, synonym conversion, problem word conversion, core entity identification, retrieval, result rearrangement and the like. The method has the disadvantages that the similarity between the query text and the text to be retrieved is calculated in the modes of synonym conversion and the like during retrieval, the similarity is not calculated in a word vector embedding feature space, the similarity is realized through a synonym vocabulary in encyclopedic, and the intervention of an encyclopedic knowledge base is needed.
Disclosure of Invention
The invention aims to overcome the defects of the existing method and provides a hierarchical multi-mode intellectual property search engine method and system. The method and the device solve the main problems that firstly, the prior art determines the requirement type of the user according to the search word by acquiring the search word input by the user, and determines the corresponding guide strategy according to the requirement type, but the technology is in a single mode, and the segment input of the user is difficult to accurately express the retrieval requirement of the user. Secondly, the design and implementation of the existing encyclopedic-based Chinese knowledge search system are realized by calculating the similarity between the query text and the text to be searched through synonym conversion and other modes during searching, and the technology needs to be realized through a synonym vocabulary in the encyclopedic and needs the intervention of an encyclopedic knowledge base.
In order to solve the above problems, the present invention provides a hierarchical multi-modal intellectual property search engine method, including:
screening an input image and a text field from an intellectual property database, and processing the image into a uniform size;
inputting an input image in an intellectual property data set into a hierarchical depth image retrieval model to obtain an image retrieval result, firstly extracting deep features of the image by using an image deep feature extraction network for an inquiry picture, then obtaining a binary code of the inquiry picture by using a Hash coding network and binarization operation, carrying out coarse retrieval by using the Hash value, taking the first K results, carrying out fine retrieval based on the image deep features, and finally obtaining an image retrieval result R of the hierarchical depth image retrieval modelv
Inputting text fields in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then performing retrieval through Euclidean distances among document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property to obtain a text retrieval result;
inputting the obtained text retrieval into a similar intellectual property recommendation model to obtain a similar recommendation result;
performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result, and fusing the obtained results to form RbWill fuse the text results RbReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
Preferably, the input image and the text field are screened from the intellectual property database, and the image is processed into a uniform size, specifically:
the method comprises the steps of processing images such as a flow chart, a network structure chart and the like in the input intellectual property into a uniform size by means of center equal-proportion cutting and equal-proportion scaling, and performing data enhancement means such as random rotation angles on the input images every time.
Preferably, the input image in the intellectual property data set is input into the hierarchical depth image retrieval model to obtain an image retrieval result, the deep features of the image are extracted by the image deep feature extraction network for the query picture, then the binary code of the query picture is obtained through the hash coding network and the binarization operation, the hash value is used for rough retrieval, the first K results are taken and then the image deep feature-based fine retrieval is carried out, and finally the retrieval result R of the hierarchical depth image retrieval model is obtainedvThe method specifically comprises the following steps:
selecting ResNet-50 as a skeleton network of the model, inputting an input image into a ResNet model pre-trained on an image classification data set ImageNet, and extracting visual characteristics of the clothing image;
inputting the extracted visual features into a Hash coding network for Hash coding, inputting the high latitude image features extracted by the feature extraction network into a full connection layer, outputting n-dimensional binary values, converting the binary values into Hash features through binarization operation, and updating the whole network parameters including the deep feature extraction network of the image in the error back propagation process so as to better fit the weight of the network on a Hash coding task, wherein the loss function is as follows:
Figure BDA0003411515120000041
n is the number of image pairs selected by a batch during training, hi,1,hi,2Representing the net output of two images in the ith pair, i.e. binary-like feature representation, siAnd whether the two images in the ith image pair are similar or not is shown, whether the two clothes images belong to the same category or not is shown in the data set, the similarity is 1, otherwise, the similarity is 0, t is a boundary threshold parameter, and alpha is a regularization strength parameter.
The loss boxThe first two terms of the number can well enable class binary codes generated by pictures of the same class to be as close as possible, while class binary codes of pictures of different classes are as far away from each other as possible, so that h can be generated in the process of optimizing lossi,1,hi,2Respectively approaching to-1 or 1 as much as possible, thus generating output approaching to binary, and finally carrying out binarization by taking 0 as a boundary point, thus obtaining standard binary output;
and carrying out rough search in Hamming space. In the rough retrieval stage, the query picture is output through the network to obtain n-dimensional binary representation bqBinary representation of any garment in the database biThe clothing items in the database are ordered according to the following hamming distances:
Figure BDA0003411515120000051
and searching in the image feature space. In the fine retrieval stage, the first K items of results obtained in the coarse retrieval stage are taken, and the output obtained by ResNet of the query picture is represented as rqThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as riAnd then sorting the Euclidean distances based on the deep features of the image:
distf(rq,ri)=||rq,ri||2
thereby obtaining more accurate retrieval results.
Preferably, the query text is classified in advance through a text classification network, the search range is effectively narrowed through category screening, and then the search is performed through the euclidean distance between document feature vectors obtained by embedding the text into the feature extraction model in the screened intellectual property rights, so as to obtain a text search result, specifically:
extracting a feature vector of an input query text by using a doc2 vec-based text embedded feature extraction model, inputting the query text into a text classification network by using the text classification network constructed by taking LSTM as a core component to obtain a classified text category, and searching through Euclidean distance between document feature vectors obtained by the text embedded feature extraction model in screened intellectual property rights.
Preferably, the text retrieval obtained above is input into a similar intellectual property recommendation model, and a similar result is recommended, specifically:
in the recommendation model, the relevance is defined as the mutual information of two words x, y:
Figure BDA0003411515120000061
the stronger the relevance of the two words, the larger the mutual information value, which usually shows that the two words often appear in the same sentence and often match. Training is carried out by using a skip-gram model in word2vec and an optimized acceleration module thereof, namely h-softmax, so as to obtain the word wiThe word w in the case of occurrencekProbability modeling of occurrence, namely:
p(wk|wi)
preferably, the performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result, and the obtained similar recommendation result specifically includes:
searching out the top k _ visual text results R similar to the images in the search library through a hierarchical depth image search modelvSearching front k _ text results R similar to text description semantics through a text semantic search modeltFollowed by reaction of RtThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results Rs. Finally, the three results are fused together to form a fused text result RbR is to bebReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
Correspondingly, the invention also provides a hierarchical multi-modal intellectual property search engine system, which comprises:
the image preprocessing unit is used for screening the input images and the text fields from the intellectual property database and processing the images into a uniform size;
an image retrieval unit for inputting the input image in the intellectual property data set into the hierarchical depth image retrieval model to obtain an image retrieval result, extracting the deep features of the image by using an image deep feature extraction network for the query picture, then obtaining the binary code of the query picture through a Hash coding network and binarization operation, carrying out coarse retrieval by using the Hash value, taking the first K results, carrying out fine retrieval based on the deep features of the image, and finally obtaining the retrieval result R of the hierarchical depth image retrieval modelv
The text retrieval unit is used for inputting text fields in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then retrieving the screened intellectual property through Euclidean distance between document feature vectors obtained by embedding the text into a feature extraction model;
the text recommendation unit is used for inputting the obtained text retrieval into a similar intellectual property recommendation model and recommending a similar result;
a multi-modal fusion unit for performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result, and fusing the obtained results to form RbWill fuse the text results RbReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
The implementation of the invention has the following beneficial effects:
the method applies a Chinese text semantic retrieval model, and utilizes a text classifier to classify in advance to reduce the retrieval range, so that the text is represented better, and the retrieval speed and precision are improved; the invention designs a hierarchical depth image retrieval model, extracts the deep features of the image captured by the network through the deep features of the image, generates simple Hash features by utilizing a Hash coding network, performs rough retrieval on the Hash features and then performs fine retrieval on the deep features, effectively improves the retrieval speed and simultaneously keeps the retrieval precision with equivalent effect; compared with the traditional visual characteristic characterization method, the method has wider adaptability in the pre-training process by using the pre-training and fine-tuning scheme; the similar intellectual property recommendation model designed by the invention adopts a probability-driven method to quantify the similarity, and can search a search result close to the user intention; according to the multi-mode fused intellectual property search engine system designed by the invention, a user can input the query text and the related query picture, the search engine can comprehensively consider the two types of input and return the retrieval result, and the multi-mode input mode can more accurately capture the search intention of the user.
Drawings
FIG. 1 is a general flow diagram of a hierarchical multi-modal intellectual property search engine method of an embodiment of the present invention;
FIG. 2 is a block diagram of a hierarchical multi-modal intellectual property search engine system in accordance with an embodiment of the present invention;
FIG. 3 is a flowchart of hierarchical depth image retrieval according to an embodiment of the present invention;
fig. 4 is a text semantic retrieval flow diagram according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a general flow diagram of a hierarchical multi-modal intellectual property search engine method according to an embodiment of the present invention, as shown in fig. 1, the method comprising:
s1, screening the input image and the text field from the intellectual property database, and processing the image into a uniform size;
s2, inputting the input image in the intellectual property data set into the hierarchical depth image retrieval model to obtain the image retrieval result, firstly extracting the deep features of the image by using the image deep feature extraction network for the query picture, then obtaining the binary code of the query picture by using the Hash value to carry out rough retrieval, taking the first K results and carrying out fine retrieval based on the image deep feature to finally obtain the retrieval result R of the hierarchical depth image retrieval modelv
S3, inputting the text field in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through category screening, and then retrieving the screened intellectual property through Euclidean distance between document feature vectors obtained by embedding the text into a feature extraction model;
s4, inputting the obtained text retrieval into a similar intellectual property recommendation model, and recommending a similar result;
and S5, performing multi-mode result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result.
Step S1 is specifically as follows:
s1-1, inputting original images such as flow charts and network structure diagrams in intellectual property, processing the input original images into uniform size by means of center equal proportion cutting and equal proportion scaling, and performing data enhancement means such as random rotation angle on each input image.
Step S2, as shown in fig. 3, is as follows:
and inputting the original image into a ResNet model pre-trained on an image classification data set ImageNet, and extracting the visual characteristics of the clothing image. Selecting ResNet-50 as a skeleton network of a model, wherein the network has a 50-layer network layer structure, the characteristic dimensionality of the output of the last layer is 2048, and the full-connection layer of the last layer of the original network is not added into the network;
and inputting the extracted visual features into a hash coding network for hash coding. And inputting the high latitude image features extracted by the feature extraction network into the full connection layer, outputting n-dimensional binary values, and finally converting the n-dimensional binary values into hash features through binarization operation. In the error back propagation process, parameters of the whole network including the image deep layer feature extraction network are updated so as to better fit the weight of the network on the hash coding task. The loss function is as follows:
Figure BDA0003411515120000101
where N is the number of image pairs selected by a batch during training, hi,1,hi,2Representing the net output of two images in the ith pair, i.e. binary-like feature representation, siAnd whether the two images in the ith image pair are similar or not is shown, whether the two clothes images belong to the same category or not is shown in the data set, the similarity is 1, otherwise, the similarity is 0, t is a boundary threshold parameter, and alpha is a regularization strength parameter.
In the first term of the loss function when the two images are of the same category
Figure BDA0003411515120000102
Punishment is carried out on image pairs which are dissimilar in binary output; in the second term when the two images are of different categories
Figure BDA0003411515120000103
Figure BDA0003411515120000104
And punishing the image pair with similar class binary output, wherein t is the distance expected from the network output corresponding to the two images with different classes. The first two terms of the loss function can well enable class binary codes generated by pictures of the same class to be as close as possible, and class binary codes of pictures of different classes to be as far away from each other as possible. Alpha (| | | h)i,1|-1||1+|||hi,2|-1||1) Optimizing the loss for regularizing termsLet h in the process ofi,1,hi,2As close to-1 or 1, respectively, as possible, so that an output close to binary can be generated. And finally, when carrying out binarization, taking 0 as a demarcation point to obtain standard binary output.
S2-3, performing rough search in Hamming space. In the rough retrieval stage, the query picture is output through the network to obtain n-dimensional binary representation bqBinary representation of any garment in the database biThe clothing items in the database are ordered according to the following hamming distances:
Figure BDA0003411515120000105
and S2-4, searching in the image feature space. In the fine search stage, the first K items of results obtained in the coarse search stage are taken, and the output obtained by ResNet of the query picture is represented as rqThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as riAnd then, sorting the Euclidean distances based on the deep features of the image:
diStf(rq,ri)=||rq,ri||2.
thereby obtaining more accurate retrieval results.
Step S3, as shown in fig. 4, is as follows:
s3-1: and inputting the query text into a text embedded feature extraction model to extract a feature vector of the text. The text embedding feature extraction model of the method is based on doc2vec, and the training goal of the model is to maximize the average logarithmic probability of predicting the current word:
Figure BDA0003411515120000111
where N is the document length, k is half the window size, wiAre words. The probability p is processed by softmax, and the probability before normalization is as follows: :
pu=b+Kf(wi-k,…,wi+k,para;W,D)
wherein b and K are softmax parameters, f is an intermediate vector representation obtained by cascading or averaging the word vector extracted from W and the document vector extracted from D, and then predicting the next word by using the vector:
doc2vec has yet another framework: PV-DBOW, which is the inverse of PV-DM, predicts words in a document window with a document vector, regardless of context word order. The text embedding feature extraction model constructed by the method is realized based on doc2vec, and the advantages of the doc2vec model are fully utilized for modification. PV-DM considers word sequence while understanding semantic information, PV-DBOW has the advantage of less stored data (because a word vector matrix W is not required to be stored), so in order to utilize the advantages of the PV-DBOW and the DBOW and obtain more accurate and stable vector representation, document vectors obtained by two models are combined for use, and the combination method is to cascade two vectors obtained by the same document to obtain a vector with higher dimension and higher document identification degree.
And S3-2, inputting the query text into the text classification network to classify the text type. According to the method, an LSTM is used as a core component to construct a text classification network, and the whole network comprises an Embellding layer, a spatialdropout1d layer, an LSTM layer and an FC layer. Firstly, data cleaning is carried out on an input query text, a vocabulary is constructed, a document is labeled according to the vocabulary and constructed into a 250-dimensional label vector, and then the vector is input into an embedding layer, so that a 100-dimensional document embedded expression containing 250 timestamps is obtained. In order to properly reduce the dependency between each timeester, a Spatialropout1D layer is used for processing, then the document is embedded into the input LSTM, because a sequence does not need to be generated, and the final goal is classification, a many-to-one structure is adopted in the LSTM layer, only the output of the last timeester is taken, and the 100-dimensional output is obtained. And finally, generating an n-dimensional vector through a full connection layer, namely an FC layer, wherein n represents the number of clothing categories, which is a multi-classification task, so that the output is processed by adopting a softmax activation function, and finally the intellectual property category to which the query text belongs is obtained. .
S3-3, S31 andthe result output at S32 is subjected to vector search within the class. And searching the screened intellectual property rights through Euclidean distance between document feature vectors obtained by embedding the text into the feature extraction model. Formula for calculating Euclidean distance and distf(rq,ri)=||rq,ri||2Similarly;
step S4 is specifically as follows:
s4-1, in the recommendation model, firstly, the relevance is defined as the mutual information of two words x, y:
Figure BDA0003411515120000121
the stronger the relevance of two words, the larger the mutual information value, which usually shows that two words often appear in the same sentence and often match. To solve the above formula, firstly, the word w is alignediThe word w in the case of occurrencekProbability modeling of occurrence, namely:
p(wk|wi)
to do this, the skip-gram model in word2vec and its optimized acceleration module, namely h-softmax, are the best choices.
The H-softmax can accelerate the training process because the Huffman tree is adopted to encode words, the high-frequency word path is short, and the low-frequency word path is long, so that the dictionary can be effectively compressed, and the probability calculation is accelerated. In the Huffman tree structure, each leaf node represents a word, and the word w2For example, from root node to w2The intermediate node of (2) is m (w)2,1)、m(w2,2)、m(w2And 3), then input wiPrediction of w2The probability of (c) is the product of the probabilities on the path through these intermediate nodes:
p(w2|wi)=p(m(w2,1),left)*p(m(w2,2),left)*p(m(w2,3),right).
intermediate node m (w)iJ) the probability of going down is:
Figure BDA0003411515120000131
wherein the symbolic function is represented at node m (w)iJ) left or right walk:
Figure BDA0003411515120000132
step S5 is specifically as follows:
retrieving front k _ visual text results R _ v similar to the images in the retrieval library through a hierarchical depth image retrieval model, and retrieving front k _ text results R similar to text description semantics through a text semantic retrieval modeltFollowed by reaction of RtThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results Rs. Finally, the three results are fused together to form Rb,RbReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result RfThereby realizing the fusion of the multi-mode intellectual property retrieval results
Accordingly, the present invention also provides a hierarchical multi-modal intellectual property search engine system, as shown in fig. 2, comprising:
the image preprocessing unit 1 is used for screening input images and text fields from an intellectual property database and processing the images into a uniform size.
Specifically, images such as a flow chart and a network structure chart in the input intellectual property are processed into a uniform size by means of center equal proportion cutting and equal proportion scaling, and data enhancement means such as random rotation angles are carried out on each input image.
An image retrieval unit 2 for inputting the input image in the intellectual property data set into the hierarchical depth image retrieval model to obtain the image retrieval result, and for the query picture, extracting the deep layer of the image by using the image deep layer feature extraction networkObtaining binary codes of the query pictures through a Hash coding network and binarization operation, carrying out coarse retrieval by utilizing the Hash value, taking the first K results, carrying out fine retrieval based on image deep-layer features, and finally obtaining an image retrieval result R of a hierarchical depth image retrieval modelv
Specifically, ResNet-50 is selected as a skeleton network of the model, an input image is input into a ResNet model pre-trained on an image classification data set ImageNet, and the visual characteristics of the clothing image are extracted;
inputting the extracted visual features into a Hash coding network for Hash coding, inputting the high latitude image features extracted by the feature extraction network into a full connection layer, outputting n-dimensional binary values, converting the binary values into Hash features through binarization operation, and updating the whole network parameters including the deep feature extraction network of the image in the error back propagation process so as to better fit the weight of the network on a Hash coding task, wherein the loss function is as follows:
Figure BDA0003411515120000141
n is the number of image pairs selected by a batch during training, hi,1,hi,2Representing the net output of two images in the ith pair, i.e. binary-like feature representation, siAnd whether the two images in the ith image pair are similar or not is shown, whether the two clothes images belong to the same category or not is shown in the data set, the similarity is 1, otherwise, the similarity is 0, t is a boundary threshold parameter, and alpha is a regularization strength parameter.
The first two items of the loss function can well enable class binary codes generated by pictures of the same class to be as close as possible, while class binary codes of pictures of different classes are as far away from each other as possible, so that h can be generated in the process of optimizing lossi,1,hi,2Respectively approaching to-1 or 1 as much as possible, thus generating output approaching to binary, and finally carrying out binarization by taking 0 as a boundary point, thus obtaining standard binary output;
and carrying out rough search in Hamming space.In the rough retrieval stage, the query picture is output through the network to obtain n-dimensional binary representation bqBinary representation of any garment in the database biThe clothing items in the database are ordered according to the following hamming distances:
Figure BDA0003411515120000151
and searching in the image feature space. In the fine retrieval stage, the output obtained by ResNet of the query picture is represented as r according to the first K items obtained in the coarse retrieval stageqThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as riAnd then sorting the Euclidean distances based on the deep features of the image:
distf(rq,ri)=||rq,ri||2
thereby obtaining more accurate retrieval results.
And the text retrieval unit 3 is used for classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then retrieving through Euclidean distances among document feature vectors obtained by embedding the text into a feature extraction model in the screened intellectual property rights to obtain a text retrieval result.
Specifically, a doc2 vec-based text embedded feature extraction model is used for extracting feature vectors of an input query text, a text classification network constructed by taking LSTM as a core component is used for inputting the query text into the text classification network to obtain classified text categories, and searching is carried out through Euclidean distance between document feature vectors obtained through the text embedded feature extraction model in screened intellectual property rights.
And the text recommendation unit 4 is used for inputting the obtained text retrieval into a similar intellectual property recommendation model and recommending a similar result.
Specifically, in the recommendation model, the relevance is defined as mutual information of two words x, y:
Figure BDA0003411515120000161
the stronger the relevance of the two words, the larger the mutual information value, which usually shows that the two words often appear in the same sentence and often match. Training is carried out by using a skip-gram model in word2vec and an optimized acceleration module thereof, namely h-softmax, so as to obtain the word wiThe word w in the case of occurrencekProbability modeling of occurrence, namely:
p(wk|wi)
and a multi-modal fusion unit 5, configured to perform multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result, and the obtained similar recommendation result.
Specifically, the top k _ visual text results R with pictures similar to those in the search library are searched out through a hierarchical depth image search modelvSearching front k _ text results R similar to text description semantics through a text semantic search modeltFollowed by reaction of RtThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results Rs. Finally, the three results are fused together to form a fused text result RbR is to bebReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
Therefore, the invention can effectively improve the retrieval speed and keep the retrieval precision with equivalent effect by using the hierarchical depth image retrieval model, extracting the deep features of the network captured image by using the deep features of the image, generating the simple Hash features by using the Hash coding network, and carrying out coarse retrieval by the Hash features and fine retrieval by the deep features; the text semantic retrieval model utilizes a text classifier to classify in advance to reduce the retrieval range, and can better represent the text; by utilizing a similar intellectual property recommendation model and adopting a probability-driven method to quantify the similarity, a retrieval result close to the user intention can be retrieved; compared with the traditional visual feature characterization method, the pre-training process has wider adaptability by using the pre-training and fine-tuning scheme, and the visual features which are general and have specific scene meanings can be extracted; according to the multi-mode fused intellectual property search engine system, a user can input the query text and the related query picture, the search engine can comprehensively consider the two types of input and return the retrieval result, and the multi-mode input mode can more accurately capture the search intention of the user.
The method and system for hierarchical multi-modal intellectual property search engine provided by the embodiment of the present invention are described in detail, and the specific examples are applied herein to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (12)

1. A hierarchical multi-modal intellectual property search engine method, the method comprising:
screening an input image and a text field from an intellectual property database, and processing the image into a uniform size;
inputting an input image in an intellectual property data set into a hierarchical depth image retrieval model to obtain an image retrieval result, firstly extracting deep features of the image by using an image deep feature extraction network for an inquiry picture, then obtaining a binary code of the inquiry picture by using a Hash coding network and binarization operation, carrying out coarse retrieval by using the Hash value, taking the first K results, carrying out fine retrieval based on the image deep features, and finally obtaining a retrieval result R of the hierarchical depth image retrieval modelv
Inputting text fields in the intellectual property data set into a text semantic retrieval model to obtain a text retrieval result, classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then performing retrieval through Euclidean distance between document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property;
inputting the obtained text retrieval into a similar intellectual property recommendation model, and recommending a similar result;
performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result, and fusing the obtained results to form RbWill fuse the text results RbReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
2. The method of claim 1, wherein the method of hierarchical multi-modal intellectual property search engine screens out input images and text fields from the intellectual property database and processes the images into uniform size, specifically:
processing input images such as flow charts and network structure diagrams in input intellectual property rights into a uniform size by using a central equal-proportion cutting and equal-proportion scaling mode, and performing data enhancement means such as random rotation angles on the input images every time.
3. The hierarchical multi-modal intellectual property search engine method according to claim 1, wherein the input image in the intellectual property data set is input into the hierarchical depth image search model to obtain an image search result, the deep features of the image are extracted by the deep feature extraction network for the query picture, then the binary code of the query picture is obtained through the hash coding network and the binarization operation, the hash value is used for rough search, the first K results are taken for fine search based on the deep features of the image, and finally the search result R of the hierarchical depth image search model is obtainedvThe method specifically comprises the following steps:
selecting ResNet-50 as a skeleton network of the model, inputting an input image into a ResNet model pre-trained on an image classification data set ImageNet, and extracting visual characteristics of the clothing image;
inputting the extracted visual features into a Hash coding network for Hash coding, inputting the high latitude image features extracted by the feature extraction network into a full connection layer, outputting n-dimensional binary values, converting the binary values into Hash features through binarization operation, and updating the whole network parameters including the deep feature extraction network of the image in the error back propagation process so as to better fit the weight of the network on a Hash coding task, wherein the loss function is as follows:
Figure FDA0003411515110000021
n is the number of image pairs selected by a batch during training, hi,1,hi,2Representing the net output of two images in the ith pair, i.e. binary-like feature representation, siWhether two images in the ith image pair are similar or not is shown, whether the two clothes images belong to the same category or not is shown in the data set, the similarity is 1, otherwise, the similarity is 0, t is a boundary threshold parameter, and alpha is a regularization strength parameter;
the first two items of the loss function can well enable class binary codes generated by pictures of the same class to be as close as possible, while class binary codes of pictures of different classes are as far away from each other as possible, so that h can be generated in the process of optimizing lossi,1,hi,2Respectively approaching to-1 or 1 as much as possible, thus generating output approaching to binary, and finally carrying out binarization by taking 0 as a boundary point, thus obtaining standard binary output;
carrying out rough retrieval in Hamming space, and in the rough retrieval stage, obtaining n-dimensional binary representation b after the query picture is output through networkqAnd b is represented by any two-value of clothing in databaseiThe clothing items in the database are ordered according to the following hamming distances:
Figure FDA0003411515110000031
searching in the image feature space, and in the fine searching stage, taking the first K items of results obtained in the coarse searching stage, and expressing the output obtained by ResNet of the query picture as rqThe output obtained by ResNet of any one result of the top K items obtained in the rough retrieval stage is represented as riAnd then sorting the Euclidean distances based on the deep features of the image:
distf(rq,ri)=‖rq,ri2.
thereby obtaining more accurate retrieval results.
4. The method according to claim 1, wherein the query text is classified in advance through a text classification network, the search range is effectively narrowed through category screening, and then the search is performed through Euclidean distance between document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property, so as to obtain a text search result, specifically:
extracting a feature vector of an input query text by using a doc2 vec-based text embedded feature extraction model;
inputting the query text into a text classification network by using the text classification network constructed by taking an LSTM as a core component to obtain a classified text category;
and searching the screened intellectual property rights through Euclidean distance between document feature vectors obtained by embedding the text into the feature extraction model.
5. The method as claimed in claim 1, wherein said text retrieval is inputted into a similar intellectual property recommendation model, and a similar result is recommended, specifically:
in the recommendation model, the relevance is defined as the mutual information of two words x, y:
Figure FDA0003411515110000041
the stronger the relevance of the two words, the larger the mutual information value, which usually shows that the two words often appear in the same sentence and often match, the skip-gram model in word2vec and the optimized acceleration module thereof are used for training, and the word w is trainediThe word w in the case of occurrencekProbability modeling of occurrence, namely:
p(wk|wi)。
6. the method according to claim 1, wherein the multi-modal image search result, the text search result and the similar recommendation result are combined into a multi-modal result, specifically:
searching out the top k _ visual text results R with similar pictures in the search library through a hierarchical depth image search modelvSearching front k _ text results R similar to text description semantics through a text semantic search modeltFollowed by reaction of RtThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results RsThen the three results are fused together to form a fused text result RbR is to bebReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
7. An intellectual property search engine system based on hierarchical multi-modal, the system comprising:
the image preprocessing unit is used for screening the input images and the text fields from the intellectual property database and processing the images into a uniform size;
an image retrieval unit for extracting deep features of the image for the query pictureExtracting deep features of the image by a network, obtaining a binary code of the query picture by a Hash coding network and a binarization operation, performing rough retrieval by using the Hash value, performing fine retrieval based on the deep features of the image after taking the first K results, and finally obtaining a retrieval result R of a hierarchical deep image retrieval modelv
The text retrieval unit is used for classifying the query text in advance through a text classification network, effectively reducing the search range through class screening, and then retrieving through Euclidean distance between document feature vectors obtained by embedding a text into a feature extraction model in the screened intellectual property;
the text recommendation unit is used for inputting the obtained text retrieval into a similar intellectual property recommendation model and recommending a similar result;
a multi-modal fusion unit for performing multi-modal result fusion on the obtained image retrieval result, the obtained text retrieval result and the obtained similar recommendation result, and fusing the obtained results to form RbWill fuse the text results RbReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
8. The system as claimed in claim 7, wherein the image pre-processing unit processes the inputted images of intellectual property such as flow chart and network structure chart into a uniform size by using a center equal-proportion cutting and equal-proportion scaling method, and performs data enhancement such as random rotation angle on each inputted image.
9. The system of claim 7, wherein the image retrieval unit is configured to input the input image into a ResNet model pre-trained on an image classification dataset ImageNet, extract visual features of the garment image, input the extracted visual features into a hash coding network for hash coding, update parameters of the whole network during error back propagation, perform a coarse retrieval in a hamming space, perform a fine retrieval in an image feature space, and sort the retrieval results based on euclidean distances of deep features of the image, thereby obtaining more accurate retrieval results.
10. The system of claim 7, wherein the text retrieval unit is configured to embed a text based on doc2vec into the feature extraction model to extract feature vectors of the inputted query text, input the query text into the text classification network using a text classification network constructed by using LSTM as a core component to obtain classified text categories, and retrieve the selected intellectual property rights by using euclidean distances between the document feature vectors obtained by the text embedding feature extraction model.
11. The system of claim 7, wherein the text recommendation unit is required to define mutual information with relevance of two words x and y in the recommendation model, the stronger the relevance of the two words, the larger the mutual information value, which usually shows that the two words often appear in the same sentence and often match, and train with a skip-gram model in word2vec and h-softmax thereof to obtain the word wiThe word w in the case of occurrencekProbability of occurrence modeling.
12. The system of claim 7, wherein the multi-modal fusion unit retrieves the top k visual text results R similar to the search library by using the hierarchical depth image search modelvSearching front k _ text results R similar to text description semantics through a text semantic search modeltFollowed by reaction of RtThe first h _ similar input similar intellectual property recommendation models in the system are used for recommending intellectual property, and the first k _ similar input similar intellectual property recommendation models are used as similar results RsThen the above three results are mergedTogether form a fused text result RbR is to bebReordering the query text input by the user in the text embedding feature space based on Euclidean distance to obtain the final retrieval result Rf
CN202111531155.4A 2021-12-15 2021-12-15 Hierarchical multi-modal intellectual property search engine method and system Pending CN114461890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111531155.4A CN114461890A (en) 2021-12-15 2021-12-15 Hierarchical multi-modal intellectual property search engine method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111531155.4A CN114461890A (en) 2021-12-15 2021-12-15 Hierarchical multi-modal intellectual property search engine method and system

Publications (1)

Publication Number Publication Date
CN114461890A true CN114461890A (en) 2022-05-10

Family

ID=81405143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111531155.4A Pending CN114461890A (en) 2021-12-15 2021-12-15 Hierarchical multi-modal intellectual property search engine method and system

Country Status (1)

Country Link
CN (1) CN114461890A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269882A (en) * 2022-09-28 2022-11-01 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
CN116244306A (en) * 2023-01-10 2023-06-09 江苏理工学院 Academic paper quotation recommendation method and system based on knowledge organization semantic relation
CN116932731A (en) * 2023-09-18 2023-10-24 上海帜讯信息技术股份有限公司 Multi-mode knowledge question-answering method and system for 5G message

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269882A (en) * 2022-09-28 2022-11-01 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
CN115269882B (en) * 2022-09-28 2022-12-30 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
CN116244306A (en) * 2023-01-10 2023-06-09 江苏理工学院 Academic paper quotation recommendation method and system based on knowledge organization semantic relation
CN116244306B (en) * 2023-01-10 2023-11-03 江苏理工学院 Academic paper quotation recommendation method and system based on knowledge organization semantic relation
CN116932731A (en) * 2023-09-18 2023-10-24 上海帜讯信息技术股份有限公司 Multi-mode knowledge question-answering method and system for 5G message
CN116932731B (en) * 2023-09-18 2024-01-30 上海帜讯信息技术股份有限公司 Multi-mode knowledge question-answering method and system for 5G message

Similar Documents

Publication Publication Date Title
Yang et al. Visual sentiment prediction based on automatic discovery of affective regions
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN112256939B (en) Text entity relation extraction method for chemical field
CN111324765A (en) Fine-grained sketch image retrieval method based on depth cascade cross-modal correlation
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
CN110008365B (en) Image processing method, device and equipment and readable storage medium
CN112417097A (en) Multi-modal data feature extraction and association method for public opinion analysis
CN112163114B (en) Image retrieval method based on feature fusion
Patel et al. Dynamic lexicon generation for natural scene images
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN113157859A (en) Event detection method based on upper concept information
CN116304066A (en) Heterogeneous information network node classification method based on prompt learning
CN113076483A (en) Case element heteromorphic graph-based public opinion news extraction type summarization method
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
Fan et al. A hierarchical Dirichlet process mixture of generalized Dirichlet distributions for feature selection
CN112256904A (en) Image retrieval method based on visual description sentences
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN115168590A (en) Text feature extraction method, model training method, device, equipment and medium
CN111581964A (en) Theme analysis method for Chinese ancient books
CN112307364B (en) Character representation-oriented news text place extraction method
Siddiqui et al. A survey on automatic image annotation and retrieval
CN117390299A (en) Interpretable false news detection method based on graph evidence
Al-Tameemi et al. Multi-model fusion framework using deep learning for visual-textual sentiment classification
Vijayaraju Image retrieval using image captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination