CN112507109A - Retrieval method and device based on semantic analysis and keyword recognition - Google Patents
Retrieval method and device based on semantic analysis and keyword recognition Download PDFInfo
- Publication number
- CN112507109A CN112507109A CN202011442031.4A CN202011442031A CN112507109A CN 112507109 A CN112507109 A CN 112507109A CN 202011442031 A CN202011442031 A CN 202011442031A CN 112507109 A CN112507109 A CN 112507109A
- Authority
- CN
- China
- Prior art keywords
- keyword
- keywords
- vector
- weight
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000004458 analytical method Methods 0.000 title claims abstract description 20
- 239000013598 vector Substances 0.000 claims abstract description 141
- 238000006243 chemical reaction Methods 0.000 claims abstract description 42
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 30
- 230000000694 effects Effects 0.000 claims abstract description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 7
- 230000003068 static effect Effects 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 238000013016 damping Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 206010070834 Sensitisation Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000008313 sensitization Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a retrieval method and a retrieval device based on semantic analysis and keyword recognition, which comprise the following steps: extracting patent keywords from the patent text by a Textrank algorithm to obtain a patent keyword data set, and performing vector conversion according to an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set; determining weights of titles, abstracts, first claims and technical effect sentences of patent texts through an analytic hierarchy process, matching keywords in index information from high weights to low weights according to the keywords to be retrieved to obtain a matching keyword vector set, inputting the matching keyword vector set into a weight model, calculating the weight values of corresponding patent texts, carrying out TOP-K sorting according to the weight values, and forming retrieval results and presenting the retrieval results to a user side. The invention can expand the coverage of related patents, and carry out semantic analysis and keyword recognition on the contents in the patent text, thereby improving the relevance of the retrieval result.
Description
Technical Field
The invention relates to the technical field of patent information, in particular to a retrieval method and a retrieval device based on semantic analysis and keyword recognition.
Background
The task of patent retrieval is to match patent information which best meets the requirements of users according to the conditions provided by the users. With the advent of the big data age, patent retrieval has become an important research hotspot in the field of information retrieval. The particularity of patent retrieval is that the retrieval object is a patent text, and the patent text has particularity unlike the traditional information retrieval task. The attributes of the patent text are various, such as common ipc classification numbers, claim numbers, technical efficacy, legal status, invention types and the like, which often require more specialized personnel to reasonably utilize the data; the patent text also has the characteristics of integration of various information, technical sensitization, wide subject range, wide regional coverage and the like, the characteristics of the patent text are fully considered in the process of constructing the patent retrieval model, the model construction can improve the retrieval efficiency, important help is provided for scientific research, social and economic activities and the like, and a promoting effect is provided for development of subjects and progress of scientific technology.
The main patent retrieval modes existing at present mainly include the following modes:
(1) a patent retrieval method based on a topic model and a language model comprises the steps of firstly constructing a candidate set, wherein relevant patents searched by initial query are loaded in the candidate set, then sorting the screened candidate patents based on the language model and the topic model (LDA and DMR), and the basis of sorting is the proposed weight evaluation standard;
(2) in the patent retrieval method based on the reference relationship, the mutual association between objects, the development context of a technical route and the like can be seen by reasonably using the reference relationship, for example, Fujii calculates the correlation relationship between patent documents on the basis of the reference relationship by using reference information between patents, so as to expand the patent retrieval result. Starting from a patent citation relation, the Mahdabi and the Crestani construct a patent citation network, and provide a Pagerank algorithm based on time perception on the basis of the network;
(3) the patent retrieval method based on query expansion is the most commonly used method in the patent retrieval field and is mainly used for solving the problem of low recall ratio caused by ambiguity or ambiguity of initial query, for example, the query expansion method based on position nearest neighbor uses IPC description as an expansion dictionary to expand query words, and the main idea is to calculate the closeness between candidate words and query words by the distance between the candidate words and the query words in a text;
(4) the patent retrieval method based on the ontology is based on the idea of ontology modeling, the ontology modeling method is applied to the description of the patent information ontology, multiple description problems of the same concept in the patent information database are solved by establishing a patent retrieval information association ontology, then, the patent information ontology, examples and data in the patent database are associated, and the self-organization optimization of the patent retrieval information and the sequencing of patent resources are realized by combining the self-organization evolution process and method of the ontology.
However, the means (1) and (2) are difficult to cover all the related patents because [1] different applicants may use different terms to describe the same technology, and even experts may use different terms; [2] when applying for a patent, sometimes it is desirable to keep the tone low, and in order to avoid paying too much attention to the own patent, they often choose some rare words to describe their own technology; [3] an immature technology is not standardized in the development process and has no uniform name; [4] in the translation process, patents in different countries have no uniform standard;
the method (3) only processes text information of patent documents, but the content of the patent documents is far more than that of the texts and often contains some non-text information such as drawings, diagrams and the like, but because the related fields are wide, the part of information cannot be processed at present and can only be ignored, the information also has important significance for understanding the content of the patent documents, and because the ICPC classification method is too complex, some patent documents are not thoroughly classified, and occasionally can meet patent documents which cannot be classified or are cross-classified, and the requirements on training sets and classification algorithms are high, so that the difficulty of correcting the problems is greatly increased;
the method for rapidly constructing the ontology in the field provided by the mode (4) depends on the existing and perfect relational corpus, but patent data covers all industries of the current society, great differences exist in all fields, and the task of finding the corpus corresponding to each field is almost impossible, so that the search of a proper ontology construction method and the mutual integration of all fields is still a difficult task.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a search method and apparatus based on semantic analysis and keyword recognition.
A retrieval method based on semantic analysis and keyword recognition comprises the following steps:
acquiring search information, wherein the search information comprises keywords to be retrieved; acquiring a patent text from a patent database, extracting patent keywords from the patent text according to a Textrank algorithm, and acquiring a patent keyword data set; performing vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set; determining relevant weights for each index information of a patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matched keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses; inputting the matched keyword vector set into a weight model, and calculating the weight value of the patent text corresponding to the matched keyword vector set; and carrying out TOP-K sorting according to the weight value of the patent text to form and present a retrieval result.
In one embodiment, the extracting patent keywords from the patent text according to the Textrank algorithm to obtain a patent keyword dataset specifically includes: segmenting the patent text and filtering out stop words to obtain candidate keywords; constructing a candidate keyword graph G ═ (V, e), wherein V is a node set, and the node set consists of the candidate keywords; constructing an edge between any two nodes by utilizing a co-occurrence relation; iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node; sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words; and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
In one embodiment, the weight update formula is:
wherein, ViAnd VjAll represent a set of nodes, WS (V)i) Representing a set of nodes ViD is a damping coefficient, and is generally set to 0.85 In (V)i) Set of sentences representing the presence of the keyword i, Out (V)j) Indicating presence of a keySentence set of word j, weight term ωjiIs the weight of the edge, i.e., the similarity between sentences.
In one embodiment, the vector conversion of the patent keyword data set by the Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set specifically includes: constructing a word vector conversion model through a double-layer biLSTM structure; training the word vector transformation model in a dataset; and inputting the patent keyword data set into the word vector conversion model, and outputting the word vectors of the corresponding patent keywords according to the patent keywords to obtain a patent keyword vector set.
In one embodiment, the inputting the patent keyword data set into the word vector conversion model, and outputting a corresponding patent keyword vector according to the patent keyword to obtain a patent keyword vector set specifically includes: a sentence is given, and the sentence comprises a corresponding keyword data set; searching a word vector E (1), a word vector E (N) corresponding to the keyword from a static keyword vector table according to the keyword data set, and inputting the word vector conversion model, wherein the word vector conversion model comprises a first layer forward LSTM, a first layer backward LSTM, a second layer forward LSTM and a second layer backward LSTM; respectively inputting keyword vectors E (1),. multidot.,. multidot.E (N)) into a first-layer forward LSTM and a first-layer backward LSTM, so as to obtain forward outputs h (1,1, →),. multidot.,. multidot.E (N), (N)) and backward outputs h (1,1, ° h); passing the forward outputs h (1,1, →), ·, h (N,1, →) into the second layer forward LSTM, resulting in second layer forward outputs h (1,2, →),. ·, h (N,2, →); transmitting the backward output h (1,1, ←), · h (N,1, ←) into the second layer backward LSTM, obtaining the second layer backward output h (1,2, ←), · h (N,2, ←); the word vectors that keyword i can ultimately find include e (i), h (N,1, →), h (N,2, →) and h (N,2, ←).
In one embodiment, the inputting the matching keyword vector set into a weight model, and calculating a weight value of the patent text corresponding to the matching keyword vector set specifically includes: presetting a keyword similarity threshold U; recording the times of keyword retrieval as n, and using x, y, z and h to count the number of words with keyword similarity larger than U; calculating the weight value of the corresponding keyword according to a weight calculation formula; the weight calculation formula is as follows:
wherein, w1、w2、w3And w4Each representing a corresponding ranking weight vector for the keyword.
A retrieval device based on semantic analysis and keyword recognition comprises:
the information acquisition module is used for acquiring search information, and the search information comprises keywords to be retrieved; the keyword extraction module is used for acquiring patent texts from a patent database, extracting patent keywords from the patent texts according to a Textrank algorithm and acquiring a patent keyword data set; the vector conversion module is used for carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set; the keyword matching module is used for determining relevant weights for each index information of the patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matching keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses; the weight calculation module is used for inputting the matching keyword vector set into a weight model and calculating the weight value of the patent text corresponding to the matching keyword vector set; and the text sorting module is used for carrying out TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
Compared with the prior art, the invention has the advantages and beneficial effects that: firstly, obtaining search information of a user, wherein the search information comprises keywords to be retrieved, obtaining patent texts from a patent database, extracting the patent keywords from the patent texts according to a Textrank algorithm to obtain a patent keyword data set, carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set, determining related weights for each index information of the patent texts through an analytic hierarchy process, namely judging the weights corresponding to titles, abstracts, first claim requirements and technical effect sentences, obtaining a matching keyword vector set according to matching of the keywords from high to low of the weight values, inputting the matching keyword vector set into a weight model, calculating the weight values of the patent texts corresponding to the matching keyword vector set relative to the keywords to be retrieved, carrying out TOP-K sorting according to the weight values from high to low, and a retrieval result is formed and presented to a user side, so that the coverage of related patents can be enlarged, and semantic analysis and keyword recognition are performed on the content in the patent text, thereby improving the relevance of the retrieval result.
Drawings
FIG. 1 is a schematic flow chart illustrating a search method based on semantic analysis and keyword recognition according to an embodiment;
FIG. 2 is a diagram illustrating a structure of a word vector transformation model in one embodiment;
fig. 3 is a schematic structural diagram of a retrieving device based on semantic analysis and keyword recognition according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings by way of specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In one embodiment, as shown in fig. 1, there is provided a search method based on semantic analysis and keyword recognition, including the following steps:
step S101, search information is obtained, and the search information comprises keywords to be retrieved.
Specifically, the user may input search information for retrieving a patent in a web page or an application, the search information including a keyword to be retrieved.
Step S102, obtaining patent texts from a patent database, extracting patent keywords from the patent texts according to a Textrank algorithm, and obtaining a patent keyword data set.
Specifically, patent texts are obtained from a patent database, which can be a patent database in an existing website or a high-value patent database after compilation, and patent keywords are extracted from the patent texts according to a Textrank algorithm to obtain a patent keyword data set.
The Textrank algorithm, i.e. the text sorting algorithm, is used to generate keywords and summaries for the text.
And step S103, carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set.
Specifically, the Elmo (entries from Language Models, word vector representation from Language Models) dynamic word vector transformation algorithm can perform vector transformation on the obtained patent keyword data set to obtain a corresponding patent keyword vector set.
Step S104, determining relevant weights for each index information of the patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matching keyword vector set, wherein the index information comprises: the title, abstract, headings, and technical clauses.
Specifically, the related weight of the title, the abstract, the first claim and the technical effect sentence in the patent text to the patent text is determined through an analytic hierarchy process; and matching keywords in the titles, abstracts, first claims and technical effect sentences of the patent texts from high weight to low weight according to the keywords to be retrieved to obtain a matching keyword vector set.
Step S105, inputting the matched keyword vector set into the weight model, and calculating the weight value of the patent text corresponding to the matched keyword vector set.
Specifically, a weight model may be constructed according to an analytic hierarchy process, the matching keyword vector set is input to the weight model, and a weight value of the patent text corresponding to the matching keyword vector set is calculated.
And S106, performing TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
Specifically, TOP-K sorting is performed according to the weighted value of the patent text, wherein the K value can be set according to the needs of the user, and the sorted patent text is taken as a retrieval result and presented to a retrieval page of the user.
In this embodiment, first, search information of a user is obtained, the search information includes keywords to be retrieved, patent texts are obtained in a patent database, patent keywords are extracted from the patent texts according to a Textrank algorithm to obtain a patent keyword data set, the patent keyword data set is vector-converted through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set, relevant weights are determined for each index information of the patent texts through an analytic hierarchy process, namely, weights corresponding to a title, an abstract, a first claim and a technical effect sentence are judged, a matching keyword vector set is obtained according to matching of the keywords from high to low of the weight values, the matching keyword vector set is input into a weight model, the weight values of the patent texts corresponding to the matching keyword vector set relative to the keywords to be retrieved are calculated, TOP-K sorting is performed according to the weight values from high to low, and the sequencing result is presented as a retrieval result, so that the coverage of related patents can be improved, and the semantic analysis and keyword recognition are performed on the content in the patent text, thereby improving the relevance of the retrieval result.
In one embodiment, the relationship between patents can be obtained by constructing a patent knowledge graph, and query expansion and the like can be performed.
Wherein, step S102 specifically includes: segmenting the patent text and filtering out stop words to obtain candidate keywords; constructing a candidate keyword graph G which is (V, e), wherein V is a node set, and the node set consists of candidate keywords; constructing an edge between any two nodes by utilizing a co-occurrence relation; iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node; sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words; and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
Specifically, when the weight value of a node is always within a range and fluctuates, the range can be determined as a preset range, at this time, the weight update of the node is stopped, and the weight value updated last time is used as the weight value of the node.
The reverse order arrangement means that the weighted values are ordered from large to small, the keywords corresponding to the nodes arranged in the preset order are the important keywords, the preset order can be set according to actual needs, for example, the top five can be set, that is, the keywords with the weighted values of the nodes ordered at the top five are obtained as the important keywords.
Specifically, after the important keywords are labeled in the patent text, when the retrieval result is presented to the user, the retrieval result can be conveniently used for checking the patent text, and the reading speed of the patent document is increased.
Wherein, the weight updating formula is as follows:
wherein, ViAnd VjAll represent a set of nodes, WS (V)i) Representing a set of nodes ViD is a damping coefficient, and is generally set to 0.85 In (V)i) Set of sentences representing the presence of the keyword i, Out (V)j) Set of sentences representing the presence of a keyword j, weight term ωjiIs the weight of the edge, i.e., the similarity between sentences.
Wherein, step S103 specifically includes: constructing a word vector conversion model through a double-layer biLSTM structure; training a word vector fitting and replacing model in a data set; and inputting the patent keyword data set into a word vector conversion model, outputting word vectors of corresponding patent keywords according to the patent keywords, and acquiring a patent keyword vector set.
FIG. 2 shows the structure of a word vector transformation modelSchematic diagram, wherein 10 is a forward language model (LSTM), 20 is a backward language model (LSTM), T1、T2……TNRepresenting the remaining keyword data that has not been vector converted.
Specifically, the biLSTM (bidirectional language model) has both forward LSTM and backward LSTM, and can learn word vectors storing the above information and the below information at the same time. The word vector emphasis points obtained by different layers in the biLSTM are different, the CNN-BIG-LSTM word vector adopted by the input layer can better encode the part of speech information, the first layer of LSTM can better encode the syntax information, and the second layer of LSTM can better encode the word semantic information; and obtaining a final word vector through the fusion of the multiple layers of word vectors, wherein the final word vector can give consideration to multiple information of different layers.
The method includes the steps of inputting a patent keyword data set into the word vector conversion model, outputting corresponding patent keyword vectors according to patent keywords, and obtaining a patent keyword vector set, and specifically includes the steps of: a sentence is given, and the sentence comprises a corresponding keyword data set; searching a word vector E (1), a word vector E (N) corresponding to the keyword from a static keyword vector table according to the keyword data set, and inputting the word vector conversion model, wherein the word vector conversion model comprises a first layer forward LSTM, a first layer backward LSTM, a second layer forward LSTM and a second layer backward LSTM; respectively inputting keyword vectors E (1),. multidot.,. multidot.E (N)) into a first-layer forward LSTM and a first-layer backward LSTM, so as to obtain forward outputs h (1,1, →),. multidot.,. multidot.E (N), (N)) and backward outputs h (1,1, ° h); passing the forward outputs h (1,1, →), ·, h (N,1, →) into the second layer forward LSTM, resulting in second layer forward outputs h (1,2, →),. ·, h (N,2, →); transmitting the backward output h (1,1, ←), · h (N,1, ←) into the second layer backward LSTM, obtaining the second layer backward output h (1,2, ←), · h (N,2, ←); the word vectors that keyword i can ultimately find include e (i), h (N,1, →), h (N,2, →) and h (N,2, ←).
Specifically, the static keyword vector table may be obtained by a static Word vector algorithm, such as Word to vector (Word to vector) algorithm and Glove (latent semantic analysis) algorithm.
Specifically, if a biLSTM of L layers is employed, 2L +1 word vectors can be finally obtained.
Wherein, step S105 specifically includes: presetting a keyword similarity threshold U; and recording the number of times of searching the keywords as n, using x, y, z and h to count the number of words with the keyword similarity larger than U, and calculating the weight value of the corresponding keyword according to a weight calculation formula.
Specifically, the weight calculation formula is:
wherein, w1、w2、w3And w4Each representing a corresponding ranking weight vector for the keyword.
In one embodiment, as shown in fig. 3, there is provided a retrieval apparatus 30 based on semantic analysis and keyword recognition, including: an information obtaining module 31, a keyword extracting module 32, a vector converting module 33, a keyword matching module 34, a weight calculating module 35, and a text sorting module 36, wherein:
the information acquisition module 31 is configured to acquire search information, where the search information includes a keyword to be retrieved;
the keyword extraction module 32 is configured to obtain a patent text from a patent database, extract patent keywords from the patent text according to a Textrank algorithm, and obtain a patent keyword dataset;
the vector conversion module 33 is configured to perform vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set;
the keyword matching module 34 is configured to determine a relevant weight for each index information of the patent text by an analytic hierarchy process, and match keywords in the index information from a high weight to a low weight according to the keyword to be retrieved to obtain a matching keyword vector set, where the index information includes: title, abstract, first claim and technical effect clauses;
the weight calculation module 35 is configured to input the matching keyword vector set into a weight model, and calculate a weight value of the patent text corresponding to the matching keyword vector set;
and the text sorting module 36 is configured to perform TOP-K sorting according to the weight values of the patent texts, form a retrieval result, and present the retrieval result to the user side.
In one embodiment, the keyword extraction module 32 is further configured to: segmenting the patent text and filtering out stop words to obtain candidate keywords; constructing a candidate keyword graph G ═ (V, e), wherein V is a node set, and the node set consists of the candidate keywords; constructing an edge between any two nodes by utilizing a co-occurrence relation; iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node; sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words; and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
In one embodiment, the vector conversion module 33 is further configured to: constructing a word vector conversion model through a double-layer biLSTM structure; training the word vector transformation model in a dataset; and inputting the patent keyword data set into the word vector conversion model, and outputting the word vectors of the corresponding patent keywords according to the patent keywords to obtain a patent keyword vector set.
The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (7)
1. A retrieval method based on semantic analysis and keyword recognition is characterized by comprising the following steps:
acquiring search information, wherein the search information comprises keywords to be retrieved;
acquiring a patent text from a patent database, extracting patent keywords from the patent text according to a Textrank algorithm, and acquiring a patent keyword data set;
performing vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set;
determining relevant weights for each index information of a patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matched keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses;
inputting the matched keyword vector set into a weight model, and calculating the weight value of the patent text corresponding to the matched keyword vector set;
and carrying out TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
2. The retrieval method based on semantic analysis and keyword recognition according to claim 1, wherein the extracting patent keywords from the patent text according to Textrank algorithm to obtain a patent keyword dataset specifically comprises:
segmenting the patent text and filtering out stop words to obtain candidate keywords;
constructing a candidate keyword graph G ═ (V, e), wherein V is a node set, and the node set consists of the candidate keywords;
constructing an edge between any two nodes by utilizing a co-occurrence relation;
iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node;
sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words;
and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
3. The search method based on semantic analysis and keyword recognition according to claim 2, wherein the weight update formula is:
wherein, ViAnd VjAll represent a set of nodes, WS (V)i) Representing a set of nodes ViD is a damping coefficient, and is generally set to 0.85 In (V)i) Set of sentences representing the presence of the keyword i, Out (V)j) Set of sentences representing the presence of a keyword j, weight term ωjiIs the weight of the edge, i.e., the similarity between sentences.
4. The search method based on semantic analysis and keyword recognition according to claim 1, wherein the vector conversion is performed on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set, specifically comprising:
constructing a word vector conversion model through a double-layer biLSTM structure;
training the word vector transformation model in a dataset;
and inputting the patent keyword data set into the word vector conversion model, and outputting the word vectors of the corresponding patent keywords according to the patent keywords to obtain a patent keyword vector set.
5. The method according to claim 4, wherein the step of inputting the patent keyword data set into the word vector conversion model and outputting a corresponding patent keyword vector according to the patent keyword to obtain a patent keyword vector set comprises:
a sentence is given, and the sentence comprises a corresponding keyword data set;
searching a word vector E (1), a word vector E (N) corresponding to the keyword from a static keyword vector table according to the keyword data set, and inputting the word vector conversion model, wherein the word vector conversion model comprises a first layer forward LSTM, a first layer backward LSTM, a second layer forward LSTM and a second layer backward LSTM;
respectively inputting keyword vectors E (1),. multidot.,. multidot.E (N)) into a first-layer forward LSTM and a first-layer backward LSTM, so as to obtain forward outputs h (1,1, →),. multidot.,. multidot.E (N), (N)) and backward outputs h (1,1, ° h);
passing the forward outputs h (1,1, →), ·, h (N,1, →) into the second layer forward LSTM, resulting in second layer forward outputs h (1,2, →),. ·, h (N,2, →); transmitting the backward output h (1,1, ←), · h (N,1, ←) into the second layer backward LSTM, obtaining the second layer backward output h (1,2, ←), · h (N,2, ←);
the word vectors that keyword i can ultimately find include e (i), h (N,1, →), h (N,2, →) and h (N,2, ←).
6. The method as claimed in claim 1, wherein the step of inputting the matching keyword vector set into a weight model to calculate a weight value of the patent text corresponding to the matching keyword vector set includes:
presetting a keyword similarity threshold U;
recording the times of keyword retrieval as n, and using x, y, z and h to count the number of words with keyword similarity larger than U;
calculating the weight value of the corresponding keyword according to a weight calculation formula;
the weight calculation formula is as follows:
wherein, w1、w2、w3And w4Each representing a corresponding ranking weight vector for the keyword.
7. A retrieval device based on semantic analysis and keyword recognition is characterized by comprising:
the information acquisition module is used for acquiring search information, and the search information comprises keywords to be retrieved;
the keyword extraction module is used for acquiring patent texts from a patent database, extracting patent keywords from the patent texts according to a Textrank algorithm and acquiring a patent keyword data set;
the vector conversion module is used for carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set;
the keyword matching module is used for determining relevant weights for each index information of the patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matching keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses;
the weight calculation module is used for inputting the matching keyword vector set into a weight model and calculating the weight value of the patent text corresponding to the matching keyword vector set;
and the text sorting module is used for carrying out TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011442031.4A CN112507109A (en) | 2020-12-11 | 2020-12-11 | Retrieval method and device based on semantic analysis and keyword recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011442031.4A CN112507109A (en) | 2020-12-11 | 2020-12-11 | Retrieval method and device based on semantic analysis and keyword recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112507109A true CN112507109A (en) | 2021-03-16 |
Family
ID=74970927
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011442031.4A Pending CN112507109A (en) | 2020-12-11 | 2020-12-11 | Retrieval method and device based on semantic analysis and keyword recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112507109A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114780690A (en) * | 2022-06-20 | 2022-07-22 | 成都信息工程大学 | Patent text retrieval method and device based on multi-mode matrix vector representation |
CN115794999A (en) * | 2023-02-01 | 2023-03-14 | 北京知呱呱科技服务有限公司 | Patent document query method based on diffusion model and computer equipment |
CN116010560A (en) * | 2023-03-28 | 2023-04-25 | 青岛阿斯顿工程技术转移有限公司 | International technology transfer data service system |
CN116738968A (en) * | 2023-08-14 | 2023-09-12 | 宁波深擎信息科技有限公司 | Content linking method, device, computer equipment and storage medium |
CN116842138A (en) * | 2023-07-24 | 2023-10-03 | 上海诚狐信息科技有限公司 | Document-based retrieval method, device, equipment and storage medium |
CN116881437A (en) * | 2023-09-08 | 2023-10-13 | 北京睿企信息科技有限公司 | Data processing system for acquiring text set |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107729312A (en) * | 2017-09-05 | 2018-02-23 | 苏州大学 | More granularity segmenting methods and system based on sequence labelling modeling |
US20180121799A1 (en) * | 2016-11-03 | 2018-05-03 | Salesforce.Com, Inc. | Training a Joint Many-Task Neural Network Model using Successive Regularization |
CN109635150A (en) * | 2018-12-19 | 2019-04-16 | 腾讯科技(深圳)有限公司 | Document creation method, device and storage medium |
CN110807084A (en) * | 2019-05-15 | 2020-02-18 | 北京信息科技大学 | Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy |
CN111737560A (en) * | 2020-07-20 | 2020-10-02 | 平安国际智慧城市科技股份有限公司 | Content search method, field prediction model training method, device and storage medium |
-
2020
- 2020-12-11 CN CN202011442031.4A patent/CN112507109A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180121799A1 (en) * | 2016-11-03 | 2018-05-03 | Salesforce.Com, Inc. | Training a Joint Many-Task Neural Network Model using Successive Regularization |
CN107729312A (en) * | 2017-09-05 | 2018-02-23 | 苏州大学 | More granularity segmenting methods and system based on sequence labelling modeling |
CN109635150A (en) * | 2018-12-19 | 2019-04-16 | 腾讯科技(深圳)有限公司 | Document creation method, device and storage medium |
CN110807084A (en) * | 2019-05-15 | 2020-02-18 | 北京信息科技大学 | Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy |
CN111737560A (en) * | 2020-07-20 | 2020-10-02 | 平安国际智慧城市科技股份有限公司 | Content search method, field prediction model training method, device and storage medium |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114780690A (en) * | 2022-06-20 | 2022-07-22 | 成都信息工程大学 | Patent text retrieval method and device based on multi-mode matrix vector representation |
CN115794999A (en) * | 2023-02-01 | 2023-03-14 | 北京知呱呱科技服务有限公司 | Patent document query method based on diffusion model and computer equipment |
CN116010560A (en) * | 2023-03-28 | 2023-04-25 | 青岛阿斯顿工程技术转移有限公司 | International technology transfer data service system |
CN116842138A (en) * | 2023-07-24 | 2023-10-03 | 上海诚狐信息科技有限公司 | Document-based retrieval method, device, equipment and storage medium |
CN116738968A (en) * | 2023-08-14 | 2023-09-12 | 宁波深擎信息科技有限公司 | Content linking method, device, computer equipment and storage medium |
CN116738968B (en) * | 2023-08-14 | 2023-11-24 | 宁波深擎信息科技有限公司 | Content linking method, device, computer equipment and storage medium |
CN116881437A (en) * | 2023-09-08 | 2023-10-13 | 北京睿企信息科技有限公司 | Data processing system for acquiring text set |
CN116881437B (en) * | 2023-09-08 | 2023-12-01 | 北京睿企信息科技有限公司 | Data processing system for acquiring text set |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108763333B (en) | Social media-based event map construction method | |
CN107193803B (en) | Semantic-based specific task text keyword extraction method | |
CN110442777B (en) | BERT-based pseudo-correlation feedback model information retrieval method and system | |
CN106776711B (en) | Chinese medical knowledge map construction method based on deep learning | |
CN109960786A (en) | Chinese Measurement of word similarity based on convergence strategy | |
US8751218B2 (en) | Indexing content at semantic level | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN112507109A (en) | Retrieval method and device based on semantic analysis and keyword recognition | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN109829104A (en) | Pseudo-linear filter model information search method and system based on semantic similarity | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
Gupta et al. | A novel hybrid text summarization system for Punjabi text | |
Mahata et al. | Theme-weighted ranking of keywords from text documents using phrase embeddings | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
Yang et al. | A new network model for extracting text keywords | |
CN114462392B (en) | Short text feature expansion method based on association degree of subject and association of keywords | |
CN114997288B (en) | Design resource association method | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
CN118245564B (en) | Method and device for constructing feature comparison library supporting semantic review and repayment | |
Huang et al. | An approach on Chinese microblog entity linking combining baidu encyclopaedia and word2vec | |
CN112417170A (en) | Relation linking method for incomplete knowledge graph | |
Jia et al. | A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth | |
Rogushina | Use of Semantic Similarity Estimates for Unstructured Data Analysis. | |
CN111581326A (en) | Method for extracting answer information based on heterogeneous external knowledge source graph structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210316 |