CN112507109A - Retrieval method and device based on semantic analysis and keyword recognition - Google Patents

Retrieval method and device based on semantic analysis and keyword recognition Download PDF

Info

Publication number
CN112507109A
CN112507109A CN202011442031.4A CN202011442031A CN112507109A CN 112507109 A CN112507109 A CN 112507109A CN 202011442031 A CN202011442031 A CN 202011442031A CN 112507109 A CN112507109 A CN 112507109A
Authority
CN
China
Prior art keywords
keyword
keywords
vector
weight
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011442031.4A
Other languages
Chinese (zh)
Inventor
刘伟
刘灿
吴永杰
钟延珍
陈善雄
李莉
李磊
王雪春
王仲煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Intellectual Property Big Data Research Institute Co ltd
Original Assignee
Chongqing Intellectual Property Big Data Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Intellectual Property Big Data Research Institute Co ltd filed Critical Chongqing Intellectual Property Big Data Research Institute Co ltd
Priority to CN202011442031.4A priority Critical patent/CN112507109A/en
Publication of CN112507109A publication Critical patent/CN112507109A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a retrieval method and a retrieval device based on semantic analysis and keyword recognition, which comprise the following steps: extracting patent keywords from the patent text by a Textrank algorithm to obtain a patent keyword data set, and performing vector conversion according to an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set; determining weights of titles, abstracts, first claims and technical effect sentences of patent texts through an analytic hierarchy process, matching keywords in index information from high weights to low weights according to the keywords to be retrieved to obtain a matching keyword vector set, inputting the matching keyword vector set into a weight model, calculating the weight values of corresponding patent texts, carrying out TOP-K sorting according to the weight values, and forming retrieval results and presenting the retrieval results to a user side. The invention can expand the coverage of related patents, and carry out semantic analysis and keyword recognition on the contents in the patent text, thereby improving the relevance of the retrieval result.

Description

Retrieval method and device based on semantic analysis and keyword recognition
Technical Field
The invention relates to the technical field of patent information, in particular to a retrieval method and a retrieval device based on semantic analysis and keyword recognition.
Background
The task of patent retrieval is to match patent information which best meets the requirements of users according to the conditions provided by the users. With the advent of the big data age, patent retrieval has become an important research hotspot in the field of information retrieval. The particularity of patent retrieval is that the retrieval object is a patent text, and the patent text has particularity unlike the traditional information retrieval task. The attributes of the patent text are various, such as common ipc classification numbers, claim numbers, technical efficacy, legal status, invention types and the like, which often require more specialized personnel to reasonably utilize the data; the patent text also has the characteristics of integration of various information, technical sensitization, wide subject range, wide regional coverage and the like, the characteristics of the patent text are fully considered in the process of constructing the patent retrieval model, the model construction can improve the retrieval efficiency, important help is provided for scientific research, social and economic activities and the like, and a promoting effect is provided for development of subjects and progress of scientific technology.
The main patent retrieval modes existing at present mainly include the following modes:
(1) a patent retrieval method based on a topic model and a language model comprises the steps of firstly constructing a candidate set, wherein relevant patents searched by initial query are loaded in the candidate set, then sorting the screened candidate patents based on the language model and the topic model (LDA and DMR), and the basis of sorting is the proposed weight evaluation standard;
(2) in the patent retrieval method based on the reference relationship, the mutual association between objects, the development context of a technical route and the like can be seen by reasonably using the reference relationship, for example, Fujii calculates the correlation relationship between patent documents on the basis of the reference relationship by using reference information between patents, so as to expand the patent retrieval result. Starting from a patent citation relation, the Mahdabi and the Crestani construct a patent citation network, and provide a Pagerank algorithm based on time perception on the basis of the network;
(3) the patent retrieval method based on query expansion is the most commonly used method in the patent retrieval field and is mainly used for solving the problem of low recall ratio caused by ambiguity or ambiguity of initial query, for example, the query expansion method based on position nearest neighbor uses IPC description as an expansion dictionary to expand query words, and the main idea is to calculate the closeness between candidate words and query words by the distance between the candidate words and the query words in a text;
(4) the patent retrieval method based on the ontology is based on the idea of ontology modeling, the ontology modeling method is applied to the description of the patent information ontology, multiple description problems of the same concept in the patent information database are solved by establishing a patent retrieval information association ontology, then, the patent information ontology, examples and data in the patent database are associated, and the self-organization optimization of the patent retrieval information and the sequencing of patent resources are realized by combining the self-organization evolution process and method of the ontology.
However, the means (1) and (2) are difficult to cover all the related patents because [1] different applicants may use different terms to describe the same technology, and even experts may use different terms; [2] when applying for a patent, sometimes it is desirable to keep the tone low, and in order to avoid paying too much attention to the own patent, they often choose some rare words to describe their own technology; [3] an immature technology is not standardized in the development process and has no uniform name; [4] in the translation process, patents in different countries have no uniform standard;
the method (3) only processes text information of patent documents, but the content of the patent documents is far more than that of the texts and often contains some non-text information such as drawings, diagrams and the like, but because the related fields are wide, the part of information cannot be processed at present and can only be ignored, the information also has important significance for understanding the content of the patent documents, and because the ICPC classification method is too complex, some patent documents are not thoroughly classified, and occasionally can meet patent documents which cannot be classified or are cross-classified, and the requirements on training sets and classification algorithms are high, so that the difficulty of correcting the problems is greatly increased;
the method for rapidly constructing the ontology in the field provided by the mode (4) depends on the existing and perfect relational corpus, but patent data covers all industries of the current society, great differences exist in all fields, and the task of finding the corpus corresponding to each field is almost impossible, so that the search of a proper ontology construction method and the mutual integration of all fields is still a difficult task.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a search method and apparatus based on semantic analysis and keyword recognition.
A retrieval method based on semantic analysis and keyword recognition comprises the following steps:
acquiring search information, wherein the search information comprises keywords to be retrieved; acquiring a patent text from a patent database, extracting patent keywords from the patent text according to a Textrank algorithm, and acquiring a patent keyword data set; performing vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set; determining relevant weights for each index information of a patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matched keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses; inputting the matched keyword vector set into a weight model, and calculating the weight value of the patent text corresponding to the matched keyword vector set; and carrying out TOP-K sorting according to the weight value of the patent text to form and present a retrieval result.
In one embodiment, the extracting patent keywords from the patent text according to the Textrank algorithm to obtain a patent keyword dataset specifically includes: segmenting the patent text and filtering out stop words to obtain candidate keywords; constructing a candidate keyword graph G ═ (V, e), wherein V is a node set, and the node set consists of the candidate keywords; constructing an edge between any two nodes by utilizing a co-occurrence relation; iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node; sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words; and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
In one embodiment, the weight update formula is:
Figure BDA0002830528390000031
wherein, ViAnd VjAll represent a set of nodes, WS (V)i) Representing a set of nodes ViD is a damping coefficient, and is generally set to 0.85 In (V)i) Set of sentences representing the presence of the keyword i, Out (V)j) Indicating presence of a keySentence set of word j, weight term ωjiIs the weight of the edge, i.e., the similarity between sentences.
In one embodiment, the vector conversion of the patent keyword data set by the Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set specifically includes: constructing a word vector conversion model through a double-layer biLSTM structure; training the word vector transformation model in a dataset; and inputting the patent keyword data set into the word vector conversion model, and outputting the word vectors of the corresponding patent keywords according to the patent keywords to obtain a patent keyword vector set.
In one embodiment, the inputting the patent keyword data set into the word vector conversion model, and outputting a corresponding patent keyword vector according to the patent keyword to obtain a patent keyword vector set specifically includes: a sentence is given, and the sentence comprises a corresponding keyword data set; searching a word vector E (1), a word vector E (N) corresponding to the keyword from a static keyword vector table according to the keyword data set, and inputting the word vector conversion model, wherein the word vector conversion model comprises a first layer forward LSTM, a first layer backward LSTM, a second layer forward LSTM and a second layer backward LSTM; respectively inputting keyword vectors E (1),. multidot.,. multidot.E (N)) into a first-layer forward LSTM and a first-layer backward LSTM, so as to obtain forward outputs h (1,1, →),. multidot.,. multidot.E (N), (N)) and backward outputs h (1,1, ° h); passing the forward outputs h (1,1, →), ·, h (N,1, →) into the second layer forward LSTM, resulting in second layer forward outputs h (1,2, →),. ·, h (N,2, →); transmitting the backward output h (1,1, ←), · h (N,1, ←) into the second layer backward LSTM, obtaining the second layer backward output h (1,2, ←), · h (N,2, ←); the word vectors that keyword i can ultimately find include e (i), h (N,1, →), h (N,2, →) and h (N,2, ←).
In one embodiment, the inputting the matching keyword vector set into a weight model, and calculating a weight value of the patent text corresponding to the matching keyword vector set specifically includes: presetting a keyword similarity threshold U; recording the times of keyword retrieval as n, and using x, y, z and h to count the number of words with keyword similarity larger than U; calculating the weight value of the corresponding keyword according to a weight calculation formula; the weight calculation formula is as follows:
Figure BDA0002830528390000041
wherein, w1、w2、w3And w4Each representing a corresponding ranking weight vector for the keyword.
A retrieval device based on semantic analysis and keyword recognition comprises:
the information acquisition module is used for acquiring search information, and the search information comprises keywords to be retrieved; the keyword extraction module is used for acquiring patent texts from a patent database, extracting patent keywords from the patent texts according to a Textrank algorithm and acquiring a patent keyword data set; the vector conversion module is used for carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set; the keyword matching module is used for determining relevant weights for each index information of the patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matching keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses; the weight calculation module is used for inputting the matching keyword vector set into a weight model and calculating the weight value of the patent text corresponding to the matching keyword vector set; and the text sorting module is used for carrying out TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
Compared with the prior art, the invention has the advantages and beneficial effects that: firstly, obtaining search information of a user, wherein the search information comprises keywords to be retrieved, obtaining patent texts from a patent database, extracting the patent keywords from the patent texts according to a Textrank algorithm to obtain a patent keyword data set, carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set, determining related weights for each index information of the patent texts through an analytic hierarchy process, namely judging the weights corresponding to titles, abstracts, first claim requirements and technical effect sentences, obtaining a matching keyword vector set according to matching of the keywords from high to low of the weight values, inputting the matching keyword vector set into a weight model, calculating the weight values of the patent texts corresponding to the matching keyword vector set relative to the keywords to be retrieved, carrying out TOP-K sorting according to the weight values from high to low, and a retrieval result is formed and presented to a user side, so that the coverage of related patents can be enlarged, and semantic analysis and keyword recognition are performed on the content in the patent text, thereby improving the relevance of the retrieval result.
Drawings
FIG. 1 is a schematic flow chart illustrating a search method based on semantic analysis and keyword recognition according to an embodiment;
FIG. 2 is a diagram illustrating a structure of a word vector transformation model in one embodiment;
fig. 3 is a schematic structural diagram of a retrieving device based on semantic analysis and keyword recognition according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings by way of specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In one embodiment, as shown in fig. 1, there is provided a search method based on semantic analysis and keyword recognition, including the following steps:
step S101, search information is obtained, and the search information comprises keywords to be retrieved.
Specifically, the user may input search information for retrieving a patent in a web page or an application, the search information including a keyword to be retrieved.
Step S102, obtaining patent texts from a patent database, extracting patent keywords from the patent texts according to a Textrank algorithm, and obtaining a patent keyword data set.
Specifically, patent texts are obtained from a patent database, which can be a patent database in an existing website or a high-value patent database after compilation, and patent keywords are extracted from the patent texts according to a Textrank algorithm to obtain a patent keyword data set.
The Textrank algorithm, i.e. the text sorting algorithm, is used to generate keywords and summaries for the text.
And step S103, carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set.
Specifically, the Elmo (entries from Language Models, word vector representation from Language Models) dynamic word vector transformation algorithm can perform vector transformation on the obtained patent keyword data set to obtain a corresponding patent keyword vector set.
Step S104, determining relevant weights for each index information of the patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matching keyword vector set, wherein the index information comprises: the title, abstract, headings, and technical clauses.
Specifically, the related weight of the title, the abstract, the first claim and the technical effect sentence in the patent text to the patent text is determined through an analytic hierarchy process; and matching keywords in the titles, abstracts, first claims and technical effect sentences of the patent texts from high weight to low weight according to the keywords to be retrieved to obtain a matching keyword vector set.
Step S105, inputting the matched keyword vector set into the weight model, and calculating the weight value of the patent text corresponding to the matched keyword vector set.
Specifically, a weight model may be constructed according to an analytic hierarchy process, the matching keyword vector set is input to the weight model, and a weight value of the patent text corresponding to the matching keyword vector set is calculated.
And S106, performing TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
Specifically, TOP-K sorting is performed according to the weighted value of the patent text, wherein the K value can be set according to the needs of the user, and the sorted patent text is taken as a retrieval result and presented to a retrieval page of the user.
In this embodiment, first, search information of a user is obtained, the search information includes keywords to be retrieved, patent texts are obtained in a patent database, patent keywords are extracted from the patent texts according to a Textrank algorithm to obtain a patent keyword data set, the patent keyword data set is vector-converted through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set, relevant weights are determined for each index information of the patent texts through an analytic hierarchy process, namely, weights corresponding to a title, an abstract, a first claim and a technical effect sentence are judged, a matching keyword vector set is obtained according to matching of the keywords from high to low of the weight values, the matching keyword vector set is input into a weight model, the weight values of the patent texts corresponding to the matching keyword vector set relative to the keywords to be retrieved are calculated, TOP-K sorting is performed according to the weight values from high to low, and the sequencing result is presented as a retrieval result, so that the coverage of related patents can be improved, and the semantic analysis and keyword recognition are performed on the content in the patent text, thereby improving the relevance of the retrieval result.
In one embodiment, the relationship between patents can be obtained by constructing a patent knowledge graph, and query expansion and the like can be performed.
Wherein, step S102 specifically includes: segmenting the patent text and filtering out stop words to obtain candidate keywords; constructing a candidate keyword graph G which is (V, e), wherein V is a node set, and the node set consists of candidate keywords; constructing an edge between any two nodes by utilizing a co-occurrence relation; iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node; sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words; and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
Specifically, when the weight value of a node is always within a range and fluctuates, the range can be determined as a preset range, at this time, the weight update of the node is stopped, and the weight value updated last time is used as the weight value of the node.
The reverse order arrangement means that the weighted values are ordered from large to small, the keywords corresponding to the nodes arranged in the preset order are the important keywords, the preset order can be set according to actual needs, for example, the top five can be set, that is, the keywords with the weighted values of the nodes ordered at the top five are obtained as the important keywords.
Specifically, after the important keywords are labeled in the patent text, when the retrieval result is presented to the user, the retrieval result can be conveniently used for checking the patent text, and the reading speed of the patent document is increased.
Wherein, the weight updating formula is as follows:
Figure BDA0002830528390000071
wherein, ViAnd VjAll represent a set of nodes, WS (V)i) Representing a set of nodes ViD is a damping coefficient, and is generally set to 0.85 In (V)i) Set of sentences representing the presence of the keyword i, Out (V)j) Set of sentences representing the presence of a keyword j, weight term ωjiIs the weight of the edge, i.e., the similarity between sentences.
Wherein, step S103 specifically includes: constructing a word vector conversion model through a double-layer biLSTM structure; training a word vector fitting and replacing model in a data set; and inputting the patent keyword data set into a word vector conversion model, outputting word vectors of corresponding patent keywords according to the patent keywords, and acquiring a patent keyword vector set.
FIG. 2 shows the structure of a word vector transformation modelSchematic diagram, wherein 10 is a forward language model (LSTM), 20 is a backward language model (LSTM), T1、T2……TNRepresenting the remaining keyword data that has not been vector converted.
Specifically, the biLSTM (bidirectional language model) has both forward LSTM and backward LSTM, and can learn word vectors storing the above information and the below information at the same time. The word vector emphasis points obtained by different layers in the biLSTM are different, the CNN-BIG-LSTM word vector adopted by the input layer can better encode the part of speech information, the first layer of LSTM can better encode the syntax information, and the second layer of LSTM can better encode the word semantic information; and obtaining a final word vector through the fusion of the multiple layers of word vectors, wherein the final word vector can give consideration to multiple information of different layers.
The method includes the steps of inputting a patent keyword data set into the word vector conversion model, outputting corresponding patent keyword vectors according to patent keywords, and obtaining a patent keyword vector set, and specifically includes the steps of: a sentence is given, and the sentence comprises a corresponding keyword data set; searching a word vector E (1), a word vector E (N) corresponding to the keyword from a static keyword vector table according to the keyword data set, and inputting the word vector conversion model, wherein the word vector conversion model comprises a first layer forward LSTM, a first layer backward LSTM, a second layer forward LSTM and a second layer backward LSTM; respectively inputting keyword vectors E (1),. multidot.,. multidot.E (N)) into a first-layer forward LSTM and a first-layer backward LSTM, so as to obtain forward outputs h (1,1, →),. multidot.,. multidot.E (N), (N)) and backward outputs h (1,1, ° h); passing the forward outputs h (1,1, →), ·, h (N,1, →) into the second layer forward LSTM, resulting in second layer forward outputs h (1,2, →),. ·, h (N,2, →); transmitting the backward output h (1,1, ←), · h (N,1, ←) into the second layer backward LSTM, obtaining the second layer backward output h (1,2, ←), · h (N,2, ←); the word vectors that keyword i can ultimately find include e (i), h (N,1, →), h (N,2, →) and h (N,2, ←).
Specifically, the static keyword vector table may be obtained by a static Word vector algorithm, such as Word to vector (Word to vector) algorithm and Glove (latent semantic analysis) algorithm.
Specifically, if a biLSTM of L layers is employed, 2L +1 word vectors can be finally obtained.
Wherein, step S105 specifically includes: presetting a keyword similarity threshold U; and recording the number of times of searching the keywords as n, using x, y, z and h to count the number of words with the keyword similarity larger than U, and calculating the weight value of the corresponding keyword according to a weight calculation formula.
Specifically, the weight calculation formula is:
Figure BDA0002830528390000091
wherein, w1、w2、w3And w4Each representing a corresponding ranking weight vector for the keyword.
In one embodiment, as shown in fig. 3, there is provided a retrieval apparatus 30 based on semantic analysis and keyword recognition, including: an information obtaining module 31, a keyword extracting module 32, a vector converting module 33, a keyword matching module 34, a weight calculating module 35, and a text sorting module 36, wherein:
the information acquisition module 31 is configured to acquire search information, where the search information includes a keyword to be retrieved;
the keyword extraction module 32 is configured to obtain a patent text from a patent database, extract patent keywords from the patent text according to a Textrank algorithm, and obtain a patent keyword dataset;
the vector conversion module 33 is configured to perform vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set;
the keyword matching module 34 is configured to determine a relevant weight for each index information of the patent text by an analytic hierarchy process, and match keywords in the index information from a high weight to a low weight according to the keyword to be retrieved to obtain a matching keyword vector set, where the index information includes: title, abstract, first claim and technical effect clauses;
the weight calculation module 35 is configured to input the matching keyword vector set into a weight model, and calculate a weight value of the patent text corresponding to the matching keyword vector set;
and the text sorting module 36 is configured to perform TOP-K sorting according to the weight values of the patent texts, form a retrieval result, and present the retrieval result to the user side.
In one embodiment, the keyword extraction module 32 is further configured to: segmenting the patent text and filtering out stop words to obtain candidate keywords; constructing a candidate keyword graph G ═ (V, e), wherein V is a node set, and the node set consists of the candidate keywords; constructing an edge between any two nodes by utilizing a co-occurrence relation; iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node; sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words; and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
In one embodiment, the vector conversion module 33 is further configured to: constructing a word vector conversion model through a double-layer biLSTM structure; training the word vector transformation model in a dataset; and inputting the patent keyword data set into the word vector conversion model, and outputting the word vectors of the corresponding patent keywords according to the patent keywords to obtain a patent keyword vector set.
The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (7)

1. A retrieval method based on semantic analysis and keyword recognition is characterized by comprising the following steps:
acquiring search information, wherein the search information comprises keywords to be retrieved;
acquiring a patent text from a patent database, extracting patent keywords from the patent text according to a Textrank algorithm, and acquiring a patent keyword data set;
performing vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set;
determining relevant weights for each index information of a patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matched keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses;
inputting the matched keyword vector set into a weight model, and calculating the weight value of the patent text corresponding to the matched keyword vector set;
and carrying out TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
2. The retrieval method based on semantic analysis and keyword recognition according to claim 1, wherein the extracting patent keywords from the patent text according to Textrank algorithm to obtain a patent keyword dataset specifically comprises:
segmenting the patent text and filtering out stop words to obtain candidate keywords;
constructing a candidate keyword graph G ═ (V, e), wherein V is a node set, and the node set consists of the candidate keywords;
constructing an edge between any two nodes by utilizing a co-occurrence relation;
iteratively updating the weight value of each node according to a weight updating formula until the weight value of each node converges to a range, namely, the weight value obtained by the last updating is regarded as the weight value of the node;
sorting the weighted values of the nodes in an inverted order, wherein the key words corresponding to the nodes arranged in the preset order are important key words;
and marking the important keywords in the corresponding patent texts, and constructing a patent keyword data set through the important keywords.
3. The search method based on semantic analysis and keyword recognition according to claim 2, wherein the weight update formula is:
Figure FDA0002830528380000021
wherein, ViAnd VjAll represent a set of nodes, WS (V)i) Representing a set of nodes ViD is a damping coefficient, and is generally set to 0.85 In (V)i) Set of sentences representing the presence of the keyword i, Out (V)j) Set of sentences representing the presence of a keyword j, weight term ωjiIs the weight of the edge, i.e., the similarity between sentences.
4. The search method based on semantic analysis and keyword recognition according to claim 1, wherein the vector conversion is performed on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set, specifically comprising:
constructing a word vector conversion model through a double-layer biLSTM structure;
training the word vector transformation model in a dataset;
and inputting the patent keyword data set into the word vector conversion model, and outputting the word vectors of the corresponding patent keywords according to the patent keywords to obtain a patent keyword vector set.
5. The method according to claim 4, wherein the step of inputting the patent keyword data set into the word vector conversion model and outputting a corresponding patent keyword vector according to the patent keyword to obtain a patent keyword vector set comprises:
a sentence is given, and the sentence comprises a corresponding keyword data set;
searching a word vector E (1), a word vector E (N) corresponding to the keyword from a static keyword vector table according to the keyword data set, and inputting the word vector conversion model, wherein the word vector conversion model comprises a first layer forward LSTM, a first layer backward LSTM, a second layer forward LSTM and a second layer backward LSTM;
respectively inputting keyword vectors E (1),. multidot.,. multidot.E (N)) into a first-layer forward LSTM and a first-layer backward LSTM, so as to obtain forward outputs h (1,1, →),. multidot.,. multidot.E (N), (N)) and backward outputs h (1,1, ° h);
passing the forward outputs h (1,1, →), ·, h (N,1, →) into the second layer forward LSTM, resulting in second layer forward outputs h (1,2, →),. ·, h (N,2, →); transmitting the backward output h (1,1, ←), · h (N,1, ←) into the second layer backward LSTM, obtaining the second layer backward output h (1,2, ←), · h (N,2, ←);
the word vectors that keyword i can ultimately find include e (i), h (N,1, →), h (N,2, →) and h (N,2, ←).
6. The method as claimed in claim 1, wherein the step of inputting the matching keyword vector set into a weight model to calculate a weight value of the patent text corresponding to the matching keyword vector set includes:
presetting a keyword similarity threshold U;
recording the times of keyword retrieval as n, and using x, y, z and h to count the number of words with keyword similarity larger than U;
calculating the weight value of the corresponding keyword according to a weight calculation formula;
the weight calculation formula is as follows:
Figure FDA0002830528380000031
wherein, w1、w2、w3And w4Each representing a corresponding ranking weight vector for the keyword.
7. A retrieval device based on semantic analysis and keyword recognition is characterized by comprising:
the information acquisition module is used for acquiring search information, and the search information comprises keywords to be retrieved;
the keyword extraction module is used for acquiring patent texts from a patent database, extracting patent keywords from the patent texts according to a Textrank algorithm and acquiring a patent keyword data set;
the vector conversion module is used for carrying out vector conversion on the patent keyword data set through an Elmo dynamic word vector conversion algorithm to obtain a patent keyword vector set;
the keyword matching module is used for determining relevant weights for each index information of the patent text through an analytic hierarchy process, matching keywords in the index information from high weights to low weights according to the keywords to be retrieved, and acquiring a matching keyword vector set, wherein the index information comprises: title, abstract, first claim and technical effect clauses;
the weight calculation module is used for inputting the matching keyword vector set into a weight model and calculating the weight value of the patent text corresponding to the matching keyword vector set;
and the text sorting module is used for carrying out TOP-K sorting according to the weight value of the patent text to form a retrieval result and presenting the retrieval result to the user side.
CN202011442031.4A 2020-12-11 2020-12-11 Retrieval method and device based on semantic analysis and keyword recognition Pending CN112507109A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011442031.4A CN112507109A (en) 2020-12-11 2020-12-11 Retrieval method and device based on semantic analysis and keyword recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011442031.4A CN112507109A (en) 2020-12-11 2020-12-11 Retrieval method and device based on semantic analysis and keyword recognition

Publications (1)

Publication Number Publication Date
CN112507109A true CN112507109A (en) 2021-03-16

Family

ID=74970927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011442031.4A Pending CN112507109A (en) 2020-12-11 2020-12-11 Retrieval method and device based on semantic analysis and keyword recognition

Country Status (1)

Country Link
CN (1) CN112507109A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780690A (en) * 2022-06-20 2022-07-22 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation
CN115794999A (en) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 Patent document query method based on diffusion model and computer equipment
CN116010560A (en) * 2023-03-28 2023-04-25 青岛阿斯顿工程技术转移有限公司 International technology transfer data service system
CN116738968A (en) * 2023-08-14 2023-09-12 宁波深擎信息科技有限公司 Content linking method, device, computer equipment and storage medium
CN116842138A (en) * 2023-07-24 2023-10-03 上海诚狐信息科技有限公司 Document-based retrieval method, device, equipment and storage medium
CN116881437A (en) * 2023-09-08 2023-10-13 北京睿企信息科技有限公司 Data processing system for acquiring text set

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729312A (en) * 2017-09-05 2018-02-23 苏州大学 More granularity segmenting methods and system based on sequence labelling modeling
US20180121799A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Training a Joint Many-Task Neural Network Model using Successive Regularization
CN109635150A (en) * 2018-12-19 2019-04-16 腾讯科技(深圳)有限公司 Document creation method, device and storage medium
CN110807084A (en) * 2019-05-15 2020-02-18 北京信息科技大学 Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
CN111737560A (en) * 2020-07-20 2020-10-02 平安国际智慧城市科技股份有限公司 Content search method, field prediction model training method, device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121799A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Training a Joint Many-Task Neural Network Model using Successive Regularization
CN107729312A (en) * 2017-09-05 2018-02-23 苏州大学 More granularity segmenting methods and system based on sequence labelling modeling
CN109635150A (en) * 2018-12-19 2019-04-16 腾讯科技(深圳)有限公司 Document creation method, device and storage medium
CN110807084A (en) * 2019-05-15 2020-02-18 北京信息科技大学 Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
CN111737560A (en) * 2020-07-20 2020-10-02 平安国际智慧城市科技股份有限公司 Content search method, field prediction model training method, device and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780690A (en) * 2022-06-20 2022-07-22 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation
CN115794999A (en) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 Patent document query method based on diffusion model and computer equipment
CN116010560A (en) * 2023-03-28 2023-04-25 青岛阿斯顿工程技术转移有限公司 International technology transfer data service system
CN116842138A (en) * 2023-07-24 2023-10-03 上海诚狐信息科技有限公司 Document-based retrieval method, device, equipment and storage medium
CN116738968A (en) * 2023-08-14 2023-09-12 宁波深擎信息科技有限公司 Content linking method, device, computer equipment and storage medium
CN116738968B (en) * 2023-08-14 2023-11-24 宁波深擎信息科技有限公司 Content linking method, device, computer equipment and storage medium
CN116881437A (en) * 2023-09-08 2023-10-13 北京睿企信息科技有限公司 Data processing system for acquiring text set
CN116881437B (en) * 2023-09-08 2023-12-01 北京睿企信息科技有限公司 Data processing system for acquiring text set

Similar Documents

Publication Publication Date Title
CN108763333B (en) Social media-based event map construction method
CN107193803B (en) Semantic-based specific task text keyword extraction method
CN110442777B (en) BERT-based pseudo-correlation feedback model information retrieval method and system
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN109960786A (en) Chinese Measurement of word similarity based on convergence strategy
US8751218B2 (en) Indexing content at semantic level
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN109829104A (en) Pseudo-linear filter model information search method and system based on semantic similarity
CN110674252A (en) High-precision semantic search system for judicial domain
Gupta et al. A novel hybrid text summarization system for Punjabi text
Mahata et al. Theme-weighted ranking of keywords from text documents using phrase embeddings
CN112559684A (en) Keyword extraction and information retrieval method
Yang et al. A new network model for extracting text keywords
CN114462392B (en) Short text feature expansion method based on association degree of subject and association of keywords
CN114997288B (en) Design resource association method
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN118245564B (en) Method and device for constructing feature comparison library supporting semantic review and repayment
Huang et al. An approach on Chinese microblog entity linking combining baidu encyclopaedia and word2vec
CN112417170A (en) Relation linking method for incomplete knowledge graph
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
Rogushina Use of Semantic Similarity Estimates for Unstructured Data Analysis.
CN111581326A (en) Method for extracting answer information based on heterogeneous external knowledge source graph structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210316