CN112559684A - Keyword extraction and information retrieval method - Google Patents

Keyword extraction and information retrieval method Download PDF

Info

Publication number
CN112559684A
CN112559684A CN202011454537.7A CN202011454537A CN112559684A CN 112559684 A CN112559684 A CN 112559684A CN 202011454537 A CN202011454537 A CN 202011454537A CN 112559684 A CN112559684 A CN 112559684A
Authority
CN
China
Prior art keywords
keyword
synonym
weight
keywords
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011454537.7A
Other languages
Chinese (zh)
Inventor
任文强
冯凯
王元卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Original Assignee
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences filed Critical Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority to CN202011454537.7A priority Critical patent/CN112559684A/en
Publication of CN112559684A publication Critical patent/CN112559684A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a keyword extraction and information retrieval method, which comprises the following steps: establishing a knowledge base, namely receiving original data, establishing the knowledge base according to the original data, wherein the knowledge base comprises a synonym base, and collecting entries with the same meaning as a synonym set, wherein synonym tags are arranged on the synonym set and are standardized expressions of words in the synonym set; extracting keywords, namely receiving a retrieval text, and extracting the keywords in the retrieval text as first keywords; and performing keyword standard processing, namely receiving the first keyword, inputting the first keyword into a synonym library to obtain a synonym set where the first keyword is located, and taking a synonym label of the synonym set as a search keyword. And the keywords in the user retrieval text are subjected to standardized processing, so that the input specialty of the user is improved, and the accuracy of the query result is improved. The information retrieval method improves the information retrieval efficiency.

Description

Keyword extraction and information retrieval method
Technical Field
The invention relates to the technical field of information retrieval, in particular to a keyword extraction and information retrieval method.
Background
Information retrieval generally refers to a process of organizing and storing information in a certain manner and finding out relevant information according to the needs of a user, and is generally called "information search" or "information search". Information retrieval usually needs to search related matching information according to input keywords, so that the terms of query can be accurately matched in the information retrieval, and the accuracy of the query result is greatly influenced.
Currently, for query search of long texts, keywords are extracted first by a method of extracting the keywords, and then the keywords are searched and matched to obtain corresponding results. However, in the existing query mode, because the sentence input by the user during the query may be spoken, the description of the spoken language is not standard enough, and lacks a certain specialty, the result of the query is not accurate enough, and the user requirement is difficult to meet.
Therefore, a method for keyword extraction and information retrieval is needed in the art.
Accordingly, the present invention is directed to such a system.
Disclosure of Invention
The invention aims to provide a keyword extraction method, which can standardize the input of a user, improve the specialty and improve the precision of a query result.
The first aspect of the invention provides a keyword extraction method, which comprises the following steps:
establishing a knowledge base, namely receiving original data, establishing the knowledge base according to the original data, wherein the knowledge base comprises a synonym base, the synonym base collects entries with the same meaning into a synonym set, synonym labels are arranged on the synonym set, and the synonym labels are standardized expressions of words in the synonym set;
extracting keywords, namely receiving a retrieval text, and extracting the keywords in the retrieval text as first keywords;
and performing keyword standard processing, namely receiving the first keyword, inputting the first keyword into the synonym library to obtain a synonym set where the first keyword is located, and taking a synonym label of the synonym set as a search keyword.
By adopting the scheme, the keywords in the user retrieval text are extracted firstly, the keywords are processed according to the synonym library, the standardized synonym labels are used as search keywords, the keywords in the user retrieval text are subjected to standardized processing, the user input specialty is improved, and the query result accuracy is improved.
Preferably, the knowledge base further comprises a polysemous word base, the keyword extraction method further comprises the step of disambiguating polysemous words, and the step of disambiguating polysemous words comprises:
receiving the first keyword;
judging whether the first keyword exists in the polysemous word bank or not;
if not, performing keyword standard processing;
if yes, converting the first keyword into a second keyword according to the polysemous word library, inputting the second keyword into the synonym library to obtain a synonym set where the second keyword is located, and taking a synonym label of the synonym set as a search keyword.
By adopting the scheme, when one word has multiple meanings, such as apple has the meanings of apple company and fruit, disambiguation of the polysemous word is firstly carried out, the real intention of a user in searching is determined, and the precision of the query result is improved.
Further, the step of converting the first keyword into the second keyword according to the polysemous word stock comprises the steps of obtaining word sense labels according to other first keywords of the retrieval text where the first keyword is located, and adding the word sense labels to the first keywords to obtain the second keyword.
Further, the search text is the search text input by the user, and the search keyword is a keyword for finally performing a search after processing the search text input by the user.
Further, the raw data collected by the knowledge base establishment may be processed according to a computer program, or may be processed in a manner of manual distribution.
Preferably, the step of establishing the knowledge base comprises:
receiving raw data, the raw data comprising structured data and unstructured data;
when the original data are structured data, integrating the data of the structured data to obtain a knowledge base;
and when the original data is unstructured data, performing data extraction on the unstructured data to obtain a knowledge base.
By adopting the scheme, data integration is rapidly performed by using different data processing methods according to different original data types, and the knowledge base establishing efficiency and the database accuracy are improved.
Further, the structured data may be table data; the unstructured data may be text data, picture data, document data, or the like.
Preferably, the step of data extraction comprises:
extracting entities, namely extracting the semantic words in the original data;
and (4) extracting the relation, namely receiving the relation template, obtaining the relation between the meaning words in the original data according to the relation template, and obtaining a knowledge base.
Further, the semantic word may be a noun in the original data.
Further, the step of extracting the keywords further comprises:
performing word segmentation processing, namely receiving the search text, and performing word segmentation on the search text to obtain candidate words;
and acquiring a first keyword, receiving the candidate words, distributing the weight of the candidate words to be a first weight, and extracting the first N candidate words with larger first weights as the first keyword.
By adopting the scheme, the search text is segmented, noise data in the search text is removed, then the candidate words are distributed with weights, the weights of the candidate words represent the importance and the representativeness of the candidate words in the search text, the candidate words with larger weights are extracted as the first keywords, and the extraction accuracy of the first keywords is improved.
The invention also provides an information retrieval method, which improves the information retrieval efficiency.
A second aspect of the present invention provides an information retrieval method, including:
receiving a weight, namely receiving the search keyword and a first weight of a first keyword corresponding to the search keyword, and endowing the weight to the corresponding search keyword;
text retrieval, inputting the search keywords into a database, and outputting documents related to the search keywords in the database;
and document sorting, namely receiving the weight of the search keyword in the related documents as a second weight, obtaining the document relevance according to the first weight and the second weight, and sorting the documents according to the document relevance.
By adopting the scheme, the first weight represents the importance degree of the search keyword in the retrieval text, the second weight represents the importance degree of the search keyword in the related documents, the document relevance is obtained according to the first weight and the second weight, the documents with higher relevance are preferentially displayed, the user can conveniently check the documents, and the information retrieval efficiency is improved.
Further, the step of ranking the documents further comprises:
calculating the relevancy, namely receiving the first weight and the second weight, and multiplying the first weight by the second weight to obtain the relevancy of the document;
and ranking the relevancy, receiving the relevancy of the documents related to the search keyword, and ranking the documents from large to small according to the relevancy of the documents.
By adopting the scheme, the document with higher document relevance is displayed in front of the lower document, so that the user can conveniently view the document.
A third aspect of the present invention provides a keyword extraction system, including:
the knowledge base establishing module is used for receiving original data and establishing a knowledge base according to the original data, wherein the knowledge base comprises a synonym base, the synonym base collects entries with the same meaning into a synonym set, synonym labels are arranged on the synonym set, and the synonym labels are standardized expressions of words in the synonym set;
the keyword extraction module is used for receiving a search text and extracting keywords in the search text as first keywords;
and the keyword standard processing module is used for receiving the first keyword, inputting the first keyword into the synonym library to obtain a synonym set where the first keyword is located, and taking a synonym label of the synonym set as a search keyword.
Further, the knowledge base further comprises a polysemous word base, the keyword extraction system further comprises a step polysemous word disambiguation module, and the polysemous word disambiguation module comprises:
receiving the first keyword;
judging whether the first keyword exists in the polysemous word bank or not;
if not, performing keyword standard processing;
if yes, converting the first keyword into a second keyword according to the polysemous word library, inputting the second keyword into the synonym library to obtain a synonym set where the second keyword is located, and taking a synonym label of the synonym set as a search keyword.
Further, the step of converting the first keyword into the second keyword according to the polysemous word stock comprises the steps of obtaining word sense labels according to other first keywords of the retrieval text where the first keyword is located, and adding the word sense labels to the first keywords to obtain the second keyword.
Preferably, the knowledge base establishing module comprises:
a raw data receiving module, the raw data comprising structured data and unstructured data;
when the original data are structured data, a data integration module is carried out on the structured data to obtain a knowledge base;
and when the original data is unstructured data, performing a data extraction module on the unstructured data to obtain a knowledge base.
Preferably, the data extraction module comprises:
the entity extraction module is used for extracting the semantic words in the original data;
and the relation extraction module is used for receiving the relation template, obtaining the relation between the meaning words in the original data according to the relation template and obtaining the knowledge base.
Further, the keyword extraction module further comprises:
the word segmentation processing module is used for receiving the search text and segmenting words of the search text to obtain candidate words;
and the first keyword acquisition module is used for receiving the candidate words, distributing the weight of the candidate words to be a first weight, and extracting the first N candidate words with larger first weights as the first keywords.
A fourth aspect of the present invention provides an information retrieval system comprising:
the weight receiving module is used for receiving the search keywords and the first weights of the first keywords corresponding to the search keywords and endowing the weights to the corresponding search keywords;
the text retrieval module is used for inputting the search keywords into a database and outputting documents related to the search keywords in the database;
and the document sorting module is used for receiving the second weight of the search keyword in the related documents, obtaining the document relevancy according to the first weight and the second weight, and sorting the documents according to the document relevancy.
Further, the document ranking module further comprises:
the relevancy calculation module is used for receiving the first weight and the second weight, and the relevancy of the document is obtained by multiplying the first weight and the second weight;
and the relevancy sorting module is used for receiving the document relevancy of all relevant documents of the search keyword and sorting the documents according to the document relevancy from large to small.
A fifth aspect of the present invention provides a keyword extraction apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the keyword extraction method when executing the program.
A sixth aspect of the present invention provides an information retrieval apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the information retrieval method described above when executing the program.
A seventh aspect of the present invention provides a storage medium including one or more programs executable by a processor to perform the keyword extraction method described above.
An eighth aspect of the present invention provides a storage medium including one or more programs executable by a processor to perform the above-described information retrieval method.
In conclusion, the invention has the following beneficial effects:
1. according to the keyword extraction method, the keywords in the user retrieval text are extracted, the keywords are processed according to the synonym library, the standardized synonym labels are used as search keywords, the keywords in the user retrieval text are subjected to standardization processing, the user input specialty is improved, and the query result accuracy is improved;
2. according to the keyword extraction method, when one word has multiple meanings, such as apple has the meanings of apple company and fruits, disambiguation of polysemous words is firstly carried out, the real intention of a user in searching is determined, and the accuracy of a query result is improved;
3. the keyword extraction method comprises the steps of firstly segmenting the search text, removing noise data in the search text, then distributing weights to candidate words, wherein the weights of the candidate words represent the importance and the representativeness of the candidate words in the search text, and extracting the candidate words with larger weights as first keywords so as to improve the extraction accuracy of the first keywords;
4. according to the information retrieval method, the first weight represents the importance degree of the search keyword in the retrieval text, the second weight represents the importance degree of the search keyword in the related documents, the document relevance is obtained according to the first weight and the second weight, the documents with higher relevance are preferentially displayed, the user can conveniently check the documents, and the information retrieval efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of one embodiment of a keyword extraction method of the present invention;
FIG. 2 is a flowchart illustrating another embodiment of a keyword extraction method according to the present invention;
FIG. 3 is a flowchart detailing the steps of knowledge base establishment in the keyword extraction method of the present invention;
FIG. 4 is a flowchart detailing the steps of FIG. 2;
FIG. 5 is a flow chart of an embodiment of an information retrieval method of the present invention;
FIG. 6 is a flowchart detailing the steps of FIG. 5;
FIG. 7 is a flowchart of adding an extraneous document deletion step to FIG. 6;
FIG. 8 is a diagram illustrating a keyword extraction system according to an embodiment of the present invention;
FIG. 9 is a diagram of another embodiment of a keyword extraction system according to the present invention;
FIG. 10 is a schematic diagram of a refinement of the module of FIG. 9;
FIG. 11 is a schematic diagram of further details of the module of FIG. 10;
FIG. 12 is a diagram illustrating an embodiment of an information retrieval system according to the present invention;
fig. 13 is a schematic diagram of a module refinement in fig. 12.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
As shown in fig. 1, a first aspect of the present invention provides a keyword extraction method, including the following steps:
s100, establishing a knowledge base, receiving original data, establishing the knowledge base according to the original data, wherein the knowledge base comprises a synonym base, the synonym base collects entries with the same meaning into a synonym set, synonym labels are arranged on the synonym set, and the synonym labels are standardized expressions of words in the synonym set;
in a specific implementation process, the raw data may be text data, table data, or picture data, and the knowledge base is a word bank.
In a specific implementation process, the synonym labels in the synonym library may be artificial intelligence, and the synonym set corresponding to the artificial intelligence synonym labels may be artificial intelligence, AI, neural network, deep learning, and the like.
S200, extracting keywords, namely receiving a retrieval text, and extracting the keywords in the retrieval text as first keywords;
in a specific implementation process, the keywords are terms representative of the retrieval text.
In a specific implementation process, the search text may be "neural network is a part in artificial intelligence learning", and the keywords of the search text may be "neural network" or "artificial intelligence".
S400, keyword standard processing, namely receiving the first keyword, inputting the first keyword into the synonym library to obtain a synonym set where the first keyword is located, and taking a synonym label of the synonym set as a search keyword.
In a specific implementation process, the first keyword may be "data intelligence", and the synonyms in the synonym set where the "data intelligence" is located include "machine learning", "machine intelligence", and "ML", and if the synonym tag of the synonym set is "machine learning", the "machine learning" is used as a search keyword to perform a search.
By adopting the scheme, the keywords in the user retrieval text are extracted firstly, the keywords are processed according to the synonym library, the standardized synonym labels are used as search keywords, the keywords in the user retrieval text are subjected to standardized processing, the user input specialty is improved, and the query result accuracy is improved.
As shown in fig. 2, in a preferred embodiment of the present invention, the knowledge base further includes a polysemous word base, the keyword extraction method further includes step S300, disambiguation of polysemous words, and the step of disambiguating polysemous words includes:
receiving the first keyword;
judging whether the first keyword exists in the polysemous word bank or not;
if not, performing keyword standard processing;
if yes, converting the first keyword into a second keyword according to the polysemous word library, inputting the second keyword into the synonym library to obtain a synonym set where the second keyword is located, and taking a synonym label of the synonym set as a search keyword.
In a specific implementation process, the entries in the polysemous word library can be 'neural networks', and the 'neural networks' have two meanings, one is a biological 'neural network', and the other is a computer 'neural network'.
By adopting the scheme, when one word has multiple meanings, such as apple has the meanings of apple company and fruit, disambiguation of the polysemous word is firstly carried out, the real intention of a user in searching is determined, and the precision of the query result is improved.
In a specific implementation process, the step of converting the first keyword into the second keyword according to the polysemous word stock comprises the steps of obtaining word sense labels according to other first keywords of the retrieval text where the first keyword is located, and adding the word sense labels to the first keywords to obtain the second keyword.
In the specific implementation process, if the first keyword is machine learning, and the machine learning only has one meaning, disambiguation of polysemous words is not required; if the first keyword is "neural network", disambiguation of ambiguous words is required, and if the search text is "neural network is an important machine learning technique", and the first keyword in the search text includes "neural network" and "machine learning", a word sense label can be added to the "neural network" to be "machine learning".
In a specific implementation process, the retrieval text is the retrieval text input by the user, and the search keyword is a keyword for finally searching after the retrieval text input by the user is processed.
In a specific implementation process, the raw data collected by establishing the knowledge base may be processed according to a computer program, may also be processed in a manner of artificial integration, and may also be received and superimposed into an existing knowledge base.
In a specific implementation process, the data of the synonym library may be shown in table one below, the data of the polysense library may be shown in table two below, in table two, an apple [ fruit ] may be the second keyword, an apple is the first keyword, and [ fruit ] is the sense tag;
watch 1
Figure BDA0002828091380000081
Watch two
Figure BDA0002828091380000082
As shown in fig. 4, in a specific implementation process, the step of S100, establishing a knowledge base includes:
s110, receiving original data, wherein the original data comprises structured data and unstructured data;
when the original data is structured data, performing S120 and data integration on the structured data to obtain a knowledge base;
in a specific implementation process, the S120 and the data are integrated into terms with original existing relations, and the terms are extracted and added into a knowledge base;
in the specific implementation process, the original terms with the relationship can be 'neural network' and 'machine learning', and the 'neural network' in the table belongs to one of the 'machine learning', so that the 'neural network' can be integrated into the knowledge base.
And when the original data is unstructured data, performing S130 and data extraction on the unstructured data to obtain a knowledge base.
By adopting the scheme, data integration is rapidly performed by using different data processing methods according to different original data types, and the knowledge base establishing efficiency and the database accuracy are improved.
In a specific implementation process, the structured data may be table data; the unstructured data may be text data, picture data, document data, or the like.
As shown in fig. 3, in a preferred embodiment of the present invention, the step of S130 extracting data includes:
s131, entity extraction, namely extracting the semantic words in the original data;
in a specific implementation process, the S131 entity extraction may be extracting a noun in the original data.
In a preferred embodiment of the invention, the entity extraction uses a Bi-LSTM + CRF classical model architecture for extraction, and Bi-LSTM + CRF is optimized on the basis of the original Bi-LSTM + maximum entropy, and the biggest idea is to hang a conditional random field model on the Bi-LSTM as a decoding layer of the model, and consider the rationality between prediction results in the conditional random field model. The LSTM is an improved recurrent neural network and can solve the problem that RNN cannot handle long-distance dependence, and the Bi-LSTM mainly inputs an input sequence into a model twice in a positive and negative way and then is spliced to obtain a final word vector.
S132, extracting the relation, receiving the relation template, obtaining the relation between the meaning words in the original data according to the relation template, and obtaining a knowledge base.
In a specific implementation process, the S132 and the relationship extraction use a remote Supervision algorithm to perform extraction, and the remote supervised english name distance provision is a common practice in the relationship extraction at present. The method is firstly proposed by M Mintz on ACL2009, and is neither a pure supervised corpus in the traditional sense nor unsupervised. It is a method of using KB to align the markup of plain text (distance super for translation extraction with out labeled data).
In a specific implementation process, the original data may be "shanghai located in the southeast of china", the extracted data is "shanghai", "china" or "southeast", the relationship template may be "los angeles located in the west of the united states, los angeles belonging to the united states", and S132, the relationship is extracted (shanghai, china, belonging to a relationship).
In the specific implementation process, after the relationship extraction, a synonym library can be generated (machine learning, machine intelligence, synonym relationship), and after the relationship extraction, a synonym library can be generated (neural network, machine learning, belonging relationship), (neural network, human body structure, belonging relationship) by the method of S132.
In a specific implementation, the semantic word may be a noun in the original data.
As shown in fig. 4, in a specific implementation process, the step S200 of extracting the keyword further includes:
s210, performing word segmentation processing, namely receiving the search text, and performing word segmentation on the search text to obtain candidate words;
in a specific implementation process, the word segmentation process may use a jieba word segmenter, an ansj word segmenter or a Hanlp word segmenter to perform word segmentation.
S220, obtaining a first keyword, receiving the candidate words, distributing the weight of the candidate words to be a first weight, and extracting the first N candidate words with larger first weights as the first keyword.
In a specific implementation process, the candidate word is assigned with a weight as the first weight, and the weight may be assigned through TF-IDF, LDA or textRank algorithm.
In a preferred embodiment of the invention, weights are assigned using the textRank algorithm.
By adopting the scheme, the search text is segmented, noise data in the search text is removed, then the candidate words are distributed with weights, the weights of the candidate words represent the importance and the representativeness of the candidate words in the search text, the candidate words with larger weights are extracted as the first keywords, and the extraction accuracy of the first keywords is improved.
In a specific implementation process, the search text may be "usage time of the wuhan changjiang river bridge", S210, the candidate word after the word segmentation processing is "wuhan changjiang river bridge", "usage", "time", the assigned weight of the candidate word may be "wuhan changjiang river bridge" 0.5 "," usage "0.1", "time" 0.3, and extracting the first N candidate words with a larger first weight may be the first 2 candidate words with a larger first weight, which are "wuhan changjiang river bridge", and "time" as the first keyword.
As shown in fig. 5, a second aspect of the present invention provides an information retrieval method, including:
s500, weight receiving, namely receiving the search keyword and a first weight of a first keyword corresponding to the search keyword, and endowing the weight to the corresponding search keyword;
in a specific implementation process, the search keyword may be "machine learning", the first keyword may be "neural network", and if the first weight of "machine learning" is 0.6, the weight of "neural network" is also 0.6.
S600, text retrieval, namely inputting the search keywords into a database and outputting documents related to the search keywords in the database;
in a specific implementation, the database may be an ES database, where ES is an open source distributed search engine based on RESTful web interface and built on Apache Lucene. ES is also a distributed document database in which each field can be indexed and the data in each field can be searched, expanding laterally to hundreds of servers storing and processing PB-level data. A large amount of data can be stored, searched, and analyzed in an extremely short time. Typically as a core engine with complex search scenarios.
S700, sorting the documents, receiving the second weight of the search keyword in the related documents, obtaining the document relevancy according to the first weight and the second weight, and sorting the documents according to the document relevancy.
In a specific implementation process, if the search keyword is "machine learning", the second weight is a weight occupied by the neural network in the relevant documents searched from the database.
By adopting the scheme, the first weight represents the importance degree of the search keyword in the retrieval text, the second weight represents the importance degree of the search keyword in the related documents, the document relevance is obtained according to the first weight and the second weight, the documents with higher relevance are preferentially displayed, the user can conveniently check the documents, and the information retrieval efficiency is improved.
As shown in fig. 6, in the specific implementation process, the document ranking step further includes:
s710, calculating relevance, namely receiving the first weight and the second weight, and multiplying the first weight and the second weight to obtain the relevance of the document;
in a specific implementation process, if a first keyword in the search text is only "machine learning", and if a first weight of "machine learning" is 0.6 and a second weight is 10, a document relevance is 0.6 × 10 — 6; if the first keyword in the search text is "machine learning" or "category", and if the first weight of "machine learning" is 0.6, the second weight is 10, and the first weight of "category" is 0.2, and the second weight is 20, the document relevance is 0.6 × 10+0.2 × 20 — 10.
S720, ranking the relevancy of the documents, receiving the relevancy of the documents related to the search keyword, and ranking the documents according to the relevancy of the documents from large to small.
By adopting the scheme, the document with higher document relevance is displayed in front of the lower document, so that the user can conveniently view the document.
As shown in fig. 7, in a preferred embodiment of the present invention, the step of sorting documents further includes S730, deleting irrelevant documents, and the step of S730, deleting irrelevant documents includes:
receiving documents ranked from big to small according to the relevance of the documents;
receiving a display quantity threshold value, taking the first M documents with higher document relevance according to the display quantity threshold value, and not displaying the documents with lower document relevance than the Mth document.
By adopting the scheme, the number of the documents required to be consulted by the user is reduced, the document refining degree is improved, and the retrieval efficiency is improved.
As shown in fig. 8, a third aspect of the present invention provides a keyword extraction system, including:
the knowledge base establishing module 100 is configured to receive raw data, establish a knowledge base according to the raw data, where the knowledge base includes a synonym base, the synonym base collects entries with the same meaning as a synonym set, and sets synonym tags on the synonym set, where the synonym tags are standardized expressions of words in the synonym set;
the keyword extraction module 200 is configured to receive a search text, and extract a keyword in the search text as a first keyword;
and the keyword standard processing module 400 is configured to receive the first keyword, input the first keyword into the synonym library, obtain a synonym set where the first keyword is located, and use a synonym tag of the synonym set as a search keyword.
As shown in fig. 9, in a specific implementation process, the knowledge base further includes a polysemous word base, the keyword extraction system further includes a step polysemous word disambiguation module 300, and the polysemous word disambiguation module 300 includes:
receiving the first keyword;
judging whether the first keyword exists in the polysemous word bank or not;
if not, performing keyword standard processing;
if yes, converting the first keyword into a second keyword according to the polysemous word library, inputting the second keyword into the synonym library to obtain a synonym set where the second keyword is located, and taking a synonym label of the synonym set as a search keyword.
In a specific implementation process, the step of converting the first keyword into the second keyword according to the polysemous word stock comprises the steps of obtaining word sense labels according to other first keywords of the retrieval text where the first keyword is located, and adding the word sense labels to the first keywords to obtain the second keyword.
As shown in fig. 10, in a specific implementation process, the knowledge base building module 100 includes:
a raw data receiving module 110, the raw data comprising structured data and unstructured data;
when the original data is structured data, the data integration module 120 is performed on the structured data to obtain a knowledge base;
when the original data is unstructured data, the unstructured data is processed by the data extraction module 130 to obtain a knowledge base.
As shown in fig. 11, in a specific implementation process, the data extraction module 130 includes:
an entity extraction module 131, configured to extract semantic words in the original data;
the relation extracting module 132 is configured to receive the relation template, obtain a relation between the semantic words in the original data according to the relation template, and obtain a knowledge base.
As shown in fig. 10, in a specific implementation process, the keyword extraction module 200 further includes:
the word segmentation processing module 210 receives the search text, performs word segmentation on the search text, and obtains candidate words;
the first keyword obtaining module 220 receives the candidate words, assigns weights to the candidate words as first weights, and extracts the first N candidate words with larger first weights as first keywords.
As shown in fig. 12, a fourth aspect of the present invention provides an information retrieval system including:
a weight receiving module 500, configured to receive the search keyword and a first weight of a first keyword corresponding to the search keyword, and assign the weight to the corresponding search keyword;
a text retrieval module 600, configured to input the search keyword into a database, and output a document in the database related to the search keyword;
and the document sorting module 700 is configured to receive the second weight, which is the weight of the search keyword in the related documents, obtain the document relevancy according to the first weight and the second weight, and sort the documents according to the document relevancy.
As shown in fig. 13, in a specific implementation process, the document ranking module 700 further includes:
a relevancy calculating module 710, configured to receive the first weight and the second weight, where the first weight and the second weight are multiplied to obtain a document relevancy;
and the relevancy sorting module 720 is configured to receive the document relevancy of all the documents related to the search keyword, and sort the documents according to the document relevancy from large to small.
In a preferred embodiment of the present invention, the document ranking module 700 further includes an irrelevant document deleting module 730, and the irrelevant document deleting module 730 further includes:
receiving documents ranked from big to small according to the relevance of the documents;
receiving a display quantity threshold value, taking the first M documents with higher document relevance according to the display quantity threshold value, and not displaying the documents with lower document relevance than the Mth document.
A fifth aspect of the present invention provides a keyword extraction apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the keyword extraction method when executing the program.
A sixth aspect of the present invention provides an information retrieval apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the information retrieval method described above when executing the program.
A seventh aspect of the present invention provides a storage medium including one or more programs executable by a processor to perform the keyword extraction method described above.
An eighth aspect of the present invention provides a storage medium including one or more programs executable by a processor to perform the above-described information retrieval method.
It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the protection scope of the claims of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A keyword extraction method is characterized by comprising the following steps:
establishing a knowledge base, namely receiving original data, establishing the knowledge base according to the original data, wherein the knowledge base comprises a synonym base, the synonym base collects entries with the same meaning into a synonym set, synonym labels are arranged on the synonym set, and the synonym labels are standardized expressions of words in the synonym set;
extracting keywords, namely receiving a retrieval text, and extracting the keywords in the retrieval text as first keywords;
and performing keyword standard processing, namely receiving the first keyword, inputting the first keyword into the synonym library to obtain a synonym set where the first keyword is located, and taking a synonym label of the synonym set as a search keyword.
2. The keyword extraction method according to claim 1, characterized in that: the knowledge base also comprises a polysemous word base, the keyword extraction method also comprises the step of disambiguating polysemous words, and the step of disambiguating the polysemous words comprises the following steps:
receiving the first keyword;
judging whether the first keyword exists in the polysemous word bank or not;
if not, performing keyword standard processing;
if yes, converting the first keyword into a second keyword according to the polysemous word library, inputting the second keyword into the synonym library to obtain a synonym set where the second keyword is located, and taking a synonym label of the synonym set as a search keyword.
3. The keyword extraction method according to claim 2, characterized in that: and converting the first keywords into second keywords according to the polysemous word stock, wherein sense labels are obtained according to other first keywords of the search text where the first keywords are located, and the sense labels are added to the first keywords to obtain the second keywords.
4. The keyword extraction method according to any one of claims 1 to 3, characterized in that: the step of establishing the knowledge base comprises the following steps:
receiving raw data, the raw data comprising structured data and unstructured data;
when the original data are structured data, integrating the data of the structured data to obtain a knowledge base;
and when the original data is unstructured data, performing data extraction on the unstructured data to obtain a knowledge base.
5. The keyword extraction method according to claim 4, characterized in that: the data extraction step comprises:
extracting entities, namely extracting the semantic words in the original data;
and (4) extracting the relation, namely receiving the relation template, obtaining the relation between the meaning words in the original data according to the relation template, and obtaining a knowledge base.
6. The keyword extraction method according to claim 1 or 5, characterized in that: the keyword extraction step further comprises:
performing word segmentation processing, namely receiving the search text, and performing word segmentation on the search text to obtain candidate words;
and acquiring a first keyword, receiving the candidate words, distributing the weight of the candidate words to be a first weight, and extracting the first N candidate words with larger first weights as the first keyword.
7. An information retrieval method, comprising:
receiving a weight of a search keyword according to any one of claims 1 to 6 and a first weight of a first keyword corresponding to the search keyword, and assigning the weight to the corresponding search keyword;
text retrieval, inputting the search keywords into a database, and outputting documents related to the search keywords in the database;
and document sorting, namely receiving the weight of the search keyword in the related documents as a second weight, obtaining the document relevance according to the first weight and the second weight, and sorting the documents according to the document relevance.
8. The information retrieval method according to claim 7, characterized in that: the document ranking step further comprises:
calculating the relevancy, namely receiving the first weight and the second weight, and multiplying the first weight by the second weight to obtain the relevancy of the document;
and ranking the relevancy, receiving the relevancy of the documents related to the search keyword, and ranking the documents from large to small according to the relevancy of the documents.
9. A keyword extraction apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the keyword extraction method according to any one of claims 1 to 6 when executing the program.
10. An information retrieval apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the information retrieval method as claimed in claim 7 or 8 when executing the program.
CN202011454537.7A 2020-12-10 2020-12-10 Keyword extraction and information retrieval method Pending CN112559684A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011454537.7A CN112559684A (en) 2020-12-10 2020-12-10 Keyword extraction and information retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011454537.7A CN112559684A (en) 2020-12-10 2020-12-10 Keyword extraction and information retrieval method

Publications (1)

Publication Number Publication Date
CN112559684A true CN112559684A (en) 2021-03-26

Family

ID=75061645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011454537.7A Pending CN112559684A (en) 2020-12-10 2020-12-10 Keyword extraction and information retrieval method

Country Status (1)

Country Link
CN (1) CN112559684A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239054A (en) * 2021-05-11 2021-08-10 北京百度网讯科技有限公司 Information generation method, related device and computer program product
CN113326350A (en) * 2021-05-31 2021-08-31 江汉大学 Keyword extraction method, system, device and storage medium based on remote learning
CN113590746A (en) * 2021-07-01 2021-11-02 中国国家铁路集团有限公司 Method, device, equipment and medium for comprehensively retrieving information of data warehouse
CN114020643A (en) * 2021-11-29 2022-02-08 中国银行股份有限公司 Knowledge base testing method and device
CN115904482A (en) * 2022-11-30 2023-04-04 杭州巨灵兽智能科技有限公司 Interface document generation method, device, equipment and storage medium
CN116522011A (en) * 2023-05-16 2023-08-01 深圳九星互动科技有限公司 Big data-based pushing method and pushing system
CN116578677A (en) * 2023-07-14 2023-08-11 高密市中医院 Retrieval system and method for medical examination information
CN116628201A (en) * 2023-05-18 2023-08-22 浙江数洋科技有限公司 Intelligent grouping and pushing method for text database

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294595A (en) * 2016-07-29 2017-01-04 海尔优家智能科技(北京)有限公司 A kind of document storage, search method and device
CN107993724A (en) * 2017-11-09 2018-05-04 易保互联医疗信息科技(北京)有限公司 A kind of method and device of medicine intelligent answer data processing
CN110516062A (en) * 2019-08-26 2019-11-29 腾讯科技(深圳)有限公司 A kind of search processing method and device of document
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111859974A (en) * 2019-04-22 2020-10-30 广东小天才科技有限公司 Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment
CN112015905A (en) * 2020-08-05 2020-12-01 河北工程大学 Method for constructing fatigue marker disease knowledge graph

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294595A (en) * 2016-07-29 2017-01-04 海尔优家智能科技(北京)有限公司 A kind of document storage, search method and device
CN107993724A (en) * 2017-11-09 2018-05-04 易保互联医疗信息科技(北京)有限公司 A kind of method and device of medicine intelligent answer data processing
CN111859974A (en) * 2019-04-22 2020-10-30 广东小天才科技有限公司 Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment
CN110516062A (en) * 2019-08-26 2019-11-29 腾讯科技(深圳)有限公司 A kind of search processing method and device of document
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN112015905A (en) * 2020-08-05 2020-12-01 河北工程大学 Method for constructing fatigue marker disease knowledge graph

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239054A (en) * 2021-05-11 2021-08-10 北京百度网讯科技有限公司 Information generation method, related device and computer program product
CN113326350A (en) * 2021-05-31 2021-08-31 江汉大学 Keyword extraction method, system, device and storage medium based on remote learning
CN113590746A (en) * 2021-07-01 2021-11-02 中国国家铁路集团有限公司 Method, device, equipment and medium for comprehensively retrieving information of data warehouse
CN114020643A (en) * 2021-11-29 2022-02-08 中国银行股份有限公司 Knowledge base testing method and device
CN115904482A (en) * 2022-11-30 2023-04-04 杭州巨灵兽智能科技有限公司 Interface document generation method, device, equipment and storage medium
CN115904482B (en) * 2022-11-30 2023-09-26 杭州巨灵兽智能科技有限公司 Interface document generation method, device, equipment and storage medium
CN116522011A (en) * 2023-05-16 2023-08-01 深圳九星互动科技有限公司 Big data-based pushing method and pushing system
CN116522011B (en) * 2023-05-16 2024-02-13 深圳九星互动科技有限公司 Big data-based pushing method and pushing system
CN116628201A (en) * 2023-05-18 2023-08-22 浙江数洋科技有限公司 Intelligent grouping and pushing method for text database
CN116628201B (en) * 2023-05-18 2023-10-20 浙江数洋科技有限公司 Intelligent grouping and pushing method for text database
CN116578677A (en) * 2023-07-14 2023-08-11 高密市中医院 Retrieval system and method for medical examination information
CN116578677B (en) * 2023-07-14 2023-09-15 高密市中医院 Retrieval system and method for medical examination information

Similar Documents

Publication Publication Date Title
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN112559684A (en) Keyword extraction and information retrieval method
CN109815308B (en) Method and device for determining intention recognition model and method and device for searching intention recognition
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN109960756B (en) News event information induction method
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN110543564B (en) Domain label acquisition method based on topic model
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN106126619A (en) A kind of video retrieval method based on video content and system
CN107506472B (en) Method for classifying browsed webpages of students
CN111414763A (en) Semantic disambiguation method, device, equipment and storage device for sign language calculation
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN112149422B (en) Dynamic enterprise news monitoring method based on natural language
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN115964474A (en) Policy keyword extraction method and device, storage medium and electronic equipment
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210326