CN114638225A - Automatic keyword extraction method based on scientific and technological literature graph network - Google Patents
Automatic keyword extraction method based on scientific and technological literature graph network Download PDFInfo
- Publication number
- CN114638225A CN114638225A CN202210227126.7A CN202210227126A CN114638225A CN 114638225 A CN114638225 A CN 114638225A CN 202210227126 A CN202210227126 A CN 202210227126A CN 114638225 A CN114638225 A CN 114638225A
- Authority
- CN
- China
- Prior art keywords
- scientific
- node
- literature
- information
- technical literature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a keyword automatic extraction method based on a scientific and technical literature graph network, which comprises the following steps: establishing a scientific and technological literature graph network for a set scientific and technological literature set according to the literature citation relation and the co-author information; establishing a data organization model based on a scientific and technical literature graph network; extracting self information of the scientific and technological literature to be detected, and combining the information with scientific and technological literature graph network information acquired based on a data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author. The invention constructs a scientific and technological literature graph network based on co-authors and citation relations, and provides an automatic extraction method based on the scientific and technological literature graph network, so that the automatic extraction effect of keywords of scientific and technological literature is further improved; the scientific and technical literature graph network construction method and the scientific and technical literature graph network data organization model are further provided, so that the scientific and technical literature graph network information can be fully utilized, and the problem of how to utilize graph network data is solved.
Description
Technical Field
The invention relates to the technical field of computer application, natural language processing and automatic keyword extraction, in particular to a method for automatically extracting keywords based on a scientific and technical literature graph network.
Background
The traditional scientific and technical literature keyword extraction method extracts by means of the information of the literature, ignores the network relation among the scientific and technical literature, and fails to apply the semantic information of the scientific and technical literature graph network to the key extraction field. Generally, there are several authors in a scientific literature, and several authors issue other scientific literatures. One scientific and technical literature can be related to the cited scientific and technical literature through a citation relation, and one scientific and technical literature can be cited by other scientific and technical literatures at the same time. The scientific literature forms a complex network of graphs through co-authors and citations.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an automatic keyword extraction method based on a scientific and technical literature graph network.
In order to achieve the above object, the present invention provides a method for automatically extracting keywords based on a scientific and technical literature graph network, wherein the method comprises:
step 1) establishing a scientific and technological literature graph network for a set scientific and technological literature set according to literature citation relations and co-author information;
step 2) establishing a data organization model based on a scientific and technical literature graph network;
step 3) extracting self information of the scientific and technical literature to be detected, and combining the information with scientific and technical literature graph network information acquired based on the data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author.
As an improvement of the above method, the step 1) specifically includes:
step 1-1) setting each literature in a set scientific and technical literature set as a node;
step 1-2) traversing each node, repeating the step 1-3) and the step 1-4), and turning to the step 1-5) when each node is traversed;
step 1-3) establishing an edge of a corresponding node of a cited document pointing to a corresponding node of a cited document according to the citation relation of the document; setting the category of the corresponding node of the cited document as a cited node, and setting the category of the corresponding node of the cited document as a cited node;
step 1-4) establishing an edge between corresponding nodes of a thesis with a co-author according to the co-author information, and setting the type of the node as a co-author node;
and 1-5) obtaining a scientific and technical literature graph network.
As a modification of the above method, the step 2) specifically includes:
step 2-1) setting node key information according to the category of each node;
and 2-2) calculating the weight of the key information of the node according to the type of the node and the key information of the node.
As an improvement of the above method, the step 2-1) specifically comprises:
for the node category as a reference node, setting node key information comprises: title, abstract, time, keywords, reference hierarchy, and reference fragment; wherein, the time represents publication time of the literature, and the citation level represents cited distance between the scientific literature;
for the node category as the referenced node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, referenced hierarchy, and referenced segment; the quoted level represents the quoted distance of the scientific and technical literature;
for a node with a node category as a co-author node, setting node key information comprises: title, abstract, time, keywords, and co-author hierarchy representing document co-author association distance.
As an improvement of the above method, the step 2-2) specifically includes:
for the node type as a reference node, the information weight of the topic is 1.5Q1The information weight of the abstract is Q1The information weight of the keyword is 2Q1The information weight of the reference fragment is 1.2Q1(ii) a Wherein Q is1For reference to the reference weights of scientific and technical literature, the following formula is satisfied:
Q1a × (1-time difference/10) × (1-reference level/5)
Wherein, A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference of the publication time of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of referenced nodes, the information weight of the topic is 1.5Q2The information weight of the abstract is Q2The information weight of the keyword is 2Q2The information weight of the referenced segment is 1.2Q2(ii) a Wherein Q is2For the reference weight of the cited scientific literature, the following formula is satisfied:
Q2b × (1-time difference/10) × (1-cited level/5)
B represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of co-author nodes, the information weight of the topic is 1.5Q3The information weight of the abstract is Q3The information weight of the keyword is 2Q3Wherein Q is3Reference weight Q for co-author scientific and technological literature3Satisfies the following formula:
Q3either C × (1-time difference/10) × (1-co-author level/5)
Wherein, C represents the weight base number of the scientific and technological literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technological literature of the co-author.
As an improvement of the above method, the step 3) specifically includes:
extracting self information of a scientific and technological document to be detected to obtain time, questions, an abstract, a text, a reference document and an author;
constructing a scientific and technical literature graph network according to the co-author and the citation relationship;
organizing the scientific and technical literature graph network data based on a scientific and technical literature citation network data organization model;
fusing scientific and technical literature information to be tested and scientific and technical literature graph network information to extract keywords
Compared with the prior art, the invention has the advantages that:
1. the invention constructs a complex scientific and technical literature graph network based on co-authors and citation relations, and provides an automatic extraction method based on the scientific and technical literature graph network, so that the automatic extraction effect of keywords of scientific and technical literature is further improved;
2. the invention provides a scientific and technical literature graph network construction method and a scientific and technical literature graph network data organization model, which can make full use of scientific and technical literature graph network information and solve the problem of how graph network data is utilized.
Drawings
FIG. 1 is a flow chart of a keyword automatic extraction method based on a scientific and technical literature diagram network according to the present invention;
FIG. 2 is a diagram of node types and related node key information of the scientific and technical literature;
FIG. 3 is a diagram illustrating node types and weight settings of scientific and technical literature according to the present invention;
fig. 4 is an example of keyword extraction using the method of the present invention.
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, an embodiment 1 of the present invention provides an automatic keyword extraction method based on a scientific and technical literature graph network, which specifically includes the following steps:
(1) scientific and technological literature graph network construction method
For a collection of papers, we treat each paper as a node. The connections between different nodes are constructed by reference relationships and co-author information. For example, document a references document B, node a generates an edge pointing to node B. By means of the author information of the scientific document A, the scientific document C published by the co-author can be associated, and then an edge from the node C to the node A can be generated. Through the literature citation relationship and the co-author information, a scientific and technical literature graph network can be conveniently constructed.
(2) Scientific and technical literature graph network data organization model
For the scientific literature graph network data, a data organization model is constructed. The nodes of the scientific and technical literature graph network are divided into three categories, wherein one category is the nodes of the cited scientific and technical literature, the nodes of the cited scientific and technical literature and the nodes of the scientific and technical literature published by a co-author. The specific categories and the key information of the related nodes are shown in fig. 2.
(3) Key word extraction method of scientific and technological literature graph network
For a scientific and technical literature, by means of a scientific and technical literature graph network data organization model, relevant information of cited scientific and technical literature, cited scientific and technical literature and co-author scientific and technical literature can be obtained. When extracting keywords, the related information can be used as supplementary information of scientific and technical literature. According to the time difference, the hierarchical relation and the key information type of the nodes of the scientific and technical literature, different weights are given to the supplementary information. The smaller the publication time difference between scientific and technical literatures, the higher the correlation and the higher the weight. The more levels between scientific and technical literature, the worse the correlation and the lower the weight. Topic information is more refined than summary information, and is given higher weight than the summary. The weight settings are shown in fig. 3. The method specifically comprises the following steps:
for the node type as a reference node, the information weight of the title is 1.5Q1The information weight of the abstract is Q1The information weight of the keyword is 2Q1The information weight of the reference fragment is 1.2Q1(ii) a Wherein Q is1For reference to the reference weights of scientific and technical literature, the following formula is satisfied:
Q1a × (1-time difference/10) × (1-reference level/5)
Wherein A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of referenced nodes, the information weight of the topic is 1.5Q2The information weight of the abstract is Q2The information weight of the keyword is 2Q2The information weight of the referenced segment is 1.2Q2(ii) a Wherein Q is2For the reference weight of the cited scientific literature, the following formula is satisfied:
Q2b × (1-time difference/10) × (1-cited level/5)
B represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of co-author nodes, the information weight of the topic is 1.5Q3The information weight of the abstract is Q3The information weight of the keyword is 2Q3Wherein, Q3Reference weight Q for co-author scientific and technological literature3Satisfies the following formula:
Q3either C × (1-time difference/10) × (1-co-author level/5)
Wherein, C represents the weight base number of the scientific and technical literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technical literature of the co-author.
And finally, combining the information of the scientific and technical literature with the information of the scientific and technical literature graph network acquired by means of the scientific and technical literature graph network data organization model, and then extracting keywords.
Simulation example:
the text provides a keyword automatic extraction method based on a scientific and technical literature graph network. First, we need to build a scientific literature graph network based on citation relationships and co-author information. CiteSeer is an academic official literature digital library built by NEC research institute on the basis of an automatic citation indexing mechanism. The abstract, title, and reference fragment (circulation Context) can be obtained by searching on a CiteSeer with the title of the scientific literature. The CiteSeer search results are shown. The author information, the citation fragment, the publication time, the abstract information, the title information, the citation paper information, and the like can be conveniently obtained, as shown in fig. 4.
Then, the graph network data is organized based on a scientific literature citation network data organization model. And finally, extracting the keywords based on a scientific and technical literature graph network keyword extraction frame.
To verify the keyword extraction effect, 500 scientific and technical documents are selected from the known network as experimental data. Firstly, extracting keywords of the scientific and technical literature by using an existing keyword extractor, and calculating extraction accuracy (P), recall rate (R) and F1 values; then extracting by the method provided by the text, and calculating the extraction accuracy (P), recall (R) and F1 values; and finally comparing the test results. P, R, F1 index calculation method is as follows:
p is the exact number of results automatically extracted/total number of results automatically extracted
R is the exact number of results automatically extracted/keywords of the document itself
F1=2PR/(P+R)。
Experimental results show that the method provided by the invention can improve the keyword extraction effect by 5-15% on the original basis.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (6)
1. A keyword automatic extraction method based on a scientific and technical literature graph network comprises the following steps:
step 1) establishing a scientific and technological literature graph network for a set scientific and technological literature set according to literature citation relations and co-author information;
step 2) establishing a data organization model based on a scientific and technical literature graph network;
step 3) extracting self information of the scientific and technical literature to be detected, and combining the information with scientific and technical literature graph network information acquired based on the data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author.
2. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 1, wherein the step 1) specifically comprises:
step 1-1) setting each literature in a set scientific and technical literature set as a node;
step 1-2) traversing each node, repeating the step 1-3) and the step 1-4), and turning to the step 1-5) when each node is traversed;
step 1-3) establishing an edge of a corresponding node of a cited document pointing to a corresponding node of a cited document according to the citation relation of the document; setting the category of the corresponding node of the cited document as a cited node, and setting the category of the corresponding node of the cited document as a cited node;
step 1-4) establishing an edge between corresponding nodes of a thesis with a co-author according to the co-author information, and setting the type of the node as a co-author node;
and 1-5) obtaining a scientific and technical literature graph network.
3. The method for automatically extracting keywords based on scientific and technical literature graph network as claimed in claim 2, wherein the step 2) specifically comprises:
step 2-1) setting node key information according to the category of each node;
and 2-2) calculating the weight of the key information of the node according to the category of the node and the key information of the node.
4. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 3, wherein the step 2-1) specifically comprises:
for the node category as a reference node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, reference hierarchy, and reference fragment; wherein, the time represents publication time of the literature, and the citation level represents cited distance between the scientific literature;
for the node category as the referenced node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, referenced hierarchy, and referenced segment; the quoted level represents the quoted distance of the scientific and technical literature;
for a node with a node category as a co-author node, setting node key information comprises: title, abstract, time, keywords, and co-author hierarchy representing document co-author association distance.
5. The method for automatically extracting keywords based on scientific and technical literature graph network as claimed in claim 4, wherein the step 2-2) specifically comprises:
for the node type as a reference node, the information weight of the topic is 1.5Q1The information weight of the abstract is Q1The information weight of the keyword is 2Q1The information weight of the reference fragment is 1.2Q1(ii) a Wherein Q is1For reference to the reference weights of the scientific literature, the following formula is satisfied:
Q1a × (1-time difference/10) × (1-reference level/5)
Wherein A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of referenced nodes, the information weight of the topic is 1.5Q2Information weight of abstractIs Q2The information weight of the keyword is 2Q2The information weight of the referenced segment is 1.2Q2(ii) a Wherein Q is2For the reference weight of the cited scientific literature, the following formula is satisfied:
Q2b × (1-time difference/10) × (1-cited level/5)
B represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of co-author nodes, the information weight of the topic is 1.5Q3The information weight of the abstract is Q3The information weight of the keyword is 2Q3Wherein Q is3Reference weight Q for co-author scientific and technological literature3Satisfies the following formula:
Q3either C × (1-time difference/10) × (1-co-author level/5)
Wherein, C represents the weight base number of the scientific and technological literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technological literature of the co-author.
6. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 5, wherein the step 3) specifically comprises:
extracting self information of a scientific and technological document to be detected to obtain time, questions, an abstract, a text, a reference document and an author;
constructing a scientific and technical literature graph network according to the co-author and the citation relationship;
organizing the scientific and technical literature graph network data based on a scientific and technical literature citation network data organization model;
and fusing the scientific and technical literature information to be detected and the scientific and technical literature graph network information to extract keywords.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210227126.7A CN114638225A (en) | 2022-03-08 | 2022-03-08 | Automatic keyword extraction method based on scientific and technological literature graph network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210227126.7A CN114638225A (en) | 2022-03-08 | 2022-03-08 | Automatic keyword extraction method based on scientific and technological literature graph network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114638225A true CN114638225A (en) | 2022-06-17 |
Family
ID=81948514
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210227126.7A Pending CN114638225A (en) | 2022-03-08 | 2022-03-08 | Automatic keyword extraction method based on scientific and technological literature graph network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114638225A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116644338A (en) * | 2023-06-01 | 2023-08-25 | 北京智谱华章科技有限公司 | Literature topic classification method, device, equipment and medium based on mixed similarity |
-
2022
- 2022-03-08 CN CN202210227126.7A patent/CN114638225A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116644338A (en) * | 2023-06-01 | 2023-08-25 | 北京智谱华章科技有限公司 | Literature topic classification method, device, equipment and medium based on mixed similarity |
CN116644338B (en) * | 2023-06-01 | 2024-01-30 | 北京智谱华章科技有限公司 | Literature topic classification method, device, equipment and medium based on mixed similarity |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11475319B2 (en) | Extracting facts from unstructured information | |
CN108829858B (en) | Data query method and device and computer readable storage medium | |
Beliga et al. | An overview of graph-based keyword extraction methods and approaches | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN102253930B (en) | A kind of method of text translation and device | |
Chen et al. | Websrc: A dataset for web-based structural reading comprehension | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN102622453A (en) | Body-based food security event semantic retrieval system | |
CN105550189A (en) | Ontology-based intelligent retrieval system for information security event | |
CN109255012B (en) | Method and device for machine reading understanding and candidate data set size reduction | |
US10678820B2 (en) | System and method for computerized semantic indexing and searching | |
CN107463548A (en) | Short phrase picking method and device | |
CN109271524A (en) | Entity link method in knowledge base question answering system | |
CN111831794A (en) | Knowledge map-based construction method for knowledge question-answering system in comprehensive pipe gallery industry | |
CN105373546A (en) | Information processing method and system for knowledge services | |
CN110807326A (en) | Short text keyword extraction method combining GPU-DMM and text features | |
CN115757689A (en) | Information query system, method and equipment | |
Menezes et al. | Building a massive corpus for named entity recognition using free open data sources | |
CN114638225A (en) | Automatic keyword extraction method based on scientific and technological literature graph network | |
WO2022121146A1 (en) | Method and apparatus for determining importance of code segment | |
Barbosa et al. | An approach to clustering and sequencing of textual requirements | |
US11861321B1 (en) | Systems and methods for structure discovery and structure-based analysis in natural language processing models | |
CN113536772A (en) | Text processing method, device, equipment and storage medium | |
CN111753540B (en) | Method and system for collecting text data to perform Natural Language Processing (NLP) | |
Zhang et al. | An improved ontology-based web information extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |