CN114638225A - Automatic keyword extraction method based on scientific and technological literature graph network - Google Patents

Automatic keyword extraction method based on scientific and technological literature graph network Download PDF

Info

Publication number
CN114638225A
CN114638225A CN202210227126.7A CN202210227126A CN114638225A CN 114638225 A CN114638225 A CN 114638225A CN 202210227126 A CN202210227126 A CN 202210227126A CN 114638225 A CN114638225 A CN 114638225A
Authority
CN
China
Prior art keywords
scientific
node
literature
information
technical literature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210227126.7A
Other languages
Chinese (zh)
Inventor
宋宇
罗准辰
武帅
罗威
谭玉珊
胡明昊
田昌海
毛彬
叶宇铭
赵晋巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Military Science Information Research Center Of Military Academy Of Chinese Pla
Original Assignee
Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Military Science Information Research Center Of Military Academy Of Chinese Pla filed Critical Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority to CN202210227126.7A priority Critical patent/CN114638225A/en
Publication of CN114638225A publication Critical patent/CN114638225A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a keyword automatic extraction method based on a scientific and technical literature graph network, which comprises the following steps: establishing a scientific and technological literature graph network for a set scientific and technological literature set according to the literature citation relation and the co-author information; establishing a data organization model based on a scientific and technical literature graph network; extracting self information of the scientific and technological literature to be detected, and combining the information with scientific and technological literature graph network information acquired based on a data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author. The invention constructs a scientific and technological literature graph network based on co-authors and citation relations, and provides an automatic extraction method based on the scientific and technological literature graph network, so that the automatic extraction effect of keywords of scientific and technological literature is further improved; the scientific and technical literature graph network construction method and the scientific and technical literature graph network data organization model are further provided, so that the scientific and technical literature graph network information can be fully utilized, and the problem of how to utilize graph network data is solved.

Description

Automatic keyword extraction method based on scientific and technological literature graph network
Technical Field
The invention relates to the technical field of computer application, natural language processing and automatic keyword extraction, in particular to a method for automatically extracting keywords based on a scientific and technical literature graph network.
Background
The traditional scientific and technical literature keyword extraction method extracts by means of the information of the literature, ignores the network relation among the scientific and technical literature, and fails to apply the semantic information of the scientific and technical literature graph network to the key extraction field. Generally, there are several authors in a scientific literature, and several authors issue other scientific literatures. One scientific and technical literature can be related to the cited scientific and technical literature through a citation relation, and one scientific and technical literature can be cited by other scientific and technical literatures at the same time. The scientific literature forms a complex network of graphs through co-authors and citations.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an automatic keyword extraction method based on a scientific and technical literature graph network.
In order to achieve the above object, the present invention provides a method for automatically extracting keywords based on a scientific and technical literature graph network, wherein the method comprises:
step 1) establishing a scientific and technological literature graph network for a set scientific and technological literature set according to literature citation relations and co-author information;
step 2) establishing a data organization model based on a scientific and technical literature graph network;
step 3) extracting self information of the scientific and technical literature to be detected, and combining the information with scientific and technical literature graph network information acquired based on the data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author.
As an improvement of the above method, the step 1) specifically includes:
step 1-1) setting each literature in a set scientific and technical literature set as a node;
step 1-2) traversing each node, repeating the step 1-3) and the step 1-4), and turning to the step 1-5) when each node is traversed;
step 1-3) establishing an edge of a corresponding node of a cited document pointing to a corresponding node of a cited document according to the citation relation of the document; setting the category of the corresponding node of the cited document as a cited node, and setting the category of the corresponding node of the cited document as a cited node;
step 1-4) establishing an edge between corresponding nodes of a thesis with a co-author according to the co-author information, and setting the type of the node as a co-author node;
and 1-5) obtaining a scientific and technical literature graph network.
As a modification of the above method, the step 2) specifically includes:
step 2-1) setting node key information according to the category of each node;
and 2-2) calculating the weight of the key information of the node according to the type of the node and the key information of the node.
As an improvement of the above method, the step 2-1) specifically comprises:
for the node category as a reference node, setting node key information comprises: title, abstract, time, keywords, reference hierarchy, and reference fragment; wherein, the time represents publication time of the literature, and the citation level represents cited distance between the scientific literature;
for the node category as the referenced node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, referenced hierarchy, and referenced segment; the quoted level represents the quoted distance of the scientific and technical literature;
for a node with a node category as a co-author node, setting node key information comprises: title, abstract, time, keywords, and co-author hierarchy representing document co-author association distance.
As an improvement of the above method, the step 2-2) specifically includes:
for the node type as a reference node, the information weight of the topic is 1.5Q1The information weight of the abstract is Q1The information weight of the keyword is 2Q1The information weight of the reference fragment is 1.2Q1(ii) a Wherein Q is1For reference to the reference weights of scientific and technical literature, the following formula is satisfied:
Q1a × (1-time difference/10) × (1-reference level/5)
Wherein, A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference of the publication time of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of referenced nodes, the information weight of the topic is 1.5Q2The information weight of the abstract is Q2The information weight of the keyword is 2Q2The information weight of the referenced segment is 1.2Q2(ii) a Wherein Q is2For the reference weight of the cited scientific literature, the following formula is satisfied:
Q2b × (1-time difference/10) × (1-cited level/5)
B represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of co-author nodes, the information weight of the topic is 1.5Q3The information weight of the abstract is Q3The information weight of the keyword is 2Q3Wherein Q is3Reference weight Q for co-author scientific and technological literature3Satisfies the following formula:
Q3either C × (1-time difference/10) × (1-co-author level/5)
Wherein, C represents the weight base number of the scientific and technological literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technological literature of the co-author.
As an improvement of the above method, the step 3) specifically includes:
extracting self information of a scientific and technological document to be detected to obtain time, questions, an abstract, a text, a reference document and an author;
constructing a scientific and technical literature graph network according to the co-author and the citation relationship;
organizing the scientific and technical literature graph network data based on a scientific and technical literature citation network data organization model;
fusing scientific and technical literature information to be tested and scientific and technical literature graph network information to extract keywords
Compared with the prior art, the invention has the advantages that:
1. the invention constructs a complex scientific and technical literature graph network based on co-authors and citation relations, and provides an automatic extraction method based on the scientific and technical literature graph network, so that the automatic extraction effect of keywords of scientific and technical literature is further improved;
2. the invention provides a scientific and technical literature graph network construction method and a scientific and technical literature graph network data organization model, which can make full use of scientific and technical literature graph network information and solve the problem of how graph network data is utilized.
Drawings
FIG. 1 is a flow chart of a keyword automatic extraction method based on a scientific and technical literature diagram network according to the present invention;
FIG. 2 is a diagram of node types and related node key information of the scientific and technical literature;
FIG. 3 is a diagram illustrating node types and weight settings of scientific and technical literature according to the present invention;
fig. 4 is an example of keyword extraction using the method of the present invention.
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, an embodiment 1 of the present invention provides an automatic keyword extraction method based on a scientific and technical literature graph network, which specifically includes the following steps:
(1) scientific and technological literature graph network construction method
For a collection of papers, we treat each paper as a node. The connections between different nodes are constructed by reference relationships and co-author information. For example, document a references document B, node a generates an edge pointing to node B. By means of the author information of the scientific document A, the scientific document C published by the co-author can be associated, and then an edge from the node C to the node A can be generated. Through the literature citation relationship and the co-author information, a scientific and technical literature graph network can be conveniently constructed.
(2) Scientific and technical literature graph network data organization model
For the scientific literature graph network data, a data organization model is constructed. The nodes of the scientific and technical literature graph network are divided into three categories, wherein one category is the nodes of the cited scientific and technical literature, the nodes of the cited scientific and technical literature and the nodes of the scientific and technical literature published by a co-author. The specific categories and the key information of the related nodes are shown in fig. 2.
(3) Key word extraction method of scientific and technological literature graph network
For a scientific and technical literature, by means of a scientific and technical literature graph network data organization model, relevant information of cited scientific and technical literature, cited scientific and technical literature and co-author scientific and technical literature can be obtained. When extracting keywords, the related information can be used as supplementary information of scientific and technical literature. According to the time difference, the hierarchical relation and the key information type of the nodes of the scientific and technical literature, different weights are given to the supplementary information. The smaller the publication time difference between scientific and technical literatures, the higher the correlation and the higher the weight. The more levels between scientific and technical literature, the worse the correlation and the lower the weight. Topic information is more refined than summary information, and is given higher weight than the summary. The weight settings are shown in fig. 3. The method specifically comprises the following steps:
for the node type as a reference node, the information weight of the title is 1.5Q1The information weight of the abstract is Q1The information weight of the keyword is 2Q1The information weight of the reference fragment is 1.2Q1(ii) a Wherein Q is1For reference to the reference weights of scientific and technical literature, the following formula is satisfied:
Q1a × (1-time difference/10) × (1-reference level/5)
Wherein A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of referenced nodes, the information weight of the topic is 1.5Q2The information weight of the abstract is Q2The information weight of the keyword is 2Q2The information weight of the referenced segment is 1.2Q2(ii) a Wherein Q is2For the reference weight of the cited scientific literature, the following formula is satisfied:
Q2b × (1-time difference/10) × (1-cited level/5)
B represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of co-author nodes, the information weight of the topic is 1.5Q3The information weight of the abstract is Q3The information weight of the keyword is 2Q3Wherein, Q3Reference weight Q for co-author scientific and technological literature3Satisfies the following formula:
Q3either C × (1-time difference/10) × (1-co-author level/5)
Wherein, C represents the weight base number of the scientific and technical literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technical literature of the co-author.
And finally, combining the information of the scientific and technical literature with the information of the scientific and technical literature graph network acquired by means of the scientific and technical literature graph network data organization model, and then extracting keywords.
Simulation example:
the text provides a keyword automatic extraction method based on a scientific and technical literature graph network. First, we need to build a scientific literature graph network based on citation relationships and co-author information. CiteSeer is an academic official literature digital library built by NEC research institute on the basis of an automatic citation indexing mechanism. The abstract, title, and reference fragment (circulation Context) can be obtained by searching on a CiteSeer with the title of the scientific literature. The CiteSeer search results are shown. The author information, the citation fragment, the publication time, the abstract information, the title information, the citation paper information, and the like can be conveniently obtained, as shown in fig. 4.
Then, the graph network data is organized based on a scientific literature citation network data organization model. And finally, extracting the keywords based on a scientific and technical literature graph network keyword extraction frame.
To verify the keyword extraction effect, 500 scientific and technical documents are selected from the known network as experimental data. Firstly, extracting keywords of the scientific and technical literature by using an existing keyword extractor, and calculating extraction accuracy (P), recall rate (R) and F1 values; then extracting by the method provided by the text, and calculating the extraction accuracy (P), recall (R) and F1 values; and finally comparing the test results. P, R, F1 index calculation method is as follows:
p is the exact number of results automatically extracted/total number of results automatically extracted
R is the exact number of results automatically extracted/keywords of the document itself
F1=2PR/(P+R)。
Experimental results show that the method provided by the invention can improve the keyword extraction effect by 5-15% on the original basis.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (6)

1. A keyword automatic extraction method based on a scientific and technical literature graph network comprises the following steps:
step 1) establishing a scientific and technological literature graph network for a set scientific and technological literature set according to literature citation relations and co-author information;
step 2) establishing a data organization model based on a scientific and technical literature graph network;
step 3) extracting self information of the scientific and technical literature to be detected, and combining the information with scientific and technical literature graph network information acquired based on the data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author.
2. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 1, wherein the step 1) specifically comprises:
step 1-1) setting each literature in a set scientific and technical literature set as a node;
step 1-2) traversing each node, repeating the step 1-3) and the step 1-4), and turning to the step 1-5) when each node is traversed;
step 1-3) establishing an edge of a corresponding node of a cited document pointing to a corresponding node of a cited document according to the citation relation of the document; setting the category of the corresponding node of the cited document as a cited node, and setting the category of the corresponding node of the cited document as a cited node;
step 1-4) establishing an edge between corresponding nodes of a thesis with a co-author according to the co-author information, and setting the type of the node as a co-author node;
and 1-5) obtaining a scientific and technical literature graph network.
3. The method for automatically extracting keywords based on scientific and technical literature graph network as claimed in claim 2, wherein the step 2) specifically comprises:
step 2-1) setting node key information according to the category of each node;
and 2-2) calculating the weight of the key information of the node according to the category of the node and the key information of the node.
4. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 3, wherein the step 2-1) specifically comprises:
for the node category as a reference node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, reference hierarchy, and reference fragment; wherein, the time represents publication time of the literature, and the citation level represents cited distance between the scientific literature;
for the node category as the referenced node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, referenced hierarchy, and referenced segment; the quoted level represents the quoted distance of the scientific and technical literature;
for a node with a node category as a co-author node, setting node key information comprises: title, abstract, time, keywords, and co-author hierarchy representing document co-author association distance.
5. The method for automatically extracting keywords based on scientific and technical literature graph network as claimed in claim 4, wherein the step 2-2) specifically comprises:
for the node type as a reference node, the information weight of the topic is 1.5Q1The information weight of the abstract is Q1The information weight of the keyword is 2Q1The information weight of the reference fragment is 1.2Q1(ii) a Wherein Q is1For reference to the reference weights of the scientific literature, the following formula is satisfied:
Q1a × (1-time difference/10) × (1-reference level/5)
Wherein A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of referenced nodes, the information weight of the topic is 1.5Q2Information weight of abstractIs Q2The information weight of the keyword is 2Q2The information weight of the referenced segment is 1.2Q2(ii) a Wherein Q is2For the reference weight of the cited scientific literature, the following formula is satisfied:
Q2b × (1-time difference/10) × (1-cited level/5)
B represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;
for a node category of co-author nodes, the information weight of the topic is 1.5Q3The information weight of the abstract is Q3The information weight of the keyword is 2Q3Wherein Q is3Reference weight Q for co-author scientific and technological literature3Satisfies the following formula:
Q3either C × (1-time difference/10) × (1-co-author level/5)
Wherein, C represents the weight base number of the scientific and technological literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technological literature of the co-author.
6. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 5, wherein the step 3) specifically comprises:
extracting self information of a scientific and technological document to be detected to obtain time, questions, an abstract, a text, a reference document and an author;
constructing a scientific and technical literature graph network according to the co-author and the citation relationship;
organizing the scientific and technical literature graph network data based on a scientific and technical literature citation network data organization model;
and fusing the scientific and technical literature information to be detected and the scientific and technical literature graph network information to extract keywords.
CN202210227126.7A 2022-03-08 2022-03-08 Automatic keyword extraction method based on scientific and technological literature graph network Pending CN114638225A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210227126.7A CN114638225A (en) 2022-03-08 2022-03-08 Automatic keyword extraction method based on scientific and technological literature graph network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210227126.7A CN114638225A (en) 2022-03-08 2022-03-08 Automatic keyword extraction method based on scientific and technological literature graph network

Publications (1)

Publication Number Publication Date
CN114638225A true CN114638225A (en) 2022-06-17

Family

ID=81948514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210227126.7A Pending CN114638225A (en) 2022-03-08 2022-03-08 Automatic keyword extraction method based on scientific and technological literature graph network

Country Status (1)

Country Link
CN (1) CN114638225A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116644338A (en) * 2023-06-01 2023-08-25 北京智谱华章科技有限公司 Literature topic classification method, device, equipment and medium based on mixed similarity

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116644338A (en) * 2023-06-01 2023-08-25 北京智谱华章科技有限公司 Literature topic classification method, device, equipment and medium based on mixed similarity
CN116644338B (en) * 2023-06-01 2024-01-30 北京智谱华章科技有限公司 Literature topic classification method, device, equipment and medium based on mixed similarity

Similar Documents

Publication Publication Date Title
US11475319B2 (en) Extracting facts from unstructured information
CN108829858B (en) Data query method and device and computer readable storage medium
Beliga et al. An overview of graph-based keyword extraction methods and approaches
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN102253930B (en) A kind of method of text translation and device
Chen et al. Websrc: A dataset for web-based structural reading comprehension
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN102622453A (en) Body-based food security event semantic retrieval system
CN105550189A (en) Ontology-based intelligent retrieval system for information security event
CN109255012B (en) Method and device for machine reading understanding and candidate data set size reduction
US10678820B2 (en) System and method for computerized semantic indexing and searching
CN107463548A (en) Short phrase picking method and device
CN109271524A (en) Entity link method in knowledge base question answering system
CN111831794A (en) Knowledge map-based construction method for knowledge question-answering system in comprehensive pipe gallery industry
CN105373546A (en) Information processing method and system for knowledge services
CN110807326A (en) Short text keyword extraction method combining GPU-DMM and text features
CN115757689A (en) Information query system, method and equipment
Menezes et al. Building a massive corpus for named entity recognition using free open data sources
CN114638225A (en) Automatic keyword extraction method based on scientific and technological literature graph network
WO2022121146A1 (en) Method and apparatus for determining importance of code segment
Barbosa et al. An approach to clustering and sequencing of textual requirements
US11861321B1 (en) Systems and methods for structure discovery and structure-based analysis in natural language processing models
CN113536772A (en) Text processing method, device, equipment and storage medium
CN111753540B (en) Method and system for collecting text data to perform Natural Language Processing (NLP)
Zhang et al. An improved ontology-based web information extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination