CN114638225A

CN114638225A - Automatic keyword extraction method based on scientific and technological literature graph network

Info

Publication number: CN114638225A
Application number: CN202210227126.7A
Authority: CN
Inventors: 宋宇; 罗准辰; 武帅; 罗威; 谭玉珊; 胡明昊; 田昌海; 毛彬; 叶宇铭; 赵晋巍
Original assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla
Current assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-17

Abstract

The invention discloses a keyword automatic extraction method based on a scientific and technical literature graph network, which comprises the following steps: establishing a scientific and technological literature graph network for a set scientific and technological literature set according to the literature citation relation and the co-author information; establishing a data organization model based on a scientific and technical literature graph network; extracting self information of the scientific and technological literature to be detected, and combining the information with scientific and technological literature graph network information acquired based on a data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author. The invention constructs a scientific and technological literature graph network based on co-authors and citation relations, and provides an automatic extraction method based on the scientific and technological literature graph network, so that the automatic extraction effect of keywords of scientific and technological literature is further improved; the scientific and technical literature graph network construction method and the scientific and technical literature graph network data organization model are further provided, so that the scientific and technical literature graph network information can be fully utilized, and the problem of how to utilize graph network data is solved.

Description

Automatic keyword extraction method based on scientific and technological literature graph network

Technical Field

The invention relates to the technical field of computer application, natural language processing and automatic keyword extraction, in particular to a method for automatically extracting keywords based on a scientific and technical literature graph network.

Background

The traditional scientific and technical literature keyword extraction method extracts by means of the information of the literature, ignores the network relation among the scientific and technical literature, and fails to apply the semantic information of the scientific and technical literature graph network to the key extraction field. Generally, there are several authors in a scientific literature, and several authors issue other scientific literatures. One scientific and technical literature can be related to the cited scientific and technical literature through a citation relation, and one scientific and technical literature can be cited by other scientific and technical literatures at the same time. The scientific literature forms a complex network of graphs through co-authors and citations.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an automatic keyword extraction method based on a scientific and technical literature graph network.

In order to achieve the above object, the present invention provides a method for automatically extracting keywords based on a scientific and technical literature graph network, wherein the method comprises:

step 1) establishing a scientific and technological literature graph network for a set scientific and technological literature set according to literature citation relations and co-author information;

step 2) establishing a data organization model based on a scientific and technical literature graph network;

step 3) extracting self information of the scientific and technical literature to be detected, and combining the information with scientific and technical literature graph network information acquired based on the data organization model to complete keyword extraction; the self information comprises publication time, title, abstract, text, reference and author.

As an improvement of the above method, the step 1) specifically includes:

step 1-1) setting each literature in a set scientific and technical literature set as a node;

step 1-2) traversing each node, repeating the step 1-3) and the step 1-4), and turning to the step 1-5) when each node is traversed;

step 1-3) establishing an edge of a corresponding node of a cited document pointing to a corresponding node of a cited document according to the citation relation of the document; setting the category of the corresponding node of the cited document as a cited node, and setting the category of the corresponding node of the cited document as a cited node;

step 1-4) establishing an edge between corresponding nodes of a thesis with a co-author according to the co-author information, and setting the type of the node as a co-author node;

and 1-5) obtaining a scientific and technical literature graph network.

As a modification of the above method, the step 2) specifically includes:

step 2-1) setting node key information according to the category of each node;

and 2-2) calculating the weight of the key information of the node according to the type of the node and the key information of the node.

As an improvement of the above method, the step 2-1) specifically comprises:

for the node category as a reference node, setting node key information comprises: title, abstract, time, keywords, reference hierarchy, and reference fragment; wherein, the time represents publication time of the literature, and the citation level represents cited distance between the scientific literature;

for the node category as the referenced node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, referenced hierarchy, and referenced segment; the quoted level represents the quoted distance of the scientific and technical literature;

for a node with a node category as a co-author node, setting node key information comprises: title, abstract, time, keywords, and co-author hierarchy representing document co-author association distance.

As an improvement of the above method, the step 2-2) specifically includes:

for the node type as a reference node, the information weight of the topic is 1.5Q₁The information weight of the abstract is Q₁The information weight of the keyword is 2Q₁The information weight of the reference fragment is 1.2Q₁(ii) a Wherein Q is₁For reference to the reference weights of scientific and technical literature, the following formula is satisfied:

Q₁a × (1-time difference/10) × (1-reference level/5)

Wherein, A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference of the publication time of the cited scientific and technical literature and the cited scientific and technical literature;

for a node category of referenced nodes, the information weight of the topic is 1.5Q₂The information weight of the abstract is Q₂The information weight of the keyword is 2Q₂The information weight of the referenced segment is 1.2Q₂(ii) a Wherein Q is₂For the reference weight of the cited scientific literature, the following formula is satisfied:

Q₂b × (1-time difference/10) × (1-cited level/5)

B represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;

for a node category of co-author nodes, the information weight of the topic is 1.5Q₃The information weight of the abstract is Q₃The information weight of the keyword is 2Q₃Wherein Q is₃Reference weight Q for co-author scientific and technological literature₃Satisfies the following formula:

Q₃either C × (1-time difference/10) × (1-co-author level/5)

Wherein, C represents the weight base number of the scientific and technological literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technological literature of the co-author.

As an improvement of the above method, the step 3) specifically includes:

extracting self information of a scientific and technological document to be detected to obtain time, questions, an abstract, a text, a reference document and an author;

constructing a scientific and technical literature graph network according to the co-author and the citation relationship;

organizing the scientific and technical literature graph network data based on a scientific and technical literature citation network data organization model;

fusing scientific and technical literature information to be tested and scientific and technical literature graph network information to extract keywords

Compared with the prior art, the invention has the advantages that:

1. the invention constructs a complex scientific and technical literature graph network based on co-authors and citation relations, and provides an automatic extraction method based on the scientific and technical literature graph network, so that the automatic extraction effect of keywords of scientific and technical literature is further improved;

2. the invention provides a scientific and technical literature graph network construction method and a scientific and technical literature graph network data organization model, which can make full use of scientific and technical literature graph network information and solve the problem of how graph network data is utilized.

Drawings

FIG. 1 is a flow chart of a keyword automatic extraction method based on a scientific and technical literature diagram network according to the present invention;

FIG. 2 is a diagram of node types and related node key information of the scientific and technical literature;

FIG. 3 is a diagram illustrating node types and weight settings of scientific and technical literature according to the present invention;

fig. 4 is an example of keyword extraction using the method of the present invention.

Detailed Description

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, an embodiment 1 of the present invention provides an automatic keyword extraction method based on a scientific and technical literature graph network, which specifically includes the following steps:

(1) scientific and technological literature graph network construction method

For a collection of papers, we treat each paper as a node. The connections between different nodes are constructed by reference relationships and co-author information. For example, document a references document B, node a generates an edge pointing to node B. By means of the author information of the scientific document A, the scientific document C published by the co-author can be associated, and then an edge from the node C to the node A can be generated. Through the literature citation relationship and the co-author information, a scientific and technical literature graph network can be conveniently constructed.

(2) Scientific and technical literature graph network data organization model

For the scientific literature graph network data, a data organization model is constructed. The nodes of the scientific and technical literature graph network are divided into three categories, wherein one category is the nodes of the cited scientific and technical literature, the nodes of the cited scientific and technical literature and the nodes of the scientific and technical literature published by a co-author. The specific categories and the key information of the related nodes are shown in fig. 2.

(3) Key word extraction method of scientific and technological literature graph network

For a scientific and technical literature, by means of a scientific and technical literature graph network data organization model, relevant information of cited scientific and technical literature, cited scientific and technical literature and co-author scientific and technical literature can be obtained. When extracting keywords, the related information can be used as supplementary information of scientific and technical literature. According to the time difference, the hierarchical relation and the key information type of the nodes of the scientific and technical literature, different weights are given to the supplementary information. The smaller the publication time difference between scientific and technical literatures, the higher the correlation and the higher the weight. The more levels between scientific and technical literature, the worse the correlation and the lower the weight. Topic information is more refined than summary information, and is given higher weight than the summary. The weight settings are shown in fig. 3. The method specifically comprises the following steps:

for the node type as a reference node, the information weight of the title is 1.5Q₁The information weight of the abstract is Q₁The information weight of the keyword is 2Q₁The information weight of the reference fragment is 1.2Q₁(ii) a Wherein Q is₁For reference to the reference weights of scientific and technical literature, the following formula is satisfied:

Q₁a × (1-time difference/10) × (1-reference level/5)

Wherein A represents the weight base number of the cited scientific and technical literature, and the time difference represents the difference between the publication times of the cited scientific and technical literature and the cited scientific and technical literature;

Q₂b × (1-time difference/10) × (1-cited level/5)

for a node category of co-author nodes, the information weight of the topic is 1.5Q₃The information weight of the abstract is Q₃The information weight of the keyword is 2Q₃Wherein, Q₃Reference weight Q for co-author scientific and technological literature₃Satisfies the following formula:

Q₃either C × (1-time difference/10) × (1-co-author level/5)

Wherein, C represents the weight base number of the scientific and technical literature of the co-author, and the time difference represents the difference of the publication time of the scientific and technical literature of the co-author.

And finally, combining the information of the scientific and technical literature with the information of the scientific and technical literature graph network acquired by means of the scientific and technical literature graph network data organization model, and then extracting keywords.

Simulation example:

the text provides a keyword automatic extraction method based on a scientific and technical literature graph network. First, we need to build a scientific literature graph network based on citation relationships and co-author information. CiteSeer is an academic official literature digital library built by NEC research institute on the basis of an automatic citation indexing mechanism. The abstract, title, and reference fragment (circulation Context) can be obtained by searching on a CiteSeer with the title of the scientific literature. The CiteSeer search results are shown. The author information, the citation fragment, the publication time, the abstract information, the title information, the citation paper information, and the like can be conveniently obtained, as shown in fig. 4.

Then, the graph network data is organized based on a scientific literature citation network data organization model. And finally, extracting the keywords based on a scientific and technical literature graph network keyword extraction frame.

To verify the keyword extraction effect, 500 scientific and technical documents are selected from the known network as experimental data. Firstly, extracting keywords of the scientific and technical literature by using an existing keyword extractor, and calculating extraction accuracy (P), recall rate (R) and F1 values; then extracting by the method provided by the text, and calculating the extraction accuracy (P), recall (R) and F1 values; and finally comparing the test results. P, R, F1 index calculation method is as follows:

p is the exact number of results automatically extracted/total number of results automatically extracted

R is the exact number of results automatically extracted/keywords of the document itself

F1＝2PR/(P+R)。

Experimental results show that the method provided by the invention can improve the keyword extraction effect by 5-15% on the original basis.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A keyword automatic extraction method based on a scientific and technical literature graph network comprises the following steps:

2. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 1, wherein the step 1) specifically comprises:

and 1-5) obtaining a scientific and technical literature graph network.

3. The method for automatically extracting keywords based on scientific and technical literature graph network as claimed in claim 2, wherein the step 2) specifically comprises:

step 2-1) setting node key information according to the category of each node;

and 2-2) calculating the weight of the key information of the node according to the category of the node and the key information of the node.

4. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 3, wherein the step 2-1) specifically comprises:

for the node category as a reference node, setting the key information of the node comprises the following steps: title, abstract, time, keywords, reference hierarchy, and reference fragment; wherein, the time represents publication time of the literature, and the citation level represents cited distance between the scientific literature;

5. The method for automatically extracting keywords based on scientific and technical literature graph network as claimed in claim 4, wherein the step 2-2) specifically comprises:

for the node type as a reference node, the information weight of the topic is 1.5Q₁The information weight of the abstract is Q₁The information weight of the keyword is 2Q₁The information weight of the reference fragment is 1.2Q₁(ii) a Wherein Q is₁For reference to the reference weights of the scientific literature, the following formula is satisfied:

Q₁a × (1-time difference/10) × (1-reference level/5)

for a node category of referenced nodes, the information weight of the topic is 1.5Q₂Information weight of abstractIs Q₂The information weight of the keyword is 2Q₂The information weight of the referenced segment is 1.2Q₂(ii) a Wherein Q is₂For the reference weight of the cited scientific literature, the following formula is satisfied:

Q₂b × (1-time difference/10) × (1-cited level/5)

Q₃either C × (1-time difference/10) × (1-co-author level/5)

6. The method for automatically extracting keywords based on scientific and technical literature graph network according to claim 5, wherein the step 3) specifically comprises:

and fusing the scientific and technical literature information to be detected and the scientific and technical literature graph network information to extract keywords.