CN110781297B

CN110781297B - Classification method of multi-label scientific research papers based on hierarchical discriminant trees

Info

Publication number: CN110781297B
Application number: CN201910881086.6A
Authority: CN
Inventors: 刘玮; 吴俊杰; 李超; 左源; 纪玉春; 袁石
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2022-06-21
Anticipated expiration: 2039-09-18
Also published as: CN110781297A

Abstract

The invention discloses a classification method of multi-label scientific research papers based on a hierarchical discriminant tree, which comprises the following steps: acquiring a thesis and a label with known labels, extracting a feature word set of the label, and constructing a binary discriminant model; updating the label into a binary discrimination model to obtain a hierarchical discrimination tree model; step three, obtaining a text representation of a label unknown paper, inputting the text representation into all binary discriminant models of a root node in a hierarchical discriminant tree model, calculating the probability of a label corresponding to the node, and outputting the label corresponding to the root node if the probability is greater than a threshold value; inputting the label into all binary discriminant models of child nodes of the node corresponding to the label, calculating the probability of representing the label by the node, if the probability is greater than a threshold value, outputting the label corresponding to the child node, and gradually judging until the leaf node is reached; all the output labels are the labels of the paper. The method has the advantages of fully mining the characteristic words of the papers and quickly and accurately classifying the papers in a hierarchical manner.

Description

Classification method of multi-label scientific research papers based on hierarchical discriminant trees

Technical Field

The invention relates to the field of scientific research paper classification. More specifically, the invention relates to a classification method of multi-label scientific research papers based on a hierarchical discriminant tree.

Background

The organization and management of scientific research papers are always concerned by publishing institutions, scientific research workers and the like. In the field of organization and management of scientific research papers, classification of scientific research papers is an important basic task. The task is to carry out hierarchical label classification on scientific research papers according to the existing class label system, and has very important significance for quick retrieval, induction and summarization of the scientific papers. On one hand, scientific research paper classification can help a publishing institution to quickly locate the latest scientific research paper category, and the latest paper is added into a citation database to provide a high-quality paper data service. On the other hand, scientific research paper classification can support scientific research institutions and scientific research workers to carry out rapid paper retrieval and summarization according to the existing classification system, and the retrieval and summarization efficiency of the scientific research institutions and the scientific research workers is improved. However, the existing class label system with a multi-layer complex structure brings difficulties to scientific research papers, for example, after a new scientific research paper is taken, the paper needs to be reasonably and comprehensively formed with a classification label in the multi-layer label system, and thus, the workload is large, and the work difficulty is high.

Disclosure of Invention

An object of the present invention is to solve at least the above problems and to provide at least the advantages described later.

The invention also aims to provide a classification method of multi-label scientific research papers based on the hierarchical discriminant tree, which can fully mine the characteristic words of the papers and quickly and accurately classify the papers in a hierarchical manner.

To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided a classification method of multi-labeled scientific papers based on hierarchical discriminant trees, comprising:

step one, constructing a binary discriminant model:

acquiring all papers with known labels and labels of the papers in a multi-level label system, acquiring text representations of all papers by adopting a text word segmentation technology, screening the text representations to obtain a characteristic word set of each label, and constructing a binary discrimination model by using the corresponding relation between each label and the characteristic word set of the label;

step two, constructing a hierarchical discrimination tree model: updating labels of all levels in a multi-level label system into a binary discrimination model of the labels to form a level discrimination tree model;

step three, classifying the papers with unknown labels: adopting a text word segmentation technology to obtain text representations of the paper, respectively inputting the text representations into all binary discriminant models of root nodes in a hierarchical discriminant tree model, calculating the probability that the paper has a label corresponding to the node by using the binary discriminant models, and outputting the label corresponding to the root node if the probability is greater than a threshold value;

inputting the text representation into all binary discriminant models of the child nodes of the node corresponding to the label of the hierarchy, calculating the probability that the thesis has the label represented by the node by using the binary discriminant models, and outputting the label corresponding to the child node if the probability is greater than a threshold value;

judging according to the hierarchical sequence from top to bottom until the text representation is input to the binary judgment model of the leaf node of the hierarchical judgment tree model and the output result is judged;

all labels output on the path starting from the root node and ending with the leaf nodes are taken as labels of the paper.

Preferably, the method for obtaining the text representation by adopting the text word segmentation technology comprises the following steps:

adopting a word segmentation and part-of-speech tagging tool to perform word segmentation and part-of-speech tagging on the paper, and reserving all words with part-of-speech tagging results in the text as nouns to form a word set I;

adopting a BERT pre-training language model to obtain semantic vectors of words in each word set I from a thesis to form a word set II;

the word set I and word set II comprise textual representations of the paper.

Preferably, the method for obtaining the feature word set of each label by screening comprises the following steps: starting from a top-level label of a multi-level label system, acquiring a characteristic word corresponding to each label by the following method according to the sequence from a root node to a leaf node;

the method comprises the following steps:

step a, calculating the weight of each word in the text representation of all papers according to all papers under each label, wherein the weight calculation formula is shown as a formula (1):

wherein, F_j(i) Representing the frequency of the word i in paper j, the calculation formula is shown in formula (2):

count (i) represents the number of times word i appears in paper j, total _ word_jRepresents the total number of words in paper j; n is a radical of_tRepresents the number of all papers under label t; n is a radical of_～tIndicating the number of all papers under other tags having the same upper level tag as tag t; if the label t is a top label, t represents other top labels; if the label t is a non-top label, t represents other labels under the upper label belonging to the label t; n is a radical ofⁱ _～tRepresents the number of papers in which the word i appears in all papers under other labels having the same upper-level label as the label t;

b, sorting the weights of all words under the label in a descending order, taking M words at the top of the ranking as the characteristic words of the label, and forming an initial characteristic word set of the label;

step c, calculating semantic similarity of all the remaining words and all the words in the initial characteristic word set according to the semantic characteristics of the characteristic words, wherein a calculation formula is shown as a formula (3):

wherein M represents the number of words in the initial characteristic word set of the label, cos (j, i) represents the cosine distance of semantic representations of the word j and the word i, and W represents the distance between the words in the initial characteristic word set of the label and the cosine distance of the semantic representations of the word i_t(j) Represents the weight of the word j in the label t;

sequencing all the remaining words under the label according to the sequence of semantic similarity from large to small, wherein K words before ranking are the feature words of the label to form a supplementary feature word set of the label;

and the initial characteristic word set and the supplementary characteristic word set of the label form a characteristic word set of the label.

Preferably, the value of M is 5% of the total number of words of the text representation under the corresponding label.

Preferably, M is no greater than 1000.

Preferably, the total number of feature words per tag is no greater than 5000.

Preferably, after the binary discriminant model calculates the probability, the threshold of the probability is 0.5.

Preferably, the method for constructing and forming the binary discriminant model is any one of a convolutional neural network, naive Bayes and a support vector product.

The invention at least comprises the following beneficial effects:

first, the labels in the existing multi-level label system have no judgment function, and can only be defined by human subjectivity, so that whether the labels have relevance with a paper can not be accurately known, and after a hierarchical discrimination tree model is formed, each node has an automatic discrimination function, and only text representation needs to be input, whether the labels corresponding to the paper and the node have relevance can be output, so that the discrimination readiness is improved, and the method is more objective and less prone to error.

And secondly, the binary discriminant model can accurately and comprehensively reflect the association relation between the label and the word used in the thesis, and the feature word with the maximum association with the label is obtained. And with the increase of the number of the papers and the update, the feature word set of each label is correspondingly increased and updated, so that the accuracy of the whole classification system can be improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

Fig. 1 is a block diagram of one embodiment of the present invention.

Detailed Description

The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

As shown in fig. 1, the present invention provides a classification method for a multi-label scientific research paper based on a hierarchical discriminant tree, including:

step one, constructing a binary discriminant model:

acquiring all papers with known labels and labels of the papers in a multi-level label system, acquiring text representations of all papers by adopting a text word segmentation technology, screening the text representations to obtain a characteristic word set of each label, and constructing a binary discrimination model by using the corresponding relation between each label and the characteristic word set of the label; the discrimination model can judge whether a scientific research paper belongs to a label by adopting a traditional data mining method such as support vector product, naive Bayes, logistic regression and the like. The binary discriminant model obtained in the way can accurately and comprehensively reflect the association relation between the label and the word used in the thesis, and obtain the feature word with the maximum association with the label. And with the increase of the number of the papers and the update, the feature word set of each label is correspondingly increased and updated, so that the accuracy of the whole classification system can be improved.

Step two, constructing a hierarchical discrimination tree model: updating labels of all levels in a multi-level label system into a binary discrimination model of the labels to form a level discrimination tree model; the labels in the existing multilayer label system have no judgment function and can be defined only by the subjectivity of people, so that whether the labels and the paper have the relevance cannot be accurately known, after a hierarchical judgment tree model is formed, each node has an automatic judgment function, and only text representation needs to be input, whether the paper and the labels corresponding to the nodes have the relevance can be output, the judgment preparation is improved, and the method is more objective and is not easy to make mistakes.

all labels output on the path starting from the root node and ending with the leaf nodes are taken as labels of the paper. The root node is judged from the leaf node to the leaf node according to the hierarchy sequence, so as to avoid omission, reduce the workload of judgment, quickly and accurately output the hierarchical label of the new paper, and classify the new paper.

In the technical scheme, in view of the incidence relation between the word and word terms used in the scientific research papers and the tags, the scientific research papers with known tags and tag information thereof are utilized to obtain a feature word set corresponding to each tag; then, according to a multi-level label system, a binary discrimination model is constructed for each label, and discrimination models of all labels are fused into a level discrimination tree model; and finally, judging the labels to which the scientific research papers with unknown labels belong based on the hierarchical discrimination tree model. The method considers the relevance between the words and phrases used in scientific research papers and the labels, can automatically screen the characteristic words related to the labels, and constructs a corresponding binary discrimination model. And a classification task of scientific research papers with unknown labels is realized by utilizing the hierarchical discrimination tree model, and the hierarchical relation among the labels is fully excavated.

In another technical scheme, a method for obtaining text representation by adopting a text word segmentation technology comprises the following steps:

the word set I and word set II comprise textual representations of the paper.

In another technical scheme, the method for obtaining the feature word set of each label by screening comprises the following steps: starting from a top-level label of a multi-level label system, acquiring a characteristic word corresponding to each label by the following method according to the sequence from a root node to a leaf node;

the method comprises the following steps:

step a, calculating the weight of each word in the text representation of the papers according to all papers under each label, wherein the weight calculation formula is shown as a formula (1):

count (i) represents the number of times the word i appears in paper j, total _ word_jRepresents the total number of words in paper j; n is a radical of hydrogen_tRepresents the number of all papers under label t; n is a radical of_～tIndicating the number of all papers under other tags having the same upper level tag as tag t; if the label t is a top label, t represents other top labels; if the label t is a non-top label, t represents other labels under the upper label belonging to the label t; n is a radical ofⁱ _～tRepresents the number of papers in which the word i appears in all papers under other labels having the same upper-level label as the label t;

b, sequencing the weights of all words under the label in a descending order, and taking M words before ranking as feature words of the label to form an initial feature word set of the label;

In the technical scheme, the scientific research paper is long in space and has more information irrelevant to the classification of the multi-level labels, so that the information relevant to the classification of the multi-level labels in the scientific research paper is extracted, the text representation of the scientific research paper is obtained, and the classification efficiency and the classification accuracy can be improved.

In another technical scheme, the value of M is 5% of the total number of words represented by the text under the corresponding label. The value of M can be adjusted in a floating mode according to the total number of the characteristic words under each label, and the value of M is generally 5% of the total number of the characteristic words.

In another technical scheme, the value of M is not more than 1000. The total number of the characteristic words of scientific research papers to which part of labels belong is large and can reach over ten thousand. This will result in an excessively large value of M, easily increasing noise words, and reducing the effect of the multi-level label classification model. Therefore, the present invention limits the value of M to 1000 to reduce the number of noise feature words.

In another technical scheme, the total number of characteristic words of each label is not more than 5000. And sequencing all the remaining words according to the calculated semantic similarity, taking K words before ranking, adding the feature word set of the label, and expanding the feature word set. To prevent the introduction of too many noise feature words, M + K (i.e., the total number of feature words per tag) is limited to 5000.

In another technical scheme, after the binary discriminant model calculates the probability, the threshold values of the probability are all 0.5. So as to improve the accuracy of the correspondence of the label and the paper.

In another technical scheme, the method for constructing and forming the binary discriminant model is any one of a convolutional neural network, naive Bayes and a support vector product. The three methods have accurate corresponding relation, small calculated amount and quick judgment.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. The classification method of the multi-label scientific research paper based on the hierarchical discriminant tree is characterized by comprising the following steps:

step one, constructing a binary discriminant model:

taking all labels output on a path from a root node to a leaf node as labels of the paper;

the method for obtaining the feature word set of each label through screening comprises the following steps: starting from a top-level label of a multi-level label system, acquiring a characteristic word corresponding to each label by the following method according to the sequence from a root node to a leaf node;

the method comprises the following steps:

wherein, F_j（i) Meaning termiIn the thesisjThe calculation formula of the frequency in (1) is shown as formula (2):

meaning wordiIn the thesisjThe number of times of occurrence of (a),

presentation paperjThe total number of words in; n is a radical of_tPresentation labeltThe number of all papers that follow; n is a radical of_t~Presentation and labeltThe number of all papers under other labels with the same upper label; if labeltTop label, then ~ -tRepresent other top-level labels; if labeltIs a non-top label, then ~ -tTag for indicating co-existencetOther tags under the higher level tag of (a); n is a radical ofⁱ _t~Is shown in and labeledtAll papers under other labels with the same upper label appear in terms of wordsiThe number of papers of (1);

wherein M represents the number of words in the initial feature word set of the tag, cos (c) ((M))j,i) Meaning termjWords and phrasesiOf the semantic representation of (2) cosine distance, W_t（j) Meaning wordjOn the labeltThe weight of (1);

2. The method for classifying a multi-label scientific research paper based on a hierarchical discriminant tree as claimed in claim 1, wherein the method for obtaining the text representation by using the text segmentation technology comprises:

the word set I and word set II comprise textual representations of the paper.

3. The method for classifying multi-label scientific research papers based on hierarchical discriminant trees as claimed in claim 1, wherein the value of M is 5% of the total number of words represented by the text under the corresponding label.

4. The method for classifying multi-label scientific papers based on hierarchical discriminant trees as claimed in claim 3, wherein a value of M is not greater than 1000.

5. The method of classifying a multi-label scientific paper based on hierarchical discriminant trees as claimed in claim 1, wherein the total number of feature words per label is not more than 5000.

6. The method for classifying multi-label scientific research papers based on hierarchical discriminant trees as claimed in claim 1, wherein after the binary discriminant model calculates the probability, the threshold values of the probability are all 0.5.

7. The classification method of multi-label scientific research papers based on hierarchical discriminant trees as claimed in claim 1, wherein the method for constructing the binary discriminant model is any one of convolutional neural network, naive bayes, and support vector product.