CN115017903A - Method and system for extracting key phrases by combining document hierarchical structure with global local information - Google Patents

Method and system for extracting key phrases by combining document hierarchical structure with global local information Download PDF

Info

Publication number
CN115017903A
CN115017903A CN202210697632.2A CN202210697632A CN115017903A CN 115017903 A CN115017903 A CN 115017903A CN 202210697632 A CN202210697632 A CN 202210697632A CN 115017903 A CN115017903 A CN 115017903A
Authority
CN
China
Prior art keywords
phrases
candidate
local
phrase
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210697632.2A
Other languages
Chinese (zh)
Inventor
赵姝
殷俊
郭双瑞
张金磊
段震
陈洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202210697632.2A priority Critical patent/CN115017903A/en
Publication of CN115017903A publication Critical patent/CN115017903A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a method and a system for extracting key phrases by combining a document hierarchical structure with global local information, wherein the method comprises the following steps: word segmentation and magnetic labeling, NP segmentation; judging the length of the document, and embedding the document and words by using a BERT model according to the length of the document; global similarity measurement, the invention innovatively uses the document title and the ending to carry out global similarity evaluation on candidate key phrases, and solves the preference of longer candidate phrases caused by vector space alignment; local similarity evaluation, namely performing topic division and clustering on the candidate key phrases in the full text by adopting brand-new topic centrality, and fully capturing local significant information; comprehensively evaluating and scoring the candidate phrases by combining the position information, the global similarity and the local similarity, and then ranking according to the scores; post-processing operations are performed to select key phrases. The invention solves the technical problems of low key phrase extraction accuracy rate caused by lost semantics, preference for long phrases and insufficient main body information mining.

Description

Method and system for extracting key phrases by combining document hierarchical structure with global local information
Technical Field
The invention relates to the technical field of text analysis, in particular to a method and a system for extracting key phrases by combining global and local information of a document hierarchical structure.
Background
The key phrases are phrases in the document that provide a concise abstract of the core content and help the reader to understand the content of the article in a short time. Because of the concise and accurate expression, key phrases are widely used in information retrieval, document classification, recommendation and search. The embedding-based approach is widely used for unsupervised key phrase extraction tasks. In general, these methods simply calculate the similarity between phrase embedding and document embedding, and there is room for improvement in both method utility and effectiveness.
A large number of scholars have studied keyword extraction of texts, and the related methods can be generally divided into an unsupervised extraction mode and a supervised extraction mode.
Supervised methods [ Sterckx et al, 2016; alzaidy et al, 2019; sun et al, 2020; mu et al, 2020] generally treats key phrase extraction as a time-two-classification problem, requiring not only large-scale annotated training data, but also always underperforming when migrating to different domains or types of data sets. Compared with the supervised method, the unsupervised method has more universality and adaptability by extracting phrases based on the information of the input document. Therefore, in this patent, we focus on the unsupervised key phrase extraction model.
Unsupervised key phrase extraction has been studied by a large number of scholars, and recently, embedding-based methods such as embedrRank [ Bennani-Smires 2018] and SIFRANK [ Sun 2020] have achieved good results as text representation learning progresses. Generally, the methods embed candidate phrases and texts through a static pre-training model Word2Vec or a dynamic pre-training model BERT, then calculate the embedding similarity between the candidate phrases and the whole text, and sort according to scores. Although the embedding-based method may perform better than the conventional statistical (e.g., TF-IDF [ Salton G,1975]), graph-based (e.g., PositionRank [ Corina Florescu 2017]), simply calculating the similarity between candidate phrases and full-text does not capture different types of contexts. CCRank [ Liang et al, 2021] first proposed jointly modeling global information and local information for keyword extraction, but his approach has two problems. The first is that due to the limitation of the BERT model, the first 512 tokens can be automatically truncated on long texts by using the method, which can cause a great deal of semantic loss; second, because the full-text vector and candidate-phrase vector are not aligned in semantic space, his global similarity gives a higher score to phrases that are long in the candidate phrase, which results in a model that prefers long candidate phrases; third, he is simply using boundary properties to model local information and does not adequately mine the topic information of the article. The invention is shown in fig. 4 below after embedding the key phrases and full-text vectors and visualizing them. The five-pointed star is vector embedding of an article, nodes with colors being thickened and close to each other belong to the same theme, the global similarity is only considered in the traditional embedding-based method, namely candidate phrases in a black dotted frame are only selected, the importance of a local theme of the article is obviously not considered, but the phrases from the boundary usually only represent a small part of the theme of the article and the theme information among the candidate key phrases cannot be fully mined. In addition, the previous phrases do not consider the limitation that BERT can only code 512 tokens, so that a truncation mode is usually selected when a long text is faced, only candidate key phrases of a title and an abstract are obtained, and key phrases of a conclusion part are not obtained, so that sufficient semantic information is not obtained, and the effect is poor. In the patent document CN111160017A, the patent document "a keyword extraction method, device, computer device, and storage medium" of the present invention inputs text data to be processed into a keyword extraction network model obtained by training sequence labeling samples carrying set codes, so that semantic relevance of contexts can be fully discovered through standard keywords, and accuracy of keyword extraction is improved. The application also provides a method and a device for grading the dialect, computer equipment and a storage medium, and the keywords only relevant to the service in the dialect can be extracted according to different service scenes by inputting the dialect to be graded into the trained keyword extraction network model. The specification of the prior patent document also discloses that an initial keyword extraction network model composed of three-layer network elements based on ERNIE-BiLSTM-CRF is deployed in the server 104, and the ERNIE network element is an improved version based on the BERT model, and is optimized for tasks at the chinese vocabulary level, so as to have a better effect on extraction of chinese entities and entity relationships. The main structure of the model is the same as that of the BERT model, and the model is a technical scheme consisting of 12 encoder layers. The prior patent document does not disclose the technical solution of the present application, and the technical effects of the present application cannot be achieved. The method disclosed in the prior invention patent document CN113255340A, a subject extraction method, device and storage medium for scientific and technological requirements, includes: acquiring scientific and technological requirement text data, wherein the scientific and technological requirement text data carries a first-level theme category label in the industry field; respectively obtaining word vectors and document vectors based on scientific and technological requirement text data belonging to the same primary topic category; obtaining a subject word vector representation and a subject word set based on the word vector and the document vector by utilizing a subject model based on deep learning; clustering scientific and technological requirement text data based on a preset clustering number on the basis of the subject word vector; and extracting the subject words in the subject word set as key words by using a text sorting algorithm, sorting the extracted subject words, screening the subject words serving as second-level clustering subject category label words according to the scores of the subject words, and taking the subject word with the highest score as a second-level subject representative of the category. The prior patent document does not disclose the technical solution of the present application, and the technical effect of the present application cannot be achieved.
In conclusion, the prior art has the technical problems of semantic loss, preference for long phrases and low key phrase extraction accuracy rate caused by insufficient main body information mining.
Disclosure of Invention
The invention aims to solve the technical problems of low key phrase extraction accuracy rate caused by semantic loss, preference for long phrases and insufficient main body information mining in the prior art.
The invention adopts the following technical scheme to solve the technical problems: the method for extracting the key phrases by combining the document hierarchical structure with the global local information comprises the following steps:
s1, performing word segmentation and part-of-speech tagging on the input document by using a StandfordCoreNLP tool, and performing NP blocking according to a preset extraction rule to generate a candidate key phrase set;
s2, judging whether the length of the input document is smaller than or equal to a preset document length threshold value or not, if so, embedding the input document by using a BERT model to obtain vector expression, otherwise, acquiring the specified range content of the input document according to a preset range, and inputting the specified range content into the SimCSE model to embed and acquire the vector expression, the title vector and the ending vector of the candidate key phrase;
s3, processing the title vector and the ending vector to perform global similarity measurement on the candidate key phrases so as to obtain global similarity;
s4, using topic centrality to perform topic partitioning and clustering on the candidate key phrases in the full text of the input document with preset logic, and obtaining local similarity according to local similarity evaluation, wherein the step S4 further includes:
s41, taking the candidate key phrases as nodes and taking the similarity among the nodes as edges to construct a complete undirected graph;
s42, setting a self-adaptive noise filtering threshold value according to the maximum value and the minimum value of each input document;
s43, updating the weight of the edge according to the self-adaptive noise filtering threshold value to obtain local significance data, and obtaining an updated edge according to the local significance data;
s44, acquiring the position information of the candidate key phrases of the full text of the input document;
s45, calculating the local similarity according to the position information;
s5, comprehensively evaluating and scoring the candidate key phrases by combining and processing the position information, the global similarity and the local similarity, and processing the candidate key phrases according to the sequence to obtain key phrase ranking data;
s6, obtaining a candidate key phrase sorting data set according to the key phrase ranking data, carrying out post-processing operation on the candidate key phrases, deleting a subset of the candidate key phrase set to obtain semantic diversity key phrases, obtaining word frequency data, and removing high-frequency general phrases on the candidate key phrase sorting data set to filter out high-frequency invalid phrase interference.
The method provided by the invention greatly improves the extraction of key phrases. The results of comparative experiments show that the global similarity and the local similarity provided by the invention play a role. In the calculation of the local similarity, the noise filtering threshold value theta is provided by the invention. In addition, the model provided by the invention makes great progress on long texts, and the benefit of the invention is that the hierarchical structure of the document is fully utilized. The invention enables the candidate phrase to have higher semantic diversity through diversity operation, so that the result is more accepted by people.
In a more specific embodiment, step S2 includes:
s21, inserting a CLS mark at the starting position of the input document by using a BERT model, and inserting an SEP mark at the ending position;
s22, learning the input document in an embedding mode to obtain a vector of each token:
{H 1 ,H 2 ,…,H n }=BERT({T 1 ,T 2 ,…,T n })
s23, obtaining the vector representation of the candidate key phrase according to the preset extraction rule to obtain the candidate phrase vector set:
Figure BDA0003703336280000041
s24, sending the title and the end of the input document into the BERT model to obtain a title vector H title And an ending vector H end
S25, respectively inputting the conclusion part and the abstract part of the input document into the BERT model to carry out embedding operation so as to obtain the vector expression;
s26 expresses the input document in long text using the SimCSE model.
Aiming at the problem that the traditional embedding-based method can only cut off long texts and cause a large amount of semantic loss due to the fact that the BERT coding length is limited, the invention provides that titles and abstracts are combined into one group when the long texts are faced according to the document hierarchical structure and the writing habit of human, and the conclusion is that the group is divided into two times and sent to a pre-training model for embedding, so that time and space are saved, and the semantic information of the full text can be maximally stored.
Aiming at the defects of the traditional method for processing long texts, the invention creatively provides the abstract and conclusion of the document segmented and coded by using the SimCSE, so that the model provided by the invention can fully learn the information of the document and can obtain key phrases with higher quality.
In a more specific embodiment, in step S3, the header vector H is processed with the following logic title And an end vector H end Obtaining the global similarity of each candidate key phrase i according to:
Figure BDA0003703336280000042
wherein | denotes the manhattan distance,
Figure BDA0003703336280000043
representing the global similarity of the candidate phrase i to the entire document.
Aiming at the problem of preference to long and short phrases caused by semantic space alignment, the method uses the last sentence of the title and the tail to replace the traditional full-text vector according to the writing habit of human beings, thereby solving the problem of high score of the long phrase.
In a more specific embodiment, step S42 includes:
s421, processing the candidate key phrase i by using a graph centrality calculation method according to the following logic:
Figure BDA0003703336280000051
wherein the content of the first and second substances,
Figure BDA0003703336280000052
s422, setting the self-adaptive noise filtering response threshold theta by using the following logic;
θ=min(e ij )+β×(max(e ij )-min(e ij ))
the present invention performs a one-step post-processing operation on the candidate phrases. Setting a threshold value, filtering out candidate phrases with high frequency of top 20% in each specific field, avoiding the interference of high-frequency invalid words, and then improving the semantic diversity of the candidate key phrases by deleting subsets.
In a more specific embodiment, step S43 includes:
s431, obtaining the local significance data by using the following logic processing:
Figure BDA0003703336280000053
wherein the content of the first and second substances,
Figure BDA0003703336280000054
representing the local saliency of the candidate phrase i;
s432, the updating edge is obtained according to the local saliency data, and when the weight of the updating edge is smaller than 0, the weight of the updating edge is set to be 0.
In a more specific embodiment, step S44 includes:
s441, calculating the first occurrence position of the candidate key phrase in the input document by the following logic to serve as a candidate key phrase position score:
Figure BDA0003703336280000055
wherein p is 1 Is the first occurrence of the candidate term i;
s442, smoothing the candidate key phrase position score by using a softmax function, so as to obtain the position information by using the following logic processing:
Figure BDA0003703336280000056
in a more specific technical solution, in step S45, the position information is processed by the following logic, so as to obtain the local similarity of the candidate key phrase i
Figure BDA0003703336280000057
Figure BDA0003703336280000061
According to the invention, the topic centrality is adopted in the local text information modeling, so that the topic information on the full text can be identified, and the local topic information can be captured better compared with the boundary centrality.
In a more specific embodiment, step S5 includes:
s51, multiplying and comprehensively processing the global similarity and the local similarity of the candidate key phrase by using the following logic to obtain a candidate key phrase score:
Figure BDA0003703336280000062
and S52, processing the candidate key phrases according to the score ordering of the candidate key phrases to obtain the ranking data of the key phrases.
In a more specific technical solution, in step S6, the coarse-grained key phrases are deleted according to the fine-grained key phrases, so as to obtain the semantic diversity key phrases.
In a more specific technical solution, a system for extracting key phrases by combining a document hierarchy structure and global and local information includes:
the candidate phrase generation module is used for performing word segmentation and part-of-speech tagging on the input document by using a StandfordCoreNLP tool and performing NP (non-point) segmentation according to preset extraction rules to generate a candidate key phrase set;
the BERT model embedding module is used for judging whether the length of the input document is smaller than or equal to a preset document length threshold value or not, if so, the input document is embedded and processed by the BERT model to obtain vector expression, if not, the appointed range content of the input document is obtained according to a preset range, the appointed range content is input into the SimCSE model to be embedded to obtain the vector expression, the title vector and the ending vector of the candidate key phrase, and the SimCSE model embedding module is connected with the candidate phrase generating module;
a global similarity measurement module, configured to process the title vector and the end vector to perform global similarity measurement on the candidate key phrases, so as to obtain global similarity, where the global similarity measurement module is connected to the BERT model embedding module;
the local similarity evaluation module is used for performing topic division and clustering on the candidate key phrases in the full text of the input document by using topic centrality and preset logic, so as to obtain local similarity according to local similarity evaluation, and is connected with the candidate phrase generation module, wherein the local similarity evaluation module further comprises:
the undirected graph construction module is used for constructing a complete undirected graph by taking the candidate key phrases as nodes and taking the similarity among the nodes as edges;
the noise filtering threshold setting module is used for setting a self-adaptive noise filtering threshold according to the maximum value and the minimum value of each input document;
the noise filtering module is used for updating the weight of the edge according to the self-adaptive noise filtering threshold value to obtain local significance data, and obtaining an updated edge according to the local significance data, and the noise filtering module is connected with the undirected graph constructing module and the noise filtering threshold value setting module;
the position acquisition module is used for acquiring the position information of the candidate key phrases of the full text of the input document according to the new completely undirected graph;
the local similarity calculation module is used for calculating the local similarity according to the position information, and is connected with the position acquisition module;
a key phrase ranking module, configured to perform comprehensive evaluation and scoring on the candidate key phrases in combination with processing the location information, the global similarity, and the local similarity, to process the candidate key phrases in order to obtain key phrase ranking data, where the key phrase ranking module is connected to the global similarity measurement module and the local similarity evaluation module;
the post-processing module is used for obtaining a candidate key phrase ordering data set according to the key phrase ranking data, performing post-processing operation on the candidate key phrases, deleting a subset of the candidate key phrase set to obtain semantic diversity key phrases, obtaining word frequency data, removing high-frequency general phrases on the candidate key phrase ordering data set to filter out high-frequency invalid phrase interference, and is connected with the key phrase ranking module.
Compared with the prior art, the invention has the following advantages: the method provided by the invention greatly improves the extraction of key phrases. The results of comparative experiments show that the global similarity and the local similarity provided by the invention play a role. In the calculation of the local similarity, the noise filtering threshold value theta is provided by the invention. In addition, the model provided by the invention makes great progress on long texts, and the benefit of the invention is that the hierarchical structure of the document is fully utilized. According to the invention, the candidate phrase has higher semantic diversity through diversity operation, so that the result is more accepted by people.
Aiming at the problem that the traditional embedding-based method can only cut off long texts and cause a large amount of semantic loss due to the fact that the BERT coding length is limited, the invention provides that titles and abstracts are combined into one group when the long texts are faced according to the document hierarchical structure and the writing habit of human, and the conclusion is that the group is divided into two times and sent to a pre-training model for embedding, so that time and space are saved, and the semantic information of the full text can be maximally stored.
Aiming at the defects of the traditional method for processing long texts, the invention creatively provides the beginning and conclusion of the document by using SimCSE segmented coding, so that the model provided by the invention can fully learn the information of the document and can obtain key phrases with higher quality.
Aiming at the problem of preference to long and short phrases caused by semantic space alignment, the method uses the last sentence of the title and the tail to replace the traditional full-text vector according to the writing habit of human beings, thereby solving the problem of high score of the long phrase.
The present invention performs a one-step post-processing operation on the candidate phrases. Setting a threshold value, filtering out candidate phrases with high frequency of top 20% in each specific field, avoiding the interference of high-frequency invalid phrases, and then improving the semantic diversity of the candidate key phrases by deleting subsets.
According to the invention, the topic centrality is adopted in the local text information modeling, so that the topic information on the full text can be identified, and the local topic information can be captured better compared with the boundary centrality. The invention solves the technical problems of low extraction accuracy of key phrases caused by semantic loss, preference for long phrases and insufficient main body information mining in the prior art.
Drawings
FIG. 1 is a schematic diagram of the steps of a method for extracting keywords by combining a document hierarchy structure with global and local information according to embodiment 1 of the present invention;
FIG. 2 is a schematic overall flow chart of a method for extracting keywords by combining global and local information in a document hierarchy structure according to embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of the structural similarity vectors and similarities in embodiment 1 of the present invention;
FIG. 4 is a schematic diagram of the key phrase and full-text vector embedding visualization in embodiment 1 of the present invention;
fig. 5 is a schematic diagram of a test sample in embodiment 3 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1 and fig. 2, the method for extracting key phrases by combining global and local information in a document hierarchy structure provided by the present invention includes the following steps:
s1, inputting document word segmentation and part-of-speech tagging, and generating candidate key phrases according to rules;
in the embodiment, the StandfordCoreNLP tool is used for performing word segmentation and part-of-speech tagging on an input document, and candidate key phrases are generated according to rules; in this embodiment, the document D is participled and part-of-speech tagged using the standard natural language processing tool Stanford CoreNLP. After word segmentation, the document may be segmented by D ═ T 1 ,T 2 ,…,T n Represents it.
In this embodiment, the document D is participled and part-of-speech tagged using a standard Stanford CoreNLP tool. A generic stop word table is first set to filter words and symbols that have no meaning. This is done by tagging a word part of speech as 'IN' if it appears IN the stop word list. Then use the rule<NN.*|JJ>*<NN.*>Where NN represents a noun and JJ represents an adjective. This rule extracts phrases of arbitrary length that start with a noun or adjective and end with a noun. Using this rule, a candidate key phrase set KP can be obtained { KP ═ KP 0 ,KP 1 ,…,KP n }。
S2, the short document is all sent to BERT for embedding, the long document is selected to be embedded with 512 words in the front and preferably 512 words, and the vector representation of the title, the ending sentence and the candidate key phrase is obtained;
in the embodiment, the length of an input document is judged, if the length of the document does not exceed a threshold value, a BERT model is directly used for embedding to obtain vector expression, and if the length of the document exceeds the threshold value, the content in a document specified range is input into an SimCSE model for embedding to obtain vector expression;
in the present embodiment, after being preprocessed, the document D is preprocessed into a token set T ═ T 1 ,T 2 ,…,T n And candidate key phrase sequence KP ═ KP { KP } 0 ,KP 1 ,…,KP m }. Different from the prior coding by adopting static vectors, the invention adopts BERT to carry out vector embedding, which is a strong pre-training model and can obtain the dynamic vector representation of the context. Thus, a vector representation of the candidate key-phrases and the document, where the vector representation of the document employs a heading H in view of the hierarchical structure information of the document, is obtained title And the last sentence H end
In this embodiment, as shown in FIG. 3 below, the parameter symbols are defined in the following Table 1:
table 1: parameter symbol definition
Figure BDA0003703336280000091
As shown in fig. 3, D ═ T obtained in the first step 1 ,T 2 ,…,T n Inputting into BERT model, inserting CLS mark at its beginning, inserting SEP mark at its end position, and performing embedded learning to obtain eachthe vector for token, i.e.: { H 1 ,H 2 ,…,H n }=BERT({T 1 ,T 2 ,…,T n }). And then obtaining vector representation of the candidate key phrase according to the extraction rule of the second step, wherein for the candidate key phrase consisting of a plurality of words, such as the key phrase extraction, a MaxPholing mode is selected to obtain the vector. I.e. set of candidate key phrase vectors
Figure BDA0003703336280000101
At this step, the header and trailer may be fed into the BERT model to obtain the corresponding vector H title 、H end
Since BERT can only encode 512 tokens, the existing method cannot process long texts very well, and for most long documents or news articles, authors tend to write key information at the beginning and end of the document, so in order to facilitate actual operation, the conclusion of the article and the abstract of the article are respectively input into a BERT model for embedding, so that more candidate key phrases with higher diversity are obtained, and document information is fully mined. However, the study of [ lingxiaoawang, 2020] indicates that the vector expression encoded by BERT has various anisotropies, and the expression forms are non-uniform in distribution, low-frequency words are sparsely distributed, and high-frequency words are densely distributed, so that the similarity between two sentences cannot be measured. Inspired by the [ Tianyu Gao,2021] study, the present invention uses SimCSE instead of BERT's expression on long texts. The SimCSE is a model provided by Chendanqi group of Princeton university and aims to solve the problem of the anisotropy of the vector coded by the BERT.
S3, calculating the global similarity by using the title, the last sentence and the candidate key phrase vector;
in this embodiment, global similarity metric, the present invention innovatively uses document titles and endings for global similarity evaluation of candidate key-phrases, accounting for the preference for longer candidate key-phrases due to vector space alignment;
in this embodiment, for each candidate key phrase, its similarity to title H is calculated separately using the similarity method title And the last sentence H end . In this embodiment, through the preceding steps, a vector representation for each candidate key-phrase has been obtained
Figure BDA0003703336280000102
Vector representation of header H title And vector representation of the ending H end . The more similar a phrase is to an article, the more likely it is to be a key phrase, but the input sequence lengths of documents and phrases in the BERT model are not equal, resulting in a semantic space that is difficult to align, and long and short languages have advantages over single words. Inspired by the document structure, considering that people usually put their core viewpoints at the beginning and the end of an article, the invention adopts the titles and the endings to replace the full-text vector, because the length difference between the titles and the candidate key phrases can be reduced as much as possible, and partial noise can be reduced by adopting the titles and the endings to replace the full-text vector.
The global similarity for each candidate key phrase i is calculated by the following formula:
Figure BDA0003703336280000103
where | denotes the manhattan distance,
Figure BDA0003703336280000104
indicating the global similarity of the candidate key-phrase i to the entire document. H title Representing vectors of headers, H end The vectors of the endings are represented.
S4, constructing a complete undirected graph, wherein nodes are candidate key phrases, edges are similarity among the nodes, then setting an adaptive threshold according to the maximum and minimum values of each document, updating the weight of the edges to a weight-threshold, directly setting the weight of a new edge less than 0 to be 0, realizing theme division through the mode, and calculating local similarity;
as shown in fig. 4, in the present embodiment, local similarity evaluation is performed by using a completely new topic centrality to perform topic partitioning and clustering on candidate key phrases in a full-text, so as to fully capture local significant information. The invention visually displays the embedded key phrases and full-text vectors, the pentagram is the vector embedding of the article, and the nodes filled with the same belong to the same theme.
In this embodiment, a completely undirected graph is constructed, where vertices are candidate key phrases. The initial weight of the edge is the dot product result between two term vectors, and considering that the article is composed of a plurality of small topics, a dynamic threshold method is adopted to filter the noise of irrelevant topics to candidate key phrases.
In this embodiment, a completely undirected graph G ═ (V, E) is first constructed, where the point is
Figure BDA0003703336280000111
Figure BDA0003703336280000112
I.e., points are candidate key phrases. The edge is E ═ E ij And (4) representing the weight among the candidate key phrases, wherein the traditional graph centrality calculation method comprises the following steps:
Figure BDA0003703336280000113
wherein the content of the first and second substances,
Figure BDA0003703336280000114
according to the foregoing, since a document has a plurality of local topics and candidate key phrases form a plurality of local topics, it is a difficult problem how to find out these small topics accurately, and as can be seen from fig. 4, each small topic includes candidate key phrases which are grouped together. Also, when the degree of difference in the subject is not particularly large, one candidate key phrase may be included in a plurality of subjects. For a candidate key phrase, it is more important to state that he is if he is included in more small topics. But phrases in the diametrically opposite subject matter may cause noise interference to the candidate key-phrases. Based on this assumption, the present invention designs a threshold θ to filter noise.
θ=min(e ij )+β×(max(e ij )-min(e ij ))
Below this threshold θ, we let the weight e of the edge ij As such, the interference of totally uncorrelated phrases can be filtered. Thus, the traditional degree centrality calculation formula is rewritten into
Figure BDA0003703336280000115
Herein, the
Figure BDA0003703336280000116
Representative is the local saliency of the candidate key-phrase i.
For most documents, authors tend to write key information at the beginning of the document. [ Florescu,2017]Note that location biasing weights can greatly improve the performance of key phrase extraction, taking as weight the sum of the location and the reciprocal of the word in the document. For example, if a candidate key-phrase occurs at the second, fifth and tenth positions, then his position is scored
Figure BDA0003703336280000121
To prevent duplicate computation, the present invention computes only the first occurrence of the candidate key phrase as its location score, i.e., the location of the candidate key phrase
Figure BDA0003703336280000122
Wherein p is 1 Is the first occurrence of the candidate key-phrase i. To prevent the position information from dominating the final score, the softmax function is used to smooth the position score, so the present invention modifies the position information formula to:
Figure BDA0003703336280000123
therefore, after comprehensively considering the position information, the local similarity formula of the candidate key phrase i is rewritten into.
Figure BDA0003703336280000124
End use of the invention
Figure BDA0003703336280000125
To measure the local similarity of the candidate key phrase i.
S5, sorting algorithm: scoring and sequencing the subsequent key phrases by combining the position information, the global similarity and the local similarity;
in the embodiment, the candidate key phrases are comprehensively evaluated and scored according to the position information, the global similarity and the local similarity, and then ranked according to the scores;
in this embodiment, a large body of literature has demonstrated that for most paper news articles, authors tend to write key information at the beginning and end of the document. Therefore, the position information of the candidate key phrase is important, and the more times a phrase appears in the article, the more likely it is a key phrase, and in order to prevent repeated calculation considering that the word frequency information is already used in the local similarity calculation process, the invention only records the first position of the phrase appearance, and takes the reciprocal of the position of the candidate key phrase as the position score. Comprehensively calculating the global similarity score of one phrase; a local similarity score and a location score. And finally, outputting a score list. In this embodiment, the global similarity and the local similarity of the candidate key phrases are integrated by the simplest multiplication, so that the final score of one candidate key phrase is obtained as follows:
Figure BDA0003703336280000126
s6, post-processing: for the candidate key terms after the sorting algorithm, the diversity is improved by deleting the subsets in the candidate key phrases, and then the high-frequency common words are deleted, so that the user experience is integrally improved;
in the embodiment, post-processing operation is performed, high-frequency general words on the data set are removed, interference of high-frequency invalid words is avoided, and then semantic diversity of the candidate key phrases is improved in a subset deleting mode.
In this embodiment, a semantic diversity operation is performed on the key phrases, then the high-frequency general phrases are filtered, and finally topN is selected as the key phrase.
In the embodiment, considering the diversity of key phrases, more detailed key phrases are used to replace coarse-grained key phrases in one article, so the invention chooses to use fine-grained key phrases to delete coarse-grained key phrases. For example, "government policies" and "government", "policies" appear in the candidate key phrase list, we delete the two more coarse-grained candidate key phrases of "government" and "policies", so as to obtain the key phrases with more diversity and better meeting human requirements.
A candidate key phrase may be formally represented as KP i ={w 1 ,w 2 ,…,w n The method deletes only single word w in the candidate key phrase 1 ,w 2 ,…,w n Therefore, key phrases with more semantic diversity can be obtained.
Example 2
The method for extracting the key words by combining the document hierarchical structure with the global and local information, provided by the invention, is verified in three public data sets, namely, an expert, a DUC2001 and a SemEval2010, and the result shows that the method and the device for extracting the key phrases by combining the document hierarchical structure with the global and local information, provided by the invention, can effectively realize the extraction of the key phrases of the document.
The experimental results are as follows:
data set
The invention performed experiments on three common datasets, namely instec [ Hulth,2003], DUC2001[ Wan,2008] and SemEval2010[ Kim,2010 ]. The spec dataset contains 2000 summary documents from the summary of the scientific journal. The 500 test documents and reader labeled versions of the key phrase were used in our experiments as the groudtruth. DUC2001 is a collection of 308 long news articles. SemEval2010 contains the ACM full-length paper. In this experiment, 100 test documents were used, along with a combined key-phrase annotated by the author and reader.
The experimental results are as follows:
as shown in the following table, three indexes F1@5, F1@10 and F1@15 were selected to evaluate the accuracy of the method (DHSRank +) proposed by the present invention as in the past work.
Figure BDA0003703336280000141
The experimental result shows that the method provided by the invention greatly improves the extraction of key phrases. The results of comparative experiments show that the global similarity and the local similarity provided by the invention play a role. The noise filtering threshold θ proposed by the present invention is especially important in the calculation of local similarity. In addition, the model provided by the invention makes great progress on long texts, so that the hierarchical structure of the document is fully utilized, especially for the defects of the traditional method for processing the long texts, the beginning and conclusion of the document is innovatively provided and coded by using the SimCSE segments, the model provided by the invention can fully learn the information of the document, and key phrases with higher quality can be obtained. Finally, the invention enables the candidate key phrase to have higher semantic diversity through diversity operation, so that the result is more accepted by people.
Example 3
Sample demonstration:
an example of the DUC2001 is shown in FIG. 5
DUC2001 is a data set from a news article. The correct key phrase is underlined. The black bold text represents the standard key phrases and the text identified with dashed underlines represents the phrases extracted by our model.
We can see that the gold standard corresponds to each topic of the article. Our model extracts many correct phrases that are identical to the standard key-phrase, and extracts the phrase "net income" that is semantically identical to "lower net income" in the standard key-phrase.
It is worth mentioning that our model focuses on the document boundaries, with most of the extracted phrases at the beginning and end of the document, which demonstrates the validity of our proposed title + end as a global vector. From the figure we can also find that the wrong phrase is highly correlated with each small topic of the document, which proves the effectiveness of our topic perception centrality. This example shows that joint modeling of global and local contexts can improve the performance of key phrase extraction, and our model really captures both local and global information.
In conclusion, the method provided by the invention greatly improves the extraction of key phrases. The results of comparative experiments show that the global similarity and the local similarity provided by the invention play a role. In the calculation of the local similarity, the noise filtering threshold value theta is provided by the invention. In addition, the model provided by the invention makes great progress on long texts, and the benefit of the invention is that the hierarchical structure of the document is fully utilized. The method and the device have the advantages that the candidate key phrases have higher semantic diversity through diversity operation, and results are more accepted by people.
Aiming at the problem that the traditional embedding-based method is only capable of truncating long texts and causing a large amount of semantic loss due to the fact that BERT coding length is limited, the invention proposes that titles and abstracts are combined into one group when long texts are faced according to a document hierarchical structure and the writing habit of human, and the conclusion is that one group is divided into two times and is fed into a pre-training model for embedding, so that time and space are saved, and the semantic information of the full text can be maximally stored.
Aiming at the defects of the traditional method for processing long texts, the invention creatively provides the beginning and conclusion of the document coded by the SimCSE segments, so that the model provided by the invention can fully learn the information of the document and can obtain key phrases with higher quality.
Aiming at the problem of preference to long and short phrases caused by semantic space alignment, the method uses the last sentence of the title and the tail to replace the traditional full-text vector according to the writing habit of human beings, thereby solving the problem of high score of the long phrase.
The invention performs a one-step post-processing operation on the candidate key phrases. Setting a threshold value, filtering out candidate key phrases with high frequency of top 20% in each specific field, avoiding the interference of high-frequency invalid phrases, and then improving the semantic diversity of the candidate key phrases by deleting subsets.
According to the invention, on the aspect of local text information modeling, the subject centrality is adopted, so that the subject information on the whole text can be identified, and the local subject information can be captured better compared with the boundary centrality. The method solves the technical problems of low keyword extraction accuracy rate caused by semantic loss, preference for long phrases and insufficient main body information mining in the prior art.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. The method for extracting key phrases by combining the document hierarchical structure with global local information is characterized by comprising the following steps of:
s1, performing word segmentation and part-of-speech tagging on the input document by using a StandfordCoreNLP tool, and performing NP blocking according to a preset extraction rule to generate a candidate key phrase set;
s2, judging whether the length of the input document is smaller than or equal to a preset document length threshold value or not, if so, embedding the input document by using a BERT model to obtain vector expression, otherwise, acquiring the specified range content of the input document according to a preset range, and inputting the specified range content into the SimCSE model to embed and acquire the vector expression, the title vector and the ending vector of the candidate key phrase;
s3, processing the title vector and the ending vector to perform global similarity measurement on the candidate key phrases so as to obtain global similarity;
s4, using topic centrality to perform topic partitioning and clustering on the candidate key phrases in the full text of the input document with preset logic, and obtaining local similarity according to local similarity evaluation, wherein the step S4 further includes:
s41, taking the candidate key phrases as nodes and taking the similarity among the nodes as edges to construct a complete undirected graph;
s42, setting a self-adaptive noise filtering threshold value according to the maximum value and the minimum value of each input document;
s43, updating the weight of the edge according to the self-adaptive noise filtering threshold value to obtain local significance data, and obtaining an updated edge according to the local significance data;
s44, acquiring the position information of the candidate key phrases of the full text of the input document;
s45, calculating the local similarity according to the position information;
s5, comprehensively evaluating and scoring the candidate key phrases by combining and processing the position information, the global similarity and the local similarity, and processing the candidate key phrases according to the sequence to obtain key phrase ranking data;
s6, obtaining a candidate key phrase sorting data set according to the key phrase ranking data, carrying out post-processing operation on the candidate key phrases, deleting a subset of the candidate key phrase set to obtain semantic diversity key phrases, obtaining word frequency data, and removing high-frequency general phrases on the candidate key phrase sorting data set to filter out high-frequency invalid phrase interference.
2. The method for extracting key phrases in combination with global local information in document hierarchy according to claim 1, wherein said step S2 includes:
s21, inserting a CLS mark at the starting position of the input document by using a BERT model, and inserting an SEP mark at the ending position;
s22, learning the input document in an embedding mode to obtain a vector of each token:
{H 1 ,H 2 ,…,H n }=BERT({T 1 ,T 2 ,…,T n });
s23, obtaining the vector representation of the candidate key phrase according to the preset extraction rule to obtain the candidate phrase vector set:
Figure FDA0003703336270000021
s24, sending the title and the end of the input document into the BERT model to obtain a title vector H title And an ending vector H end
S25, respectively inputting the conclusion and the abstract of the input document into the BERT model for embedding operation to obtain the vector expression;
s26 expresses the input document in long text using the SimCSE model.
3. The method for extracting key phrases in combination with global local information in document hierarchy according to claim 1, wherein in said step S3, the heading vector H is processed with the following logic title And an end vector H end Obtaining the global similarity of each candidate key phrase i according to:
Figure FDA0003703336270000022
wherein | denotes the Manhattan distance,
Figure FDA0003703336270000023
representing candidate phrases i with the entire documentGlobal similarity.
4. The method for extracting key phrases in combination with global local information in document hierarchy according to claim 1, wherein said step S42 includes:
s421, processing the candidate key phrase i by using a graph centrality calculation method according to the following logic:
Figure FDA0003703336270000024
wherein the content of the first and second substances,
Figure FDA0003703336270000025
s422, setting the self-adaptive noise filtering response threshold theta by using the following logic;
θ=min(e ij )+β×(max(e ij )-min(e ij ))。
5. the method for extracting key phrases in combination with global local information in document hierarchy according to claim 1, wherein said step S43 includes:
s431, obtaining the local significance data by using the following logic processing:
Figure FDA0003703336270000031
wherein the content of the first and second substances,
Figure FDA0003703336270000032
representing the local saliency of the candidate phrase i;
s432, the updating edge is obtained according to the local significance data, and when the weight of the updating edge is smaller than 0, the weight of the updating edge is set to be 0.
6. The method for extracting key phrases in combination with global local information in document hierarchy according to claim 1, wherein said step S44 includes:
s441, calculating the first occurrence position of the candidate key phrase in the input document by the following logic to serve as a candidate key phrase position score:
Figure FDA0003703336270000033
wherein p is 1 Is the first occurrence of the candidate term i;
s442, smoothing the candidate key phrase position score by using a softmax function, so as to obtain the position information by using the following logic processing:
Figure FDA0003703336270000034
7. the method according to claim 1, wherein in step S45, the position information is processed by the following logic to obtain the local similarity of the candidate key phrase i
Figure FDA0003703336270000035
Figure FDA0003703336270000036
8. The method for extracting key phrases in combination with global local information in document hierarchy according to claim 1, wherein said step S5 includes:
s51, performing a multiplication synthesis process on the global similarity and the local similarity of the candidate key phrase by using the following logic, so as to obtain a candidate key phrase score:
Figure FDA0003703336270000037
and S52, processing the candidate key phrases according to the score ordering of the candidate key phrases to obtain the ranking data of the key phrases.
9. The method for extracting key phrases in combination with global local information in a document hierarchy according to claim 1, wherein in step S6, the coarse-grained key phrases are deleted according to the fine-grained key phrases, so as to obtain the key phrases with semantic diversity.
10. The system for extracting key phrases by combining a document hierarchical structure with global and local information is characterized by comprising the following steps:
the candidate phrase generation module is used for performing word segmentation and part-of-speech tagging on the input document by using a StandfordCoreNLP tool and performing NP (non-point) segmentation according to preset extraction rules to generate a candidate key phrase set;
the BERT model embedding module is used for judging whether the length of the input document is smaller than or equal to a preset document length threshold value or not, if so, the input document is embedded and processed by the BERT model to obtain vector expression, if not, the appointed range content of the input document is obtained according to a preset range, the appointed range content is input into the BERT model to be embedded to obtain the vector expression, the title vector and the ending vector of the candidate key phrase, and the BERT model embedding module is connected with the candidate phrase generating module;
a global similarity measurement module, configured to process the title vector and the end vector to perform global similarity measurement on the candidate key phrases, so as to obtain global similarity, where the global similarity measurement module is connected to the BERT model embedding module;
the local similarity evaluation module is used for performing topic division and clustering on the candidate key phrases in the full text of the input document by using topic centrality and preset logic, so as to obtain local similarity according to local similarity evaluation, and is connected with the candidate phrase generation module, wherein the local similarity evaluation module further comprises:
the undirected graph construction module is used for constructing a complete undirected graph by taking the candidate key phrases as nodes and taking the similarity among the nodes as edges;
the noise filtering threshold setting module is used for setting a self-adaptive noise filtering threshold according to the maximum value and the minimum value of each input document;
the noise filtering module is used for updating the weight of the edge according to the self-adaptive noise filtering threshold value to obtain local significance data, and obtaining an updated edge according to the local significance data, and the noise filtering module is connected with the undirected graph constructing module and the noise filtering threshold value setting module;
the position acquisition module is used for acquiring the position information of the candidate key phrases of the full text of the input document according to the new completely undirected graph;
the local similarity calculation module is used for calculating the local similarity according to the position information, and is connected with the position acquisition module;
a key phrase ranking module, configured to perform comprehensive evaluation and scoring on the candidate key phrases in combination with processing the location information, the global similarity, and the local similarity, to process the candidate key phrases in order to obtain key phrase ranking data, where the key phrase ranking module is connected to the global similarity measurement module and the local similarity evaluation module;
the post-processing module is used for obtaining a candidate key phrase ordering data set according to the key phrase ranking data, performing post-processing operation on the candidate key phrases, deleting a subset of the candidate key phrase set to obtain semantic diversity key phrases, obtaining phrase frequency data, removing high-frequency general phrases on the candidate key phrase ordering data set to filter out high-frequency invalid phrase interference, and is connected with the key phrase ranking module.
CN202210697632.2A 2022-06-20 2022-06-20 Method and system for extracting key phrases by combining document hierarchical structure with global local information Pending CN115017903A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210697632.2A CN115017903A (en) 2022-06-20 2022-06-20 Method and system for extracting key phrases by combining document hierarchical structure with global local information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210697632.2A CN115017903A (en) 2022-06-20 2022-06-20 Method and system for extracting key phrases by combining document hierarchical structure with global local information

Publications (1)

Publication Number Publication Date
CN115017903A true CN115017903A (en) 2022-09-06

Family

ID=83075764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210697632.2A Pending CN115017903A (en) 2022-06-20 2022-06-20 Method and system for extracting key phrases by combining document hierarchical structure with global local information

Country Status (1)

Country Link
CN (1) CN115017903A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687576A (en) * 2022-12-29 2023-02-03 安徽大学 Keyword extraction method and device represented by theme constraint
CN115713085A (en) * 2022-10-31 2023-02-24 北京市农林科学院 Document theme content analysis method and device
CN116187307A (en) * 2023-04-27 2023-05-30 吉奥时空信息技术股份有限公司 Method, device and storage device for extracting keywords of titles of government articles

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713085A (en) * 2022-10-31 2023-02-24 北京市农林科学院 Document theme content analysis method and device
CN115713085B (en) * 2022-10-31 2023-11-07 北京市农林科学院 Method and device for analyzing literature topic content
CN115687576A (en) * 2022-12-29 2023-02-03 安徽大学 Keyword extraction method and device represented by theme constraint
CN116187307A (en) * 2023-04-27 2023-05-30 吉奥时空信息技术股份有限公司 Method, device and storage device for extracting keywords of titles of government articles

Similar Documents

Publication Publication Date Title
US8484245B2 (en) Large scale unsupervised hierarchical document categorization using ontological guidance
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
US20090300046A1 (en) Method and system for document classification based on document structure and written style
CN115017903A (en) Method and system for extracting key phrases by combining document hierarchical structure with global local information
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN107590219A (en) Webpage personage subject correlation message extracting method
CN113486667B (en) Medical entity relationship joint extraction method based on entity type information
Kmail et al. An automatic online recruitment system based on exploiting multiple semantic resources and concept-relatedness measures
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN107577671A (en) A kind of key phrases extraction method based on multi-feature fusion
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN108009135A (en) The method and apparatus for generating documentation summary
CN112256939A (en) Text entity relation extraction method for chemical field
JP7281905B2 (en) Document evaluation device, document evaluation method and program
CN108319583A (en) Method and system for extracting knowledge from Chinese language material library
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN114997288A (en) Design resource association method
Jing et al. Context-driven image caption with global semantic relations of the named entities
CN111008530A (en) Complex semantic recognition method based on document word segmentation
Manojkumar et al. An experimental investigation on unsupervised text summarization for customer reviews
CN111274354B (en) Referee document structuring method and referee document structuring device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination