CN114706972A - Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression - Google Patents
Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression Download PDFInfo
- Publication number
- CN114706972A CN114706972A CN202210275509.1A CN202210275509A CN114706972A CN 114706972 A CN114706972 A CN 114706972A CN 202210275509 A CN202210275509 A CN 202210275509A CN 114706972 A CN114706972 A CN 114706972A
- Authority
- CN
- China
- Prior art keywords
- sentence
- node
- text
- word
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention relates to an unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression, and belongs to the technical field of natural language generation. Aiming at the multi-document text generation in the field of scientific and technological intelligence, firstly, the source data is obtained based on the topic crawler of the LDA topic similarity thesaurus expansion method. And sequencing all the text paragraphs through a text information value evaluation model of three indexes of authority, timeliness and content relevance of the text information. And selecting the paragraphs with higher scores as original texts for generating the final scientific and technological intelligence. And finally, automatically generating a scientific and technical intelligence abstract by adopting an unsupervised multi-document abstract method based on spectral clustering and multi-sentence compression. The method effectively solves the problems that the requirements of scientific and technological information generation on data timeliness and authority are high in the data screening process, and the traditional neural network-based multi-document generation method cannot be applied due to the lack of data sets in the scientific and technological information field.
Description
Technical Field
The invention relates to an unsupervised scientific and technical information abstract automatic generation method, in particular to an unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression, and belongs to the technical field of natural language generation.
Background
Scientific and technological information works, and key functions are developed for the planning of national huge scientific and technological strategies, the deployment of huge scientific and technological plans and the development of economic society, so that the method makes contributions to the development of society, economy and science and is a key component in the development of national scientific and technological plans and economic society.
In the field of scientific and technological information, in the face of a big data environment, valuable text data is collected, sorted and screened manually, and an information report is written manually, so that a large amount of labor and time cost are consumed, therefore, the current requirements of people on information are no longer met by the ordered acquisition of information resources and the processing sorting and access analysis which are mainly characterized by document units, but higher requirements on information analysis depth are met, including quick evaluation and recommendation of data resources, extraction and analysis of knowledge units, multi-dimensional data fusion, fine-grained data analysis, visualization, computerized data presentation and analysis and the like, the efforts are made to remove redundancy classification, remove coarse storage and false storage of the big data, and the generation of a basic automatic information abstract is realized.
However, in the era of information explosion, since the sources of scientific and technical information are complicated and disorderly, it is a great challenge to quickly and accurately find the useful information needed by the user from a large amount of information. To achieve basic automated intelligence generation, the first step is to efficiently collect effective information. In addition, since timeliness and authority of information are very important in information research, important consideration is needed in literature selection. Moreover, the information structure is not uniform due to different information sources, and it is also a difficult point to integrate a plurality of heterogeneous documents and generate a final report. In summary, in the process of automatically generating the scientific and technical information abstract, the main problems to be solved are: and integrating factors such as time and the like into comprehensive evaluation recommendation of heterogeneous texts and multi-document summarization.
Currently, a better approach has a topic crawler in terms of efficient information collection. Most researchers adopt a method of combining a crawling strategy based on link and content, and good effects are achieved. However, in the field of scientific and technological information, the way to acquire data is usually an authoritative intellectual library at home and abroad, and the link in the webpage of the intellectual library is less, so that the crawling method based on the content in the field of information is more suitable. In the research in the field of multi-document summarization, most recent achievements adopt a method of firstly sequencing multiple documents, screening out the first N most important documents, and then adopting a neural network or a method of combining the neural network and a graph model, and some authors also fuse pre-training models such as pre-training Bert and the like into the model. The method has good effect in the supervision of multi-document summarization. However, in the field of scientific intelligence, data set starvation is a non-negligible problem, making supervised methods practically unusable in this field.
Disclosure of Invention
The invention aims to solve the technical problems of difficulty in manual collection and screening and report generation in the technical information field, and creatively provides an automatic scientific and technological information abstract generation method from data collection and data screening to information generation. The method effectively solves the problems that the requirements of scientific and technological information generation on data timeliness and authority are high in the data screening process, and the traditional neural network-based multi-document generation method cannot be applied due to the lack of data sets in the scientific and technological information field.
The innovation points of the invention are as follows: for multi-document text generation in the field of scientific and technological intelligence, source data is obtained based on a topic crawler of an LDA (Latent Dirichlet Allocation), a document topic generation model, also called a three-layer Bayesian probability model, containing three-layer structures of words, topics and documents) topic similarity word bank expansion method. And sequencing all the text paragraphs through a text information value evaluation model of three indexes of authority, timeliness and content relevance of the text information. And selecting the paragraphs with higher scores as original texts for generating the final scientific and technical intelligence. And finally, automatically generating a scientific and technological intelligence abstract by adopting an unsupervised multi-document abstract method based on spectral clustering and multi-sentence compression.
The invention is realized by the following technical scheme.
An unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression comprises the following steps:
step 1: and adopting a topic crawler mode based on an LDA topic similarity word bank expansion method to capture text contents and acquire source data.
By the given initial keywords, under the condition that the theme description is insufficient, the corpus is continuously expanded through the collection function of the theme crawler on the theme related resources, the model is trained circularly, and the theme description is continuously perfected, expanded and updated, so that the desired content is acquired more comprehensively and accurately.
Step 2: and evaluating and ranking the crawled texts according to the relevance of the contents of the crawled texts and the keywords and the timeliness and authority of the source texts. The text of the paragraphs with scores ranking at least the top 40 is selected as the original text for generating the final scientific intelligence.
And step 3: and (3) taking the result text obtained in the step (2) as the input of the model, and obtaining an abstract result by adopting an unsupervised multi-document abstract model based on spectral clustering and multi-sentence compression.
Advantageous effects
Compared with the prior art, the method of the invention has the following advantages:
1. the method respectively provides a thesis patent text information evaluation model and a wisdom article text information evaluation model. The model has strong universality and can be suitable for all thesis patent texts and all wisdom library articles.
2. The method provides an automatic scientific and technical information abstract generation method for text generation from data acquisition, and by using the topic crawler, the relevance of the data to the topic keywords is improved, the redundant data is reduced, and the efficiency of the data acquisition and cleaning stages is optimized. And a combination method of spectral clustering and multi-sentence compression is utilized in the text generation stage, so that the effect of unsupervised multi-document summarization is improved.
Drawings
FIG. 1 is an overall flow diagram of the method of the present invention;
FIG. 2 is an architecture diagram of the subject crawler module of example 1 and method step 1 of the present invention;
FIG. 3 is a flow chart of the text message value evaluation process of step 2 of the method and embodiment 1 of the present invention;
FIG. 4 is a flowchart of method step 3 and the multiple document summarization algorithm of embodiment 1 of the present invention;
fig. 5 is a flow chart of the multiple sentence compression algorithm used in the multiple document summarization process of the present invention and method step 3.4.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
An unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression comprises the following steps:
step 1: and adopting a topic crawler mode based on an LDA topic similarity word bank expansion method to capture text contents and acquire source data.
Because only a small number of keywords are given, the contents crawled by the crawler are not completely consistent with the contents actually expected to be crawled, and therefore, the crawling efficiency can be improved as much as possible while the accuracy is improved and the crawling range is expanded by adopting the topic crawler mode.
By the given initial keywords, under the condition of insufficient theme description, the corpus is continuously expanded through the collection function of the theme crawler to the theme related resources, the model is trained circularly, and the theme description is continuously perfected, expanded and updated so as to acquire the desired content more comprehensively and accurately.
Specifically, step 1 comprises the steps of:
step 1.1: crawling corresponding result webpages according to given initial keywords, and extracting abstracts of the newly added webpages to serve as LDA new training corpora.
Step 1.2: word embedding (word embedding) is done to the training expectation. May be implemented using the word2vec model.
Step 1.3: and combining the original corpus, and obtaining a new theme document through LDA training, wherein the new theme document is used for covering and updating the theme document of the original theme crawler.
Step 2: and evaluating and sequencing the crawled texts according to the relevance of the contents of the crawled texts and the keywords and the timeliness and authority of the source texts.
The evaluation of the value of the text information is generally performed by analyzing the information in terms of the propagation source, the propagation characteristics, the content characteristics and the like. The information dissemination source reflects the characteristics of the information publishing subject, including the publishing channel, the authority of the publisher, and the like. The information propagation characteristics reflect the form characteristics of the information propagation process. Only after extensive, deep and fast propagation of information, the inherent value of the information can have fully embodied opportunities, which generally comprise the number of propagation persons, the propagation speed, the propagation chain depth and the like. In addition, the information has obvious timeliness characteristics, and the expired information is often useless.
Therefore, the text information value evaluation model is constructed by extracting three characteristic dimensions of authority, timeliness and content relevance of the text information.
Specifically, step 2 comprises the steps of:
step 2.1: all text is segmented by paragraph. In the subsequent calculation, the calculation is performed in units of paragraphs.
The method for evaluating the value of the papers, patents and periodicals comprises the following steps:
aiming at the texts such as thesis, patents and periodicals, the influence factor, the total issue text amount and the total download amount of a first author, the text download amount and the reference amount are used as authority judgment indexes, the release time is used as a timeliness index, the similarity between the abstract and the topic lexicon is used as a content correlation index, corresponding parameters are set for each index, a text information value evaluation model is constructed, and the value score of the text is comprehensively calculated.
Further, the invention provides a value score calculation method for texts such as papers, patents and periodicals, which comprises the following steps:
the first step is as follows: computing authority x1。
For authority x1Factors related to authority include the publication journal authority of the text, the authority of the author in the field, and other researchers in the field evaluating the text.
Wherein, authority x of periodicals11The ratio of the journal influence factor to the maximum value of all literature influence factors is used for expression, and the expression is shown as formula 1:
the authoritativeness of papers and patents is determined by the number of articles published in the field by the author as the first author and the total amount of downloaded articles published by the author as the first author, as shown in equation 2:
the value of a paper itself is determined by the download amount and the reference amount of the paper, as shown in formula 3:
the second step is that: calculating the timeliness x2。
If the attenuation coefficient of the text information value along with time is mu and the time interval between the information acquisition time and the information release time is delta t, the calculation of the time variation of the information value is as shown in formula 4:
x2=e-μΔt (4)
wherein e is a natural constant.
The third step: computing content relevance x3。
Specifically, the BM25 algorithm may be used to calculate the relevance of text content. Each word in the topic word library acquired by the topic crawler is regarded as qi. For the abstract a of the text, calculate each word qiScoring the degree of correlation with a, dividing qiAnd a, performing weighted summation with the relevance Score of a to obtain a relevance Score (Q, a) of the current text and the topic lexicon, as shown in formula 5:
wherein WiDenotes the ith word qiThe weight of (2) is calculated by using a TF-IDF algorithm; n represents the total number of words in the word stock; r (q)iA) represents the word qiThe correlation with a is calculated by equations 6 and 7:
wherein, tftaIs the word frequency of the word t in a; l isaIs the length of a, LaveIs the average length of all texts, and the variable k is a positive parameter used for standardizing the range of word frequency of the article; b is an adjustable parameter, 0<b<1, representing a range for representing the amount of information by deciding the length of a document to use; k is an intermediate result in the calculation.
Step 2.3: the value of the Chiense article is evaluated by the following method:
aiming at the intelligent library article text, the number of fans and the number of issued articles of an article author are used as authority indexes, the issuing time is used as a timeliness index, the similarity between the article abstract and the subject thesaurus is used as a content correlation index, corresponding parameters are set for each index, and an intelligent library article text information value evaluation model is constructed.
The invention provides a value scoring calculation method for a thesaurus text, which comprises the following steps:
the first step is as follows: computing authority x1。
For the wisdom library article, data such as download amount and quote amount do not exist, and quantitative measurement standards do not exist in authority of the wisdom library, so the number of fans of an author of the article and the number of issued documents are used as the authority measurement indexes. Specifically, the calculation is performed by using the following formula 8 and 9:
the second step is that: calculating the timeliness x2。
The calculation method is the same as that described in the second step of step 2.2.
The third step: computing content relevance x3。
The calculation method is the same as that in the third step of step 2.2.
Step 2.4: and calculating the information value of the text.
The text information value is defined as a linear combination of new authority characteristics, timeliness characteristics and content correlation characteristics. Meanwhile, considering the multiplier effect of timeliness, the value of the obtained measurement and calculation information is as follows:
X=[δ1(α1x11+α2x12+α3x13)+δ2(βx3)]x2 (10)
wherein X represents the value of the text message, alpha1、α2、α3、δ1、δ2Representing the influence factors of different characteristics on the text value, and selecting the values according to actual needs. In the present invention, α may be taken1=α2=0.3,α3=0.4,δ1=δ2=0.5。
Step 2.5: and sequencing each paragraph according to the text information value score, and selecting the top 40 paragraphs of the sequencing result as the text data for subsequent multi-document summarization.
And step 3: and obtaining an abstract result by adopting an unsupervised multi-document abstract model based on spectral clustering and multi-sentence compression.
Because a labeling data set similar to a thesis patent or a Chiense article text does not exist in the field of multi-document summarization, the invention provides an unsupervised machine learning method based on spectral clustering and multi-sentence compression for summarization generation. The method comprises the steps of converting an original document into a sentence subgraph, simultaneously considering language and depth representation, then applying spectral clustering to obtain a plurality of sentence clusters, and finally compressing each cluster to generate a final abstract.
Specifically, step 3 includes the steps of:
step 3.1: the text data is processed.
For the paragraph set P ═ { P ] related to one topic finally obtained in step 21,p2,…pnThe final goal is to generate a summary S that encompasses the important information in the original document and has no redundant information.
And taking the sentence as the minimum processing unit of the text, and considering that the last step needs sentence compression, so that all stop words are reserved. The specific method comprises the following steps: a sentence list (which can be generated by calling the NLP module of SpaCy, SpaCy is a leading industrial-level processing library in the NLP task domain) is generated and used as input for a subsequently constructed sentence graph.
Step 3.2: and (3) establishing a structured sentence subgraph, wherein nodes correspond to the sentences generated in the step (3.1), and drawing edges according to the vocabulary relation and the deep semantic relation among the sentences.
The goal of this step is: and identifying paired sentence connections capable of representing the P utterance structure of the paragraph set, and constructing a sentence sub-graph by adopting an approximate language graph (ADG) based technology and combining a deep embedding technology.
Specifically, a graph G ═ (V, E) is constructed, and a node V of the graph is constructediE.g. V represents a sentence, V represents a set of nodes, ei,jE represents the node viAnd node vjAnd E represents a set of edges. For any two non-identical nodes viAnd node vjIf the sentences they represent have the following relationship, they are connected with each other and have an edge with a value of 1, i.e. ei,j=1。
The graph G construction rule comprises the following steps:
de-lexical noun association: according to the english grammar, when an event or entity is mentioned in a verb phrase, the event or entity is usually represented as a dependent noun or noun phrase of the changed verb in the following sentence. The noun form of this verb phrase is found by WordNet (a cognitive linguistics-based english dictionary that arranges words alphabetically and constitutes a "network of words" according to the meaning of the words). If the noun form of the verb phrase in a sentence appears in the sentence after the sentence, the nodes represented by the two sentences are connected with each other.
Entity continuation: this term takes into account wording relevance. If the sentence viAnd sentence vjAnd contain the same entity class (e.g., organization, name, product, etc.), then the two nodes are connected to each other.
The speech tagging language: if there is a semantic relationship between adjacent sentences, such as the connecting words however, meanwhile, furtermore, etc., the nodes represented by the two sentences are connected with each other.
Sentence similarity: the similarity score of a sentence is calculated by averaging all word vectors of a sentence as a sentence representation and using the cosine similarity of the two sentence vectors. And if the similarity score reaches a set threshold value, judging that the two nodes are connected with each other.
Step 3.3: and (4) clustering by using the graph to obtain the intra-graph partitions.
Currently, most graph clustering methods identify node groups in a graph according to edges connecting nodes. The invention adopts a spectral clustering method, which comprises the following steps:
the first step is as follows: acquiring a Laplace matrix of the sentence graph constructed in the above mode (which can be obtained by subtracting the adjacency matrix from the degree matrix of the graph);
the second step is that: calculating the first m eigenvectors of the matrix, which are used for defining the eigenvector of each sentence;
the third step: the sentences are divided into m categories by means of k-means clustering.
M sentence categories which represent different key information are obtained, next, sentence sets of the m categories are subjected to multi-sentence compression operation respectively to obtain m abstracts, and the compression process is shown in step 3.4.
Step 3.4: and generating a summary from the extracted subgraph.
Sentence Compression (MSC) is to generate a summary sentence from clusters each containing a set of semantically related sentences. At present, the classical approach is to construct a word graph and select a sentence constructed by the shortest path as the abstract.
The invention provides a new implementation method, which expands the classical method and specifically comprises the following steps:
the first step is as follows: and constructing a word graph.
For a set of sentences S ═ S1,s2,…,snThat is, a node is first mapped to each word that appears in a sentence. Since the situation of word ambiguity in natural language is widely existed, each node uses a bigram (token, tag) as its identifier, and each time a repeated word is considered, the word graph is adjusted according to the following rules:
and directly establishing a new node for the words which are not stop words, non punctuations and have no candidate nodes (no token, tag) in the current word graph corresponds to the word).
For words with non-stop words, non-punctuation and only one candidate node, the word is directly mapped to the candidate node.
For words that are not stop words, are not punctuated, and have multiple candidate nodes: the word is mapped to the node closest to the context, but the word graph is kept acyclic-i.e. two identical words of the same sentence cannot be mapped to one node. And if no node meeting the condition exists, a new node is created.
And for stop words and punctuation, if the nodes with the same context exist, mapping the stop words and punctuation as the nodes, otherwise, creating a new node.
Regarding the weight of the edge between the nodes, considering the co-occurrence probability between the nodes, the larger the co-occurrence probability of two nodes is, the smaller the edge weight thereof is, when there is an edge between two nodes, if there is a multi-hop connection between them, the edge weight thereof is enhanced, and as the path length becomes longer, the multi-hop connection enhancing effect is weakened, which is specifically expressed by equation 11:
wherein, w (e)i,j) Representing the weight of the edge between node i and node j; freq (i) and freq (j) represent the number of words mapped to node i and node j, respectively; diff (s, i, j) refers to the distance between the offset position of the word mapped to node i and the word mapped to node j in sentence s;
the second step: a recall phase. Finding the shortest paths of F in the word graph, wherein each sentence formed by the paths is a candidate answer.
The essence of this step is to solve the limited F shortest path problem. In the invention, the Yen's algorithm is adopted to solve the problem. The algorithm is divided into two parts, the 1 st shortest path P (1) is calculated, and then other F-1 shortest paths are calculated in sequence on the basis. When calculating P (i +1), all nodes except the end node on P (i) are regarded as deviated nodes, the shortest path from each deviated node to the end node is calculated, and then the shortest path is spliced with the previous path from the start node to the deviated nodes on P (i) to form a candidate path, and then the shortest deviated path is calculated. The top 100 ranked path is selected as the candidate sentence path.
The third step: and re-ordering the candidate answers, and selecting the candidate answer with the top order as the final answer.
Specifically, key phrases are extracted using TextRank, and new scores are designed for reordering. First, each node updates its score using equation 12 until convergence:
wherein, S (n)i) Representing a node n in a word graphiIs scored. The damping coefficient d, the value of which may take on 0.85. adj (n)i) Representation and node niAdjacent node, w (e)j,i) Representing a node njAnd node niThe weight of the edges in between.
Then, a key phrase r is obtained according to the key word combination, and the score (r) is as follows:
wherein, TextRank (w) represents the score of the word node w calculated via the TextRank algorithm. The denominator is the weighted length (r) of the key phrase r and the normalization of the scores is done to favor longer phrases.
Finally, the paths are reordered by multiplying the weighted length of the total path in the candidate sentences obtained in the second step by the sum of the scores of the key phrases contained in the total path. From the key phrase scores, a final score is calculated for each sentence:
where length (c) represents the weighted length of sentence c, and path (c) represents the complete path of sentence c.
And selecting the summary with the minimum score as the generated summary, and finally connecting the summaries generated by the m categories to obtain the final complete summary.
Examples
This example describes a specific embodiment of the method of the present invention.
The implementation schematic diagram is shown in the overall flow chart of fig. 1. The invention provides a complete process from text data acquisition, data processing and abstract text generation in the scientific and technical intelligence abstract generation process. When the method is implemented specifically, the topic crawler module starts to work, data required by analysis are obtained according to a keyword library provided by a user, then the text information value evaluation module analyzes and sequences the obtained data, and finally a sequencing result is taken as the input of the abstract generation module and is brought into a model to obtain a final result.
First, according to keywords provided by a user, a topic crawler module is applied to Google academy, DARPA, IARPA and Lande Chinesia to acquire data. FIG. 2 is a flow of data acquisition in an unsupervised scientific and technical intelligence automatic generation method based on sentence graph clustering. According to the step 1 introduced by the invention, a certain number of webpages are crawled according to given initial keywords, then the abstracts of the newly added webpages are taken as LDA new training linguistic data, then word embedding is carried out on training expectations by using word2vec, and finally a new theme document is obtained by LDA training in combination with an original corpus and is used for covering and updating the theme document of the original theme crawler.
After the required text data is acquired, the text value is evaluated according to the attribute data of the text, and the evaluation mode is shown in fig. 3. All texts are segmented according to paragraphs, authority, timeliness and content relevance of the text data are calculated according to data such as periodicals, authors and download amount of the text data at a time, and then value of text information is calculated by combining the authority, timeliness and content relevance. And finally, sorting the text data according to the value of the text information, and selecting the first 40 texts as the text data for subsequent multi-document summarization.
And finally, a text generation stage. The specific process is as shown in fig. 4, firstly processing text data, segmenting paragraph data obtained in the previous step into sentences, then calling an NLP module of a SpaCy library to generate a sentence list, then constructing an undirected sentence subgraph according to the rule described in step 3.2, then clustering the sentence subgraph according to the spectral clustering method described in step 3.3 to generate m classes, and finally generating an abstract by adopting a multi-sentence compression mode for the m classes. The flow of compressing multiple sentences is shown in fig. 5, firstly, a word graph is constructed according to the rules described in the first step in step 3.4, then the Yen's algorithm is used to find the shortest path of 100 in the graph, and finally, reordering is performed. The process of reordering is: extracting key phrases by using a TextRank algorithm, then recalculating scores of sentences according to the key phrases, and finally sequencing the scores of 100 paths, wherein the sentences formed by connecting words in the paths with the minimum scores are the abstract results of the category. And finally, connecting the abstracts generated in the m classes to generate a final abstract.
While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. It is intended that all equivalents and modifications which do not depart from the spirit of the invention disclosed herein are deemed to be within the scope of the invention.
Claims (6)
1. An unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression is characterized by comprising the following steps:
step 1: adopting a topic crawler mode based on an LDA topic similarity word bank expansion method to capture text content and acquire source data;
and 2, step: evaluating and sequencing the crawled texts according to the relevance of the contents and the keywords and the timeliness and authority of the source text; building a text information value evaluation model by extracting three characteristic dimensions of authority, timeliness and content relevance of text information;
the method comprises the following steps:
step 2.1: segmenting all texts according to paragraphs; in subsequent calculations, the calculation is performed in paragraph units;
the method for evaluating the value of the papers, patents and periodicals comprises the following steps:
aiming at the texts such as thesis, patents and periodicals, the influence factor, the total issue text amount and the total download amount of a first author, the text download amount and the reference amount are used as authority judgment indexes, the release time is used as a timeliness index, the similarity between the abstract and the topic lexicon is used as a content correlation index, corresponding parameters are set for each index, a text information value evaluation model is constructed, and the value score of the text is comprehensively calculated;
step 2.3: evaluating the value of the wisdom library article; aiming at the intelligent library article text, the number of fans and the number of issued articles of an article author are used as authority indexes, the issuing time is used as a timeliness index, the similarity between the article abstract and the subject thesaurus is used as a content correlation index, corresponding parameters are set for each index, and an intelligent library article text information value evaluation model is constructed;
step 2.4: calculating the information value of the text;
defining the text information value as a linear combination of new authority characteristics, timeliness characteristics and content correlation characteristics; meanwhile, considering the multiplier effect of timeliness, the value of the obtained measurement and calculation information is as follows:
X=[δ1(α1x11+α2x12+α3x13)+δ2(βx3)]x2 (10)
wherein X represents the value of the text message, alpha1、α2、α3、δ1、δ2Representing influence factors of different characteristics on the text value, wherein the values are selected according to actual requirements;
step 2.5: sequencing each paragraph according to the text information value score, and selecting the top 40 paragraphs of the sequencing result as the text data for subsequently performing multi-document summarization;
and 3, step 3: taking the result text obtained in the step 2 as the input of the model, and adopting an unsupervised multi-document abstract model based on spectral clustering and multi-sentence compression to obtain an abstract result;
firstly, converting an original document into a sentence subgraph, simultaneously considering language and depth representation, then applying spectral clustering to obtain a plurality of sentence clusters, and finally compressing each cluster to generate a final abstract.
2. The method for automatically generating unsupervised scientific and technical intelligence abstract based on multi-sentence compression as claimed in claim 1, wherein the step 1 comprises the following steps:
step 1.1: crawling corresponding result webpages according to given initial keywords, and extracting abstracts of the newly added webpages to serve as LDA new training corpora;
step 1.2: word embedding is carried out on training anticipation;
step 1.3: and combining the original corpus, and obtaining a new theme document through LDA training, wherein the new theme document is used for covering and updating the theme document of the original theme crawler.
3. The method for automatically generating unsupervised scientific and technical intelligence abstract based on multi-sentence compression as claimed in claim 1, wherein in the step 2, the method for calculating the value score of texts such as papers, patents and periodicals comprises the following steps:
the first step is as follows: computing authority x1;
For authority x1Factors related to authority include publication journal authority of the text, authority of the author in the field, and evaluation of the text by other researchers in the field;
authority x of periodicals11The ratio of the journal influence factor to the maximum value of all literature influence factors is adopted to express, as shown in formula 1:
the authoritativeness of papers and patents is determined by the number of articles published in the field by the author as the first author and the total amount of downloaded articles published by the author as the first author, as shown in equation 2:
the value of a paper itself is determined by the download amount and the reference amount of the paper, as shown in formula 3:
the second step is that: calculating the timeliness x2;
If the attenuation coefficient of the text information value along with time is mu, and the time interval between the information acquisition time and the information release time is delta t, the calculation of the information value along with the time change is shown as formula 4:
x2=e-μΔt (4)
wherein e is a natural constant;
the third step: computing content relevance x3;
Each word in the topic word library acquired by the topic crawler is regarded as qi(ii) a For a summary a of the text, calculate each word qiScoring the degree of correlation with a, dividing qiAnd a, performing weighted summation on the correlation scores of a to obtain a correlation Score Score (Q, a) of the current text and the topic lexicon, as shown in formula 5:
wherein WiDenotes the ith word qiThe weight of (2) is calculated by using a TF-IDF algorithm; n represents the total number of words in the word stock; r (q)iA) represents the word qiThe correlation with a is calculated by equations 6 and 7:
wherein, tftaIs the word frequency of the word t in a; l isaIs the length of a, LaveThe average length of all texts, wherein a variable k is a positive parameter and is used for standardizing the range of word frequency of an article; b is an adjustable parameter, 0<b<1, representing a range for representing the amount of information by deciding the length of a document to use; k is an intermediate result in the calculation;
the value score calculation method for the intelligent library file type text comprises the following steps of:
the first step is as follows: computing authority x1;
For the Chiense article, the number of fans and the number of issued documents of an author of the article are used as authority measuring indexes, and the calculation is carried out by adopting an equation 8 and an equation 9:
the second step is that: calculating the timeliness x2;
The calculation method is the same as the second step of the value score calculation method of the texts such as the papers, the patents and the periodicals;
the third step: computing content relevance x3;
The calculation method is the same as the third step of the value score calculation method of the texts such as the papers, the patents and the periodicals.
4. The method for automatically generating unsupervised scientific and technical intelligence abstract based on multisentence compression as claimed in claim 1, wherein the step 3 comprises the following steps:
step 3.1: processing the text data;
for the paragraph set P ═ { P ] related to one topic finally obtained in step 21,p2,…pnThe final aim is to generate a summary S which contains important information in the original document and has no redundant information; taking a sentence as a minimum processing unit of a text, considering that the last step needs sentence compression, and reserving all stop words; the specific method comprises the following steps: generating a sentence list and taking the sentence list as the input of a sentence subgraph constructed subsequently;
step 3.2: establishing a structured sentence subgraph, wherein nodes correspond to sentences generated in the step 3.1, drawing edges according to the lexical relation and deep semantic relation among the sentences so as to identify paired sentence connection capable of representing the P utterance structure of the paragraph set, and constructing the sentence subgraph by adopting a technology based on an approximate language graph and combining depth embedding;
step 3.3: and (3) applying graph clustering to obtain intra-graph partitions, wherein the specific steps are as follows:
the first step is as follows: acquiring a Laplace matrix of the constructed sentence graph;
the second step is that: calculating the first m eigenvectors of the matrix, which are used for defining the eigenvector of each sentence;
the third step: dividing the sentences into m categories by a k-means clustering mode;
obtaining m sentence categories which represent different key information, and then respectively carrying out multi-sentence compression operation on the sentence sets of the m categories to obtain m abstracts;
step 3.4: and generating a summary from the extracted subgraph.
5. The method as claimed in claim 4, wherein in step 3.2, a graph G (V, E) is constructed, and a node V of the graph is constructediE.g. V represents a sentence, V represents a set of nodes, ei,jE represents the node viAnd node vjE represents a set of edges; for any two non-identical nodes viAnd node vjIf the sentences they represent have the following relationship, they are connected with each other and have an edge with a value of 1, i.e. ei,j=1;
Wherein, the graph G construction rule comprises the following steps:
de-lexical noun association: according to english grammar, when a certain event or entity is mentioned in a verb phrase, the event or entity is usually represented as a dependent noun or noun phrase of the verb in the following sentence; find the noun form of this verb phrase through WordNet; if the noun form of a verb phrase in a sentence appears in the sentence after the sentence, nodes represented by the two sentences are connected with each other;
entity continuation: this term takes into account the relevance of words; if the sentence viAnd sentence vjAnd contain the same entity class, then these two nodes are connected to each other;
the speech tagging language: if the adjacent sentences have semantic relation, the nodes represented by the two sentences are connected with each other;
sentence similarity: calculating a similarity score of the sentence by averaging all word vectors of a sentence as sentence representation and using cosine similarity of two sentence vectors; and if the similarity score reaches a set threshold value, judging that the two nodes are connected with each other.
6. The method for automatically generating an unsupervised scientific and technical intelligence abstract based on multisentence compression as claimed in claim 4, wherein the step 3.4 of generating the abstract comprises the following steps:
the first step is as follows: constructing a word graph;
for a set of sentences S ═ S1,s2,…,snMapping each word appearing in the sentence into a node; since the situation of word ambiguity in natural language is widely existed, each node uses a bigram (token, tag) as its identifier, and each time a repeated word is considered, the word graph is adjusted according to the following rules:
directly establishing a new node for words which are not stop words, non punctuations and have no candidate nodes;
for words of non-stop words, non-punctuation and only one candidate node, directly mapping the words to the candidate node;
for words that are not stop words, are not punctuated, and have multiple candidate nodes: mapping the word to a node closest to the context, but keeping the word graph endless, namely two same words of the same sentence can not be mapped to one node; if no node meeting the condition exists, a node is newly established;
for stop words and punctuation, if the nodes with the same context exist, mapping the stop words and punctuation as the nodes, otherwise, establishing a new node;
regarding the weight of the edge between the nodes, considering the co-occurrence probability between the nodes, the larger the co-occurrence probability of two nodes is, the smaller the edge weight thereof is, when there is an edge between two nodes, if there is a multi-hop connection between them, the edge weight thereof is enhanced, and as the path length becomes longer, the multi-hop connection enhancing effect is weakened, which is specifically expressed by equation 11:
wherein, w (e)i,j) Representing the weight of the edge between node i and node j; freq (i) and freq (j) represent the number of words mapped to node i and node j, respectively; diff (s, i, j) refers to the distance between the offset position of the word mapped to node i and the word mapped to node j in sentence s;
the second step is that: a recall phase; finding F shortest paths in the word graph, wherein sentences formed by each path are candidate answers;
solving the problem by adopting a Yen's algorithm; the algorithm is divided into two parts, the 1 st shortest path P (1) is calculated, and then other F-1 shortest paths are calculated in sequence on the basis; when P (i +1) is obtained, all nodes except the termination node on P (i) and P (i) are regarded as deviation nodes, the shortest path from each deviation node to the termination node is calculated, and then the shortest paths are spliced with the paths from the initial node to the deviation nodes on P (i) and P (i) to form candidate paths, so that the shortest deviation paths are obtained; selecting a path with the top rank of 100 as a candidate sentence path;
the third step: reordering the candidate answers and selecting the candidate answer with the top order as the final answer; extracting key phrases by using a TextRank, and designing a new score for reordering; first, each node updates its score using equation 12 until convergence:
wherein, S (n)i) Representing a node n in a word graphiScore of (a); a damping coefficient d, which may take a value of 0.85; adj (n)i) Representation and node niAdjacent nodes, w (e)j,i) Representing a node njAnd node niThe weight of the edges in between;
then, a key phrase r is obtained according to the key word combination, and the score (r) is as follows:
wherein, TextRank (w) represents the score of the word node w calculated by the TextRank algorithm; the denominator is the weighted length (r) of the key phrase r, and the normalization operation on the scores is performed to tend to select longer phrases;
finally, the paths are reordered by multiplying the weighted length of the total paths in the candidate sentences obtained in the second step by the total sum of the scores of the key phrases contained in the candidate sentences; from the key phrase scores, a final score is calculated for each sentence:
wherein length (c) represents the weighted length of sentence c, and path (c) represents the complete path of sentence c;
and selecting the summary with the minimum score as the generated summary, and finally connecting the summaries generated by the m categories to obtain the final complete summary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210275509.1A CN114706972A (en) | 2022-03-21 | 2022-03-21 | Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210275509.1A CN114706972A (en) | 2022-03-21 | 2022-03-21 | Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114706972A true CN114706972A (en) | 2022-07-05 |
Family
ID=82169773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210275509.1A Pending CN114706972A (en) | 2022-03-21 | 2022-03-21 | Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114706972A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115687960A (en) * | 2022-12-30 | 2023-02-03 | 中国人民解放军61660部队 | Text clustering method for open source security information |
CN116127321A (en) * | 2023-02-16 | 2023-05-16 | 广东工业大学 | Training method, pushing method and system for ship news pushing model |
CN116541505A (en) * | 2023-07-05 | 2023-08-04 | 华东交通大学 | Dialogue abstract generation method based on self-adaptive dialogue segmentation |
-
2022
- 2022-03-21 CN CN202210275509.1A patent/CN114706972A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115687960A (en) * | 2022-12-30 | 2023-02-03 | 中国人民解放军61660部队 | Text clustering method for open source security information |
CN116127321A (en) * | 2023-02-16 | 2023-05-16 | 广东工业大学 | Training method, pushing method and system for ship news pushing model |
CN116541505A (en) * | 2023-07-05 | 2023-08-04 | 华东交通大学 | Dialogue abstract generation method based on self-adaptive dialogue segmentation |
CN116541505B (en) * | 2023-07-05 | 2023-09-19 | 华东交通大学 | Dialogue abstract generation method based on self-adaptive dialogue segmentation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106997382B (en) | Innovative creative tag automatic labeling method and system based on big data | |
US9971974B2 (en) | Methods and systems for knowledge discovery | |
CN109858028B (en) | Short text similarity calculation method based on probability model | |
RU2487403C1 (en) | Method of constructing semantic model of document | |
CN108197117B (en) | Chinese text keyword extraction method based on document theme structure and semantics | |
US20120158400A1 (en) | Methods and systems for knowledge discovery | |
EP1661031A1 (en) | System and method for processing text utilizing a suite of disambiguation techniques | |
CN114706972A (en) | Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
Chen et al. | Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph | |
US20140089246A1 (en) | Methods and systems for knowledge discovery | |
Batura et al. | A method for automatic text summarization based on rhetorical analysis and topic modeling | |
Jha et al. | Hsas: Hindi subjectivity analysis system | |
Tripathy et al. | Automated phrase mining using POST: The best approach | |
Heidary et al. | Automatic Persian text summarization using linguistic features from text structure analysis | |
Moghadam et al. | Comparative study of various Persian stemmers in the field of information retrieval | |
JP2003085181A (en) | Encyclopedia system | |
Ahmed et al. | A web statistics based conflation approach to improve Arabic text retrieval | |
Asrori et al. | Performance analysis graph-based keyphrase extraction in Indonesia scientific paper | |
IO et al. | Performance evaluation of an improved model for keyphrase extraction in documents | |
Beumer | Evaluation of Text Document Clustering using k-Means | |
Jain et al. | Investigating the Similarity of Court Decisions. | |
Das et al. | Relation recognition among named entities from a crime corpus using a web-based semantic similarity measurement | |
Rajpal et al. | A Novel Techinque For Ranking of Documents Using Semantic Similarity | |
Parra et al. | Unsupervised tagging of spanish lyrics dataset using clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |