CN114706972A - Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression - Google Patents

Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression Download PDF

Info

Publication number
CN114706972A
CN114706972A CN202210275509.1A CN202210275509A CN114706972A CN 114706972 A CN114706972 A CN 114706972A CN 202210275509 A CN202210275509 A CN 202210275509A CN 114706972 A CN114706972 A CN 114706972A
Authority
CN
China
Prior art keywords
sentence
node
text
word
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210275509.1A
Other languages
Chinese (zh)
Inventor
张隽驰
张华平
商建云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210275509.1A priority Critical patent/CN114706972A/en
Publication of CN114706972A publication Critical patent/CN114706972A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to an unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression, and belongs to the technical field of natural language generation. Aiming at the multi-document text generation in the field of scientific and technological intelligence, firstly, the source data is obtained based on the topic crawler of the LDA topic similarity thesaurus expansion method. And sequencing all the text paragraphs through a text information value evaluation model of three indexes of authority, timeliness and content relevance of the text information. And selecting the paragraphs with higher scores as original texts for generating the final scientific and technological intelligence. And finally, automatically generating a scientific and technical intelligence abstract by adopting an unsupervised multi-document abstract method based on spectral clustering and multi-sentence compression. The method effectively solves the problems that the requirements of scientific and technological information generation on data timeliness and authority are high in the data screening process, and the traditional neural network-based multi-document generation method cannot be applied due to the lack of data sets in the scientific and technological information field.

Description

Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression
Technical Field
The invention relates to an unsupervised scientific and technical information abstract automatic generation method, in particular to an unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression, and belongs to the technical field of natural language generation.
Background
Scientific and technological information works, and key functions are developed for the planning of national huge scientific and technological strategies, the deployment of huge scientific and technological plans and the development of economic society, so that the method makes contributions to the development of society, economy and science and is a key component in the development of national scientific and technological plans and economic society.
In the field of scientific and technological information, in the face of a big data environment, valuable text data is collected, sorted and screened manually, and an information report is written manually, so that a large amount of labor and time cost are consumed, therefore, the current requirements of people on information are no longer met by the ordered acquisition of information resources and the processing sorting and access analysis which are mainly characterized by document units, but higher requirements on information analysis depth are met, including quick evaluation and recommendation of data resources, extraction and analysis of knowledge units, multi-dimensional data fusion, fine-grained data analysis, visualization, computerized data presentation and analysis and the like, the efforts are made to remove redundancy classification, remove coarse storage and false storage of the big data, and the generation of a basic automatic information abstract is realized.
However, in the era of information explosion, since the sources of scientific and technical information are complicated and disorderly, it is a great challenge to quickly and accurately find the useful information needed by the user from a large amount of information. To achieve basic automated intelligence generation, the first step is to efficiently collect effective information. In addition, since timeliness and authority of information are very important in information research, important consideration is needed in literature selection. Moreover, the information structure is not uniform due to different information sources, and it is also a difficult point to integrate a plurality of heterogeneous documents and generate a final report. In summary, in the process of automatically generating the scientific and technical information abstract, the main problems to be solved are: and integrating factors such as time and the like into comprehensive evaluation recommendation of heterogeneous texts and multi-document summarization.
Currently, a better approach has a topic crawler in terms of efficient information collection. Most researchers adopt a method of combining a crawling strategy based on link and content, and good effects are achieved. However, in the field of scientific and technological information, the way to acquire data is usually an authoritative intellectual library at home and abroad, and the link in the webpage of the intellectual library is less, so that the crawling method based on the content in the field of information is more suitable. In the research in the field of multi-document summarization, most recent achievements adopt a method of firstly sequencing multiple documents, screening out the first N most important documents, and then adopting a neural network or a method of combining the neural network and a graph model, and some authors also fuse pre-training models such as pre-training Bert and the like into the model. The method has good effect in the supervision of multi-document summarization. However, in the field of scientific intelligence, data set starvation is a non-negligible problem, making supervised methods practically unusable in this field.
Disclosure of Invention
The invention aims to solve the technical problems of difficulty in manual collection and screening and report generation in the technical information field, and creatively provides an automatic scientific and technological information abstract generation method from data collection and data screening to information generation. The method effectively solves the problems that the requirements of scientific and technological information generation on data timeliness and authority are high in the data screening process, and the traditional neural network-based multi-document generation method cannot be applied due to the lack of data sets in the scientific and technological information field.
The innovation points of the invention are as follows: for multi-document text generation in the field of scientific and technological intelligence, source data is obtained based on a topic crawler of an LDA (Latent Dirichlet Allocation), a document topic generation model, also called a three-layer Bayesian probability model, containing three-layer structures of words, topics and documents) topic similarity word bank expansion method. And sequencing all the text paragraphs through a text information value evaluation model of three indexes of authority, timeliness and content relevance of the text information. And selecting the paragraphs with higher scores as original texts for generating the final scientific and technical intelligence. And finally, automatically generating a scientific and technological intelligence abstract by adopting an unsupervised multi-document abstract method based on spectral clustering and multi-sentence compression.
The invention is realized by the following technical scheme.
An unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression comprises the following steps:
step 1: and adopting a topic crawler mode based on an LDA topic similarity word bank expansion method to capture text contents and acquire source data.
By the given initial keywords, under the condition that the theme description is insufficient, the corpus is continuously expanded through the collection function of the theme crawler on the theme related resources, the model is trained circularly, and the theme description is continuously perfected, expanded and updated, so that the desired content is acquired more comprehensively and accurately.
Step 2: and evaluating and ranking the crawled texts according to the relevance of the contents of the crawled texts and the keywords and the timeliness and authority of the source texts. The text of the paragraphs with scores ranking at least the top 40 is selected as the original text for generating the final scientific intelligence.
And step 3: and (3) taking the result text obtained in the step (2) as the input of the model, and obtaining an abstract result by adopting an unsupervised multi-document abstract model based on spectral clustering and multi-sentence compression.
Advantageous effects
Compared with the prior art, the method of the invention has the following advantages:
1. the method respectively provides a thesis patent text information evaluation model and a wisdom article text information evaluation model. The model has strong universality and can be suitable for all thesis patent texts and all wisdom library articles.
2. The method provides an automatic scientific and technical information abstract generation method for text generation from data acquisition, and by using the topic crawler, the relevance of the data to the topic keywords is improved, the redundant data is reduced, and the efficiency of the data acquisition and cleaning stages is optimized. And a combination method of spectral clustering and multi-sentence compression is utilized in the text generation stage, so that the effect of unsupervised multi-document summarization is improved.
Drawings
FIG. 1 is an overall flow diagram of the method of the present invention;
FIG. 2 is an architecture diagram of the subject crawler module of example 1 and method step 1 of the present invention;
FIG. 3 is a flow chart of the text message value evaluation process of step 2 of the method and embodiment 1 of the present invention;
FIG. 4 is a flowchart of method step 3 and the multiple document summarization algorithm of embodiment 1 of the present invention;
fig. 5 is a flow chart of the multiple sentence compression algorithm used in the multiple document summarization process of the present invention and method step 3.4.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
An unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression comprises the following steps:
step 1: and adopting a topic crawler mode based on an LDA topic similarity word bank expansion method to capture text contents and acquire source data.
Because only a small number of keywords are given, the contents crawled by the crawler are not completely consistent with the contents actually expected to be crawled, and therefore, the crawling efficiency can be improved as much as possible while the accuracy is improved and the crawling range is expanded by adopting the topic crawler mode.
By the given initial keywords, under the condition of insufficient theme description, the corpus is continuously expanded through the collection function of the theme crawler to the theme related resources, the model is trained circularly, and the theme description is continuously perfected, expanded and updated so as to acquire the desired content more comprehensively and accurately.
Specifically, step 1 comprises the steps of:
step 1.1: crawling corresponding result webpages according to given initial keywords, and extracting abstracts of the newly added webpages to serve as LDA new training corpora.
Step 1.2: word embedding (word embedding) is done to the training expectation. May be implemented using the word2vec model.
Step 1.3: and combining the original corpus, and obtaining a new theme document through LDA training, wherein the new theme document is used for covering and updating the theme document of the original theme crawler.
Step 2: and evaluating and sequencing the crawled texts according to the relevance of the contents of the crawled texts and the keywords and the timeliness and authority of the source texts.
The evaluation of the value of the text information is generally performed by analyzing the information in terms of the propagation source, the propagation characteristics, the content characteristics and the like. The information dissemination source reflects the characteristics of the information publishing subject, including the publishing channel, the authority of the publisher, and the like. The information propagation characteristics reflect the form characteristics of the information propagation process. Only after extensive, deep and fast propagation of information, the inherent value of the information can have fully embodied opportunities, which generally comprise the number of propagation persons, the propagation speed, the propagation chain depth and the like. In addition, the information has obvious timeliness characteristics, and the expired information is often useless.
Therefore, the text information value evaluation model is constructed by extracting three characteristic dimensions of authority, timeliness and content relevance of the text information.
Specifically, step 2 comprises the steps of:
step 2.1: all text is segmented by paragraph. In the subsequent calculation, the calculation is performed in units of paragraphs.
The method for evaluating the value of the papers, patents and periodicals comprises the following steps:
aiming at the texts such as thesis, patents and periodicals, the influence factor, the total issue text amount and the total download amount of a first author, the text download amount and the reference amount are used as authority judgment indexes, the release time is used as a timeliness index, the similarity between the abstract and the topic lexicon is used as a content correlation index, corresponding parameters are set for each index, a text information value evaluation model is constructed, and the value score of the text is comprehensively calculated.
Further, the invention provides a value score calculation method for texts such as papers, patents and periodicals, which comprises the following steps:
the first step is as follows: computing authority x1
For authority x1Factors related to authority include the publication journal authority of the text, the authority of the author in the field, and other researchers in the field evaluating the text.
Wherein, authority x of periodicals11The ratio of the journal influence factor to the maximum value of all literature influence factors is used for expression, and the expression is shown as formula 1:
Figure BDA0003555592430000051
the authoritativeness of papers and patents is determined by the number of articles published in the field by the author as the first author and the total amount of downloaded articles published by the author as the first author, as shown in equation 2:
Figure BDA0003555592430000052
the value of a paper itself is determined by the download amount and the reference amount of the paper, as shown in formula 3:
Figure BDA0003555592430000053
the second step is that: calculating the timeliness x2
If the attenuation coefficient of the text information value along with time is mu and the time interval between the information acquisition time and the information release time is delta t, the calculation of the time variation of the information value is as shown in formula 4:
x2=e-μΔt (4)
wherein e is a natural constant.
The third step: computing content relevance x3
Specifically, the BM25 algorithm may be used to calculate the relevance of text content. Each word in the topic word library acquired by the topic crawler is regarded as qi. For the abstract a of the text, calculate each word qiScoring the degree of correlation with a, dividing qiAnd a, performing weighted summation with the relevance Score of a to obtain a relevance Score (Q, a) of the current text and the topic lexicon, as shown in formula 5:
Figure BDA0003555592430000054
wherein WiDenotes the ith word qiThe weight of (2) is calculated by using a TF-IDF algorithm; n represents the total number of words in the word stock; r (q)iA) represents the word qiThe correlation with a is calculated by equations 6 and 7:
Figure BDA0003555592430000055
Figure BDA0003555592430000056
wherein, tftaIs the word frequency of the word t in a; l isaIs the length of a, LaveIs the average length of all texts, and the variable k is a positive parameter used for standardizing the range of word frequency of the article; b is an adjustable parameter, 0<b<1, representing a range for representing the amount of information by deciding the length of a document to use; k is an intermediate result in the calculation.
Step 2.3: the value of the Chiense article is evaluated by the following method:
aiming at the intelligent library article text, the number of fans and the number of issued articles of an article author are used as authority indexes, the issuing time is used as a timeliness index, the similarity between the article abstract and the subject thesaurus is used as a content correlation index, corresponding parameters are set for each index, and an intelligent library article text information value evaluation model is constructed.
The invention provides a value scoring calculation method for a thesaurus text, which comprises the following steps:
the first step is as follows: computing authority x1
For the wisdom library article, data such as download amount and quote amount do not exist, and quantitative measurement standards do not exist in authority of the wisdom library, so the number of fans of an author of the article and the number of issued documents are used as the authority measurement indexes. Specifically, the calculation is performed by using the following formula 8 and 9:
Figure BDA0003555592430000061
Figure BDA0003555592430000062
the second step is that: calculating the timeliness x2
The calculation method is the same as that described in the second step of step 2.2.
The third step: computing content relevance x3
The calculation method is the same as that in the third step of step 2.2.
Step 2.4: and calculating the information value of the text.
The text information value is defined as a linear combination of new authority characteristics, timeliness characteristics and content correlation characteristics. Meanwhile, considering the multiplier effect of timeliness, the value of the obtained measurement and calculation information is as follows:
X=[δ11x112x123x13)+δ2(βx3)]x2 (10)
wherein X represents the value of the text message, alpha1、α2、α3、δ1、δ2Representing the influence factors of different characteristics on the text value, and selecting the values according to actual needs. In the present invention, α may be taken1=α2=0.3,α3=0.4,δ1=δ2=0.5。
Step 2.5: and sequencing each paragraph according to the text information value score, and selecting the top 40 paragraphs of the sequencing result as the text data for subsequent multi-document summarization.
And step 3: and obtaining an abstract result by adopting an unsupervised multi-document abstract model based on spectral clustering and multi-sentence compression.
Because a labeling data set similar to a thesis patent or a Chiense article text does not exist in the field of multi-document summarization, the invention provides an unsupervised machine learning method based on spectral clustering and multi-sentence compression for summarization generation. The method comprises the steps of converting an original document into a sentence subgraph, simultaneously considering language and depth representation, then applying spectral clustering to obtain a plurality of sentence clusters, and finally compressing each cluster to generate a final abstract.
Specifically, step 3 includes the steps of:
step 3.1: the text data is processed.
For the paragraph set P ═ { P ] related to one topic finally obtained in step 21,p2,…pnThe final goal is to generate a summary S that encompasses the important information in the original document and has no redundant information.
And taking the sentence as the minimum processing unit of the text, and considering that the last step needs sentence compression, so that all stop words are reserved. The specific method comprises the following steps: a sentence list (which can be generated by calling the NLP module of SpaCy, SpaCy is a leading industrial-level processing library in the NLP task domain) is generated and used as input for a subsequently constructed sentence graph.
Step 3.2: and (3) establishing a structured sentence subgraph, wherein nodes correspond to the sentences generated in the step (3.1), and drawing edges according to the vocabulary relation and the deep semantic relation among the sentences.
The goal of this step is: and identifying paired sentence connections capable of representing the P utterance structure of the paragraph set, and constructing a sentence sub-graph by adopting an approximate language graph (ADG) based technology and combining a deep embedding technology.
Specifically, a graph G ═ (V, E) is constructed, and a node V of the graph is constructediE.g. V represents a sentence, V represents a set of nodes, ei,jE represents the node viAnd node vjAnd E represents a set of edges. For any two non-identical nodes viAnd node vjIf the sentences they represent have the following relationship, they are connected with each other and have an edge with a value of 1, i.e. ei,j=1。
The graph G construction rule comprises the following steps:
de-lexical noun association: according to the english grammar, when an event or entity is mentioned in a verb phrase, the event or entity is usually represented as a dependent noun or noun phrase of the changed verb in the following sentence. The noun form of this verb phrase is found by WordNet (a cognitive linguistics-based english dictionary that arranges words alphabetically and constitutes a "network of words" according to the meaning of the words). If the noun form of the verb phrase in a sentence appears in the sentence after the sentence, the nodes represented by the two sentences are connected with each other.
Entity continuation: this term takes into account wording relevance. If the sentence viAnd sentence vjAnd contain the same entity class (e.g., organization, name, product, etc.), then the two nodes are connected to each other.
The speech tagging language: if there is a semantic relationship between adjacent sentences, such as the connecting words however, meanwhile, furtermore, etc., the nodes represented by the two sentences are connected with each other.
Sentence similarity: the similarity score of a sentence is calculated by averaging all word vectors of a sentence as a sentence representation and using the cosine similarity of the two sentence vectors. And if the similarity score reaches a set threshold value, judging that the two nodes are connected with each other.
Step 3.3: and (4) clustering by using the graph to obtain the intra-graph partitions.
Currently, most graph clustering methods identify node groups in a graph according to edges connecting nodes. The invention adopts a spectral clustering method, which comprises the following steps:
the first step is as follows: acquiring a Laplace matrix of the sentence graph constructed in the above mode (which can be obtained by subtracting the adjacency matrix from the degree matrix of the graph);
the second step is that: calculating the first m eigenvectors of the matrix, which are used for defining the eigenvector of each sentence;
the third step: the sentences are divided into m categories by means of k-means clustering.
M sentence categories which represent different key information are obtained, next, sentence sets of the m categories are subjected to multi-sentence compression operation respectively to obtain m abstracts, and the compression process is shown in step 3.4.
Step 3.4: and generating a summary from the extracted subgraph.
Sentence Compression (MSC) is to generate a summary sentence from clusters each containing a set of semantically related sentences. At present, the classical approach is to construct a word graph and select a sentence constructed by the shortest path as the abstract.
The invention provides a new implementation method, which expands the classical method and specifically comprises the following steps:
the first step is as follows: and constructing a word graph.
For a set of sentences S ═ S1,s2,…,snThat is, a node is first mapped to each word that appears in a sentence. Since the situation of word ambiguity in natural language is widely existed, each node uses a bigram (token, tag) as its identifier, and each time a repeated word is considered, the word graph is adjusted according to the following rules:
and directly establishing a new node for the words which are not stop words, non punctuations and have no candidate nodes (no token, tag) in the current word graph corresponds to the word).
For words with non-stop words, non-punctuation and only one candidate node, the word is directly mapped to the candidate node.
For words that are not stop words, are not punctuated, and have multiple candidate nodes: the word is mapped to the node closest to the context, but the word graph is kept acyclic-i.e. two identical words of the same sentence cannot be mapped to one node. And if no node meeting the condition exists, a new node is created.
And for stop words and punctuation, if the nodes with the same context exist, mapping the stop words and punctuation as the nodes, otherwise, creating a new node.
Regarding the weight of the edge between the nodes, considering the co-occurrence probability between the nodes, the larger the co-occurrence probability of two nodes is, the smaller the edge weight thereof is, when there is an edge between two nodes, if there is a multi-hop connection between them, the edge weight thereof is enhanced, and as the path length becomes longer, the multi-hop connection enhancing effect is weakened, which is specifically expressed by equation 11:
Figure BDA0003555592430000091
wherein, w (e)i,j) Representing the weight of the edge between node i and node j; freq (i) and freq (j) represent the number of words mapped to node i and node j, respectively; diff (s, i, j) refers to the distance between the offset position of the word mapped to node i and the word mapped to node j in sentence s;
the second step: a recall phase. Finding the shortest paths of F in the word graph, wherein each sentence formed by the paths is a candidate answer.
The essence of this step is to solve the limited F shortest path problem. In the invention, the Yen's algorithm is adopted to solve the problem. The algorithm is divided into two parts, the 1 st shortest path P (1) is calculated, and then other F-1 shortest paths are calculated in sequence on the basis. When calculating P (i +1), all nodes except the end node on P (i) are regarded as deviated nodes, the shortest path from each deviated node to the end node is calculated, and then the shortest path is spliced with the previous path from the start node to the deviated nodes on P (i) to form a candidate path, and then the shortest deviated path is calculated. The top 100 ranked path is selected as the candidate sentence path.
The third step: and re-ordering the candidate answers, and selecting the candidate answer with the top order as the final answer.
Specifically, key phrases are extracted using TextRank, and new scores are designed for reordering. First, each node updates its score using equation 12 until convergence:
Figure BDA0003555592430000092
wherein, S (n)i) Representing a node n in a word graphiIs scored. The damping coefficient d, the value of which may take on 0.85. adj (n)i) Representation and node niAdjacent node, w (e)j,i) Representing a node njAnd node niThe weight of the edges in between.
Then, a key phrase r is obtained according to the key word combination, and the score (r) is as follows:
Figure BDA0003555592430000093
wherein, TextRank (w) represents the score of the word node w calculated via the TextRank algorithm. The denominator is the weighted length (r) of the key phrase r and the normalization of the scores is done to favor longer phrases.
Finally, the paths are reordered by multiplying the weighted length of the total path in the candidate sentences obtained in the second step by the sum of the scores of the key phrases contained in the total path. From the key phrase scores, a final score is calculated for each sentence:
Figure BDA0003555592430000101
where length (c) represents the weighted length of sentence c, and path (c) represents the complete path of sentence c.
And selecting the summary with the minimum score as the generated summary, and finally connecting the summaries generated by the m categories to obtain the final complete summary.
Examples
This example describes a specific embodiment of the method of the present invention.
The implementation schematic diagram is shown in the overall flow chart of fig. 1. The invention provides a complete process from text data acquisition, data processing and abstract text generation in the scientific and technical intelligence abstract generation process. When the method is implemented specifically, the topic crawler module starts to work, data required by analysis are obtained according to a keyword library provided by a user, then the text information value evaluation module analyzes and sequences the obtained data, and finally a sequencing result is taken as the input of the abstract generation module and is brought into a model to obtain a final result.
First, according to keywords provided by a user, a topic crawler module is applied to Google academy, DARPA, IARPA and Lande Chinesia to acquire data. FIG. 2 is a flow of data acquisition in an unsupervised scientific and technical intelligence automatic generation method based on sentence graph clustering. According to the step 1 introduced by the invention, a certain number of webpages are crawled according to given initial keywords, then the abstracts of the newly added webpages are taken as LDA new training linguistic data, then word embedding is carried out on training expectations by using word2vec, and finally a new theme document is obtained by LDA training in combination with an original corpus and is used for covering and updating the theme document of the original theme crawler.
After the required text data is acquired, the text value is evaluated according to the attribute data of the text, and the evaluation mode is shown in fig. 3. All texts are segmented according to paragraphs, authority, timeliness and content relevance of the text data are calculated according to data such as periodicals, authors and download amount of the text data at a time, and then value of text information is calculated by combining the authority, timeliness and content relevance. And finally, sorting the text data according to the value of the text information, and selecting the first 40 texts as the text data for subsequent multi-document summarization.
And finally, a text generation stage. The specific process is as shown in fig. 4, firstly processing text data, segmenting paragraph data obtained in the previous step into sentences, then calling an NLP module of a SpaCy library to generate a sentence list, then constructing an undirected sentence subgraph according to the rule described in step 3.2, then clustering the sentence subgraph according to the spectral clustering method described in step 3.3 to generate m classes, and finally generating an abstract by adopting a multi-sentence compression mode for the m classes. The flow of compressing multiple sentences is shown in fig. 5, firstly, a word graph is constructed according to the rules described in the first step in step 3.4, then the Yen's algorithm is used to find the shortest path of 100 in the graph, and finally, reordering is performed. The process of reordering is: extracting key phrases by using a TextRank algorithm, then recalculating scores of sentences according to the key phrases, and finally sequencing the scores of 100 paths, wherein the sentences formed by connecting words in the paths with the minimum scores are the abstract results of the category. And finally, connecting the abstracts generated in the m classes to generate a final abstract.
While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. It is intended that all equivalents and modifications which do not depart from the spirit of the invention disclosed herein are deemed to be within the scope of the invention.

Claims (6)

1. An unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression is characterized by comprising the following steps:
step 1: adopting a topic crawler mode based on an LDA topic similarity word bank expansion method to capture text content and acquire source data;
and 2, step: evaluating and sequencing the crawled texts according to the relevance of the contents and the keywords and the timeliness and authority of the source text; building a text information value evaluation model by extracting three characteristic dimensions of authority, timeliness and content relevance of text information;
the method comprises the following steps:
step 2.1: segmenting all texts according to paragraphs; in subsequent calculations, the calculation is performed in paragraph units;
the method for evaluating the value of the papers, patents and periodicals comprises the following steps:
aiming at the texts such as thesis, patents and periodicals, the influence factor, the total issue text amount and the total download amount of a first author, the text download amount and the reference amount are used as authority judgment indexes, the release time is used as a timeliness index, the similarity between the abstract and the topic lexicon is used as a content correlation index, corresponding parameters are set for each index, a text information value evaluation model is constructed, and the value score of the text is comprehensively calculated;
step 2.3: evaluating the value of the wisdom library article; aiming at the intelligent library article text, the number of fans and the number of issued articles of an article author are used as authority indexes, the issuing time is used as a timeliness index, the similarity between the article abstract and the subject thesaurus is used as a content correlation index, corresponding parameters are set for each index, and an intelligent library article text information value evaluation model is constructed;
step 2.4: calculating the information value of the text;
defining the text information value as a linear combination of new authority characteristics, timeliness characteristics and content correlation characteristics; meanwhile, considering the multiplier effect of timeliness, the value of the obtained measurement and calculation information is as follows:
X=[δ11x112x123x13)+δ2(βx3)]x2 (10)
wherein X represents the value of the text message, alpha1、α2、α3、δ1、δ2Representing influence factors of different characteristics on the text value, wherein the values are selected according to actual requirements;
step 2.5: sequencing each paragraph according to the text information value score, and selecting the top 40 paragraphs of the sequencing result as the text data for subsequently performing multi-document summarization;
and 3, step 3: taking the result text obtained in the step 2 as the input of the model, and adopting an unsupervised multi-document abstract model based on spectral clustering and multi-sentence compression to obtain an abstract result;
firstly, converting an original document into a sentence subgraph, simultaneously considering language and depth representation, then applying spectral clustering to obtain a plurality of sentence clusters, and finally compressing each cluster to generate a final abstract.
2. The method for automatically generating unsupervised scientific and technical intelligence abstract based on multi-sentence compression as claimed in claim 1, wherein the step 1 comprises the following steps:
step 1.1: crawling corresponding result webpages according to given initial keywords, and extracting abstracts of the newly added webpages to serve as LDA new training corpora;
step 1.2: word embedding is carried out on training anticipation;
step 1.3: and combining the original corpus, and obtaining a new theme document through LDA training, wherein the new theme document is used for covering and updating the theme document of the original theme crawler.
3. The method for automatically generating unsupervised scientific and technical intelligence abstract based on multi-sentence compression as claimed in claim 1, wherein in the step 2, the method for calculating the value score of texts such as papers, patents and periodicals comprises the following steps:
the first step is as follows: computing authority x1
For authority x1Factors related to authority include publication journal authority of the text, authority of the author in the field, and evaluation of the text by other researchers in the field;
authority x of periodicals11The ratio of the journal influence factor to the maximum value of all literature influence factors is adopted to express, as shown in formula 1:
Figure FDA0003555592420000021
the authoritativeness of papers and patents is determined by the number of articles published in the field by the author as the first author and the total amount of downloaded articles published by the author as the first author, as shown in equation 2:
Figure FDA0003555592420000022
the value of a paper itself is determined by the download amount and the reference amount of the paper, as shown in formula 3:
Figure FDA0003555592420000023
the second step is that: calculating the timeliness x2
If the attenuation coefficient of the text information value along with time is mu, and the time interval between the information acquisition time and the information release time is delta t, the calculation of the information value along with the time change is shown as formula 4:
x2=e-μΔt (4)
wherein e is a natural constant;
the third step: computing content relevance x3
Each word in the topic word library acquired by the topic crawler is regarded as qi(ii) a For a summary a of the text, calculate each word qiScoring the degree of correlation with a, dividing qiAnd a, performing weighted summation on the correlation scores of a to obtain a correlation Score Score (Q, a) of the current text and the topic lexicon, as shown in formula 5:
Figure FDA0003555592420000031
wherein WiDenotes the ith word qiThe weight of (2) is calculated by using a TF-IDF algorithm; n represents the total number of words in the word stock; r (q)iA) represents the word qiThe correlation with a is calculated by equations 6 and 7:
Figure FDA0003555592420000032
Figure FDA0003555592420000033
wherein, tftaIs the word frequency of the word t in a; l isaIs the length of a, LaveThe average length of all texts, wherein a variable k is a positive parameter and is used for standardizing the range of word frequency of an article; b is an adjustable parameter, 0<b<1, representing a range for representing the amount of information by deciding the length of a document to use; k is an intermediate result in the calculation;
the value score calculation method for the intelligent library file type text comprises the following steps of:
the first step is as follows: computing authority x1
For the Chiense article, the number of fans and the number of issued documents of an author of the article are used as authority measuring indexes, and the calculation is carried out by adopting an equation 8 and an equation 9:
Figure FDA0003555592420000034
Figure FDA0003555592420000035
the second step is that: calculating the timeliness x2
The calculation method is the same as the second step of the value score calculation method of the texts such as the papers, the patents and the periodicals;
the third step: computing content relevance x3
The calculation method is the same as the third step of the value score calculation method of the texts such as the papers, the patents and the periodicals.
4. The method for automatically generating unsupervised scientific and technical intelligence abstract based on multisentence compression as claimed in claim 1, wherein the step 3 comprises the following steps:
step 3.1: processing the text data;
for the paragraph set P ═ { P ] related to one topic finally obtained in step 21,p2,…pnThe final aim is to generate a summary S which contains important information in the original document and has no redundant information; taking a sentence as a minimum processing unit of a text, considering that the last step needs sentence compression, and reserving all stop words; the specific method comprises the following steps: generating a sentence list and taking the sentence list as the input of a sentence subgraph constructed subsequently;
step 3.2: establishing a structured sentence subgraph, wherein nodes correspond to sentences generated in the step 3.1, drawing edges according to the lexical relation and deep semantic relation among the sentences so as to identify paired sentence connection capable of representing the P utterance structure of the paragraph set, and constructing the sentence subgraph by adopting a technology based on an approximate language graph and combining depth embedding;
step 3.3: and (3) applying graph clustering to obtain intra-graph partitions, wherein the specific steps are as follows:
the first step is as follows: acquiring a Laplace matrix of the constructed sentence graph;
the second step is that: calculating the first m eigenvectors of the matrix, which are used for defining the eigenvector of each sentence;
the third step: dividing the sentences into m categories by a k-means clustering mode;
obtaining m sentence categories which represent different key information, and then respectively carrying out multi-sentence compression operation on the sentence sets of the m categories to obtain m abstracts;
step 3.4: and generating a summary from the extracted subgraph.
5. The method as claimed in claim 4, wherein in step 3.2, a graph G (V, E) is constructed, and a node V of the graph is constructediE.g. V represents a sentence, V represents a set of nodes, ei,jE represents the node viAnd node vjE represents a set of edges; for any two non-identical nodes viAnd node vjIf the sentences they represent have the following relationship, they are connected with each other and have an edge with a value of 1, i.e. ei,j=1;
Wherein, the graph G construction rule comprises the following steps:
de-lexical noun association: according to english grammar, when a certain event or entity is mentioned in a verb phrase, the event or entity is usually represented as a dependent noun or noun phrase of the verb in the following sentence; find the noun form of this verb phrase through WordNet; if the noun form of a verb phrase in a sentence appears in the sentence after the sentence, nodes represented by the two sentences are connected with each other;
entity continuation: this term takes into account the relevance of words; if the sentence viAnd sentence vjAnd contain the same entity class, then these two nodes are connected to each other;
the speech tagging language: if the adjacent sentences have semantic relation, the nodes represented by the two sentences are connected with each other;
sentence similarity: calculating a similarity score of the sentence by averaging all word vectors of a sentence as sentence representation and using cosine similarity of two sentence vectors; and if the similarity score reaches a set threshold value, judging that the two nodes are connected with each other.
6. The method for automatically generating an unsupervised scientific and technical intelligence abstract based on multisentence compression as claimed in claim 4, wherein the step 3.4 of generating the abstract comprises the following steps:
the first step is as follows: constructing a word graph;
for a set of sentences S ═ S1,s2,…,snMapping each word appearing in the sentence into a node; since the situation of word ambiguity in natural language is widely existed, each node uses a bigram (token, tag) as its identifier, and each time a repeated word is considered, the word graph is adjusted according to the following rules:
directly establishing a new node for words which are not stop words, non punctuations and have no candidate nodes;
for words of non-stop words, non-punctuation and only one candidate node, directly mapping the words to the candidate node;
for words that are not stop words, are not punctuated, and have multiple candidate nodes: mapping the word to a node closest to the context, but keeping the word graph endless, namely two same words of the same sentence can not be mapped to one node; if no node meeting the condition exists, a node is newly established;
for stop words and punctuation, if the nodes with the same context exist, mapping the stop words and punctuation as the nodes, otherwise, establishing a new node;
regarding the weight of the edge between the nodes, considering the co-occurrence probability between the nodes, the larger the co-occurrence probability of two nodes is, the smaller the edge weight thereof is, when there is an edge between two nodes, if there is a multi-hop connection between them, the edge weight thereof is enhanced, and as the path length becomes longer, the multi-hop connection enhancing effect is weakened, which is specifically expressed by equation 11:
Figure FDA0003555592420000051
wherein, w (e)i,j) Representing the weight of the edge between node i and node j; freq (i) and freq (j) represent the number of words mapped to node i and node j, respectively; diff (s, i, j) refers to the distance between the offset position of the word mapped to node i and the word mapped to node j in sentence s;
the second step is that: a recall phase; finding F shortest paths in the word graph, wherein sentences formed by each path are candidate answers;
solving the problem by adopting a Yen's algorithm; the algorithm is divided into two parts, the 1 st shortest path P (1) is calculated, and then other F-1 shortest paths are calculated in sequence on the basis; when P (i +1) is obtained, all nodes except the termination node on P (i) and P (i) are regarded as deviation nodes, the shortest path from each deviation node to the termination node is calculated, and then the shortest paths are spliced with the paths from the initial node to the deviation nodes on P (i) and P (i) to form candidate paths, so that the shortest deviation paths are obtained; selecting a path with the top rank of 100 as a candidate sentence path;
the third step: reordering the candidate answers and selecting the candidate answer with the top order as the final answer; extracting key phrases by using a TextRank, and designing a new score for reordering; first, each node updates its score using equation 12 until convergence:
Figure FDA0003555592420000061
wherein, S (n)i) Representing a node n in a word graphiScore of (a); a damping coefficient d, which may take a value of 0.85; adj (n)i) Representation and node niAdjacent nodes, w (e)j,i) Representing a node njAnd node niThe weight of the edges in between;
then, a key phrase r is obtained according to the key word combination, and the score (r) is as follows:
Figure FDA0003555592420000062
wherein, TextRank (w) represents the score of the word node w calculated by the TextRank algorithm; the denominator is the weighted length (r) of the key phrase r, and the normalization operation on the scores is performed to tend to select longer phrases;
finally, the paths are reordered by multiplying the weighted length of the total paths in the candidate sentences obtained in the second step by the total sum of the scores of the key phrases contained in the candidate sentences; from the key phrase scores, a final score is calculated for each sentence:
Figure FDA0003555592420000063
wherein length (c) represents the weighted length of sentence c, and path (c) represents the complete path of sentence c;
and selecting the summary with the minimum score as the generated summary, and finally connecting the summaries generated by the m categories to obtain the final complete summary.
CN202210275509.1A 2022-03-21 2022-03-21 Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression Pending CN114706972A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210275509.1A CN114706972A (en) 2022-03-21 2022-03-21 Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210275509.1A CN114706972A (en) 2022-03-21 2022-03-21 Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression

Publications (1)

Publication Number Publication Date
CN114706972A true CN114706972A (en) 2022-07-05

Family

ID=82169773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210275509.1A Pending CN114706972A (en) 2022-03-21 2022-03-21 Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression

Country Status (1)

Country Link
CN (1) CN114706972A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687960A (en) * 2022-12-30 2023-02-03 中国人民解放军61660部队 Text clustering method for open source security information
CN116127321A (en) * 2023-02-16 2023-05-16 广东工业大学 Training method, pushing method and system for ship news pushing model
CN116541505A (en) * 2023-07-05 2023-08-04 华东交通大学 Dialogue abstract generation method based on self-adaptive dialogue segmentation

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687960A (en) * 2022-12-30 2023-02-03 中国人民解放军61660部队 Text clustering method for open source security information
CN116127321A (en) * 2023-02-16 2023-05-16 广东工业大学 Training method, pushing method and system for ship news pushing model
CN116541505A (en) * 2023-07-05 2023-08-04 华东交通大学 Dialogue abstract generation method based on self-adaptive dialogue segmentation
CN116541505B (en) * 2023-07-05 2023-09-19 华东交通大学 Dialogue abstract generation method based on self-adaptive dialogue segmentation

Similar Documents

Publication Publication Date Title
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
US9971974B2 (en) Methods and systems for knowledge discovery
CN109858028B (en) Short text similarity calculation method based on probability model
RU2487403C1 (en) Method of constructing semantic model of document
CN108197117B (en) Chinese text keyword extraction method based on document theme structure and semantics
US20120158400A1 (en) Methods and systems for knowledge discovery
EP1661031A1 (en) System and method for processing text utilizing a suite of disambiguation techniques
CN114706972A (en) Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
Chen et al. Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph
US20140089246A1 (en) Methods and systems for knowledge discovery
Batura et al. A method for automatic text summarization based on rhetorical analysis and topic modeling
Jha et al. Hsas: Hindi subjectivity analysis system
Tripathy et al. Automated phrase mining using POST: The best approach
Heidary et al. Automatic Persian text summarization using linguistic features from text structure analysis
Moghadam et al. Comparative study of various Persian stemmers in the field of information retrieval
JP2003085181A (en) Encyclopedia system
Ahmed et al. A web statistics based conflation approach to improve Arabic text retrieval
Asrori et al. Performance analysis graph-based keyphrase extraction in Indonesia scientific paper
IO et al. Performance evaluation of an improved model for keyphrase extraction in documents
Beumer Evaluation of Text Document Clustering using k-Means
Jain et al. Investigating the Similarity of Court Decisions.
Das et al. Relation recognition among named entities from a crime corpus using a web-based semantic similarity measurement
Rajpal et al. A Novel Techinque For Ranking of Documents Using Semantic Similarity
Parra et al. Unsupervised tagging of spanish lyrics dataset using clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination