CN107133213B

CN107133213B - Method and system for automatically extracting text abstract based on algorithm

Info

Publication number: CN107133213B
Application number: CN201710314598.5A
Authority: CN
Inventors: 余珊珊; 苏锦钿; 连俊玮
Original assignee: Guangdong Pharmaceutical University
Current assignee: Guangdong Pharmaceutical University
Priority date: 2017-05-06
Filing date: 2017-05-06
Publication date: 2020-09-25
Anticipated expiration: 2037-05-06
Also published as: CN107133213A

Abstract

The invention discloses an algorithm-based automatic text abstract extraction method, which relates to the technical field of text extraction and comprises the following steps: s1, preprocessing the text; s2, extracting the features of the text; s3, calculating the similarity between sentences by adopting the existing similarity calculation method, and performing weighting processing in the calculation process; s4, constructing an undirected weighted TextRank network graph by using sentences in the text as nodes, using similarity relations among the sentences as edges and using similarity as weight of the edges; obtaining each node containing the weighted value through iterative calculation until convergence; s5, selecting a core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, and outputting the core sentence as an extraction result after sequencing. The invention also discloses a system for extracting the abstract. The method and the device are beneficial to improving the accuracy of automatic extraction of the text abstract.

Description

Method and system for automatically extracting text abstract based on algorithm

Technical Field

The invention relates to the technical field of text extraction, in particular to a text abstract automatic extraction method and system based on an algorithm.

Background

The automatic extraction of the text abstract based on machine learning is a hot spot in the research field of text mining in recent years, and has very wide application prospect in the fields of search engines, portal websites, mobile internet, information retrieval systems and the like. The automatic extraction of the text abstract is realized by utilizing a computer technology, so that the text information can be effectively mined and concentrated, the reading time of a user is reduced, and the user experience is improved.

Early automatic extraction of text summaries mainly used rule-based or statistical machine learning based approaches. In recent years, many researchers have studied high text abstract automatic extraction by using various machine learning algorithms, such as regression models (including linear regression or ELM regression, etc.), lda (late dichletaslocation) models, support vector machine SVM, and LexRank algorithm, and further improve the abstract extraction effect by combining some relevant research results of linguistics, such as chapter structure, word weight, keyword, topic model, etc. Since linear regression, ELM regression, LDA and the like are supervised learning methods, they are easily affected by training samples, and thus have poor field universality and are not suitable for abstract extraction of massive texts. In 2004, Mihalcea and Tarau put forward an unsupervised learning algorithm TextRank based on the Google PageRank algorithm in combination with the research on automatic abstract extraction, which is essentially to construct a TextRank network graph according to the similarity between sentences, and to regard the similarity between sentences as a recommendation or voting relationship. Some researchers apply the TextRank to the aspects of information retrieval, keyword extraction and the like on the basis of the work of Mihalcel and Tarau, and obtain better effect. But the representation of text in these works mainly adopts bag-of-words (bag-of-words) based approach, i.e. one-of-V (where V is the size of the dictionary), and mainly based on co-occurrence information between words, neglects the order of words and their semantics. For example, word-to-word similarity cannot be expressed (the vector inner product of any two different words is 0), and it is easy to make the dimension of the word vector too large.

Chinese patent application CN104216875A discloses a microblog text automatic abstracting method based on unsupervised keyword binary word string extraction, comprising: preprocessing a microblog; standardizing the binary word strings; extracting key binary word strings based on mixed TF-IDF, TextRank and LDA; sentence sequencing based on the intersection similarity and mutual information strategy; abstract sentence extraction based on a similarity threshold; and reasonably combining the abstract sentences to generate the abstract. The patent application is still limited by a traditional thinking framework for automatically extracting text abstract, and can not solve the problems of dimension disaster and the like.

Another chinese patent application cn200710130576.x discloses a data processing apparatus, comprising: a first unsupervised learning processing unit, a second unsupervised learning processing unit and a supervised learning processing unit. The first unsupervised learning processing unit classifies data of a first data group according to unsupervised learning so as to perform dimensionality reduction of the first data group, thereby obtaining a first classified data group. The second unsupervised learning processing unit classifies data of a second data group according to unsupervised learning so as to perform dimensionality reduction of the second data group, thereby obtaining a second classified data group. The supervised learning processing unit performs supervised learning using the first classified data group obtained by the first unsupervised learning processing unit and the second classified data group obtained by the second unsupervised learning processing unit as teacher data so as to determine a mapping relationship between the first classified data group and the second classified data group. The patent application can reduce data dimension, but at present, no method or system which can be effectively applied to automatic extraction of text summaries exists.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide an algorithm-based text abstract automatic extraction method and system, which combines the Doc2Vec algorithm and the TextRank algorithm to be applied to the text abstract automatic extraction and improves the accuracy of the text abstract automatic extraction.

In order to achieve the purpose, the invention adopts the following technical scheme:

an algorithm-based text abstract automatic extraction method comprises the following steps:

s1, preprocessing the text, wherein the preprocessed content comprises segmenting, sentence and word segmentation of the text, and extracting chapter structure information of the text;

s2, extracting the features of the preprocessed text, wherein the content of the extracted features is as follows: learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, enabling each sentence to correspond to a real number paragraph word vector with a specified dimension and continuous density, and taking the real number paragraph word vector as a characteristic representation of the sentence;

s3, after the feature extraction is completed, calculating the similarity between sentences in the text by adopting the existing similarity calculation method, weighting the sentence structure and the sentence position of the text in combination in the calculation process, and obtaining a sentence similarity matrix of the text after the similarity calculation combined with the weighting processing is completed;

s4, constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; obtaining each node containing the weighted value through iterative calculation until convergence;

and S5, combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting as the extraction result of the text abstract, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences in the total number of the articles.

Further, in S3, the weighting processing is performed when calculating the similarity in the following principle: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:

wherein, P_0h'And P_ih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;

4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:

wherein e is₁And e₂Set threshold values which are larger than 0 and smaller than 1 are adopted, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is enlarged by 1.1 times, the key sentence is a paragraph which has more words than the set value and directly forms a paragraphA sentence; 6) for sentences that are preprocessed to be empty, no weighting is performed.

Further, in S1, the manner of segmenting, clauseing, and word segmentation of the text specifically includes: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.

Further, in S1, the content of preprocessing the text further includes: and (5) carrying out punctuation filtering, abbreviation filling and space deletion on the text.

An algorithm-based automatic text summarization extraction system, comprising:

the preprocessing module is used for segmenting, clauses and words of the text and extracting chapter structure information of the text;

the characteristic extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real paragraph word vector with specified dimensionality and continuous density, and real paragraph word vectors which are expressed as characteristics of the sentences are obtained;

the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the conventional similarity calculation method, performing weighting processing by combining the chapter structure of the text and the positions of the sentences in the calculation process, and obtaining a sentence similarity matrix of the text after completing the similarity calculation combined with the weighting processing;

the weight value calculation module based on the TextRank is used for constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence;

and the abstract extraction module is used for combining the set abstract space parameters, selecting the core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, sequencing the core sentences according to the appearance sequence of the core sentences and outputting the core sentences as the extraction result of the text abstract, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences in the total number of the sentence.

Further, the principle of the weighting processing performed by the similarity calculation module when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:

wherein e is₁And e₂Set threshold values which are larger than 0 and smaller than 1 are adopted, and s and r are the number of sentences of the first segment and the last segment;

5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.

Further, the method for segmenting, clauses and words of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.

Furthermore, the preprocessing module is also used for carrying out punctuation filtering, abbreviation filling and space deletion on the text.

The invention has the beneficial effects that: the method comprises the steps that a word2Vec and word embedding-based Doc2Vec algorithm is used for learning from sentences with variable lengths in a text, so that low-dimensional dense real word vectors containing semantic information are obtained and used as sentence feature representation with a fixed size, and compared with a traditional word bag and word frequency representation method, sentence paragraph word vectors obtained by using Doc2Vec not only contain semantic information of words, but also can avoid the problem of dimension disaster representation of feature space and reduce similarity calculation workload; the Doc2Vec algorithm has the advantages of expressing sentence semantics and reducing the problem of dimensionality expressed by features, the TextRank algorithm has the advantages of no need of pre-training in the aspect of unsupervised learning, no dependence on a specific corpus, high computing performance and the like, the Doc2Vec and TextRank algorithms are organically combined and applied to the field of text abstract automatic extraction, and a TextRank network structure diagram is further optimized by combining information such as chapter structures and the like, so that the method has the advantages of high accuracy, high computing speed and the like; compared with the traditional text abstract automatic extraction method/system, the accuracy can be improved by 3% -5%, and the accuracy is obviously improved in the field.

Drawings

FIG. 1 is a block diagram of a method for automatically extracting a text abstract based on an algorithm according to the present invention;

FIG. 2 is a block diagram of an algorithm-based automatic text summarization system according to the present invention;

FIG. 3 is a block diagram of a process for pre-processing text in accordance with the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and the detailed description below:

example 1

As shown in fig. 1, an algorithm-based text abstract automatic extraction method includes the following steps:

s1, preprocessing the text, as shown in fig. 3, the preprocessed content includes: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool; extracting chapter structure information of the text; the method also comprises the steps of punctuation filtering, abbreviation filling, word stem processing and blank space deletion on the text;

s3, after completing the feature extraction, calculating the similarity between sentences in the text by adopting the existing similarity calculation method (such as cosine function, Euclidean distance or Jaccard function, etc.), weighting the chapter structure of the text and the position of the sentence in the calculation process, wherein the weighting principle when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:

wherein e is₁And e₂All the set threshold values are greater than 0 and less than 1, the default values are 0.2, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for the sentence which is preprocessed to be empty, weighting is not carried out; obtaining a sentence similarity matrix of the text after completing similarity calculation combined with weighting processing;

s5, inversely ordering the weighted values of each node, selecting a core sentence according to the weighted value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence by combining the set abstract space parameters, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences to the total number of the sentence, and finally, ordering the abstract space parameters according to the appearance sequence of the core sentences to output the abstract text extraction result.

Example 2

As shown in fig. 2 and fig. 3, an automatic text abstract extraction system based on an algorithm includes a preprocessing module, a Doc2 Vec-based feature extraction module, a similarity calculation module, a TextRank-based weight value calculation module, and an abstract extraction module; the preprocessing module is used for segmenting, sentence-separating and word-separating the text, extracting chapter structure information of the text, and performing punctuation filtering, abbreviation filling, word stem processing and space deletion on the text; the method for segmenting, separating sentences and separating words of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool; the feature extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real number paragraph word vector with specified dimensionality and continuous density, and real number paragraph word vectors represented by the features of the sentences are obtained; the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the existing similarity calculation method (such as cosine function, Euclidean distance or Jaccard function, etc.), weighting is carried out by combining the chapter structure of the text and the position of the sentences in the calculation process, and the principle of weighting treatment in the similarity calculation process is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:

wherein e is₁And e₂All the set threshold values are greater than 0 and less than 1, the default values are 0.2, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for the sentence which is preprocessed to be empty, weighting is not carried out; obtaining a sentence similarity matrix of the text after completing similarity calculation combined with weighting processing; the weight value calculation module based on the TextRank is used for taking each sentence in the text as a section according to the sentence similarity matrixConstructing an undirected weighted TextRank network graph by taking the similarity relation among sentences as edges and the similarity among sentences as weights of the edges; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence; the abstract extraction module is used for inversely ordering the weight values of the nodes, combining set abstract space parameters, selecting a core sentence according to the weight values of sentences corresponding to the nodes, the chapter structure of the text and the position information of the sentences, wherein the abstract space parameters comprise abstract word numbers, abstract sentence numbers and the proportion of the abstract sentences to the total number of the sentences of the article, and finally, the abstract extraction module is used for outputting the abstract extraction result of the text after ordering according to the appearance sequence of the core sentences.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. An algorithm-based text abstract automatic extraction method is characterized by comprising the following steps:

s5, combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the space structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting as the extraction result of the text abstract;

in S3, the weighting processing is performed when calculating the similarity as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:

wherein e is₁And e₂Set threshold values which are more than 0 and less than 1 are provided, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.

2. The method for automatically extracting a text abstract according to claim 1, wherein in S1, the way of segmenting, clauseing and word segmentation of the text is specifically as follows: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.

3. The method for automatically extracting a text abstract according to claim 2, wherein in S1, the content of preprocessing the text further comprises: and (5) carrying out punctuation filtering, abbreviation filling and space deletion on the text.

4. An algorithm-based text abstract automatic extraction system is characterized by comprising:

the abstract extraction module is used for combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the space chapter structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting the core sentence as the extraction result of the text abstract;

the principle of the weighting processing performed by the similarity calculation module when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:

5. The system for automatically extracting a text abstract according to claim 4, wherein the method for segmenting, clauseing and word segmentation of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.

6. The system of claim 5, wherein the preprocessing module is further configured to perform punctuation filtering, abbreviation padding, and space removal on the text.