CN107133213B - Method and system for automatically extracting text abstract based on algorithm - Google Patents

Method and system for automatically extracting text abstract based on algorithm Download PDF

Info

Publication number
CN107133213B
CN107133213B CN201710314598.5A CN201710314598A CN107133213B CN 107133213 B CN107133213 B CN 107133213B CN 201710314598 A CN201710314598 A CN 201710314598A CN 107133213 B CN107133213 B CN 107133213B
Authority
CN
China
Prior art keywords
sentence
text
similarity
sentences
abstract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710314598.5A
Other languages
Chinese (zh)
Other versions
CN107133213A (en
Inventor
余珊珊
苏锦钿
连俊玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Pharmaceutical University
Original Assignee
Guangdong Pharmaceutical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Pharmaceutical University filed Critical Guangdong Pharmaceutical University
Priority to CN201710314598.5A priority Critical patent/CN107133213B/en
Publication of CN107133213A publication Critical patent/CN107133213A/en
Application granted granted Critical
Publication of CN107133213B publication Critical patent/CN107133213B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an algorithm-based automatic text abstract extraction method, which relates to the technical field of text extraction and comprises the following steps: s1, preprocessing the text; s2, extracting the features of the text; s3, calculating the similarity between sentences by adopting the existing similarity calculation method, and performing weighting processing in the calculation process; s4, constructing an undirected weighted TextRank network graph by using sentences in the text as nodes, using similarity relations among the sentences as edges and using similarity as weight of the edges; obtaining each node containing the weighted value through iterative calculation until convergence; s5, selecting a core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, and outputting the core sentence as an extraction result after sequencing. The invention also discloses a system for extracting the abstract. The method and the device are beneficial to improving the accuracy of automatic extraction of the text abstract.

Description

Method and system for automatically extracting text abstract based on algorithm
Technical Field
The invention relates to the technical field of text extraction, in particular to a text abstract automatic extraction method and system based on an algorithm.
Background
The automatic extraction of the text abstract based on machine learning is a hot spot in the research field of text mining in recent years, and has very wide application prospect in the fields of search engines, portal websites, mobile internet, information retrieval systems and the like. The automatic extraction of the text abstract is realized by utilizing a computer technology, so that the text information can be effectively mined and concentrated, the reading time of a user is reduced, and the user experience is improved.
Early automatic extraction of text summaries mainly used rule-based or statistical machine learning based approaches. In recent years, many researchers have studied high text abstract automatic extraction by using various machine learning algorithms, such as regression models (including linear regression or ELM regression, etc.), lda (late dichletaslocation) models, support vector machine SVM, and LexRank algorithm, and further improve the abstract extraction effect by combining some relevant research results of linguistics, such as chapter structure, word weight, keyword, topic model, etc. Since linear regression, ELM regression, LDA and the like are supervised learning methods, they are easily affected by training samples, and thus have poor field universality and are not suitable for abstract extraction of massive texts. In 2004, Mihalcea and Tarau put forward an unsupervised learning algorithm TextRank based on the Google PageRank algorithm in combination with the research on automatic abstract extraction, which is essentially to construct a TextRank network graph according to the similarity between sentences, and to regard the similarity between sentences as a recommendation or voting relationship. Some researchers apply the TextRank to the aspects of information retrieval, keyword extraction and the like on the basis of the work of Mihalcel and Tarau, and obtain better effect. But the representation of text in these works mainly adopts bag-of-words (bag-of-words) based approach, i.e. one-of-V (where V is the size of the dictionary), and mainly based on co-occurrence information between words, neglects the order of words and their semantics. For example, word-to-word similarity cannot be expressed (the vector inner product of any two different words is 0), and it is easy to make the dimension of the word vector too large.
Chinese patent application CN104216875A discloses a microblog text automatic abstracting method based on unsupervised keyword binary word string extraction, comprising: preprocessing a microblog; standardizing the binary word strings; extracting key binary word strings based on mixed TF-IDF, TextRank and LDA; sentence sequencing based on the intersection similarity and mutual information strategy; abstract sentence extraction based on a similarity threshold; and reasonably combining the abstract sentences to generate the abstract. The patent application is still limited by a traditional thinking framework for automatically extracting text abstract, and can not solve the problems of dimension disaster and the like.
Another chinese patent application cn200710130576.x discloses a data processing apparatus, comprising: a first unsupervised learning processing unit, a second unsupervised learning processing unit and a supervised learning processing unit. The first unsupervised learning processing unit classifies data of a first data group according to unsupervised learning so as to perform dimensionality reduction of the first data group, thereby obtaining a first classified data group. The second unsupervised learning processing unit classifies data of a second data group according to unsupervised learning so as to perform dimensionality reduction of the second data group, thereby obtaining a second classified data group. The supervised learning processing unit performs supervised learning using the first classified data group obtained by the first unsupervised learning processing unit and the second classified data group obtained by the second unsupervised learning processing unit as teacher data so as to determine a mapping relationship between the first classified data group and the second classified data group. The patent application can reduce data dimension, but at present, no method or system which can be effectively applied to automatic extraction of text summaries exists.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an algorithm-based text abstract automatic extraction method and system, which combines the Doc2Vec algorithm and the TextRank algorithm to be applied to the text abstract automatic extraction and improves the accuracy of the text abstract automatic extraction.
In order to achieve the purpose, the invention adopts the following technical scheme:
an algorithm-based text abstract automatic extraction method comprises the following steps:
s1, preprocessing the text, wherein the preprocessed content comprises segmenting, sentence and word segmentation of the text, and extracting chapter structure information of the text;
s2, extracting the features of the preprocessed text, wherein the content of the extracted features is as follows: learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, enabling each sentence to correspond to a real number paragraph word vector with a specified dimension and continuous density, and taking the real number paragraph word vector as a characteristic representation of the sentence;
s3, after the feature extraction is completed, calculating the similarity between sentences in the text by adopting the existing similarity calculation method, weighting the sentence structure and the sentence position of the text in combination in the calculation process, and obtaining a sentence similarity matrix of the text after the similarity calculation combined with the weighting processing is completed;
s4, constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; obtaining each node containing the weighted value through iterative calculation until convergence;
and S5, combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting as the extraction result of the text abstract, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences in the total number of the articles.
Further, in S3, the weighting processing is performed when calculating the similarity in the following principle: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:
Figure BDA0001288098790000041
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
Figure BDA0001288098790000042
wherein e is1And e2Set threshold values which are larger than 0 and smaller than 1 are adopted, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is enlarged by 1.1 times, the key sentence is a paragraph which has more words than the set value and directly forms a paragraphA sentence; 6) for sentences that are preprocessed to be empty, no weighting is performed.
Further, in S1, the manner of segmenting, clauseing, and word segmentation of the text specifically includes: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
Further, in S1, the content of preprocessing the text further includes: and (5) carrying out punctuation filtering, abbreviation filling and space deletion on the text.
An algorithm-based automatic text summarization extraction system, comprising:
the preprocessing module is used for segmenting, clauses and words of the text and extracting chapter structure information of the text;
the characteristic extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real paragraph word vector with specified dimensionality and continuous density, and real paragraph word vectors which are expressed as characteristics of the sentences are obtained;
the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the conventional similarity calculation method, performing weighting processing by combining the chapter structure of the text and the positions of the sentences in the calculation process, and obtaining a sentence similarity matrix of the text after completing the similarity calculation combined with the weighting processing;
the weight value calculation module based on the TextRank is used for constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence;
and the abstract extraction module is used for combining the set abstract space parameters, selecting the core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, sequencing the core sentences according to the appearance sequence of the core sentences and outputting the core sentences as the extraction result of the text abstract, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences in the total number of the sentence.
Further, the principle of the weighting processing performed by the similarity calculation module when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:
Figure BDA0001288098790000061
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
Figure BDA0001288098790000062
wherein e is1And e2Set threshold values which are larger than 0 and smaller than 1 are adopted, and s and r are the number of sentences of the first segment and the last segment;
5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.
Further, the method for segmenting, clauses and words of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
Furthermore, the preprocessing module is also used for carrying out punctuation filtering, abbreviation filling and space deletion on the text.
The invention has the beneficial effects that: the method comprises the steps that a word2Vec and word embedding-based Doc2Vec algorithm is used for learning from sentences with variable lengths in a text, so that low-dimensional dense real word vectors containing semantic information are obtained and used as sentence feature representation with a fixed size, and compared with a traditional word bag and word frequency representation method, sentence paragraph word vectors obtained by using Doc2Vec not only contain semantic information of words, but also can avoid the problem of dimension disaster representation of feature space and reduce similarity calculation workload; the Doc2Vec algorithm has the advantages of expressing sentence semantics and reducing the problem of dimensionality expressed by features, the TextRank algorithm has the advantages of no need of pre-training in the aspect of unsupervised learning, no dependence on a specific corpus, high computing performance and the like, the Doc2Vec and TextRank algorithms are organically combined and applied to the field of text abstract automatic extraction, and a TextRank network structure diagram is further optimized by combining information such as chapter structures and the like, so that the method has the advantages of high accuracy, high computing speed and the like; compared with the traditional text abstract automatic extraction method/system, the accuracy can be improved by 3% -5%, and the accuracy is obviously improved in the field.
Drawings
FIG. 1 is a block diagram of a method for automatically extracting a text abstract based on an algorithm according to the present invention;
FIG. 2 is a block diagram of an algorithm-based automatic text summarization system according to the present invention;
FIG. 3 is a block diagram of a process for pre-processing text in accordance with the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and the detailed description below:
example 1
As shown in fig. 1, an algorithm-based text abstract automatic extraction method includes the following steps:
s1, preprocessing the text, as shown in fig. 3, the preprocessed content includes: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool; extracting chapter structure information of the text; the method also comprises the steps of punctuation filtering, abbreviation filling, word stem processing and blank space deletion on the text;
s2, extracting the features of the preprocessed text, wherein the content of the extracted features is as follows: learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, enabling each sentence to correspond to a real number paragraph word vector with a specified dimension and continuous density, and taking the real number paragraph word vector as a characteristic representation of the sentence;
s3, after completing the feature extraction, calculating the similarity between sentences in the text by adopting the existing similarity calculation method (such as cosine function, Euclidean distance or Jaccard function, etc.), weighting the chapter structure of the text and the position of the sentence in the calculation process, wherein the weighting principle when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:
Figure BDA0001288098790000081
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
Figure BDA0001288098790000082
wherein e is1And e2All the set threshold values are greater than 0 and less than 1, the default values are 0.2, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for the sentence which is preprocessed to be empty, weighting is not carried out; obtaining a sentence similarity matrix of the text after completing similarity calculation combined with weighting processing;
s4, constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; obtaining each node containing the weighted value through iterative calculation until convergence;
s5, inversely ordering the weighted values of each node, selecting a core sentence according to the weighted value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence by combining the set abstract space parameters, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences to the total number of the sentence, and finally, ordering the abstract space parameters according to the appearance sequence of the core sentences to output the abstract text extraction result.
Example 2
As shown in fig. 2 and fig. 3, an automatic text abstract extraction system based on an algorithm includes a preprocessing module, a Doc2 Vec-based feature extraction module, a similarity calculation module, a TextRank-based weight value calculation module, and an abstract extraction module; the preprocessing module is used for segmenting, sentence-separating and word-separating the text, extracting chapter structure information of the text, and performing punctuation filtering, abbreviation filling, word stem processing and space deletion on the text; the method for segmenting, separating sentences and separating words of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool; the feature extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real number paragraph word vector with specified dimensionality and continuous density, and real number paragraph word vectors represented by the features of the sentences are obtained; the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the existing similarity calculation method (such as cosine function, Euclidean distance or Jaccard function, etc.), weighting is carried out by combining the chapter structure of the text and the position of the sentences in the calculation process, and the principle of weighting treatment in the similarity calculation process is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:
Figure BDA0001288098790000101
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
Figure BDA0001288098790000102
wherein e is1And e2All the set threshold values are greater than 0 and less than 1, the default values are 0.2, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for the sentence which is preprocessed to be empty, weighting is not carried out; obtaining a sentence similarity matrix of the text after completing similarity calculation combined with weighting processing; the weight value calculation module based on the TextRank is used for taking each sentence in the text as a section according to the sentence similarity matrixConstructing an undirected weighted TextRank network graph by taking the similarity relation among sentences as edges and the similarity among sentences as weights of the edges; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence; the abstract extraction module is used for inversely ordering the weight values of the nodes, combining set abstract space parameters, selecting a core sentence according to the weight values of sentences corresponding to the nodes, the chapter structure of the text and the position information of the sentences, wherein the abstract space parameters comprise abstract word numbers, abstract sentence numbers and the proportion of the abstract sentences to the total number of the sentences of the article, and finally, the abstract extraction module is used for outputting the abstract extraction result of the text after ordering according to the appearance sequence of the core sentences.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims (6)

1. An algorithm-based text abstract automatic extraction method is characterized by comprising the following steps:
s1, preprocessing the text, wherein the preprocessed content comprises segmenting, sentence and word segmentation of the text, and extracting chapter structure information of the text;
s2, extracting the features of the preprocessed text, wherein the content of the extracted features is as follows: learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, enabling each sentence to correspond to a real number paragraph word vector with a specified dimension and continuous density, and taking the real number paragraph word vector as a characteristic representation of the sentence;
s3, after the feature extraction is completed, calculating the similarity between sentences in the text by adopting the existing similarity calculation method, weighting the sentence structure and the sentence position of the text in combination in the calculation process, and obtaining a sentence similarity matrix of the text after the similarity calculation combined with the weighting processing is completed;
s4, constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; obtaining each node containing the weighted value through iterative calculation until convergence;
s5, combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the space structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting as the extraction result of the text abstract;
in S3, the weighting processing is performed when calculating the similarity as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:
Figure FDA0002590232610000021
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
Figure FDA0002590232610000022
wherein e is1And e2Set threshold values which are more than 0 and less than 1 are provided, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.
2. The method for automatically extracting a text abstract according to claim 1, wherein in S1, the way of segmenting, clauseing and word segmentation of the text is specifically as follows: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
3. The method for automatically extracting a text abstract according to claim 2, wherein in S1, the content of preprocessing the text further comprises: and (5) carrying out punctuation filtering, abbreviation filling and space deletion on the text.
4. An algorithm-based text abstract automatic extraction system is characterized by comprising:
the preprocessing module is used for segmenting, clauses and words of the text and extracting chapter structure information of the text;
the characteristic extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real paragraph word vector with specified dimensionality and continuous density, and real paragraph word vectors which are expressed as characteristics of the sentences are obtained;
the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the conventional similarity calculation method, performing weighting processing by combining the chapter structure of the text and the positions of the sentences in the calculation process, and obtaining a sentence similarity matrix of the text after completing the similarity calculation combined with the weighting processing;
the weight value calculation module based on the TextRank is used for constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence;
the abstract extraction module is used for combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the space chapter structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting the core sentence as the extraction result of the text abstract;
the principle of the weighting processing performed by the similarity calculation module when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:
Figure FDA0002590232610000041
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
Figure FDA0002590232610000042
wherein e is1And e2Set threshold values which are more than 0 and less than 1 are provided, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.
5. The system for automatically extracting a text abstract according to claim 4, wherein the method for segmenting, clauseing and word segmentation of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
6. The system of claim 5, wherein the preprocessing module is further configured to perform punctuation filtering, abbreviation padding, and space removal on the text.
CN201710314598.5A 2017-05-06 2017-05-06 Method and system for automatically extracting text abstract based on algorithm Expired - Fee Related CN107133213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710314598.5A CN107133213B (en) 2017-05-06 2017-05-06 Method and system for automatically extracting text abstract based on algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710314598.5A CN107133213B (en) 2017-05-06 2017-05-06 Method and system for automatically extracting text abstract based on algorithm

Publications (2)

Publication Number Publication Date
CN107133213A CN107133213A (en) 2017-09-05
CN107133213B true CN107133213B (en) 2020-09-25

Family

ID=59731409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710314598.5A Expired - Fee Related CN107133213B (en) 2017-05-06 2017-05-06 Method and system for automatically extracting text abstract based on algorithm

Country Status (1)

Country Link
CN (1) CN107133213B (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526841A (en) * 2017-09-19 2017-12-29 中央民族大学 A kind of Tibetan language text summarization generation method based on Web
CN108062351A (en) * 2017-11-14 2018-05-22 厦门市美亚柏科信息股份有限公司 Text snippet extracting method, readable storage medium storing program for executing on particular topic classification
CN108304445B (en) * 2017-12-07 2021-08-03 新华网股份有限公司 Text abstract generation method and device
CN108363696A (en) * 2018-02-24 2018-08-03 李小明 A kind of processing method and processing device of text message
CN108664465B (en) * 2018-03-07 2023-06-27 珍岛信息技术(上海)股份有限公司 Method and related device for automatically generating text
CN108763191B (en) * 2018-04-16 2022-02-11 华南师范大学 Text abstract generation method and system
CN108920455A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of Chinese automatically generates the automatic evaluation method of text
CN110929022A (en) * 2018-09-18 2020-03-27 阿基米德(上海)传媒有限公司 Text abstract generation method and system
CN109325235B (en) * 2018-10-17 2022-12-02 武汉斗鱼网络科技有限公司 Text abstract extraction method based on word weight and computing device
CN109670047B (en) * 2018-11-19 2022-09-20 内蒙古大学 Abstract note generation method, computer device and readable storage medium
CN109684642B (en) * 2018-12-26 2023-01-13 重庆电信系统集成有限公司 Abstract extraction method combining page parsing rule and NLP text vectorization
CN111435405A (en) * 2019-01-15 2020-07-21 北京行数通科技有限公司 Method and device for automatically labeling key sentences of article
CN109977194B (en) * 2019-03-20 2021-08-10 华南理工大学 Text similarity calculation method, system, device and medium based on unsupervised learning
CN110162778B (en) * 2019-04-02 2023-05-26 创新先进技术有限公司 Text abstract generation method and device
CN110008313A (en) * 2019-04-11 2019-07-12 重庆华龙网海数科技有限公司 A kind of unsupervised text snippet method of extraction-type
CN110264792B (en) * 2019-06-17 2021-11-09 上海元趣信息技术有限公司 Intelligent tutoring system for composition of pupils
CN110287280B (en) * 2019-06-24 2023-09-29 腾讯科技(深圳)有限公司 Method and device for analyzing words in article, storage medium and electronic equipment
CN110263343B (en) * 2019-06-24 2021-06-15 北京理工大学 Phrase vector-based keyword extraction method and system
CN110728143A (en) * 2019-09-23 2020-01-24 上海蜜度信息技术有限公司 Method and equipment for identifying document key sentences
CN110737768B (en) * 2019-10-16 2022-04-08 信雅达科技股份有限公司 Text abstract automatic generation method and device based on deep learning and storage medium
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics
CN111125424B (en) * 2019-12-26 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and storage medium for extracting core lyrics of song
CN111159393B (en) * 2019-12-30 2023-10-10 电子科技大学 Text generation method for abstract extraction based on LDA and D2V
CN111241268B (en) * 2020-01-21 2023-04-14 上海七印信息科技有限公司 Automatic text abstract generation method
CN112016292B (en) * 2020-09-09 2022-10-11 平安科技(深圳)有限公司 Method and device for setting article interception point and computer equipment
CN112231468A (en) * 2020-10-15 2021-01-15 平安科技(深圳)有限公司 Information generation method and device, electronic equipment and storage medium
CN113076734B (en) * 2021-04-15 2023-01-20 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts
CN113032584B (en) * 2021-05-27 2021-09-17 北京明略软件系统有限公司 Entity association method, entity association device, electronic equipment and storage medium
CN113254593B (en) * 2021-06-18 2021-10-19 平安科技(深圳)有限公司 Text abstract generation method and device, computer equipment and storage medium
CN117390173B (en) * 2023-11-02 2024-03-29 江苏优丞信息科技有限公司 Massive resume screening method for semantic similarity matching

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
AU2011203510A1 (en) * 2010-07-23 2012-02-09 Sony Corporation Information processing device, information processing method, and information processing program
CN103617158A (en) * 2013-12-17 2014-03-05 苏州大学张家港工业技术研究院 Method for generating emotion abstract of dialogue text
CN104503958A (en) * 2014-11-19 2015-04-08 百度在线网络技术(北京)有限公司 Method and device for generating document summarization
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
AU2011203510A1 (en) * 2010-07-23 2012-02-09 Sony Corporation Information processing device, information processing method, and information processing program
CN103617158A (en) * 2013-12-17 2014-03-05 苏州大学张家港工业技术研究院 Method for generating emotion abstract of dialogue text
CN104503958A (en) * 2014-11-19 2015-04-08 百度在线网络技术(北京)有限公司 Method and device for generating document summarization
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary

Also Published As

Publication number Publication date
CN107133213A (en) 2017-09-05

Similar Documents

Publication Publication Date Title
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
CN107066553B (en) Short text classification method based on convolutional neural network and random forest
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
CN106776562A (en) A kind of keyword extracting method and extraction system
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN106708929B (en) Video program searching method and device
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
WO2017193685A1 (en) Method and device for data processing in social network
CN111291188A (en) Intelligent information extraction method and system
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN111581392B (en) Automatic composition scoring calculation method based on statement communication degree
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN106570196B (en) Video program searching method and device
Wang et al. Named entity recognition method of brazilian legal text based on pre-training model
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
Chai Design and implementation of English intelligent communication platform based on similarity algorithm
CN115269834A (en) High-precision text classification method and device based on BERT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200925

Termination date: 20210506