CN107133213B - Method and system for automatically extracting text abstract based on algorithm - Google Patents
Method and system for automatically extracting text abstract based on algorithm Download PDFInfo
- Publication number
- CN107133213B CN107133213B CN201710314598.5A CN201710314598A CN107133213B CN 107133213 B CN107133213 B CN 107133213B CN 201710314598 A CN201710314598 A CN 201710314598A CN 107133213 B CN107133213 B CN 107133213B
- Authority
- CN
- China
- Prior art keywords
- sentence
- text
- similarity
- sentences
- abstract
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an algorithm-based automatic text abstract extraction method, which relates to the technical field of text extraction and comprises the following steps: s1, preprocessing the text; s2, extracting the features of the text; s3, calculating the similarity between sentences by adopting the existing similarity calculation method, and performing weighting processing in the calculation process; s4, constructing an undirected weighted TextRank network graph by using sentences in the text as nodes, using similarity relations among the sentences as edges and using similarity as weight of the edges; obtaining each node containing the weighted value through iterative calculation until convergence; s5, selecting a core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, and outputting the core sentence as an extraction result after sequencing. The invention also discloses a system for extracting the abstract. The method and the device are beneficial to improving the accuracy of automatic extraction of the text abstract.
Description
Technical Field
The invention relates to the technical field of text extraction, in particular to a text abstract automatic extraction method and system based on an algorithm.
Background
The automatic extraction of the text abstract based on machine learning is a hot spot in the research field of text mining in recent years, and has very wide application prospect in the fields of search engines, portal websites, mobile internet, information retrieval systems and the like. The automatic extraction of the text abstract is realized by utilizing a computer technology, so that the text information can be effectively mined and concentrated, the reading time of a user is reduced, and the user experience is improved.
Early automatic extraction of text summaries mainly used rule-based or statistical machine learning based approaches. In recent years, many researchers have studied high text abstract automatic extraction by using various machine learning algorithms, such as regression models (including linear regression or ELM regression, etc.), lda (late dichletaslocation) models, support vector machine SVM, and LexRank algorithm, and further improve the abstract extraction effect by combining some relevant research results of linguistics, such as chapter structure, word weight, keyword, topic model, etc. Since linear regression, ELM regression, LDA and the like are supervised learning methods, they are easily affected by training samples, and thus have poor field universality and are not suitable for abstract extraction of massive texts. In 2004, Mihalcea and Tarau put forward an unsupervised learning algorithm TextRank based on the Google PageRank algorithm in combination with the research on automatic abstract extraction, which is essentially to construct a TextRank network graph according to the similarity between sentences, and to regard the similarity between sentences as a recommendation or voting relationship. Some researchers apply the TextRank to the aspects of information retrieval, keyword extraction and the like on the basis of the work of Mihalcel and Tarau, and obtain better effect. But the representation of text in these works mainly adopts bag-of-words (bag-of-words) based approach, i.e. one-of-V (where V is the size of the dictionary), and mainly based on co-occurrence information between words, neglects the order of words and their semantics. For example, word-to-word similarity cannot be expressed (the vector inner product of any two different words is 0), and it is easy to make the dimension of the word vector too large.
Chinese patent application CN104216875A discloses a microblog text automatic abstracting method based on unsupervised keyword binary word string extraction, comprising: preprocessing a microblog; standardizing the binary word strings; extracting key binary word strings based on mixed TF-IDF, TextRank and LDA; sentence sequencing based on the intersection similarity and mutual information strategy; abstract sentence extraction based on a similarity threshold; and reasonably combining the abstract sentences to generate the abstract. The patent application is still limited by a traditional thinking framework for automatically extracting text abstract, and can not solve the problems of dimension disaster and the like.
Another chinese patent application cn200710130576.x discloses a data processing apparatus, comprising: a first unsupervised learning processing unit, a second unsupervised learning processing unit and a supervised learning processing unit. The first unsupervised learning processing unit classifies data of a first data group according to unsupervised learning so as to perform dimensionality reduction of the first data group, thereby obtaining a first classified data group. The second unsupervised learning processing unit classifies data of a second data group according to unsupervised learning so as to perform dimensionality reduction of the second data group, thereby obtaining a second classified data group. The supervised learning processing unit performs supervised learning using the first classified data group obtained by the first unsupervised learning processing unit and the second classified data group obtained by the second unsupervised learning processing unit as teacher data so as to determine a mapping relationship between the first classified data group and the second classified data group. The patent application can reduce data dimension, but at present, no method or system which can be effectively applied to automatic extraction of text summaries exists.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an algorithm-based text abstract automatic extraction method and system, which combines the Doc2Vec algorithm and the TextRank algorithm to be applied to the text abstract automatic extraction and improves the accuracy of the text abstract automatic extraction.
In order to achieve the purpose, the invention adopts the following technical scheme:
an algorithm-based text abstract automatic extraction method comprises the following steps:
s1, preprocessing the text, wherein the preprocessed content comprises segmenting, sentence and word segmentation of the text, and extracting chapter structure information of the text;
s2, extracting the features of the preprocessed text, wherein the content of the extracted features is as follows: learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, enabling each sentence to correspond to a real number paragraph word vector with a specified dimension and continuous density, and taking the real number paragraph word vector as a characteristic representation of the sentence;
s3, after the feature extraction is completed, calculating the similarity between sentences in the text by adopting the existing similarity calculation method, weighting the sentence structure and the sentence position of the text in combination in the calculation process, and obtaining a sentence similarity matrix of the text after the similarity calculation combined with the weighting processing is completed;
s4, constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; obtaining each node containing the weighted value through iterative calculation until convergence;
and S5, combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting as the extraction result of the text abstract, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences in the total number of the articles.
Further, in S3, the weighting processing is performed when calculating the similarity in the following principle: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
wherein e is1And e2Set threshold values which are larger than 0 and smaller than 1 are adopted, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is enlarged by 1.1 times, the key sentence is a paragraph which has more words than the set value and directly forms a paragraphA sentence; 6) for sentences that are preprocessed to be empty, no weighting is performed.
Further, in S1, the manner of segmenting, clauseing, and word segmentation of the text specifically includes: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
Further, in S1, the content of preprocessing the text further includes: and (5) carrying out punctuation filtering, abbreviation filling and space deletion on the text.
An algorithm-based automatic text summarization extraction system, comprising:
the preprocessing module is used for segmenting, clauses and words of the text and extracting chapter structure information of the text;
the characteristic extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real paragraph word vector with specified dimensionality and continuous density, and real paragraph word vectors which are expressed as characteristics of the sentences are obtained;
the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the conventional similarity calculation method, performing weighting processing by combining the chapter structure of the text and the positions of the sentences in the calculation process, and obtaining a sentence similarity matrix of the text after completing the similarity calculation combined with the weighting processing;
the weight value calculation module based on the TextRank is used for constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence;
and the abstract extraction module is used for combining the set abstract space parameters, selecting the core sentence according to the weight value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence, sequencing the core sentences according to the appearance sequence of the core sentences and outputting the core sentences as the extraction result of the text abstract, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences in the total number of the sentence.
Further, the principle of the weighting processing performed by the similarity calculation module when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
wherein e is1And e2Set threshold values which are larger than 0 and smaller than 1 are adopted, and s and r are the number of sentences of the first segment and the last segment;
5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.
Further, the method for segmenting, clauses and words of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
Furthermore, the preprocessing module is also used for carrying out punctuation filtering, abbreviation filling and space deletion on the text.
The invention has the beneficial effects that: the method comprises the steps that a word2Vec and word embedding-based Doc2Vec algorithm is used for learning from sentences with variable lengths in a text, so that low-dimensional dense real word vectors containing semantic information are obtained and used as sentence feature representation with a fixed size, and compared with a traditional word bag and word frequency representation method, sentence paragraph word vectors obtained by using Doc2Vec not only contain semantic information of words, but also can avoid the problem of dimension disaster representation of feature space and reduce similarity calculation workload; the Doc2Vec algorithm has the advantages of expressing sentence semantics and reducing the problem of dimensionality expressed by features, the TextRank algorithm has the advantages of no need of pre-training in the aspect of unsupervised learning, no dependence on a specific corpus, high computing performance and the like, the Doc2Vec and TextRank algorithms are organically combined and applied to the field of text abstract automatic extraction, and a TextRank network structure diagram is further optimized by combining information such as chapter structures and the like, so that the method has the advantages of high accuracy, high computing speed and the like; compared with the traditional text abstract automatic extraction method/system, the accuracy can be improved by 3% -5%, and the accuracy is obviously improved in the field.
Drawings
FIG. 1 is a block diagram of a method for automatically extracting a text abstract based on an algorithm according to the present invention;
FIG. 2 is a block diagram of an algorithm-based automatic text summarization system according to the present invention;
FIG. 3 is a block diagram of a process for pre-processing text in accordance with the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and the detailed description below:
example 1
As shown in fig. 1, an algorithm-based text abstract automatic extraction method includes the following steps:
s1, preprocessing the text, as shown in fig. 3, the preprocessed content includes: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool; extracting chapter structure information of the text; the method also comprises the steps of punctuation filtering, abbreviation filling, word stem processing and blank space deletion on the text;
s2, extracting the features of the preprocessed text, wherein the content of the extracted features is as follows: learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, enabling each sentence to correspond to a real number paragraph word vector with a specified dimension and continuous density, and taking the real number paragraph word vector as a characteristic representation of the sentence;
s3, after completing the feature extraction, calculating the similarity between sentences in the text by adopting the existing similarity calculation method (such as cosine function, Euclidean distance or Jaccard function, etc.), weighting the chapter structure of the text and the position of the sentence in the calculation process, wherein the weighting principle when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
wherein e is1And e2All the set threshold values are greater than 0 and less than 1, the default values are 0.2, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for the sentence which is preprocessed to be empty, weighting is not carried out; obtaining a sentence similarity matrix of the text after completing similarity calculation combined with weighting processing;
s4, constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; obtaining each node containing the weighted value through iterative calculation until convergence;
s5, inversely ordering the weighted values of each node, selecting a core sentence according to the weighted value of the sentence corresponding to each node, the chapter structure of the text and the position information of the sentence by combining the set abstract space parameters, wherein the abstract space parameters comprise the abstract word number, the abstract sentence number and the proportion of the abstract sentences to the total number of the sentence, and finally, ordering the abstract space parameters according to the appearance sequence of the core sentences to output the abstract text extraction result.
Example 2
As shown in fig. 2 and fig. 3, an automatic text abstract extraction system based on an algorithm includes a preprocessing module, a Doc2 Vec-based feature extraction module, a similarity calculation module, a TextRank-based weight value calculation module, and an abstract extraction module; the preprocessing module is used for segmenting, sentence-separating and word-separating the text, extracting chapter structure information of the text, and performing punctuation filtering, abbreviation filling, word stem processing and space deletion on the text; the method for segmenting, separating sentences and separating words of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool; the feature extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real number paragraph word vector with specified dimensionality and continuous density, and real number paragraph word vectors represented by the features of the sentences are obtained; the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the existing similarity calculation method (such as cosine function, Euclidean distance or Jaccard function, etc.), weighting is carried out by combining the chapter structure of the text and the position of the sentences in the calculation process, and the principle of weighting treatment in the similarity calculation process is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
wherein e is1And e2All the set threshold values are greater than 0 and less than 1, the default values are 0.2, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for the sentence which is preprocessed to be empty, weighting is not carried out; obtaining a sentence similarity matrix of the text after completing similarity calculation combined with weighting processing; the weight value calculation module based on the TextRank is used for taking each sentence in the text as a section according to the sentence similarity matrixConstructing an undirected weighted TextRank network graph by taking the similarity relation among sentences as edges and the similarity among sentences as weights of the edges; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence; the abstract extraction module is used for inversely ordering the weight values of the nodes, combining set abstract space parameters, selecting a core sentence according to the weight values of sentences corresponding to the nodes, the chapter structure of the text and the position information of the sentences, wherein the abstract space parameters comprise abstract word numbers, abstract sentence numbers and the proportion of the abstract sentences to the total number of the sentences of the article, and finally, the abstract extraction module is used for outputting the abstract extraction result of the text after ordering according to the appearance sequence of the core sentences.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.
Claims (6)
1. An algorithm-based text abstract automatic extraction method is characterized by comprising the following steps:
s1, preprocessing the text, wherein the preprocessed content comprises segmenting, sentence and word segmentation of the text, and extracting chapter structure information of the text;
s2, extracting the features of the preprocessed text, wherein the content of the extracted features is as follows: learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, enabling each sentence to correspond to a real number paragraph word vector with a specified dimension and continuous density, and taking the real number paragraph word vector as a characteristic representation of the sentence;
s3, after the feature extraction is completed, calculating the similarity between sentences in the text by adopting the existing similarity calculation method, weighting the sentence structure and the sentence position of the text in combination in the calculation process, and obtaining a sentence similarity matrix of the text after the similarity calculation combined with the weighting processing is completed;
s4, constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; obtaining each node containing the weighted value through iterative calculation until convergence;
s5, combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the space structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting as the extraction result of the text abstract;
in S3, the weighting processing is performed when calculating the similarity as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the two cases, the similarity of the sentence is weighted by adopting the following formula:
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
wherein e is1And e2Set threshold values which are more than 0 and less than 1 are provided, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.
2. The method for automatically extracting a text abstract according to claim 1, wherein in S1, the way of segmenting, clauseing and word segmentation of the text is specifically as follows: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
3. The method for automatically extracting a text abstract according to claim 2, wherein in S1, the content of preprocessing the text further comprises: and (5) carrying out punctuation filtering, abbreviation filling and space deletion on the text.
4. An algorithm-based text abstract automatic extraction system is characterized by comprising:
the preprocessing module is used for segmenting, clauses and words of the text and extracting chapter structure information of the text;
the characteristic extraction module based on Doc2Vec is used for learning word vectors and paragraph vectors of words in each sentence through a Doc2Vec algorithm and a corresponding neural network model, so that each sentence corresponds to a real paragraph word vector with specified dimensionality and continuous density, and real paragraph word vectors which are expressed as characteristics of the sentences are obtained;
the similarity calculation module is used for calculating the similarity between sentences in the text by adopting the conventional similarity calculation method, performing weighting processing by combining the chapter structure of the text and the positions of the sentences in the calculation process, and obtaining a sentence similarity matrix of the text after completing the similarity calculation combined with the weighting processing;
the weight value calculation module based on the TextRank is used for constructing an undirected weighted TextRank network graph by using the sentences in the text as nodes, the similarity relation among the sentences as edges and the similarity among the sentences as weights of the edges according to the sentence similarity matrix; the node weight calculation method is also used for obtaining each node containing the weight value through iterative calculation till convergence;
the abstract extraction module is used for combining the set abstract space parameters, selecting a core sentence according to the weight value of the sentence corresponding to each node, the space chapter structure of the text and the position information of the sentence, sequencing according to the appearance sequence of the core sentence, and outputting the core sentence as the extraction result of the text abstract;
the principle of the weighting processing performed by the similarity calculation module when calculating the similarity is as follows: 1) multiplying the similarity calculation result of the sentence by 2 as a weighted result when the sentence coincides with the text title; 2) when the similarity calculation result of the sentence and the text title is 0, the similarity calculation result of the sentence is not weighted; 3) when the similarity calculation result of the sentence and the text title is between the former two cases, the similarity of the sentence is weighted according to the following calculation formula on the weighting coefficient:
wherein, P0h'And Pih'Respectively representing the feature vectors of the title sentence and the ith sentence with the length h', and sim represents the vector product calculation result of the feature vectors of the two sentences;
4) for sentences located at the first segment and the last segment in the text, weighting is carried out according to the positive sequence position and the negative sequence position, and the calculation formula of the weighting coefficient is as follows:
wherein e is1And e2Set threshold values which are more than 0 and less than 1 are provided, and s and r are the number of sentences of the first segment and the last segment; 5) the weight of the key sentence is amplified by 1.1 times, and the key sentence is a sentence of which the number of words is greater than a set value and which directly forms a paragraph; 6) for sentences that are preprocessed to be empty, no weighting is performed.
5. The system for automatically extracting a text abstract according to claim 4, wherein the method for segmenting, clauseing and word segmentation of the text in the preprocessing module comprises the following steps: numbering each sentence in the text, segmenting and segmenting the text according to punctuation marks, and segmenting words of the text according to a coding and word segmentation tool.
6. The system of claim 5, wherein the preprocessing module is further configured to perform punctuation filtering, abbreviation padding, and space removal on the text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710314598.5A CN107133213B (en) | 2017-05-06 | 2017-05-06 | Method and system for automatically extracting text abstract based on algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710314598.5A CN107133213B (en) | 2017-05-06 | 2017-05-06 | Method and system for automatically extracting text abstract based on algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107133213A CN107133213A (en) | 2017-09-05 |
CN107133213B true CN107133213B (en) | 2020-09-25 |
Family
ID=59731409
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710314598.5A Expired - Fee Related CN107133213B (en) | 2017-05-06 | 2017-05-06 | Method and system for automatically extracting text abstract based on algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107133213B (en) |
Families Citing this family (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526841A (en) * | 2017-09-19 | 2017-12-29 | 中央民族大学 | A kind of Tibetan language text summarization generation method based on Web |
CN108062351A (en) * | 2017-11-14 | 2018-05-22 | 厦门市美亚柏科信息股份有限公司 | Text snippet extracting method, readable storage medium storing program for executing on particular topic classification |
CN108304445B (en) * | 2017-12-07 | 2021-08-03 | 新华网股份有限公司 | Text abstract generation method and device |
CN108363696A (en) * | 2018-02-24 | 2018-08-03 | 李小明 | A kind of processing method and processing device of text message |
CN108664465B (en) * | 2018-03-07 | 2023-06-27 | 珍岛信息技术(上海)股份有限公司 | Method and related device for automatically generating text |
CN108763191B (en) * | 2018-04-16 | 2022-02-11 | 华南师范大学 | Text abstract generation method and system |
CN108920455A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of Chinese automatically generates the automatic evaluation method of text |
CN110929022A (en) * | 2018-09-18 | 2020-03-27 | 阿基米德(上海)传媒有限公司 | Text abstract generation method and system |
CN109325235B (en) * | 2018-10-17 | 2022-12-02 | 武汉斗鱼网络科技有限公司 | Text abstract extraction method based on word weight and computing device |
CN109670047B (en) * | 2018-11-19 | 2022-09-20 | 内蒙古大学 | Abstract note generation method, computer device and readable storage medium |
CN109684642B (en) * | 2018-12-26 | 2023-01-13 | 重庆电信系统集成有限公司 | Abstract extraction method combining page parsing rule and NLP text vectorization |
CN111435405A (en) * | 2019-01-15 | 2020-07-21 | 北京行数通科技有限公司 | Method and device for automatically labeling key sentences of article |
CN109977194B (en) * | 2019-03-20 | 2021-08-10 | 华南理工大学 | Text similarity calculation method, system, device and medium based on unsupervised learning |
CN110162778B (en) * | 2019-04-02 | 2023-05-26 | 创新先进技术有限公司 | Text abstract generation method and device |
CN110008313A (en) * | 2019-04-11 | 2019-07-12 | 重庆华龙网海数科技有限公司 | A kind of unsupervised text snippet method of extraction-type |
CN110264792B (en) * | 2019-06-17 | 2021-11-09 | 上海元趣信息技术有限公司 | Intelligent tutoring system for composition of pupils |
CN110287280B (en) * | 2019-06-24 | 2023-09-29 | 腾讯科技(深圳)有限公司 | Method and device for analyzing words in article, storage medium and electronic equipment |
CN110263343B (en) * | 2019-06-24 | 2021-06-15 | 北京理工大学 | Phrase vector-based keyword extraction method and system |
CN110728143A (en) * | 2019-09-23 | 2020-01-24 | 上海蜜度信息技术有限公司 | Method and equipment for identifying document key sentences |
CN110737768B (en) * | 2019-10-16 | 2022-04-08 | 信雅达科技股份有限公司 | Text abstract automatic generation method and device based on deep learning and storage medium |
CN111125349A (en) * | 2019-12-17 | 2020-05-08 | 辽宁大学 | Graph model text abstract generation method based on word frequency and semantics |
CN111125424B (en) * | 2019-12-26 | 2024-01-09 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, device, equipment and storage medium for extracting core lyrics of song |
CN111159393B (en) * | 2019-12-30 | 2023-10-10 | 电子科技大学 | Text generation method for abstract extraction based on LDA and D2V |
CN111241268B (en) * | 2020-01-21 | 2023-04-14 | 上海七印信息科技有限公司 | Automatic text abstract generation method |
CN112016292B (en) * | 2020-09-09 | 2022-10-11 | 平安科技(深圳)有限公司 | Method and device for setting article interception point and computer equipment |
CN112231468A (en) * | 2020-10-15 | 2021-01-15 | 平安科技(深圳)有限公司 | Information generation method and device, electronic equipment and storage medium |
CN113076734B (en) * | 2021-04-15 | 2023-01-20 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
CN113032584B (en) * | 2021-05-27 | 2021-09-17 | 北京明略软件系统有限公司 | Entity association method, entity association device, electronic equipment and storage medium |
CN113254593B (en) * | 2021-06-18 | 2021-10-19 | 平安科技(深圳)有限公司 | Text abstract generation method and device, computer equipment and storage medium |
CN117390173B (en) * | 2023-11-02 | 2024-03-29 | 江苏优丞信息科技有限公司 | Massive resume screening method for semantic similarity matching |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
AU2011203510A1 (en) * | 2010-07-23 | 2012-02-09 | Sony Corporation | Information processing device, information processing method, and information processing program |
CN103617158A (en) * | 2013-12-17 | 2014-03-05 | 苏州大学张家港工业技术研究院 | Method for generating emotion abstract of dialogue text |
CN104503958A (en) * | 2014-11-19 | 2015-04-08 | 百度在线网络技术(北京)有限公司 | Method and device for generating document summarization |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
-
2017
- 2017-05-06 CN CN201710314598.5A patent/CN107133213B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
AU2011203510A1 (en) * | 2010-07-23 | 2012-02-09 | Sony Corporation | Information processing device, information processing method, and information processing program |
CN103617158A (en) * | 2013-12-17 | 2014-03-05 | 苏州大学张家港工业技术研究院 | Method for generating emotion abstract of dialogue text |
CN104503958A (en) * | 2014-11-19 | 2015-04-08 | 百度在线网络技术(北京)有限公司 | Method and device for generating document summarization |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
Also Published As
Publication number | Publication date |
---|---|
CN107133213A (en) | 2017-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN110866117B (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN110413986B (en) | Text clustering multi-document automatic summarization method and system for improving word vector model | |
CN107066553B (en) | Short text classification method based on convolutional neural network and random forest | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN111259127B (en) | Long text answer selection method based on transfer learning sentence vector | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN106598940A (en) | Text similarity solution algorithm based on global optimization of keyword quality | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN106708929B (en) | Video program searching method and device | |
CN110879834B (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN113761890B (en) | Multi-level semantic information retrieval method based on BERT context awareness | |
WO2017193685A1 (en) | Method and device for data processing in social network | |
CN111291188A (en) | Intelligent information extraction method and system | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN111581392B (en) | Automatic composition scoring calculation method based on statement communication degree | |
CN115203421A (en) | Method, device and equipment for generating label of long text and storage medium | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
CN106570196B (en) | Video program searching method and device | |
Wang et al. | Named entity recognition method of brazilian legal text based on pre-training model | |
CN115759119A (en) | Financial text emotion analysis method, system, medium and equipment | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
Chai | Design and implementation of English intelligent communication platform based on similarity algorithm | |
CN115269834A (en) | High-precision text classification method and device based on BERT |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200925 Termination date: 20210506 |