CN109101489B - Text automatic summarization method and device and electronic equipment - Google Patents
Text automatic summarization method and device and electronic equipment Download PDFInfo
- Publication number
- CN109101489B CN109101489B CN201810787848.1A CN201810787848A CN109101489B CN 109101489 B CN109101489 B CN 109101489B CN 201810787848 A CN201810787848 A CN 201810787848A CN 109101489 B CN109101489 B CN 109101489B
- Authority
- CN
- China
- Prior art keywords
- sentence
- sentences
- calculating
- document
- summarized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 239000013598 vector Substances 0.000 claims abstract description 57
- 230000011218 segmentation Effects 0.000 claims abstract description 33
- 238000004364 calculation method Methods 0.000 claims description 52
- 238000004891 communication Methods 0.000 claims description 18
- 238000012937 correction Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 238000011161 development Methods 0.000 description 6
- 230000018109 developmental process Effects 0.000 description 6
- 238000007726 management method Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000002955 isolation Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000001737 promoting effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 235000005078 Chaenomeles speciosa Nutrition 0.000 description 1
- 240000000425 Chaenomeles speciosa Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000002040 relaxant effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text automatic summarization method, which comprises the steps of segmenting a document to be summarized according to predefined sentence ending symbols; calculating the topic vector of each sentence after segmentation according to the existing text corpus; determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences; calculating semantic similarity between every two sentences according to the topic vector of each sentence; calculating the score of each sentence according to the relevance and semantic similarity among the sentences; and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content. The invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate and the semantic relevance of the words of the sentences and improves the accuracy rate of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality. The invention also discloses a text automatic summarization device and electronic equipment.
Description
Technical Field
The invention relates to the technical field of natural language understanding, in particular to a text automatic summarization method and device and electronic equipment.
Background
The abstract reflects the central content of the original document comprehensively and accurately through a short and coherent short text. Due to the explosion of information, the number of documents needing to be read by people before completing a job is increased continuously, the time spent is prolonged, the reading time can be effectively shortened by the application of the automatic summary, the working efficiency of various fields can be improved, and the method has a wide application prospect.
The automatic summarization technique can be divided into two categories according to the relationship between the original text and the summary: abstract and generative summarization techniques. The extraction type abstract is to extract key sentences from a clause set of an original text without modifying the key sentences, then combine the key sentences to form an abstract, and the essence of the extraction type abstract is to convert an abstract problem into a sorting problem, score each sentence, and form an abstract of a corresponding document by high-level sentences. The generated abstract technology tries to understand the content of the document and summarizes the central content of the document through refined sentences, the mode is more consistent with the essence of the abstract, and a seq2seq method is adopted in the short text abstract problem at present to make a certain progress, but when the problem of the long text abstract is solved, the technical difficulty is high, and the effect is poor.
At present, a widely used technology is still a abstract generation method based on an abstraction formula, and the relevance of sentences is generally measured by words forming sentences, however, in an actual document, sentences with high word relevance and sentences with high semantic relevance may be key sentences, so that it is not reasonable to consider both or only one relevance.
Disclosure of Invention
In view of the above, there is a need to provide a method and an apparatus for automatically abstracting a text, which can overcome the defects of the existing abstraction-based method and have the characteristics of universality and high accuracy.
The invention comprises the following contents:
a text automatic summarization method comprises the following steps:
dividing the document to be summarized according to predefined sentence ending symbols;
calculating a topic vector of each sentence after segmentation according to an existing text corpus;
determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences;
calculating semantic similarity between every two sentences according to the topic vector of each sentence;
calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;
and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content.
On the other hand, the invention also discloses a text automatic summarization device, which is characterized by comprising the following components:
the segmentation module is used for segmenting the document to be summarized according to predefined sentence ending symbols;
the first calculation module is used for calculating the theme vector of each sentence after segmentation according to the existing text corpus;
the second calculation module is used for determining the correlation degree of every two sentences according to the quantity of the words which commonly appear between every two sentences;
the third calculation module is used for calculating semantic similarity between every two sentences according to the topic vectors of each sentence;
the scoring module is used for calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;
and the abstract output module is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.
The invention also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for finishing mutual communication by the memory through the communication bus; a memory for storing a computer program; and the processor is used for realizing the steps of the method when executing the program stored in the memory.
Compared with the prior art, the invention has the beneficial effects that:
the invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate of the sentence words and the semantic relevance, and improves the accuracy of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality.
Drawings
FIG. 1 is a flow diagram of a method for automatically summarizing text, in some embodiments.
FIG. 2 is a flow diagram of a method for automatically summarizing text, in further embodiments.
Fig. 3 is a flow chart of a text automatic summarization method in case of chinese text.
Fig. 4 is a schematic diagram of an apparatus for automatically summarizing text in some embodiments.
Fig. 5 is a schematic diagram of an electronic device in some embodiments.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
In some embodiments, as shown in fig. 1, a method for automatically summarizing text includes the steps of:
and Step110, segmenting the document to be summarized according to predefined sentence ending symbols.
For example, after segmentation, it is denoted as d ═ s1,s2,…smWhere 1,2, …, m represents the number of the sentence after being divided. In the present embodiment, the sentence end symbol is not particularly limited and may be. ","! ","? ", may also be", "; ", or even a designated segmentation symbol.
Step120, calculating a topic vector of each segmented sentence according to the existing text corpus.
In some embodiments, an lda (content Dirichlet allocation) topic model algorithm may be used to generate a text corpus, that is, a co-occurrence matrix of "topic-word" in several topic-labeled text corpora is counted, and a specific calculation method is shown in formula (1):
wherein p represents probability, word omega is a word in article d, t is a subject of the article, n represents the number of subjects, dkDenotes the kth document, ωiDenotes dkThe ith word in (1).
The method for calculating the topic vector of each sentence after segmentation comprises the following steps:
calculating the conditional probability of words and themes in each sentence according to a preset formula, and repeating the conditionProbability step, until the calculation result is converged, obtaining the topic vector of each sentence after segmentation, which is respectively
The formula for calculating the conditional probability of the words and the topics in the document to be summarized can be as follows:
wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, dkDenotes the kth document, ωiDenotes dkThe (i) th word in (1),indicating the probability of association of the document with the topic,representing the probability of association of a word with a topic.
Wherein,
arfor the parameters associated with the subject r, they are generally set tobtIs a parameter related to the word t, and is generally set as
Step130, determining the relevance of the two sentences according to the number of the words which commonly appear between the two sentences.
The specific calculation method is as shown in formula (2):
wherein s isi,sjFor differently numbered sentences ωkIs a word belonging to a sentence, | siL is a sentence siNumber of words, | sjL is a sentence sjThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
Step140, calculating semantic similarity between every two sentences according to the topic vectors of each sentence.
The specific calculation method is shown as formula (3):
wherein,representing a sentence si、sjThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
It is understood that the steps of Step130 and Step140 may be interchanged.
And Step150, calculating the score of each sentence according to the relevance and semantic similarity among the sentences, and repeatedly and iteratively calculating the score of each sentence until the result is converged.
The method of calculating the score of each sentence is as follows:
wherein, TR(s)i) Representing a sentence siScore s ofi、sj、skRespectively represent a numberFor the sentence i, j, k, OUT (j) means divide by sjOther sentences, IN (i) denoted as including siAll the sentences inside, n represents the number of the subjects, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0,1 ]]And e has a value range of [0,1 ]]. (1-e)/n is a correction amount, and is taken as a sentence siRandom probability of being chosen as a key sentence. The method comprehensively considers the influence of the collinearity rate of the sentence words and the semantic relevance on the sentence scoring so as to improve the accuracy rate of the sentence scoring.
Example 2
In order to further improve the accuracy of sentence scoring, the semantic similarity between each sentence and the document to be summarized can be used as a correction parameter when the score of each sentence is calculated. As shown in fig. 2, an automatic text summarization method includes the following steps:
step210, segmenting the document d to be summarized according to the predefined sentence ending symbol.
For example, after segmentation, it is denoted as d ═ s1,s2,…smWhere 1,2, …, m represents the number of the sentence after being divided. In the present embodiment, the sentence end symbol is not particularly limited and may be. ","! ","? ", also can be", "; ", or even a designated segmentation symbol.
Step220, respectively calculating the theme vector of the document to be summarized and each sentence s after segmentation according to the existing text corpus1,s2,…smThe topic vector of (1).
In some embodiments, an lda (content Dirichlet allocation) topic model algorithm may be used to generate a text corpus, that is, a co-occurrence matrix of "topic-word" in several topic-labeled text corpora is counted, and a specific calculation method is shown in formula (1):
wherein p represents probability, word omega is a word in article d, t is a subject of the article, n represents the number of subjects, dkA k-th document is represented as,ωidenotes dkThe ith word in (1).
The method for calculating the topic vector of the document to be summarized and the topic vector of each segmented sentence according to a preset formula comprises the following steps:
calculating the conditional probability of words and topics in the document, repeating the conditional probability step until the calculation result is converged to obtain the topic vector of the document to be summarized
Calculating the conditional probability of the words and the topics in each sentence after the segmentation, repeating the step of the conditional probability until the calculation result is converged, and obtaining the topic vector of each sentence after the segmentation, wherein the topic vector is respectivelyWherein s represents a sentence, 1,2, …, and m represents a number of the sentence after being divided;
the formula for calculating the conditional probability of the words and the topics in the document and calculating the conditional probability of the words and the topics in each sentence after segmentation can be an existing feasible calculation formula, and the formula in embodiment 1 can also be referred to.
And Step230, measuring the relevance of every two sentences through the quantity of terms which commonly appear between every two sentences in the document to be summarized. The specific calculation method is as shown in formula (2):
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
Step240, calculating semantic similarity between every two sentences through the topic vector of each divided sentence, and calculating semantic similarity between the document to be summarized and each sentence through the topic vector of the document to be summarized and the topic vector of each divided sentence.
The formula for calculating the semantic similarity between two sentences is as follows:
wherein,representing a sentence si、sjThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
The formula for calculating the semantic similarity between the document to be abstracted and each sentence is as follows:
it is understood that the steps of Step230 and Step240 may be interchanged.
And Step250, calculating the score of each sentence by utilizing the relevance between the sentences, the semantic similarity and the semantic similarity between the sentences and the document to be summarized, and repeatedly and iteratively calculating the score of each sentence until the result is converged.
The method of calculating the score of each sentence is as follows:
wherein, TR(s)i) Representing a sentence siScore of, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]. The invention comprehensively considers the collinearity rate of sentence and word and the semantic faciesInfluence of relevance on sentence score, and fixed value (1-e) is not directly used as sentence s in correcting score resultiThe random probability chosen as key sentence, but the random probability value related to semantics, i.e. the relevance of (1-e) to subjectMultiplying to express the random probability that the sentence is selected as the abstract key sentence, and fully considering the influence of semantics on sentence scoring.
And Step260, selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.
The invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate of the sentence words and the semantic relevance, and improves the accuracy rate of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality. The method is suitable for text content abstract extraction in multiple fields.
Example 3
The text automatic summarization method of the invention can be applied to the fields of Chinese, English and the like. In some embodiments, as shown in fig. 3, a method for automatically summarizing a chinese text includes the following steps:
step1, counting a co-occurrence matrix of 'theme-words' in a mass text corpus with theme labels. The method comprises the following specific steps:
step 1.1, preprocessing, initializing a massive corpus, and defining the kth document as dk,dkThe ith word in (1) is represented by ωiAnd (4) showing.
Step 1.2, calculating the conditional probability of the words and the subjects in the document, wherein the specific formula is as follows:
wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, dkdenotes the kth document, ωiDenotes dkThe (i) th word in (1),indicating the probability of association of the document with the topic,representing the probability of association of a word with a topic.
Wherein,
arfor the parameters associated with the subject r, they are generally set tobtIs a parameter related to the word t, and is generally set as
And 1.3, repeating the step 1.2 until the calculation result is converged, and obtaining a final theme-word co-occurrence matrix M.
Step2, the document d to be summarized is segmented according to the sentence ending symbol, namely d ═ s1,s2,…smAnd h, wherein 1,2, …, m represents the number of the sentence after being divided. The sentence symbol is generally composed of. ","! ","? "and the like.
For example, the document d to be summarized is the following paragraphs:
"Weekly futures message 5 month 9 evening, XXX issues several comments on further promoting healthy development of capital market," and nine official legs of the New nation. The nineteen lines emphasize the relaxation of traffic admission. The method implements a public, transparent and orderly securities futures business license management system, researches cross support licenses of securities companies, fund management companies, futures companies, securities investment consulting companies and the like, and supports other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation. The grand news futures link international futures vice general manager king-red-English (Boke, microblog) explains the influence of nine nations on the futures industry. Wanghongying considers that nine nations will have positive influence on the futures industry, a powerful futures company will be further developed towards the international delivery business, and he considers that more and more large financial institutions will participate in stock futures companies, which has positive influence on providing the capital strength and the business strength of the futures company. In addition, futures companies need to conduct business differently, and futures professional businesses such as hedging will become more and more important, which undoubtedly will influence future development of futures companies. "
The sentence after segmentation is:
s1 for the evening of day 5 and 9 of the futures news, XXX issues opinions about further promoting the healthy development of capital markets, and the New nation has nine official legs.
s 2-state nine emphasis on relaxing traffic admission.
s3 is to implement the public transparent and orderly stock and exchange securities futures business license management system, research the cross support license of securities company, fund management company, stock company, securities investment consulting company, etc., and support other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation.
s4 Wang hongying (blog, microblog) of general International futures Assistant connected with Hezhongfeng futures explains the influence of nine countries on the futures industry.
s5 (royal red English) considers that nine countries have positive influence on the futures industry, a powerful futures company is further developed towards the international operation, and he considers that more and more large financial institutions are participating in the futures company, which has positive influence on providing the capital strength and the business strength of the futures company.
s6, futures companies need to operate differently, and futures professional businesses such as hedging will become more and more important, which will certainly affect future development of futures companies.
Step 3, respectively calculating the theme vector of the document d and each sentence s after segmentation according to the co-occurrence matrix M1,s2,…,snThe topic vector of (1). The method comprises the following specific steps:
3.1. and (4) calculating the conditional probability of the words and the topics in the document d according to the formula in the step 1.2.
3.2. Repeating the step 3.1 until the calculation result is converged to obtain the subject vector of the document d
3.3. Calculating each sentence s after segmentation according to the formula in the step 1.21,s2,…smConditional probability of Chinese words and topics.
3.4. Repeating the step 3.3 until the calculation result is converged, and obtaining the topic vector of each sentence after segmentation, wherein the topic vector is respectively
Step 4, measuring the relevance of the sentences through the number of the commonly occurring words of the two sentences, wherein the calculation formula is as follows:
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
In the present embodiment, the correlation between two sentences is shown in table 1.
TABLE 1
S1 | S2 | S3 | S4 | S5 | S6 | |
S1 | 0.5 | 0 | 0.15 | 0.25 | 0.26 | 0.21 |
S2 | 0 | 0.5 | 0 | 0 | 0.12 | 0 |
S3 | 0.15 | 0 | 0.5 | 0.1 | 0.23 | 0.22 |
S4 | 0.25 | 0 | 0.1 | 0.5 | 0.28 | 0.17 |
S5 | 0.26 | 0.12 | 0.23 | 0.28 | 0.5 | 0.3 |
S6 | 0.21 | 0 | 0.22 | 0.17 | 0.30 | 0.5 |
And 5, calculating the semantic similarity between every two sentences by using the main vector of each sentence after segmentation. The calculation formula is as follows:
wherein,representing a sentence si、sjThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
Table 2 shows semantic similarity between two sentences and similarity between each sentence and the digest document (denoted by Doc) in this embodiment as shown in table 2.
TABLE 2
And 6, scoring each sentence of the text content to be abstracted, and repeatedly and iteratively calculating the score of each sentence until the result is converged. The scoring formula adopted in this embodiment is as follows:
wherein, TR(s)i) Representing a sentence siScore of(s), si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]. The invention comprehensively considers the collinearity rate and semantic relevance of sentence words to improve the accuracy rate of sentence scoring, and uses the relevance between (1-e) and subjectThe multiplication represents the random probability that the sentence is selected as the summary key sentence. In this embodiment, α, e, β are 0.5,0.5,0.85 respectively, and then the score of each sentence is calculated as:
{S1:0.15868276207017054,S2:0.12009075569229676,S32: 0.18508810964258574,S4:0.15716454363543905,S5:0.21333489450057458,S6: 0.165638934458933}
step 7, selecting TR(s)n)>The sentence of phi is that,phi is a fixed threshold value, then adding connecting words giving rules, and outputting the summary content according to the appearance sequence of the original text.
If Φ is equal to 0.1, the summary content finally output by the document d to be summarized is:
' Wang hong Ying deem, nine nations will bring positive influence to the futures industry, the forceful futures company will further develop towards the international delivery business, he deems that more and more large financial institutions will participate in the stock futures company, which will bring positive influence to the capital strength and the business strength of the futures company ', ' implementing the public transparent and orderly securities futures business license management system, researching the cross support cards of securities companies, fund management companies, futures companies, securities investment companies and the like, supporting other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation ', ' in addition, the futures company also needs to carry out differentiation, the special futures business such as the hedging value will be more and more important, will undoubtedly influence the future development of the futures company ', ' and the future information message for 5 month and 9 days later, XXX issues several opinions on further promoting the healthy development of capital markets, nine formal landings in new countries, 'the general prince of international futures in social futures side chain (blogs, microblogs) interpretation of the effects of nine nations on futures industry', 'nine nations emphasis on easing business admission' ]
Example 4
Correspondingly, the present invention also discloses a text automatic summarization device, in some embodiments, as shown in fig. 4, the device includes:
and the segmentation module 11 is used for segmenting the document to be summarized according to predefined sentence ending symbols.
For example, after segmentation, the document to be summarized is marked as d ═ s1,s2,…smWhere 1,2, …, m represents the number of the sentence after being divided. Sentence end symbol is generally composed of. ","! ","? "and the like.
The first calculating module 12 is configured to calculate a topic vector of each segmented sentence according to an existing text corpus.
The first calculation module 12 further generates a text corpus by using an lda (content Dirichlet allocation) topic model algorithm, that is, a co-occurrence matrix of "topic-word" in a plurality of topic-labeled text corpuses is counted, and a specific calculation method is shown in formula (1):
wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, dkDenotes the kth document, ωiDenotes dkThe ith word in (1).
The method for calculating the topic vector of each sentence after segmentation by the first calculation module 12 comprises the following steps:
calculating the conditional probability of the words and the topics in each sentence after the segmentation, repeating the step of the conditional probability until the calculation result is converged, and obtaining the topic vector of each sentence after the segmentation, wherein the topic vector is respectivelyThe existing feasible calculation formula can be selected, and the formula in the embodiment 1 can also be referred to.
And the second calculating module 13 is configured to measure the relevance between two sentences by the number of terms commonly occurring between two sentences in the document to be summarized.
The specific calculation method is as shown in formula (2):
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
And the third calculating module 14 is configured to calculate semantic similarity between every two sentences according to the topic vector of each segmented sentence.
The specific calculation method is shown as formula (3):
wherein,representing a sentence siThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
And the scoring module 15 is configured to calculate a score of each sentence according to the relevance and similarity between the sentences.
The method of calculating the score of each sentence is as follows:
wherein, TR(s)i) Representing a sentence siScore of, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siAll the sentences inside, n represents the number of the subjects, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0,1 ]]And e is taken to be in the range of [0,1 ]]. (1-e)/n is a correction amount as a sentence siRandom probability of being chosen as a key sentence. The method comprehensively considers the influence of the sentence word collinearity rate and the semantic relevance on the sentence scoring so as to improve the sentence scoring accuracy.
And the abstract output module 16 is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the connecting words according to the selected output sequence to obtain abstract contents.
Example 5
In other embodiments, the first calculation module 12 is further configured to calculate the text corpus according to an existing text corpus, as compared to embodiment 4And (4) a topic vector of the document to be summarized. The method for calculating the theme vector of the document to be summarized comprises the following steps: calculating the conditional probability of the words and the subjects in the document, repeating the conditional probability step until the calculation result is converged, and obtaining the subject vector of the document to be summarizedThe third calculating module 14 is further configured to calculate semantic similarity between the document to be summarized and each sentence. The scoring module 15 is also used for correcting the scoring result by using the similarity between each sentence and the summary document. The final scoring formula used is as follows:
wherein, TR(s)i) Representing a sentence siScore of, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) denoted as including siIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]。 TopicSim(siTitle) represents the semantic similarity between the document to be summarized and each sentence, wherein, representing a sentence siThe subject vector of (a) is,and the topic vector represents the document to be summarized.
The embodiment comprehensively considers the influence of the sentence word collinearity rate and the semantic relevance on the sentence score, and does not directly use the fixed value (1-e) as the sentence s when the score result is correctediRandom probability chosen as key sentence, but use and semantic faciesRandom probability values of relevance, i.e. using (1-e) relevance to subject matterMultiplying to express the random probability that the sentence is selected as the summary key sentence, and fully considering the influence of semantics on sentence scoring.
Example 6
Corresponding to the method embodiment, the embodiment of the invention also provides electronic equipment. Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device includes: a processor 510, a communication interface 520, a memory 530, and a communication bus 540, wherein:
a processor 510, a communication interface 520, a memory 530 for communicating with each other via a communication bus 540, and a memory 530 for storing computer programs;
the processor 510 is configured to implement the generation of the text automatic summarization method provided in the present invention when executing the program stored in the memory 530. Specifically, the text automatic summarization method includes:
dividing the document to be summarized according to predefined sentence ending symbols;
calculating a topic vector of each sentence after segmentation according to an existing text corpus;
determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences;
calculating semantic similarity between every two sentences according to the topic vector of each sentence;
calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;
and selecting sentences the score values of which meet the threshold value, adding preset connecting words, and outputting the sentences according to the selected output sequence to obtain the abstract content.
Of course, in order to further improve the accuracy of the summary, the text automatic summary method may further include:
dividing the document to be summarized according to predefined sentence ending symbols;
respectively calculating the topic vector of the document to be abstracted and the topic vector of each sentence after segmentation according to the existing text corpus;
measuring the relevance of every two sentences through the quantity of words commonly appearing between every two sentences in the document to be summarized;
calculating semantic similarity between every two sentences through the topic vector of each segmented sentence, and calculating the semantic similarity between the document to be abstracted and each sentence through the topic vector of the document to be abstracted and the topic vector of each segmented sentence;
calculating the score of each sentence according to the relevance between the sentences, the semantic similarity and the semantic similarity between the sentences and the document to be summarized;
and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content.
The implementation manner of the text automatic summarization method is the same as that of the text automatic summarization method provided in the foregoing method embodiment section, and details are not repeated here.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but not only one bus or type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.
It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.
Claims (9)
1. A text automatic summarization method is characterized by comprising the following steps:
dividing the document to be summarized according to predefined sentence ending symbols;
calculating a topic vector of each sentence after segmentation according to an existing text corpus;
determining the relevance of every two sentences according to the quantity of the words which commonly appear between every two sentences;
calculating semantic similarity of every two sentences according to the topic vector of each sentence;
the method for calculating the score of each sentence by utilizing the relevance and the semantic similarity adopts the following formula:
wherein, TR(s)i) Representing a sentence siThe score of alpha, e, beta is a preset calculation parameter, si、sj、skRespectively representing sentences numbered i, j, k,indicates the relevance of the sentences si, sj, TopicSim(s)i,sj) Representing semantic similarity of sentences si, sj, OUT (j) representing division by sjOther sentences, IN (i) including siAll sentences within, n represents the number of topics;
and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content.
2. The method for automatically summarizing text according to claim 1, further comprising:
calculating a theme vector of a document to be summarized according to an existing text corpus, and calculating semantic similarity between the document to be summarized and each sentence by using the theme vector of the document to be summarized and the theme vector of each divided sentence;
calculating the score of each sentence further comprises: and utilizing the semantic similarity between the document to be abstracted and each sentence as a correction quantity.
3. The method for automatically summarizing text according to claim 1, wherein said calculating a topic vector for each sentence after segmentation comprises:
and calculating the conditional probability of the words and the topics in each sentence after segmentation according to a preset formula, and repeating the conditional probability step until the calculation result is converged to obtain the topic vector of each sentence after segmentation.
4. The method for automatically summarizing text according to claim 2, wherein the method for calculating the topic vector of the document to be summarized comprises:
and calculating the conditional probability of the words and the topics in the document to be summarized according to a preset formula, and repeating the step of the conditional probability until the calculation result is converged to obtain the topic vector of the document to be summarized.
5. The method for automatically summarizing text according to claim 1 or 2, wherein the relevance of two sentences determined by the number of words co-occurring between two sentences in the document to be summarized is determined by using the following calculation formula:
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siNumber of words, | sjL is a sentence sjThe higher the calculation result value is, the higher the sentence relevancy is.
6. The method for automatically summarizing text according to claim 5, wherein the calculation of semantic similarity between two sentences by the topic vector of each sentence after the segmentation adopts the following calculation formula:
7. The method for automatically summarizing text according to claim 6, wherein the method for calculating the score of each sentence uses the following formula:
wherein, TR(s)i) Representing a sentence siThe score of alpha, e, beta is a preset calculation parameter, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siAll sentences within, DOC represents the set of sentences of the document to be summarized, TopicSim(s)iTitle) represents the semantic similarity between the document to be summarized and each sentence, wherein, representing a sentence siThe subject vector of (a) is,a topic vector representing the document to be summarized.
8. An apparatus for automatically summarizing text, the apparatus comprising:
the segmentation module is used for segmenting the document to be summarized according to predefined sentence ending symbols;
the first calculation module is used for calculating the topic vector of each sentence after segmentation according to the existing text corpus;
the second calculation module is used for determining the relevance of every two sentences according to the quantity of the words which commonly appear between every two sentences;
the third calculation module is used for calculating semantic similarity between every two sentences according to the topic vector of each sentence;
the scoring module is used for calculating the score of each sentence by utilizing the relevance and the semantic similarity, and adopts the following formula:
wherein, TR(s)i) Representing a sentence siThe score of alpha, e, beta is a preset calculation parameter, si、sj、skRespectively representing sentences numbered i, j, k,indicates the relevance of the sentences si, sj, TopicSim(s)i,sj) Representing semantic similarity of sentences si, sj, OUT (j) representing division by sjOther sentences, IN (i) including siAll sentences within, n represents the number of topics;
and the abstract output module is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.
9. An electronic device comprising a processor, a communication interface, a memory, and a communication bus, wherein:
the processor, the communication interface and the memory complete mutual communication through a communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810787848.1A CN109101489B (en) | 2018-07-18 | 2018-07-18 | Text automatic summarization method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810787848.1A CN109101489B (en) | 2018-07-18 | 2018-07-18 | Text automatic summarization method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109101489A CN109101489A (en) | 2018-12-28 |
CN109101489B true CN109101489B (en) | 2022-05-20 |
Family
ID=64846627
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810787848.1A Active CN109101489B (en) | 2018-07-18 | 2018-07-18 | Text automatic summarization method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109101489B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162618B (en) * | 2019-02-22 | 2021-09-17 | 北京捷风数据技术有限公司 | Text summary generation method and device of non-contrast corpus |
CN110134942B (en) * | 2019-04-01 | 2020-10-23 | 北京中科闻歌科技股份有限公司 | Text hotspot extraction method and device |
CN110162778B (en) * | 2019-04-02 | 2023-05-26 | 创新先进技术有限公司 | Text abstract generation method and device |
CN112699657A (en) * | 2020-12-30 | 2021-04-23 | 广东德诚大数据科技有限公司 | Abnormal text detection method and device, electronic equipment and storage medium |
CN112711662A (en) * | 2021-03-29 | 2021-04-27 | 贝壳找房(北京)科技有限公司 | Text acquisition method and device, readable storage medium and electronic equipment |
CN113204637B (en) * | 2021-04-13 | 2022-09-27 | 北京三快在线科技有限公司 | Text processing method and device, storage medium and electronic equipment |
CN113407710A (en) * | 2021-06-07 | 2021-09-17 | 维沃移动通信有限公司 | Information display method and device, electronic equipment and readable storage medium |
CN114201600A (en) * | 2021-12-10 | 2022-03-18 | 北京金堤科技有限公司 | Public opinion text abstract extraction method, device, equipment and computer storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN102945228A (en) * | 2012-10-29 | 2013-02-27 | 广西工学院 | Multi-document summarization method based on text segmentation |
CN103136359A (en) * | 2013-03-07 | 2013-06-05 | 宁波成电泰克电子信息技术发展有限公司 | Generation method of single document summaries |
CN106294863A (en) * | 2016-08-23 | 2017-01-04 | 电子科技大学 | A kind of abstract method for mass text fast understanding |
CN106445920A (en) * | 2016-09-29 | 2017-02-22 | 北京理工大学 | Sentence similarity calculation method based on sentence meaning structure characteristics |
CN107491434A (en) * | 2017-08-10 | 2017-12-19 | 北京邮电大学 | Text snippet automatic generation method and device based on semantic dependency |
CN107729300A (en) * | 2017-09-18 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Processing method, device, equipment and the computer-readable storage medium of text similarity |
-
2018
- 2018-07-18 CN CN201810787848.1A patent/CN109101489B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN102945228A (en) * | 2012-10-29 | 2013-02-27 | 广西工学院 | Multi-document summarization method based on text segmentation |
CN103136359A (en) * | 2013-03-07 | 2013-06-05 | 宁波成电泰克电子信息技术发展有限公司 | Generation method of single document summaries |
CN106294863A (en) * | 2016-08-23 | 2017-01-04 | 电子科技大学 | A kind of abstract method for mass text fast understanding |
CN106445920A (en) * | 2016-09-29 | 2017-02-22 | 北京理工大学 | Sentence similarity calculation method based on sentence meaning structure characteristics |
CN107491434A (en) * | 2017-08-10 | 2017-12-19 | 北京邮电大学 | Text snippet automatic generation method and device based on semantic dependency |
CN107729300A (en) * | 2017-09-18 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Processing method, device, equipment and the computer-readable storage medium of text similarity |
Non-Patent Citations (2)
Title |
---|
基于主题词权重和句子特征的自动文摘;蒋昌金等;《华南理工大学学报(自然科学版)》;20100715(第07期);全文 * |
基于综合的句子特征的文本自动摘要;程园等;《计算机科学》;20150415(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109101489A (en) | 2018-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109101489B (en) | Text automatic summarization method and device and electronic equipment | |
Smetanin et al. | Deep transfer learning baselines for sentiment analysis in Russian | |
JP5936698B2 (en) | Word semantic relation extraction device | |
Kumar et al. | Mastering text mining with R | |
CN109299280B (en) | Short text clustering analysis method and device and terminal equipment | |
Lind et al. | Building the bridge: Topic modeling for comparative research | |
Takala et al. | Gold-standard for Topic-specific Sentiment Analysis of Economic Texts. | |
Khattak et al. | A survey on sentiment analysis in Urdu: A resource-poor language | |
Kolesnikova | Survey of word co-occurrence measures for collocation detection | |
Nguyen et al. | Statistical approach for figurative sentiment analysis on social networking services: a case study on twitter | |
CN110442872A (en) | A kind of text elements integrality checking method and device | |
Ferreira et al. | A new sentence similarity assessment measure based on a three-layer sentence representation | |
Hu et al. | Self-supervised synonym extraction from the web. | |
Woltmann et al. | Tracing university–industry knowledge transfer through a text mining approach | |
Ingólfsdóttir et al. | Named entity recognition for icelandic: Annotated corpus and models | |
Petrović et al. | The influence of text preprocessing methods and tools on calculating text similarity | |
Sarkar | A hidden markov model based system for entity extraction from social media english text at fire 2015 | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation | |
WO2023198696A1 (en) | Method for extracting information from an unstructured data source | |
Ningtyas et al. | The Influence of Negation Handling on Sentiment Analysis in Bahasa Indonesia | |
Omurca et al. | An annotated corpus for Turkish sentiment analysis at sentence level | |
Wang et al. | Unsupervised opinion phrase extraction and rating in Chinese blog posts | |
Zhang et al. | Sentiment identification by incorporating syntax, semantics and context information | |
Zhang et al. | Extracting Product Features and Sentiments from Chinese Customer Reviews. | |
CN111178038B (en) | Document similarity recognition method and device based on latent semantic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230410 Address after: 430074 Room 01, Floor 6, Building A4, Financial Port, 77 Guanggu Avenue, Donghu New Technology Development Zone, Wuhan, Hubei Province Patentee after: WUHAN SHUBO TECHNOLOGY Co.,Ltd. Patentee after: WUHAN University Address before: 430072 Fenghuo innovation Valley, No. 88, YouKeYuan Road, Hongshan District, Wuhan City, Hubei Province Patentee before: WUHAN SHUBO TECHNOLOGY Co.,Ltd. |
|
TR01 | Transfer of patent right |