CN109101489B - Text automatic summarization method and device and electronic equipment - Google Patents

Text automatic summarization method and device and electronic equipment Download PDF

Info

Publication number
CN109101489B
CN109101489B CN201810787848.1A CN201810787848A CN109101489B CN 109101489 B CN109101489 B CN 109101489B CN 201810787848 A CN201810787848 A CN 201810787848A CN 109101489 B CN109101489 B CN 109101489B
Authority
CN
China
Prior art keywords
sentence
sentences
calculating
document
summarized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810787848.1A
Other languages
Chinese (zh)
Other versions
CN109101489A (en
Inventor
文卫东
刘健博
王忠璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Shubo Technology Co ltd
Wuhan University WHU
Original Assignee
Wuhan Shubo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Shubo Technology Co ltd filed Critical Wuhan Shubo Technology Co ltd
Priority to CN201810787848.1A priority Critical patent/CN109101489B/en
Publication of CN109101489A publication Critical patent/CN109101489A/en
Application granted granted Critical
Publication of CN109101489B publication Critical patent/CN109101489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text automatic summarization method, which comprises the steps of segmenting a document to be summarized according to predefined sentence ending symbols; calculating the topic vector of each sentence after segmentation according to the existing text corpus; determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences; calculating semantic similarity between every two sentences according to the topic vector of each sentence; calculating the score of each sentence according to the relevance and semantic similarity among the sentences; and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content. The invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate and the semantic relevance of the words of the sentences and improves the accuracy rate of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality. The invention also discloses a text automatic summarization device and electronic equipment.

Description

Text automatic summarization method and device and electronic equipment
Technical Field
The invention relates to the technical field of natural language understanding, in particular to a text automatic summarization method and device and electronic equipment.
Background
The abstract reflects the central content of the original document comprehensively and accurately through a short and coherent short text. Due to the explosion of information, the number of documents needing to be read by people before completing a job is increased continuously, the time spent is prolonged, the reading time can be effectively shortened by the application of the automatic summary, the working efficiency of various fields can be improved, and the method has a wide application prospect.
The automatic summarization technique can be divided into two categories according to the relationship between the original text and the summary: abstract and generative summarization techniques. The extraction type abstract is to extract key sentences from a clause set of an original text without modifying the key sentences, then combine the key sentences to form an abstract, and the essence of the extraction type abstract is to convert an abstract problem into a sorting problem, score each sentence, and form an abstract of a corresponding document by high-level sentences. The generated abstract technology tries to understand the content of the document and summarizes the central content of the document through refined sentences, the mode is more consistent with the essence of the abstract, and a seq2seq method is adopted in the short text abstract problem at present to make a certain progress, but when the problem of the long text abstract is solved, the technical difficulty is high, and the effect is poor.
At present, a widely used technology is still a abstract generation method based on an abstraction formula, and the relevance of sentences is generally measured by words forming sentences, however, in an actual document, sentences with high word relevance and sentences with high semantic relevance may be key sentences, so that it is not reasonable to consider both or only one relevance.
Disclosure of Invention
In view of the above, there is a need to provide a method and an apparatus for automatically abstracting a text, which can overcome the defects of the existing abstraction-based method and have the characteristics of universality and high accuracy.
The invention comprises the following contents:
a text automatic summarization method comprises the following steps:
dividing the document to be summarized according to predefined sentence ending symbols;
calculating a topic vector of each sentence after segmentation according to an existing text corpus;
determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences;
calculating semantic similarity between every two sentences according to the topic vector of each sentence;
calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;
and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content.
On the other hand, the invention also discloses a text automatic summarization device, which is characterized by comprising the following components:
the segmentation module is used for segmenting the document to be summarized according to predefined sentence ending symbols;
the first calculation module is used for calculating the theme vector of each sentence after segmentation according to the existing text corpus;
the second calculation module is used for determining the correlation degree of every two sentences according to the quantity of the words which commonly appear between every two sentences;
the third calculation module is used for calculating semantic similarity between every two sentences according to the topic vectors of each sentence;
the scoring module is used for calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;
and the abstract output module is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.
The invention also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for finishing mutual communication by the memory through the communication bus; a memory for storing a computer program; and the processor is used for realizing the steps of the method when executing the program stored in the memory.
Compared with the prior art, the invention has the beneficial effects that:
the invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate of the sentence words and the semantic relevance, and improves the accuracy of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality.
Drawings
FIG. 1 is a flow diagram of a method for automatically summarizing text, in some embodiments.
FIG. 2 is a flow diagram of a method for automatically summarizing text, in further embodiments.
Fig. 3 is a flow chart of a text automatic summarization method in case of chinese text.
Fig. 4 is a schematic diagram of an apparatus for automatically summarizing text in some embodiments.
Fig. 5 is a schematic diagram of an electronic device in some embodiments.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
In some embodiments, as shown in fig. 1, a method for automatically summarizing text includes the steps of:
and Step110, segmenting the document to be summarized according to predefined sentence ending symbols.
For example, after segmentation, it is denoted as d ═ s1,s2,…smWhere 1,2, …, m represents the number of the sentence after being divided. In the present embodiment, the sentence end symbol is not particularly limited and may be. ","! ","? ", may also be", "; ", or even a designated segmentation symbol.
Step120, calculating a topic vector of each segmented sentence according to the existing text corpus.
In some embodiments, an lda (content Dirichlet allocation) topic model algorithm may be used to generate a text corpus, that is, a co-occurrence matrix of "topic-word" in several topic-labeled text corpora is counted, and a specific calculation method is shown in formula (1):
Figure BDA0001734080670000031
wherein p represents probability, word omega is a word in article d, t is a subject of the article, n represents the number of subjects, dkDenotes the kth document, ωiDenotes dkThe ith word in (1).
The method for calculating the topic vector of each sentence after segmentation comprises the following steps:
calculating the conditional probability of words and themes in each sentence according to a preset formula, and repeating the conditionProbability step, until the calculation result is converged, obtaining the topic vector of each sentence after segmentation, which is respectively
Figure BDA0001734080670000032
The formula for calculating the conditional probability of the words and the topics in the document to be summarized can be as follows:
Figure BDA0001734080670000041
wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, dkDenotes the kth document, ωiDenotes dkThe (i) th word in (1),
Figure BDA0001734080670000042
indicating the probability of association of the document with the topic,
Figure BDA0001734080670000043
representing the probability of association of a word with a topic.
Wherein,
Figure BDA0001734080670000044
Figure BDA0001734080670000045
arfor the parameters associated with the subject r, they are generally set to
Figure BDA0001734080670000046
btIs a parameter related to the word t, and is generally set as
Figure BDA0001734080670000047
Step130, determining the relevance of the two sentences according to the number of the words which commonly appear between the two sentences.
The specific calculation method is as shown in formula (2):
Figure BDA0001734080670000048
wherein s isi,sjFor differently numbered sentences ωkIs a word belonging to a sentence, | siL is a sentence siNumber of words, | sjL is a sentence sjThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
Step140, calculating semantic similarity between every two sentences according to the topic vectors of each sentence.
The specific calculation method is shown as formula (3):
Figure BDA0001734080670000049
wherein,
Figure BDA00017340806700000410
representing a sentence si、sjThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
It is understood that the steps of Step130 and Step140 may be interchanged.
And Step150, calculating the score of each sentence according to the relevance and semantic similarity among the sentences, and repeatedly and iteratively calculating the score of each sentence until the result is converged.
The method of calculating the score of each sentence is as follows:
Figure BDA0001734080670000051
wherein, TR(s)i) Representing a sentence siScore s ofi、sj、skRespectively represent a numberFor the sentence i, j, k, OUT (j) means divide by sjOther sentences, IN (i) denoted as including siAll the sentences inside, n represents the number of the subjects, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0,1 ]]And e has a value range of [0,1 ]]. (1-e)/n is a correction amount, and is taken as a sentence siRandom probability of being chosen as a key sentence. The method comprehensively considers the influence of the collinearity rate of the sentence words and the semantic relevance on the sentence scoring so as to improve the accuracy rate of the sentence scoring.
Example 2
In order to further improve the accuracy of sentence scoring, the semantic similarity between each sentence and the document to be summarized can be used as a correction parameter when the score of each sentence is calculated. As shown in fig. 2, an automatic text summarization method includes the following steps:
step210, segmenting the document d to be summarized according to the predefined sentence ending symbol.
For example, after segmentation, it is denoted as d ═ s1,s2,…smWhere 1,2, …, m represents the number of the sentence after being divided. In the present embodiment, the sentence end symbol is not particularly limited and may be. ","! ","? ", also can be", "; ", or even a designated segmentation symbol.
Step220, respectively calculating the theme vector of the document to be summarized and each sentence s after segmentation according to the existing text corpus1,s2,…smThe topic vector of (1).
In some embodiments, an lda (content Dirichlet allocation) topic model algorithm may be used to generate a text corpus, that is, a co-occurrence matrix of "topic-word" in several topic-labeled text corpora is counted, and a specific calculation method is shown in formula (1):
Figure BDA0001734080670000052
wherein p represents probability, word omega is a word in article d, t is a subject of the article, n represents the number of subjects, dkA k-th document is represented as,ωidenotes dkThe ith word in (1).
The method for calculating the topic vector of the document to be summarized and the topic vector of each segmented sentence according to a preset formula comprises the following steps:
calculating the conditional probability of words and topics in the document, repeating the conditional probability step until the calculation result is converged to obtain the topic vector of the document to be summarized
Figure BDA0001734080670000061
Calculating the conditional probability of the words and the topics in each sentence after the segmentation, repeating the step of the conditional probability until the calculation result is converged, and obtaining the topic vector of each sentence after the segmentation, wherein the topic vector is respectively
Figure BDA0001734080670000062
Wherein s represents a sentence, 1,2, …, and m represents a number of the sentence after being divided;
the formula for calculating the conditional probability of the words and the topics in the document and calculating the conditional probability of the words and the topics in each sentence after segmentation can be an existing feasible calculation formula, and the formula in embodiment 1 can also be referred to.
And Step230, measuring the relevance of every two sentences through the quantity of terms which commonly appear between every two sentences in the document to be summarized. The specific calculation method is as shown in formula (2):
Figure BDA0001734080670000063
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
Step240, calculating semantic similarity between every two sentences through the topic vector of each divided sentence, and calculating semantic similarity between the document to be summarized and each sentence through the topic vector of the document to be summarized and the topic vector of each divided sentence.
The formula for calculating the semantic similarity between two sentences is as follows:
Figure BDA0001734080670000064
wherein,
Figure BDA0001734080670000065
representing a sentence si、sjThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
The formula for calculating the semantic similarity between the document to be abstracted and each sentence is as follows:
Figure BDA0001734080670000071
it is understood that the steps of Step230 and Step240 may be interchanged.
And Step250, calculating the score of each sentence by utilizing the relevance between the sentences, the semantic similarity and the semantic similarity between the sentences and the document to be summarized, and repeatedly and iteratively calculating the score of each sentence until the result is converged.
The method of calculating the score of each sentence is as follows:
Figure BDA0001734080670000072
wherein, TR(s)i) Representing a sentence siScore of, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]. The invention comprehensively considers the collinearity rate of sentence and word and the semantic faciesInfluence of relevance on sentence score, and fixed value (1-e) is not directly used as sentence s in correcting score resultiThe random probability chosen as key sentence, but the random probability value related to semantics, i.e. the relevance of (1-e) to subject
Figure BDA0001734080670000073
Multiplying to express the random probability that the sentence is selected as the abstract key sentence, and fully considering the influence of semantics on sentence scoring.
And Step260, selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.
The invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate of the sentence words and the semantic relevance, and improves the accuracy rate of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality. The method is suitable for text content abstract extraction in multiple fields.
Example 3
The text automatic summarization method of the invention can be applied to the fields of Chinese, English and the like. In some embodiments, as shown in fig. 3, a method for automatically summarizing a chinese text includes the following steps:
step1, counting a co-occurrence matrix of 'theme-words' in a mass text corpus with theme labels. The method comprises the following specific steps:
step 1.1, preprocessing, initializing a massive corpus, and defining the kth document as dk,dkThe ith word in (1) is represented by ωiAnd (4) showing.
Step 1.2, calculating the conditional probability of the words and the subjects in the document, wherein the specific formula is as follows:
Figure BDA0001734080670000081
wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, dkdenotes the kth document, ωiDenotes dkThe (i) th word in (1),
Figure BDA0001734080670000082
indicating the probability of association of the document with the topic,
Figure BDA0001734080670000083
representing the probability of association of a word with a topic.
Wherein,
Figure BDA0001734080670000084
Figure BDA0001734080670000085
arfor the parameters associated with the subject r, they are generally set to
Figure BDA0001734080670000086
btIs a parameter related to the word t, and is generally set as
Figure BDA0001734080670000087
And 1.3, repeating the step 1.2 until the calculation result is converged, and obtaining a final theme-word co-occurrence matrix M.
Step2, the document d to be summarized is segmented according to the sentence ending symbol, namely d ═ s1,s2,…smAnd h, wherein 1,2, …, m represents the number of the sentence after being divided. The sentence symbol is generally composed of. ","! ","? "and the like.
For example, the document d to be summarized is the following paragraphs:
"Weekly futures message 5 month 9 evening, XXX issues several comments on further promoting healthy development of capital market," and nine official legs of the New nation. The nineteen lines emphasize the relaxation of traffic admission. The method implements a public, transparent and orderly securities futures business license management system, researches cross support licenses of securities companies, fund management companies, futures companies, securities investment consulting companies and the like, and supports other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation. The grand news futures link international futures vice general manager king-red-English (Boke, microblog) explains the influence of nine nations on the futures industry. Wanghongying considers that nine nations will have positive influence on the futures industry, a powerful futures company will be further developed towards the international delivery business, and he considers that more and more large financial institutions will participate in stock futures companies, which has positive influence on providing the capital strength and the business strength of the futures company. In addition, futures companies need to conduct business differently, and futures professional businesses such as hedging will become more and more important, which undoubtedly will influence future development of futures companies. "
The sentence after segmentation is:
s1 for the evening of day 5 and 9 of the futures news, XXX issues opinions about further promoting the healthy development of capital markets, and the New nation has nine official legs.
s 2-state nine emphasis on relaxing traffic admission.
s3 is to implement the public transparent and orderly stock and exchange securities futures business license management system, research the cross support license of securities company, fund management company, stock company, securities investment consulting company, etc., and support other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation.
s4 Wang hongying (blog, microblog) of general International futures Assistant connected with Hezhongfeng futures explains the influence of nine countries on the futures industry.
s5 (royal red English) considers that nine countries have positive influence on the futures industry, a powerful futures company is further developed towards the international operation, and he considers that more and more large financial institutions are participating in the futures company, which has positive influence on providing the capital strength and the business strength of the futures company.
s6, futures companies need to operate differently, and futures professional businesses such as hedging will become more and more important, which will certainly affect future development of futures companies.
Step 3, respectively calculating the theme vector of the document d and each sentence s after segmentation according to the co-occurrence matrix M1,s2,…,snThe topic vector of (1). The method comprises the following specific steps:
3.1. and (4) calculating the conditional probability of the words and the topics in the document d according to the formula in the step 1.2.
3.2. Repeating the step 3.1 until the calculation result is converged to obtain the subject vector of the document d
Figure BDA0001734080670000091
3.3. Calculating each sentence s after segmentation according to the formula in the step 1.21,s2,…smConditional probability of Chinese words and topics.
3.4. Repeating the step 3.3 until the calculation result is converged, and obtaining the topic vector of each sentence after segmentation, wherein the topic vector is respectively
Figure BDA0001734080670000092
Step 4, measuring the relevance of the sentences through the number of the commonly occurring words of the two sentences, wherein the calculation formula is as follows:
Figure BDA0001734080670000093
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
In the present embodiment, the correlation between two sentences is shown in table 1.
TABLE 1
S1 S2 S3 S4 S5 S6
S1 0.5 0 0.15 0.25 0.26 0.21
S2 0 0.5 0 0 0.12 0
S3 0.15 0 0.5 0.1 0.23 0.22
S4 0.25 0 0.1 0.5 0.28 0.17
S5 0.26 0.12 0.23 0.28 0.5 0.3
S6 0.21 0 0.22 0.17 0.30 0.5
And 5, calculating the semantic similarity between every two sentences by using the main vector of each sentence after segmentation. The calculation formula is as follows:
Figure BDA0001734080670000101
wherein,
Figure BDA0001734080670000102
representing a sentence si、sjThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
Table 2 shows semantic similarity between two sentences and similarity between each sentence and the digest document (denoted by Doc) in this embodiment as shown in table 2.
TABLE 2
Figure BDA0001734080670000103
Figure BDA0001734080670000111
And 6, scoring each sentence of the text content to be abstracted, and repeatedly and iteratively calculating the score of each sentence until the result is converged. The scoring formula adopted in this embodiment is as follows:
Figure BDA0001734080670000112
wherein, TR(s)i) Representing a sentence siScore of(s), si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]. The invention comprehensively considers the collinearity rate and semantic relevance of sentence words to improve the accuracy rate of sentence scoring, and uses the relevance between (1-e) and subject
Figure BDA0001734080670000113
The multiplication represents the random probability that the sentence is selected as the summary key sentence. In this embodiment, α, e, β are 0.5,0.5,0.85 respectively, and then the score of each sentence is calculated as:
{S1:0.15868276207017054,S2:0.12009075569229676,S32: 0.18508810964258574,S4:0.15716454363543905,S5:0.21333489450057458,S6: 0.165638934458933}
step 7, selecting TR(s)n)>The sentence of phi is that,phi is a fixed threshold value, then adding connecting words giving rules, and outputting the summary content according to the appearance sequence of the original text.
If Φ is equal to 0.1, the summary content finally output by the document d to be summarized is:
' Wang hong Ying deem, nine nations will bring positive influence to the futures industry, the forceful futures company will further develop towards the international delivery business, he deems that more and more large financial institutions will participate in the stock futures company, which will bring positive influence to the capital strength and the business strength of the futures company ', ' implementing the public transparent and orderly securities futures business license management system, researching the cross support cards of securities companies, fund management companies, futures companies, securities investment companies and the like, supporting other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation ', ' in addition, the futures company also needs to carry out differentiation, the special futures business such as the hedging value will be more and more important, will undoubtedly influence the future development of the futures company ', ' and the future information message for 5 month and 9 days later, XXX issues several opinions on further promoting the healthy development of capital markets, nine formal landings in new countries, 'the general prince of international futures in social futures side chain (blogs, microblogs) interpretation of the effects of nine nations on futures industry', 'nine nations emphasis on easing business admission' ]
Example 4
Correspondingly, the present invention also discloses a text automatic summarization device, in some embodiments, as shown in fig. 4, the device includes:
and the segmentation module 11 is used for segmenting the document to be summarized according to predefined sentence ending symbols.
For example, after segmentation, the document to be summarized is marked as d ═ s1,s2,…smWhere 1,2, …, m represents the number of the sentence after being divided. Sentence end symbol is generally composed of. ","! ","? "and the like.
The first calculating module 12 is configured to calculate a topic vector of each segmented sentence according to an existing text corpus.
The first calculation module 12 further generates a text corpus by using an lda (content Dirichlet allocation) topic model algorithm, that is, a co-occurrence matrix of "topic-word" in a plurality of topic-labeled text corpuses is counted, and a specific calculation method is shown in formula (1):
Figure BDA0001734080670000121
wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, dkDenotes the kth document, ωiDenotes dkThe ith word in (1).
The method for calculating the topic vector of each sentence after segmentation by the first calculation module 12 comprises the following steps:
calculating the conditional probability of the words and the topics in each sentence after the segmentation, repeating the step of the conditional probability until the calculation result is converged, and obtaining the topic vector of each sentence after the segmentation, wherein the topic vector is respectively
Figure BDA0001734080670000122
The existing feasible calculation formula can be selected, and the formula in the embodiment 1 can also be referred to.
And the second calculating module 13 is configured to measure the relevance between two sentences by the number of terms commonly occurring between two sentences in the document to be summarized.
The specific calculation method is as shown in formula (2):
Figure BDA0001734080670000123
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.
And the third calculating module 14 is configured to calculate semantic similarity between every two sentences according to the topic vector of each segmented sentence.
The specific calculation method is shown as formula (3):
Figure BDA0001734080670000131
wherein,
Figure BDA0001734080670000132
representing a sentence siThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.
And the scoring module 15 is configured to calculate a score of each sentence according to the relevance and similarity between the sentences.
The method of calculating the score of each sentence is as follows:
Figure BDA0001734080670000133
wherein, TR(s)i) Representing a sentence siScore of, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siAll the sentences inside, n represents the number of the subjects, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0,1 ]]And e is taken to be in the range of [0,1 ]]. (1-e)/n is a correction amount as a sentence siRandom probability of being chosen as a key sentence. The method comprehensively considers the influence of the sentence word collinearity rate and the semantic relevance on the sentence scoring so as to improve the sentence scoring accuracy.
And the abstract output module 16 is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the connecting words according to the selected output sequence to obtain abstract contents.
Example 5
In other embodiments, the first calculation module 12 is further configured to calculate the text corpus according to an existing text corpus, as compared to embodiment 4And (4) a topic vector of the document to be summarized. The method for calculating the theme vector of the document to be summarized comprises the following steps: calculating the conditional probability of the words and the subjects in the document, repeating the conditional probability step until the calculation result is converged, and obtaining the subject vector of the document to be summarized
Figure BDA0001734080670000134
The third calculating module 14 is further configured to calculate semantic similarity between the document to be summarized and each sentence. The scoring module 15 is also used for correcting the scoring result by using the similarity between each sentence and the summary document. The final scoring formula used is as follows:
Figure BDA0001734080670000141
wherein, TR(s)i) Representing a sentence siScore of, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) denoted as including siIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]。 TopicSim(siTitle) represents the semantic similarity between the document to be summarized and each sentence, wherein,
Figure BDA0001734080670000142
Figure BDA0001734080670000143
representing a sentence siThe subject vector of (a) is,
Figure BDA0001734080670000144
and the topic vector represents the document to be summarized.
The embodiment comprehensively considers the influence of the sentence word collinearity rate and the semantic relevance on the sentence score, and does not directly use the fixed value (1-e) as the sentence s when the score result is correctediRandom probability chosen as key sentence, but use and semantic faciesRandom probability values of relevance, i.e. using (1-e) relevance to subject matter
Figure BDA0001734080670000145
Multiplying to express the random probability that the sentence is selected as the summary key sentence, and fully considering the influence of semantics on sentence scoring.
Example 6
Corresponding to the method embodiment, the embodiment of the invention also provides electronic equipment. Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device includes: a processor 510, a communication interface 520, a memory 530, and a communication bus 540, wherein:
a processor 510, a communication interface 520, a memory 530 for communicating with each other via a communication bus 540, and a memory 530 for storing computer programs;
the processor 510 is configured to implement the generation of the text automatic summarization method provided in the present invention when executing the program stored in the memory 530. Specifically, the text automatic summarization method includes:
dividing the document to be summarized according to predefined sentence ending symbols;
calculating a topic vector of each sentence after segmentation according to an existing text corpus;
determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences;
calculating semantic similarity between every two sentences according to the topic vector of each sentence;
calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;
and selecting sentences the score values of which meet the threshold value, adding preset connecting words, and outputting the sentences according to the selected output sequence to obtain the abstract content.
Of course, in order to further improve the accuracy of the summary, the text automatic summary method may further include:
dividing the document to be summarized according to predefined sentence ending symbols;
respectively calculating the topic vector of the document to be abstracted and the topic vector of each sentence after segmentation according to the existing text corpus;
measuring the relevance of every two sentences through the quantity of words commonly appearing between every two sentences in the document to be summarized;
calculating semantic similarity between every two sentences through the topic vector of each segmented sentence, and calculating the semantic similarity between the document to be abstracted and each sentence through the topic vector of the document to be abstracted and the topic vector of each segmented sentence;
calculating the score of each sentence according to the relevance between the sentences, the semantic similarity and the semantic similarity between the sentences and the document to be summarized;
and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content.
The implementation manner of the text automatic summarization method is the same as that of the text automatic summarization method provided in the foregoing method embodiment section, and details are not repeated here.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but not only one bus or type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.
It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.

Claims (9)

1. A text automatic summarization method is characterized by comprising the following steps:
dividing the document to be summarized according to predefined sentence ending symbols;
calculating a topic vector of each sentence after segmentation according to an existing text corpus;
determining the relevance of every two sentences according to the quantity of the words which commonly appear between every two sentences;
calculating semantic similarity of every two sentences according to the topic vector of each sentence;
the method for calculating the score of each sentence by utilizing the relevance and the semantic similarity adopts the following formula:
Figure FDA0003598120670000011
wherein, TR(s)i) Representing a sentence siThe score of alpha, e, beta is a preset calculation parameter, si、sj、skRespectively representing sentences numbered i, j, k,
Figure FDA0003598120670000012
indicates the relevance of the sentences si, sj, TopicSim(s)i,sj) Representing semantic similarity of sentences si, sj, OUT (j) representing division by sjOther sentences, IN (i) including siAll sentences within, n represents the number of topics;
and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content.
2. The method for automatically summarizing text according to claim 1, further comprising:
calculating a theme vector of a document to be summarized according to an existing text corpus, and calculating semantic similarity between the document to be summarized and each sentence by using the theme vector of the document to be summarized and the theme vector of each divided sentence;
calculating the score of each sentence further comprises: and utilizing the semantic similarity between the document to be abstracted and each sentence as a correction quantity.
3. The method for automatically summarizing text according to claim 1, wherein said calculating a topic vector for each sentence after segmentation comprises:
and calculating the conditional probability of the words and the topics in each sentence after segmentation according to a preset formula, and repeating the conditional probability step until the calculation result is converged to obtain the topic vector of each sentence after segmentation.
4. The method for automatically summarizing text according to claim 2, wherein the method for calculating the topic vector of the document to be summarized comprises:
and calculating the conditional probability of the words and the topics in the document to be summarized according to a preset formula, and repeating the step of the conditional probability until the calculation result is converged to obtain the topic vector of the document to be summarized.
5. The method for automatically summarizing text according to claim 1 or 2, wherein the relevance of two sentences determined by the number of words co-occurring between two sentences in the document to be summarized is determined by using the following calculation formula:
Figure FDA0003598120670000021
wherein s isi,sjFor differently numbered sentences, omegakIs a word belonging to a sentence, | siL is a sentence siNumber of words, | sjL is a sentence sjThe higher the calculation result value is, the higher the sentence relevancy is.
6. The method for automatically summarizing text according to claim 5, wherein the calculation of semantic similarity between two sentences by the topic vector of each sentence after the segmentation adopts the following calculation formula:
Figure FDA0003598120670000022
wherein,
Figure FDA0003598120670000023
representing a sentence si、sjThe higher the numerical value of the calculation result is, the higher the semantic relevance of the sentence is.
7. The method for automatically summarizing text according to claim 6, wherein the method for calculating the score of each sentence uses the following formula:
Figure FDA0003598120670000031
wherein, TR(s)i) Representing a sentence siThe score of alpha, e, beta is a preset calculation parameter, si、sj、skRespectively, i, j, k, OUT (j) means sjOther sentences, IN (i) including siAll sentences within, DOC represents the set of sentences of the document to be summarized, TopicSim(s)iTitle) represents the semantic similarity between the document to be summarized and each sentence, wherein,
Figure FDA0003598120670000032
Figure FDA0003598120670000033
representing a sentence siThe subject vector of (a) is,
Figure FDA0003598120670000034
a topic vector representing the document to be summarized.
8. An apparatus for automatically summarizing text, the apparatus comprising:
the segmentation module is used for segmenting the document to be summarized according to predefined sentence ending symbols;
the first calculation module is used for calculating the topic vector of each sentence after segmentation according to the existing text corpus;
the second calculation module is used for determining the relevance of every two sentences according to the quantity of the words which commonly appear between every two sentences;
the third calculation module is used for calculating semantic similarity between every two sentences according to the topic vector of each sentence;
the scoring module is used for calculating the score of each sentence by utilizing the relevance and the semantic similarity, and adopts the following formula:
Figure FDA0003598120670000035
wherein, TR(s)i) Representing a sentence siThe score of alpha, e, beta is a preset calculation parameter, si、sj、skRespectively representing sentences numbered i, j, k,
Figure FDA0003598120670000041
indicates the relevance of the sentences si, sj, TopicSim(s)i,sj) Representing semantic similarity of sentences si, sj, OUT (j) representing division by sjOther sentences, IN (i) including siAll sentences within, n represents the number of topics;
and the abstract output module is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.
9. An electronic device comprising a processor, a communication interface, a memory, and a communication bus, wherein:
the processor, the communication interface and the memory complete mutual communication through a communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.
CN201810787848.1A 2018-07-18 2018-07-18 Text automatic summarization method and device and electronic equipment Active CN109101489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810787848.1A CN109101489B (en) 2018-07-18 2018-07-18 Text automatic summarization method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810787848.1A CN109101489B (en) 2018-07-18 2018-07-18 Text automatic summarization method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN109101489A CN109101489A (en) 2018-12-28
CN109101489B true CN109101489B (en) 2022-05-20

Family

ID=64846627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810787848.1A Active CN109101489B (en) 2018-07-18 2018-07-18 Text automatic summarization method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN109101489B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162618B (en) * 2019-02-22 2021-09-17 北京捷风数据技术有限公司 Text summary generation method and device of non-contrast corpus
CN110134942B (en) * 2019-04-01 2020-10-23 北京中科闻歌科技股份有限公司 Text hotspot extraction method and device
CN110162778B (en) * 2019-04-02 2023-05-26 创新先进技术有限公司 Text abstract generation method and device
CN112699657A (en) * 2020-12-30 2021-04-23 广东德诚大数据科技有限公司 Abnormal text detection method and device, electronic equipment and storage medium
CN112711662A (en) * 2021-03-29 2021-04-27 贝壳找房(北京)科技有限公司 Text acquisition method and device, readable storage medium and electronic equipment
CN113204637B (en) * 2021-04-13 2022-09-27 北京三快在线科技有限公司 Text processing method and device, storage medium and electronic equipment
CN113407710A (en) * 2021-06-07 2021-09-17 维沃移动通信有限公司 Information display method and device, electronic equipment and readable storage medium
CN114201600A (en) * 2021-12-10 2022-03-18 北京金堤科技有限公司 Public opinion text abstract extraction method, device, equipment and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102945228A (en) * 2012-10-29 2013-02-27 广西工学院 Multi-document summarization method based on text segmentation
CN103136359A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Generation method of single document summaries
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding
CN106445920A (en) * 2016-09-29 2017-02-22 北京理工大学 Sentence similarity calculation method based on sentence meaning structure characteristics
CN107491434A (en) * 2017-08-10 2017-12-19 北京邮电大学 Text snippet automatic generation method and device based on semantic dependency
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102945228A (en) * 2012-10-29 2013-02-27 广西工学院 Multi-document summarization method based on text segmentation
CN103136359A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Generation method of single document summaries
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding
CN106445920A (en) * 2016-09-29 2017-02-22 北京理工大学 Sentence similarity calculation method based on sentence meaning structure characteristics
CN107491434A (en) * 2017-08-10 2017-12-19 北京邮电大学 Text snippet automatic generation method and device based on semantic dependency
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于主题词权重和句子特征的自动文摘;蒋昌金等;《华南理工大学学报(自然科学版)》;20100715(第07期);全文 *
基于综合的句子特征的文本自动摘要;程园等;《计算机科学》;20150415(第04期);全文 *

Also Published As

Publication number Publication date
CN109101489A (en) 2018-12-28

Similar Documents

Publication Publication Date Title
CN109101489B (en) Text automatic summarization method and device and electronic equipment
Smetanin et al. Deep transfer learning baselines for sentiment analysis in Russian
JP5936698B2 (en) Word semantic relation extraction device
Kumar et al. Mastering text mining with R
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
Lind et al. Building the bridge: Topic modeling for comparative research
Takala et al. Gold-standard for Topic-specific Sentiment Analysis of Economic Texts.
Khattak et al. A survey on sentiment analysis in Urdu: A resource-poor language
Kolesnikova Survey of word co-occurrence measures for collocation detection
Nguyen et al. Statistical approach for figurative sentiment analysis on social networking services: a case study on twitter
CN110442872A (en) A kind of text elements integrality checking method and device
Ferreira et al. A new sentence similarity assessment measure based on a three-layer sentence representation
Hu et al. Self-supervised synonym extraction from the web.
Woltmann et al. Tracing university–industry knowledge transfer through a text mining approach
Ingólfsdóttir et al. Named entity recognition for icelandic: Annotated corpus and models
Petrović et al. The influence of text preprocessing methods and tools on calculating text similarity
Sarkar A hidden markov model based system for entity extraction from social media english text at fire 2015
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
WO2023198696A1 (en) Method for extracting information from an unstructured data source
Ningtyas et al. The Influence of Negation Handling on Sentiment Analysis in Bahasa Indonesia
Omurca et al. An annotated corpus for Turkish sentiment analysis at sentence level
Wang et al. Unsupervised opinion phrase extraction and rating in Chinese blog posts
Zhang et al. Sentiment identification by incorporating syntax, semantics and context information
Zhang et al. Extracting Product Features and Sentiments from Chinese Customer Reviews.
CN111178038B (en) Document similarity recognition method and device based on latent semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230410

Address after: 430074 Room 01, Floor 6, Building A4, Financial Port, 77 Guanggu Avenue, Donghu New Technology Development Zone, Wuhan, Hubei Province

Patentee after: WUHAN SHUBO TECHNOLOGY Co.,Ltd.

Patentee after: WUHAN University

Address before: 430072 Fenghuo innovation Valley, No. 88, YouKeYuan Road, Hongshan District, Wuhan City, Hubei Province

Patentee before: WUHAN SHUBO TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right