CN109101489B

CN109101489B - Text automatic summarization method and device and electronic equipment

Info

Publication number: CN109101489B
Application number: CN201810787848.1A
Authority: CN
Inventors: 文卫东; 刘健博; 王忠璐
Original assignee: Wuhan Shubo Technology Co ltd
Current assignee: Wuhan Shubo Technology Co ltd; Wuhan University WHU
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2022-05-20
Anticipated expiration: 2038-07-18
Also published as: CN109101489A

Abstract

The invention discloses a text automatic summarization method, which comprises the steps of segmenting a document to be summarized according to predefined sentence ending symbols; calculating the topic vector of each sentence after segmentation according to the existing text corpus; determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences; calculating semantic similarity between every two sentences according to the topic vector of each sentence; calculating the score of each sentence according to the relevance and semantic similarity among the sentences; and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content. The invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate and the semantic relevance of the words of the sentences and improves the accuracy rate of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality. The invention also discloses a text automatic summarization device and electronic equipment.

Description

Text automatic summarization method and device and electronic equipment

Technical Field

The invention relates to the technical field of natural language understanding, in particular to a text automatic summarization method and device and electronic equipment.

Background

The abstract reflects the central content of the original document comprehensively and accurately through a short and coherent short text. Due to the explosion of information, the number of documents needing to be read by people before completing a job is increased continuously, the time spent is prolonged, the reading time can be effectively shortened by the application of the automatic summary, the working efficiency of various fields can be improved, and the method has a wide application prospect.

The automatic summarization technique can be divided into two categories according to the relationship between the original text and the summary: abstract and generative summarization techniques. The extraction type abstract is to extract key sentences from a clause set of an original text without modifying the key sentences, then combine the key sentences to form an abstract, and the essence of the extraction type abstract is to convert an abstract problem into a sorting problem, score each sentence, and form an abstract of a corresponding document by high-level sentences. The generated abstract technology tries to understand the content of the document and summarizes the central content of the document through refined sentences, the mode is more consistent with the essence of the abstract, and a seq2seq method is adopted in the short text abstract problem at present to make a certain progress, but when the problem of the long text abstract is solved, the technical difficulty is high, and the effect is poor.

At present, a widely used technology is still a abstract generation method based on an abstraction formula, and the relevance of sentences is generally measured by words forming sentences, however, in an actual document, sentences with high word relevance and sentences with high semantic relevance may be key sentences, so that it is not reasonable to consider both or only one relevance.

Disclosure of Invention

In view of the above, there is a need to provide a method and an apparatus for automatically abstracting a text, which can overcome the defects of the existing abstraction-based method and have the characteristics of universality and high accuracy.

The invention comprises the following contents:

a text automatic summarization method comprises the following steps:

dividing the document to be summarized according to predefined sentence ending symbols;

calculating a topic vector of each sentence after segmentation according to an existing text corpus;

determining the relevancy of every two sentences according to the number of words which commonly appear between every two sentences;

calculating semantic similarity between every two sentences according to the topic vector of each sentence;

calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;

and selecting sentences the scores of which meet the threshold, adding preset connecting words, and outputting according to the selected output sequence to obtain the abstract content.

On the other hand, the invention also discloses a text automatic summarization device, which is characterized by comprising the following components:

the segmentation module is used for segmenting the document to be summarized according to predefined sentence ending symbols;

the first calculation module is used for calculating the theme vector of each sentence after segmentation according to the existing text corpus;

the second calculation module is used for determining the correlation degree of every two sentences according to the quantity of the words which commonly appear between every two sentences;

the third calculation module is used for calculating semantic similarity between every two sentences according to the topic vectors of each sentence;

the scoring module is used for calculating the score of each sentence by utilizing the relevance and semantic similarity among the sentences;

and the abstract output module is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.

The invention also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for finishing mutual communication by the memory through the communication bus; a memory for storing a computer program; and the processor is used for realizing the steps of the method when executing the program stored in the memory.

Compared with the prior art, the invention has the beneficial effects that:

the invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate of the sentence words and the semantic relevance, and improves the accuracy of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality.

Drawings

FIG. 1 is a flow diagram of a method for automatically summarizing text, in some embodiments.

FIG. 2 is a flow diagram of a method for automatically summarizing text, in further embodiments.

Fig. 3 is a flow chart of a text automatic summarization method in case of chinese text.

Fig. 4 is a schematic diagram of an apparatus for automatically summarizing text in some embodiments.

Fig. 5 is a schematic diagram of an electronic device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

In some embodiments, as shown in fig. 1, a method for automatically summarizing text includes the steps of:

and Step110, segmenting the document to be summarized according to predefined sentence ending symbols.

For example, after segmentation, it is denoted as d ═ s₁,s₂,…s_mWhere 1,2, …, m represents the number of the sentence after being divided. In the present embodiment, the sentence end symbol is not particularly limited and may be. ","! ","? ", may also be", "; ", or even a designated segmentation symbol.

Step120, calculating a topic vector of each segmented sentence according to the existing text corpus.

In some embodiments, an lda (content Dirichlet allocation) topic model algorithm may be used to generate a text corpus, that is, a co-occurrence matrix of "topic-word" in several topic-labeled text corpora is counted, and a specific calculation method is shown in formula (1):

wherein p represents probability, word omega is a word in article d, t is a subject of the article, n represents the number of subjects, d_kDenotes the kth document, ω_iDenotes d_kThe ith word in (1).

The method for calculating the topic vector of each sentence after segmentation comprises the following steps:

calculating the conditional probability of words and themes in each sentence according to a preset formula, and repeating the conditionProbability step, until the calculation result is converged, obtaining the topic vector of each sentence after segmentation, which is respectively

The formula for calculating the conditional probability of the words and the topics in the document to be summarized can be as follows:

wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, d_kDenotes the kth document, ω_iDenotes d_kThe (i) th word in (1),

indicating the probability of association of the document with the topic,

representing the probability of association of a word with a topic.

Wherein,

a_rfor the parameters associated with the subject r, they are generally set to

b_tIs a parameter related to the word t, and is generally set as

Step130, determining the relevance of the two sentences according to the number of the words which commonly appear between the two sentences.

The specific calculation method is as shown in formula (2):

wherein s is_i,s_jFor differently numbered sentences ω_kIs a word belonging to a sentence, | s_iL is a sentence s_iNumber of words, | s_jL is a sentence s_jThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.

Step140, calculating semantic similarity between every two sentences according to the topic vectors of each sentence.

The specific calculation method is shown as formula (3):

wherein,

representing a sentence s_i、s_jThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.

It is understood that the steps of Step130 and Step140 may be interchanged.

And Step150, calculating the score of each sentence according to the relevance and semantic similarity among the sentences, and repeatedly and iteratively calculating the score of each sentence until the result is converged.

The method of calculating the score of each sentence is as follows:

wherein, TR(s)_i) Representing a sentence s_iScore s of_i、s_j、s_kRespectively represent a numberFor the sentence i, j, k, OUT (j) means divide by s_jOther sentences, IN (i) denoted as including s_iAll the sentences inside, n represents the number of the subjects, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0,1 ]]And e has a value range of [0,1 ]]. (1-e)/n is a correction amount, and is taken as a sentence s_iRandom probability of being chosen as a key sentence. The method comprehensively considers the influence of the collinearity rate of the sentence words and the semantic relevance on the sentence scoring so as to improve the accuracy rate of the sentence scoring.

Example 2

In order to further improve the accuracy of sentence scoring, the semantic similarity between each sentence and the document to be summarized can be used as a correction parameter when the score of each sentence is calculated. As shown in fig. 2, an automatic text summarization method includes the following steps:

step210, segmenting the document d to be summarized according to the predefined sentence ending symbol.

For example, after segmentation, it is denoted as d ═ s₁,s₂,…s_mWhere 1,2, …, m represents the number of the sentence after being divided. In the present embodiment, the sentence end symbol is not particularly limited and may be. ","! ","? ", also can be", "; ", or even a designated segmentation symbol.

Step220, respectively calculating the theme vector of the document to be summarized and each sentence s after segmentation according to the existing text corpus₁,s₂,…s_mThe topic vector of (1).

wherein p represents probability, word omega is a word in article d, t is a subject of the article, n represents the number of subjects, d_kA k-th document is represented as,ω_idenotes d_kThe ith word in (1).

The method for calculating the topic vector of the document to be summarized and the topic vector of each segmented sentence according to a preset formula comprises the following steps:

calculating the conditional probability of words and topics in the document, repeating the conditional probability step until the calculation result is converged to obtain the topic vector of the document to be summarized

Calculating the conditional probability of the words and the topics in each sentence after the segmentation, repeating the step of the conditional probability until the calculation result is converged, and obtaining the topic vector of each sentence after the segmentation, wherein the topic vector is respectively

Wherein s represents a sentence, 1,2, …, and m represents a number of the sentence after being divided;

the formula for calculating the conditional probability of the words and the topics in the document and calculating the conditional probability of the words and the topics in each sentence after segmentation can be an existing feasible calculation formula, and the formula in embodiment 1 can also be referred to.

And Step230, measuring the relevance of every two sentences through the quantity of terms which commonly appear between every two sentences in the document to be summarized. The specific calculation method is as shown in formula (2):

wherein s is_i,s_jFor differently numbered sentences, omega_kIs a word belonging to a sentence, | s_iL is a sentence s_iThe higher the numerical value of the formula calculation result, the higher the sentence relevancy.

Step240, calculating semantic similarity between every two sentences through the topic vector of each divided sentence, and calculating semantic similarity between the document to be summarized and each sentence through the topic vector of the document to be summarized and the topic vector of each divided sentence.

The formula for calculating the semantic similarity between two sentences is as follows:

wherein,

The formula for calculating the semantic similarity between the document to be abstracted and each sentence is as follows:

it is understood that the steps of Step230 and Step240 may be interchanged.

And Step250, calculating the score of each sentence by utilizing the relevance between the sentences, the semantic similarity and the semantic similarity between the sentences and the document to be summarized, and repeatedly and iteratively calculating the score of each sentence until the result is converged.

The method of calculating the score of each sentence is as follows:

wherein, TR(s)_i) Representing a sentence s_iScore of, s_i、s_j、s_kRespectively, i, j, k, OUT (j) means s_jOther sentences, IN (i) including s_iIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]. The invention comprehensively considers the collinearity rate of sentence and word and the semantic faciesInfluence of relevance on sentence score, and fixed value (1-e) is not directly used as sentence s in correcting score result_iThe random probability chosen as key sentence, but the random probability value related to semantics, i.e. the relevance of (1-e) to subject

Multiplying to express the random probability that the sentence is selected as the abstract key sentence, and fully considering the influence of semantics on sentence scoring.

And Step260, selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the preset connecting words according to the selected output sequence to obtain abstract contents.

The invention calculates the score of each sentence by utilizing the relevance and the similarity among the sentences, comprehensively considers the collinearity rate of the sentence words and the semantic relevance, and improves the accuracy rate of sentence scoring. The summarization method provided by the invention is convenient to calculate and strong in universality. The method is suitable for text content abstract extraction in multiple fields.

Example 3

The text automatic summarization method of the invention can be applied to the fields of Chinese, English and the like. In some embodiments, as shown in fig. 3, a method for automatically summarizing a chinese text includes the following steps:

step1, counting a co-occurrence matrix of 'theme-words' in a mass text corpus with theme labels. The method comprises the following specific steps:

step 1.1, preprocessing, initializing a massive corpus, and defining the kth document as d_k，d_kThe ith word in (1) is represented by ω_iAnd (4) showing.

Step 1.2, calculating the conditional probability of the words and the subjects in the document, wherein the specific formula is as follows:

indicating the probability of association of the document with the topic,

representing the probability of association of a word with a topic.

Wherein,

a_rfor the parameters associated with the subject r, they are generally set to

b_tIs a parameter related to the word t, and is generally set as

And 1.3, repeating the step 1.2 until the calculation result is converged, and obtaining a final theme-word co-occurrence matrix M.

Step2, the document d to be summarized is segmented according to the sentence ending symbol, namely d ═ s₁,s₂,…s_mAnd h, wherein 1,2, …, m represents the number of the sentence after being divided. The sentence symbol is generally composed of. ","! ","? "and the like.

For example, the document d to be summarized is the following paragraphs:

"Weekly futures message 5 month 9 evening, XXX issues several comments on further promoting healthy development of capital market," and nine official legs of the New nation. The nineteen lines emphasize the relaxation of traffic admission. The method implements a public, transparent and orderly securities futures business license management system, researches cross support licenses of securities companies, fund management companies, futures companies, securities investment consulting companies and the like, and supports other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation. The grand news futures link international futures vice general manager king-red-English (Boke, microblog) explains the influence of nine nations on the futures industry. Wanghongying considers that nine nations will have positive influence on the futures industry, a powerful futures company will be further developed towards the international delivery business, and he considers that more and more large financial institutions will participate in stock futures companies, which has positive influence on providing the capital strength and the business strength of the futures company. In addition, futures companies need to conduct business differently, and futures professional businesses such as hedging will become more and more important, which undoubtedly will influence future development of futures companies. "

The sentence after segmentation is:

s1 for the evening of day 5 and 9 of the futures news, XXX issues opinions about further promoting the healthy development of capital markets, and the New nation has nine official legs.

s 2-state nine emphasis on relaxing traffic admission.

s3 is to implement the public transparent and orderly stock and exchange securities futures business license management system, research the cross support license of securities company, fund management company, stock company, securities investment consulting company, etc., and support other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation.

s4 Wang hongying (blog, microblog) of general International futures Assistant connected with Hezhongfeng futures explains the influence of nine countries on the futures industry.

s5 (royal red English) considers that nine countries have positive influence on the futures industry, a powerful futures company is further developed towards the international operation, and he considers that more and more large financial institutions are participating in the futures company, which has positive influence on providing the capital strength and the business strength of the futures company.

s6, futures companies need to operate differently, and futures professional businesses such as hedging will become more and more important, which will certainly affect future development of futures companies.

Step 3, respectively calculating the theme vector of the document d and each sentence s after segmentation according to the co-occurrence matrix M₁,s₂,…,s_nThe topic vector of (1). The method comprises the following specific steps:

3.1. and (4) calculating the conditional probability of the words and the topics in the document d according to the formula in the step 1.2.

3.2. Repeating the step 3.1 until the calculation result is converged to obtain the subject vector of the document d

3.3. Calculating each sentence s after segmentation according to the formula in the step 1.2₁,s₂,…s_mConditional probability of Chinese words and topics.

3.4. Repeating the step 3.3 until the calculation result is converged, and obtaining the topic vector of each sentence after segmentation, wherein the topic vector is respectively

Step 4, measuring the relevance of the sentences through the number of the commonly occurring words of the two sentences, wherein the calculation formula is as follows:

In the present embodiment, the correlation between two sentences is shown in table 1.

TABLE 1

	S1	S2	S3	S4	S5	S6
							S1	0.5	0	0.15	0.25	0.26	0.21
S2	0	0.5	0	0	0.12	0
							S3	0.15	0	0.5	0.1	0.23	0.22
S4	0.25	0	0.1	0.5	0.28	0.17
							S5	0.26	0.12	0.23	0.28	0.5	0.3
S6	0.21	0	0.22	0.17	0.30	0.5

And 5, calculating the semantic similarity between every two sentences by using the main vector of each sentence after segmentation. The calculation formula is as follows:

wherein,

Table 2 shows semantic similarity between two sentences and similarity between each sentence and the digest document (denoted by Doc) in this embodiment as shown in table 2.

TABLE 2

And 6, scoring each sentence of the text content to be abstracted, and repeatedly and iteratively calculating the score of each sentence until the result is converged. The scoring formula adopted in this embodiment is as follows:

wherein, TR(s)_i) Representing a sentence s_iScore of(s), s_i、s_j、s_kRespectively, i, j, k, OUT (j) means s_jOther sentences, IN (i) including s_iIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]. The invention comprehensively considers the collinearity rate and semantic relevance of sentence words to improve the accuracy rate of sentence scoring, and uses the relevance between (1-e) and subject

The multiplication represents the random probability that the sentence is selected as the summary key sentence. In this embodiment, α, e, β are 0.5,0.5,0.85 respectively, and then the score of each sentence is calculated as:

{S1:0.15868276207017054,S2:0.12009075569229676,S32: 0.18508810964258574,S4:0.15716454363543905,S5:0.21333489450057458,S6: 0.165638934458933}

step 7, selecting TR(s)_n)>The sentence of phi is that,phi is a fixed threshold value, then adding connecting words giving rules, and outputting the summary content according to the appearance sequence of the original text.

If Φ is equal to 0.1, the summary content finally output by the document d to be summarized is:

' Wang hong Ying deem, nine nations will bring positive influence to the futures industry, the forceful futures company will further develop towards the international delivery business, he deems that more and more large financial institutions will participate in the stock futures company, which will bring positive influence to the capital strength and the business strength of the futures company ', ' implementing the public transparent and orderly securities futures business license management system, researching the cross support cards of securities companies, fund management companies, futures companies, securities investment companies and the like, supporting other financial institutions meeting the conditions to apply for securities futures business license on the basis of risk isolation ', ' in addition, the futures company also needs to carry out differentiation, the special futures business such as the hedging value will be more and more important, will undoubtedly influence the future development of the futures company ', ' and the future information message for 5 month and 9 days later, XXX issues several opinions on further promoting the healthy development of capital markets, nine formal landings in new countries, 'the general prince of international futures in social futures side chain (blogs, microblogs) interpretation of the effects of nine nations on futures industry', 'nine nations emphasis on easing business admission' ]

Example 4

Correspondingly, the present invention also discloses a text automatic summarization device, in some embodiments, as shown in fig. 4, the device includes:

and the segmentation module 11 is used for segmenting the document to be summarized according to predefined sentence ending symbols.

For example, after segmentation, the document to be summarized is marked as d ═ s₁,s₂,…s_mWhere 1,2, …, m represents the number of the sentence after being divided. Sentence end symbol is generally composed of. ","! ","? "and the like.

The first calculating module 12 is configured to calculate a topic vector of each segmented sentence according to an existing text corpus.

The first calculation module 12 further generates a text corpus by using an lda (content Dirichlet allocation) topic model algorithm, that is, a co-occurrence matrix of "topic-word" in a plurality of topic-labeled text corpuses is counted, and a specific calculation method is shown in formula (1):

wherein p represents probability, ω is a term in the article d, t is a topic of the article, n represents the number of topics, d_kDenotes the kth document, ω_iDenotes d_kThe ith word in (1).

The method for calculating the topic vector of each sentence after segmentation by the first calculation module 12 comprises the following steps:

The existing feasible calculation formula can be selected, and the formula in the embodiment 1 can also be referred to.

And the second calculating module 13 is configured to measure the relevance between two sentences by the number of terms commonly occurring between two sentences in the document to be summarized.

The specific calculation method is as shown in formula (2):

And the third calculating module 14 is configured to calculate semantic similarity between every two sentences according to the topic vector of each segmented sentence.

The specific calculation method is shown as formula (3):

wherein,

representing a sentence s_iThe higher the calculation result value of the formula is, the higher the semantic relevance of the sentence is.

And the scoring module 15 is configured to calculate a score of each sentence according to the relevance and similarity between the sentences.

The method of calculating the score of each sentence is as follows:

wherein, TR(s)_i) Representing a sentence s_iScore of, s_i、s_j、s_kRespectively, i, j, k, OUT (j) means s_jOther sentences, IN (i) including s_iAll the sentences inside, n represents the number of the subjects, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0,1 ]]And e is taken to be in the range of [0,1 ]]. (1-e)/n is a correction amount as a sentence s_iRandom probability of being chosen as a key sentence. The method comprehensively considers the influence of the sentence word collinearity rate and the semantic relevance on the sentence scoring so as to improve the sentence scoring accuracy.

And the abstract output module 16 is used for selecting sentences of which the scores meet the threshold value, adding preset connecting words, and outputting the connecting words according to the selected output sequence to obtain abstract contents.

Example 5

In other embodiments, the first calculation module 12 is further configured to calculate the text corpus according to an existing text corpus, as compared to embodiment 4And (4) a topic vector of the document to be summarized. The method for calculating the theme vector of the document to be summarized comprises the following steps: calculating the conditional probability of the words and the subjects in the document, repeating the conditional probability step until the calculation result is converged, and obtaining the subject vector of the document to be summarized

The third calculating module 14 is further configured to calculate semantic similarity between the document to be summarized and each sentence. The scoring module 15 is also used for correcting the scoring result by using the similarity between each sentence and the summary document. The final scoring formula used is as follows:

wherein, TR(s)_i) Representing a sentence s_iScore of, s_i、s_j、s_kRespectively, i, j, k, OUT (j) means s_jOther sentences, IN (i) denoted as including s_iIn all the sentences, DOC represents the sentence set of the document to be abstracted, alpha, e and beta are calculation parameters, and the value ranges of alpha and beta are [0, 1%]And e has a value range of [0,1 ]]。 TopicSim(s_iTitle) represents the semantic similarity between the document to be summarized and each sentence, wherein,

representing a sentence s_iThe subject vector of (a) is,

and the topic vector represents the document to be summarized.

The embodiment comprehensively considers the influence of the sentence word collinearity rate and the semantic relevance on the sentence score, and does not directly use the fixed value (1-e) as the sentence s when the score result is corrected_iRandom probability chosen as key sentence, but use and semantic faciesRandom probability values of relevance, i.e. using (1-e) relevance to subject matter

Multiplying to express the random probability that the sentence is selected as the summary key sentence, and fully considering the influence of semantics on sentence scoring.

Example 6

Corresponding to the method embodiment, the embodiment of the invention also provides electronic equipment. Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device includes: a processor 510, a communication interface 520, a memory 530, and a communication bus 540, wherein:

a processor 510, a communication interface 520, a memory 530 for communicating with each other via a communication bus 540, and a memory 530 for storing computer programs;

the processor 510 is configured to implement the generation of the text automatic summarization method provided in the present invention when executing the program stored in the memory 530. Specifically, the text automatic summarization method includes:

and selecting sentences the score values of which meet the threshold value, adding preset connecting words, and outputting the sentences according to the selected output sequence to obtain the abstract content.

Of course, in order to further improve the accuracy of the summary, the text automatic summary method may further include:

respectively calculating the topic vector of the document to be abstracted and the topic vector of each sentence after segmentation according to the existing text corpus;

measuring the relevance of every two sentences through the quantity of words commonly appearing between every two sentences in the document to be summarized;

calculating semantic similarity between every two sentences through the topic vector of each segmented sentence, and calculating the semantic similarity between the document to be abstracted and each sentence through the topic vector of the document to be abstracted and the topic vector of each segmented sentence;

calculating the score of each sentence according to the relevance between the sentences, the semantic similarity and the semantic similarity between the sentences and the document to be summarized;

The implementation manner of the text automatic summarization method is the same as that of the text automatic summarization method provided in the foregoing method embodiment section, and details are not repeated here.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but not only one bus or type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.

It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.

Claims

1. A text automatic summarization method is characterized by comprising the following steps:

determining the relevance of every two sentences according to the quantity of the words which commonly appear between every two sentences;

calculating semantic similarity of every two sentences according to the topic vector of each sentence;

the method for calculating the score of each sentence by utilizing the relevance and the semantic similarity adopts the following formula:

wherein, TR(s)_i) Representing a sentence s_iThe score of alpha, e, beta is a preset calculation parameter, s_i、s_j、s_kRespectively representing sentences numbered i, j, k,

indicates the relevance of the sentences si, sj, TopicSim(s)_i,s_j) Representing semantic similarity of sentences si, sj, OUT (j) representing division by s_jOther sentences, IN (i) including s_iAll sentences within, n represents the number of topics;

2. The method for automatically summarizing text according to claim 1, further comprising:

calculating a theme vector of a document to be summarized according to an existing text corpus, and calculating semantic similarity between the document to be summarized and each sentence by using the theme vector of the document to be summarized and the theme vector of each divided sentence;

calculating the score of each sentence further comprises: and utilizing the semantic similarity between the document to be abstracted and each sentence as a correction quantity.

3. The method for automatically summarizing text according to claim 1, wherein said calculating a topic vector for each sentence after segmentation comprises:

and calculating the conditional probability of the words and the topics in each sentence after segmentation according to a preset formula, and repeating the conditional probability step until the calculation result is converged to obtain the topic vector of each sentence after segmentation.

4. The method for automatically summarizing text according to claim 2, wherein the method for calculating the topic vector of the document to be summarized comprises:

and calculating the conditional probability of the words and the topics in the document to be summarized according to a preset formula, and repeating the step of the conditional probability until the calculation result is converged to obtain the topic vector of the document to be summarized.

5. The method for automatically summarizing text according to claim 1 or 2, wherein the relevance of two sentences determined by the number of words co-occurring between two sentences in the document to be summarized is determined by using the following calculation formula:

wherein s is_i,s_jFor differently numbered sentences, omega_kIs a word belonging to a sentence, | s_iL is a sentence s_iNumber of words, | s_jL is a sentence s_jThe higher the calculation result value is, the higher the sentence relevancy is.

6. The method for automatically summarizing text according to claim 5, wherein the calculation of semantic similarity between two sentences by the topic vector of each sentence after the segmentation adopts the following calculation formula:

wherein,

representing a sentence s_i、s_jThe higher the numerical value of the calculation result is, the higher the semantic relevance of the sentence is.

7. The method for automatically summarizing text according to claim 6, wherein the method for calculating the score of each sentence uses the following formula:

wherein, TR(s)_i) Representing a sentence s_iThe score of alpha, e, beta is a preset calculation parameter, s_i、s_j、s_kRespectively, i, j, k, OUT (j) means s_jOther sentences, IN (i) including s_iAll sentences within, DOC represents the set of sentences of the document to be summarized, TopicSim(s)_iTitle) represents the semantic similarity between the document to be summarized and each sentence, wherein,

representing a sentence s_iThe subject vector of (a) is,

a topic vector representing the document to be summarized.

8. An apparatus for automatically summarizing text, the apparatus comprising:

the first calculation module is used for calculating the topic vector of each sentence after segmentation according to the existing text corpus;

the second calculation module is used for determining the relevance of every two sentences according to the quantity of the words which commonly appear between every two sentences;

the third calculation module is used for calculating semantic similarity between every two sentences according to the topic vector of each sentence;

the scoring module is used for calculating the score of each sentence by utilizing the relevance and the semantic similarity, and adopts the following formula:

9. An electronic device comprising a processor, a communication interface, a memory, and a communication bus, wherein:

the processor, the communication interface and the memory complete mutual communication through a communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.