WO2022262266A1

WO2022262266A1 - Text abstract generation method and apparatus, and computer device and storage medium

Info

Publication number: WO2022262266A1
Application number: PCT/CN2022/071791
Authority: WO
Inventors: 李夏昕
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-06-18
Filing date: 2022-01-13
Publication date: 2022-12-22
Also published as: CN113254593B; CN113254593A

Abstract

The present application relates to the field of artificial intelligence. Provided are a text abstract generation method and apparatus, and a computer device and a storage medium. The defects whereby a traditional TextRank algorithm, when calculating the similarity of plain text, does not distinguish the importance of different terms and also does not filter out unimportant words according to the parts of speech to which said words belong can be effectively overcome, thereby improving the possibility of sentences having high business relevance being selected for an abstract; the before-after proximity relationship between sentences and the positions of the sentences in an original article are fully considered during a modeling process, thereby effectively overcoming the problem of inaccurate text abstract generation caused by the fact that the importance of the positional sequence of sentences in an article is not considered in a traditional mode; and post-processing is added on the basis of traditional text abstract generation, and an abstract result acquired by means of a graph algorithm is corrected, thereby improving the quality of a finally-output abstract, and thus realizing more accurate text abstract generation on the basis of artificial intelligence. The present application further relates to blockchain technology, and an abstract sentence may be stored in a blockchain node.

Description

Text summary generation method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202110679639.7 and the title of "text abstract generation method, device, computer equipment and storage medium" submitted to the China Patent Office on June 18, 2021, the entire content of which is incorporated by reference incorporated in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular to a text abstract generation method, device, computer equipment and storage medium.

Background technique

Text summarization technology is an important technology in the field of artificial intelligence. For humans, it is an innate ability to read a long text and extract its core summary content. But for computers, it represents the most challenging technological progress and breakthrough in the field of artificial intelligence. The Internet in today's world carries a large amount of text information, including a large number of medium and long texts. Understanding these texts through machines and extracting core summaries can support various application functions that are beneficial to human society, such as: media monitoring, search engine marketing and optimization, financial and legal text analysis research, social media marketing, books and Document content indexing, video conferencing abstracts, automatic content authoring, and more.

Existing text summarization techniques can be divided horizontally into supervised and unsupervised, and vertically into extractive and generative. Supervised text summarization technology requires a large amount of manual labeling data. Manual labeling of text summaries is very laborious and costly. There are also certain deviations in the judgment of the core abstract content of articles by different labelers. Therefore, unsupervised solutions are generally adopted for the implementation of technology in the industry. . Extractive summarization generally extracts important content from the original article in units of sentences, and then stitches them together as an article abstract. Generative summarization directly generates the content of article summaries through deep learning seq2seq (Sequence to Sequence), which involves semantic representation, inference, and natural language generation, which are difficult to implement. Therefore, generative summarization is more of an academic Research hotspots, the landing effect in the industry is not ideal.

At present, the unsupervised extraction scheme is most commonly used in the implementation of text summarization technology in the industry. The specific methods include graph-based, topic-based model-based, centrality-based and information redundancy-based methods. Among them, the graph-based TextRank algorithm is the most classic and widely used method. The TextRank algorithm has good versatility and is suitable for texts in various fields as well as medium-length and long-length texts, but it also has some defects: (1) In the TextRank algorithm, the edge connecting two graph nodes is a single undirected edge, and this edge has only Single weight, from the perspective of this single undirected edge, the weights of the node sentences at both ends are equal. However, if any two sentences in the article are compared separately, their importance should also be divided into high and low points; (2) In the TextRank algorithm, any two nodes in the graph have a connecting edge, which is equivalent to taking all the sentences in the article Sentences are mixed together and modeled, without considering the neighbor relationship of sentences and their position in the original article. However, when extracting the abstract, the position of the sentence and the context of the sentence play an important role in the judgment of the summary sentence. For example, the sentence at the beginning and end of the article or paragraph, as well as the summary sentence, are likely to be a summary sentence; (3) TextRank When the algorithm calculates the weights of the edges in the graph, it only considers the plain text similarity between two sentences, and does not consider the semantic similarity, that is, it does not consider the situation that the text is written differently but has similar semantics; (4) TextRank algorithm is calculating In the pure text similarity, the importance of different entries is not distinguished, and the unimportant words are not filtered out according to the part of speech, so the accuracy of the pure text similarity calculation needs to be improved.

The inventor realized that the above-mentioned defects will affect the final text summary generation effect, and the existing text summary generation technology also lacks the correction of the generated summary, and the summary results output by the TextRank algorithm generally have some problems, resulting in The generated summary is not ideal.

Contents of the invention

Embodiments of the present application provide a text abstract generation method, device, computer equipment, and storage medium, which can realize more accurate text abstract generation based on artificial intelligence means.

In the first aspect, the embodiment of the present application provides a method for generating a text abstract, which includes:

Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;

Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;

calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;

calculating the semantic similarity between every two clauses in the plurality of clauses;

calculating the positional similarity between every two clauses in the multiple clauses;

The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;

The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;

Filter according to the importance of each clause to obtain alternative clauses;

Perform post-processing on the candidate clauses to obtain a summary sentence.

In a second aspect, the embodiment of the present application provides a text abstract generating device, which includes:

An acquisition unit, configured to respond to a text summary generation instruction, and obtain data to be processed according to the text summary generation instruction;

A segmentation unit, configured to segment the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses;

a calculation unit, configured to calculate the mutual recommendation degree between every two clauses in the plurality of clauses;

The calculation unit is also used to calculate the semantic similarity between every two clauses in the plurality of clauses;

The calculation unit is also used to calculate the positional similarity between every two clauses in the plurality of clauses;

The fusion unit is used to fuse the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses to obtain a graph adjacency matrix;

The calculation unit is also used to input the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause;

A screening unit is used to screen according to the importance of each clause to obtain alternative clauses;

A post-processing unit, configured to post-process the candidate clauses to obtain a summary sentence.

In the third aspect, the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program. The following steps are implemented in the program:

Perform post-processing on the candidate clauses to obtain a summary sentence.

In a fourth aspect, the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor performs the following steps :

Perform post-processing on the candidate clauses to obtain a summary sentence.

The embodiment of the present application provides a text summary generation method, device, computer equipment and storage medium, which can effectively overcome the traditional TextRank algorithm, which does not distinguish the importance of different entries when calculating the similarity of plain text, and does not filter out by part of speech The defect of unimportant words improves the possibility of sentences with strong business relevance being selected as summaries. In the process of modeling, the relationship between the front and back of sentences and their positions in the original article are fully considered, which effectively overcomes the problems in traditional methods. Due to the inaccurate generation of text summaries due to the lack of consideration of the importance of the position order of sentences in the article, post-processing is added on the basis of traditional text summaries, and the summarization results obtained by the graph algorithm are corrected to improve the final output. The quality of abstracts, and then based on artificial intelligence means to achieve more accurate text summary generation.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

FIG. 1 is a schematic flow diagram of a method for generating a text abstract provided in an embodiment of the present application;

FIG. 2 is a schematic block diagram of a text abstract generation device provided by an embodiment of the present application;

Fig. 3 is a schematic block diagram of a computer device provided by an embodiment of the present application.

detailed description

The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.

It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.

It should also be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

Please refer to FIG. 1 , which is a schematic flowchart of a method for generating a text abstract provided in an embodiment of the present application.

S10, in response to a text summary generation instruction, acquire data to be processed according to the text summary generation instruction.

In this embodiment, the text summary generation instruction may be triggered by relevant staff, such as media monitors, online educators, and the like.

In at least one embodiment of the present application, said obtaining the data to be processed according to said text summary generation instruction includes:

Detecting synchronously uploaded information when the text summary generation instruction is triggered;

Obtain an address from said information as a target address;

Link to the target address, and acquire the data stored at the target address as the data to be processed.

Wherein, the target address may include, but not limited to: a web page address, a folder address, a database address, and the like.

Of course, in other embodiments, when the synchronously uploaded information includes the data to be processed, the data to be processed is directly extracted. For example, if the user synchronously uploads the data to be processed when triggering the text summary generation instruction, the data to be processed can be obtained directly from the text summary generation instruction.

S11. Segment the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses.

In this embodiment, the dictionary obtained according to the task scene performs segmentation processing on the data to be processed, and obtains a plurality of clauses including:

Identify the current task scenario;

Retrieving a dictionary matching the current task scene as a target dictionary;

Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.

For example, when the current task scenario is a financial scenario, obtain a financial dictionary that matches the financial scenario as the target dictionary, and use the financial dictionary to segment the data to be processed into sentences to obtain the Related terms related to finance, sents=[s ₁ ,s ₂ ,…,s _i ], where, s _i =[w ₁ /t ₁ ,w ₂ /t ₂ ,…,w _n /t _n ], sents is the clause of the data to be processed, s _i is the ith clause in sents, w _n is the nth participle of the clause, t _n is the part of speech corresponding to the nth participle of the clause, i, n is a positive integer.

In this embodiment, the data to be processed can be segmented according to the punctuation mark of the sentence, such as a period, a question mark, an exclamation mark, and the like. In this embodiment, a word segmentation tool (such as a Chinese word segmentation tool) can be used to load the target dictionary, so as to segment business-related entries well.

Through the above embodiments, sentences can be segmented according to specific dictionaries associated with specific task scenarios, so as to better segment business-related entries.

S12. Calculate mutual recommendation degrees between every two clauses in the plurality of clauses.

In this embodiment, the calculating the mutual recommendation degree between every two clauses in the plurality of clauses includes:

configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;

For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;

Determine the word weight and part of speech of the target word;

Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;

L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.

Wherein, the configuration requirement may be uploaded by the user.

Among them, when calculating the similarity between texts, the formula used is as follows:

Among them, mat _t (Si, Sj) represents the mutual recommendation between any two clauses Si and Sj, Wk represents the words that appear simultaneously in Si and Sj, TermWeight represents the weight of the word, Tk represents the part of speech of Wk, and valid_postags represents valid part of speech.

Wherein, the effective parts of speech include nouns, verbs, adjectives, and adverbs, which are closely related to sentence semantics.

Moreover, when calculating the common word (Wk) scores of two sentences, different weights are assigned to business words with different importance. For example, the weight of a product name entry can be twice that of a general entry, and the weight of a disease name or competing product company name entry can be 1.5 times that of a general entry. The specific weight value can be determined by performing a parameter search based on the regression test effect. get.

Further, do L2 normalization on mat _t (Si,Sj), that is, divide each element of the matrix by norm_val; where,

Under the root sign is the sum of squares of all elements in the matrix mat _t (Si,Sj).

L2 is a regularization term, also called a penalty term. It is an item added after the loss function to limit the parameters of the model and prevent the model from overfitting. The L2 norm conforms to the Gaussian distribution and is completely differentiable.

In the above embodiment, only nouns, verbs, adjectives, and adverbs are kept, which are closely related to the sentence semantics, and different weights are given to business words with different importance when calculating the common word (Wk) score. Regularization effectively overcomes the defect that the traditional TextRank algorithm does not distinguish the importance of different entries when calculating the similarity of plain text, and does not filter out unimportant words by part of speech, and improves the selection of sentences with strong business relevance. Summary possibility.

S13. Calculate the semantic similarity between every two clauses in the plurality of clauses.

In this embodiment, the calculating the semantic similarity between every two clauses in the plurality of clauses includes:

Vectorize each clause to obtain the embedded vector representation of each clause;

Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;

The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.

Specifically, when calculating the semantic similarity between sentences, the formula used is as follows:

mat _s (Si,Sj)=cosine_similarity(s _i -embed,s _j -embed)

Among them, mat _s (Si, Sj) represents the semantic similarity between any two clauses Si and Sj, s _i -embed represents the embedded vector representation of clause Si, s _j -embed represents the embedded vector representation of clause Sj, cosine_similarity means to solve the cosine similarity.

In the above implementation manner, the defect that the traditional algorithm only considers the plain text similarity between two sentences and does not consider the semantic similarity is avoided.

S14. Calculate the positional similarity between every two clauses in the plurality of clauses.

In this embodiment, the calculating the position similarity between every two clauses in the plurality of clauses includes:

Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;

When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and when the recommended sentence is in the front preset position or the rear preset position in the corresponding paragraph, Determining that the corresponding matrix cell value is the first value;

When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;

When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;

When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;

When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;

Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.

Wherein, the first numerical value, the second numerical value, the third numerical value and the fourth numerical value can be customized. For example, in this embodiment, the first numerical value can be configured as 2, so The second value is 1.5, the third value is 2.5, and the fourth value is 1.

Wherein, the front preset position or the rear preset position can also be customized, for example, the front preset position can be configured as the first 5%, and correspondingly, the rear preset position can be configured as the rear 5%.

Wherein, the specified attribute may be a summary attribute, that is, the sentence with the specified attribute is a summary sentence.

Through the above implementation, in the process of modeling, the relationship between the front and back of the sentence and their position in the original article are fully considered, which effectively overcomes the generation of text summarization caused by the lack of consideration of the importance of the order of the sentence in the article in the traditional way. Inaccurate question.

S15, performing fusion processing on the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the position similarity between each two clauses to obtain a graph adjacency matrix.

In at least one embodiment of the present application, the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused using the following formula , to get the graph adjacency matrix:

mat _adjc ＝(αmat _t +βmat _s ) _⊙ mat _o

Among them, mat _adjc represents the graph adjacency matrix, mat _t represents the mutual recommendation between each two clauses, mat _s represents the semantic similarity between each two clauses, mat _o represents the position between each two clauses Similarity, α represents the weight of the mutual recommendation, β represents the weight of the semantic similarity, α>0, β>0, and α+β=1.

In this embodiment, (αmat _t + βmat _s ) and mat _o are multiplied element-wise, so that the symmetric matrix (αmat _t + βmat _s ) is no longer symmetrical. At this time, in mat _adjc , the similarity between sentences The degree is affected by the position of the sentence in the text.

It should be noted that in the traditional summary extraction scheme, the edge connecting the nodes of the two graphs is a single undirected edge, and this edge has only a single weight. From the perspective of this single undirected edge, the weights of the sentences at both ends of the node are equal. However, if any two sentences in the article are compared separately, their importance should also be divided into high and low. It is obviously wrong to treat the importance of the two sentences as equivalent.

However, in this embodiment, through the fusion processing of the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses, the resulting graph adjacency The matrix models the graph node connection from a single undirected edge to two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme.

S16. Input the graph adjacency matrix into the TextRank algorithm to calculate the importance of each clause.

In this embodiment, after the graph adjacency matrix is input into the TextRank algorithm, the TextRank value of each node is iteratively calculated as the importance of each corresponding clause, which will not be repeated here.

S17. Screening is performed according to the importance of each clause to obtain candidate clauses.

In at least one embodiment of the present application, the screening is performed according to the importance of each clause, and the alternative clauses obtained include:

Get the preset threshold;

A clause whose importance is greater than or equal to the preset threshold is acquired as the candidate clause.

Wherein, the preset threshold can be customized, such as 95%.

In at least one embodiment of the present application, the screening according to the importance of each clause to obtain the alternative clauses also includes:

Sort the importance of each clause in descending order;

Get the default position;

The clauses arranged before the preset position are determined as the candidate clauses.

Wherein, the preset positions can be customized, such as 20 positions.

The preset position is equivalent to a hyperparameter, which can be obtained through experiments or debugging. For example, based on the regression test set, the rouge value of the abstract is used as an index to perform a hyperparameter search on the preset position, and the optimal rouge value is selected to correspond to The value of is used as the preset position.

S18. Post-processing the candidate clauses to obtain a summary sentence.

It should be noted that the alternative clauses belong to the initially obtained summary, but they may include sentence patterns such as questions, results, progress, transitions, and guidance. Such sentences should not appear independently of the context, so if their context If it is not selected as a summary sentence, further revision is required.

Specifically, the post-processing of the candidate clauses to obtain a summary sentence includes:

identifying the type of each of the alternative clauses;

When the type of the target clause in the alternative clauses is an interrogative sentence, the next clause adjacent to the target clause is obtained, and the obtained clause is added to the summary sentence;

When one of the constituent words in the specified associated phrase is obtained in the candidate clause, the clause to which the word associated with the constituent word belongs is obtained, and the obtained clause is added to the summary sentence.

Wherein, the types of clauses may include, but not limited to: interrogative sentences and sentences composed of associated phrases.

For example, the type of each clause in the candidate clauses may be judged according to keywords or symbols obtained through text recognition. For example: when "?" is recognized, it is judged as an interrogative sentence.

For example: if a summary sentence is a question sentence, usually the next adjacent sentence should also be judged as a summary; a summary sentence is "although...but...", "because...so..." in such sentences When one constituent sentence is used, the other half of the constituent sentence should usually be judged as an abstract as well.

Through the above implementation, post-processing is added on the basis of traditional text summarization, and the summarization result obtained by the graph algorithm is corrected to improve the quality of the final output summarization.

It should be noted that, in order to further ensure the security of the data and prevent the data from being maliciously tampered with, the summary sentences can be stored on the blockchain nodes.

It can be seen from the above technical solutions that the present application can respond to the text summary generation instruction, obtain the data to be processed according to the text summary generation instruction, and perform segmentation processing on the data to be processed to obtain multiple clauses. The specific dictionary associated with the task scenario performs sentence segmentation in order to better segment out business-related entries, calculate the mutual recommendation between each two clauses in the multiple clauses, and only retain nouns , verbs, adjectives and adverbs are four parts of speech that are closely related to sentence semantics, and different weights are given to business words with different importance when calculating common word scores. Combined with regularization, it effectively overcomes the traditional TextRank algorithm in calculating plain text In terms of similarity, there is no distinction between the importance of different entries, and there is no defect in filtering out unimportant words by part of speech, which improves the possibility of sentences with strong business relevance being selected as abstracts, and calculates the number of sentences in the multiple clauses The semantic similarity between every two clauses avoids the defect that the traditional algorithm only considers the plain text similarity between two sentences and does not consider the semantic similarity, and calculates every two clauses in the multiple clauses The location similarity between sentences is fully considered in the process of modeling, and their positions in the original article are fully considered, which effectively overcomes the text summarization caused by the lack of consideration of the importance of the position order of sentences in the article Inaccurate questions are generated, and the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix, so that the final The obtained graph adjacency matrix models the graph node connection edge from a single undirected edge into two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme, and inputs the graph adjacency matrix into the TextRank algorithm to calculate each The importance of clauses is screened according to the importance of each clause to obtain alternative clauses, and post-processing is performed on the alternative clauses to obtain a summary sentence. Post-processing is added on the basis of traditional text summary generation , modify the summary results obtained by the graph algorithm, improve the quality of the final output summary, and then realize more accurate text summary generation based on artificial intelligence means.

The embodiment of the present application further provides a text abstract generation device, and the text abstract generation device is configured to execute any embodiment of the foregoing text abstract generation method. Specifically, please refer to FIG. 2 . FIG. 2 is a schematic block diagram of an apparatus for generating a text summary provided by an embodiment of the present application.

As shown in FIG. 2 , the device 100 for generating a text summary includes: an acquisition unit 101 , a segmentation unit 102 , a calculation unit 103 , a fusion unit 104 , a screening unit 105 , and a post-processing unit 106 .

In response to the text summary generation instruction, the obtaining unit 101 obtains the data to be processed according to the text summary generation instruction.

In at least one embodiment of the present application, the acquiring unit 101 acquiring the data to be processed according to the text summary generation instruction includes:

Obtain an address from said information as a target address;

The segmentation unit 102 performs segmentation processing on the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses.

In this embodiment, the segmentation unit 102 performs segmentation processing on the data to be processed according to the task scene acquisition dictionary, and obtains a plurality of clauses including:

Identify the current task scenario;

Retrieving a dictionary matching the current task scene as a target dictionary;

The calculation unit 103 calculates the degree of mutual recommendation between every two clauses in the plurality of clauses.

In this embodiment, the calculating unit 103 calculating the mutual recommendation degree between every two clauses in the plurality of clauses includes:

Determine the word weight and part of speech of the target word;

Wherein, the configuration requirement may be uploaded by the user.

The calculation unit 103 calculates the semantic similarity between every two clauses in the plurality of clauses.

In this embodiment, the calculating unit 103 calculating the semantic similarity between every two clauses in the plurality of clauses includes:

mat _s (Si,Sj)=cosine_similarity(s _i -embed,s _j -embed)

The calculation unit 103 calculates the position similarity between every two clauses in the plurality of clauses.

In this embodiment, the computing unit 103 calculating the position similarity between every two clauses in the plurality of clauses includes:

The fusion unit 104 performs fusion processing on the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the position similarity between each two clauses to obtain a graph adjacency matrix.

mat _adjc ＝(αmat _t +βmat _s ) _⊙ mat _o

The calculation unit 103 inputs the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause.

The screening unit 105 screens according to the importance of each clause to obtain candidate clauses.

In at least one embodiment of the present application, the screening unit 105 screens according to the importance of each clause, and obtains alternative clauses including:

Get the preset threshold;

Wherein, the preset threshold can be customized, such as 95%.

In at least one embodiment of the present application, the screening unit 105 performs screening according to the importance of each clause, and the obtained alternative clauses also include:

Sort the importance of each clause in descending order;

Get the default position;

Wherein, the preset positions can be customized, such as 20 positions.

The post-processing unit 106 performs post-processing on the candidate clauses to obtain a summary sentence.

Specifically, the post-processing unit 106 performs post-processing on the candidate clauses to obtain a summary sentence including:

identifying the type of each of the alternative clauses;

The above-mentioned apparatus for generating a text summary can be realized in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 3 .

Please refer to FIG. 3 . FIG. 3 is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.

Referring to FIG. 3 , the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .

The storage medium 503 can store an operating system 5031 and a computer program 5032 . When the computer program 5032 is executed, it can cause the processor 502 to execute the method for generating a text summary.

The processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .

The internal memory 504 provides an environment for the running of the computer program 5032 in the storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the method for generating a text summary.

The network interface 505 is used for network communication, such as providing data transmission and the like. Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation to the computer device 500 on which the solution of this application is applied. The specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.

Wherein, the processor 502 is configured to run a computer program 5032 stored in the memory, so as to implement the method for generating a text abstract disclosed in the embodiment of the present application.

Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 3 does not constitute a limitation on the specific composition of the computer device. In other embodiments, the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 3 , and will not be repeated here.

It should be understood that in the embodiment of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the present application a computer readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the method for generating a text abstract disclosed in the embodiment of the present application is implemented. The computer-readable storage medium may be non-volatile or volatile.

Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described devices, devices, and units can refer to the corresponding process in the foregoing method embodiments, and details are not repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the relationship between hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are implemented by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed devices, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only logical function division. In actual implementation, there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a storage medium In, several instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk.

The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A method for generating text summarization, including:

Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;

Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;

calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;

calculating the semantic similarity between every two clauses in the plurality of clauses;

calculating the positional similarity between every two clauses in the multiple clauses;

The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;

The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;

Filter according to the importance of each clause to obtain alternative clauses;

Perform post-processing on the candidate clauses to obtain a summary sentence.
The method for generating a text abstract according to claim 1, wherein said acquiring a dictionary according to a task scene performs segmentation processing on said data to be processed, and obtaining a plurality of clauses includes:

Identify the current task scenario;

Retrieving a dictionary matching the current task scene as a target dictionary;

Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
The method for generating a text abstract according to claim 1, wherein said calculating the degree of mutual recommendation between every two clauses in said plurality of clauses comprises:

configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;

For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;

Determine the word weight and part of speech of the target word;

Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;

L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
The text summary generating method according to claim 1, wherein said calculating the semantic similarity between every two clauses in said plurality of clauses comprises:

Vectorize each clause to obtain the embedded vector representation of each clause;

Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;

The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
The text summary generating method according to claim 1, wherein said calculating the positional similarity between every two clauses in said plurality of clauses comprises:

Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;

When any clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and determine the corresponding The matrix cell value of is the first value;

When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;

When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;

When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;

When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;

Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
The text abstract generation method according to claim 1, wherein the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses and the similar position between each two clauses are calculated by using the following formula The degrees are fused to obtain the graph adjacency matrix:

mat adjc ＝(αmat t +βmat s )⊙mat o

Among them, mat adjc represents the graph adjacency matrix, mat t represents the mutual recommendation between each two clauses, mat s represents the semantic similarity between each two clauses, mat o represents the position between each two clauses Similarity, α represents the weight of the mutual recommendation, β represents the weight of the semantic similarity, α>0, β>0, and α+β=1.
The method for generating a text abstract according to claim 1, wherein said post-processing said candidate clauses to obtain an abstract sentence comprises:

identifying the type of each of the alternative clauses;

When the type of the target clause in the alternative clauses is an interrogative sentence, the next clause adjacent to the target clause is obtained, and the obtained clause is added to the summary sentence;

When one of the constituent words in the specified associated phrase is obtained in the candidate clause, the clause to which the word associated with the constituent word belongs is obtained, and the obtained clause is added to the summary sentence.
A text summarization generating device, including:

An acquisition unit, configured to respond to a text summary generation instruction, and obtain data to be processed according to the text summary generation instruction;

A segmentation unit, configured to segment the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses;

a calculation unit, configured to calculate the mutual recommendation degree between every two clauses in the plurality of clauses;

The calculation unit is also used to calculate the semantic similarity between every two clauses in the plurality of clauses;

The calculation unit is also used to calculate the positional similarity between every two clauses in the plurality of clauses;

The fusion unit is used to fuse the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses to obtain a graph adjacency matrix;

The calculation unit is also used to input the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause;

A screening unit is used to screen according to the importance of each clause to obtain alternative clauses;

A post-processing unit, configured to post-process the candidate clauses to obtain a summary sentence.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the following steps when executing the computer program:

Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;

Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;

calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;

calculating the semantic similarity between every two clauses in the plurality of clauses;

calculating the positional similarity between every two clauses in the multiple clauses;

The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;

The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;

Filter according to the importance of each clause to obtain alternative clauses;

Perform post-processing on the candidate clauses to obtain a summary sentence.
The computer device according to claim 9, wherein said acquiring a dictionary according to a task scene performs segmentation processing on said data to be processed, and obtaining a plurality of clauses includes:

Identify the current task scenario;

Retrieving a dictionary matching the current task scene as a target dictionary;

Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
The computer device according to claim 9, wherein said calculating the degree of mutual recommendation between every two clauses in said plurality of clauses comprises:

configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;

For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;

Determine the word weight and part of speech of the target word;

Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;

L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
The computer device according to claim 9, wherein said calculating the semantic similarity between every two clauses in said plurality of clauses comprises:

Vectorize each clause to obtain the embedded vector representation of each clause;

Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;

The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
The computer device according to claim 9, wherein said calculating the positional similarity between every two clauses in said plurality of clauses comprises:

Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;

When any clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and determine the corresponding The matrix cell value of is the first value;

When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;

When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;

When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;

When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;

Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
The computer device as claimed in claim 9, wherein, the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses and the positional similarity between each two clauses are calculated using the following formula Fusion processing to get the graph adjacency matrix:

mat adjc ＝(αmat t +βmat s )⊙mat o

Among them, mat adjc represents the graph adjacency matrix, mat t represents the mutual recommendation between each two clauses, mat s represents the semantic similarity between each two clauses, mat o represents the position between each two clauses Similarity, α represents the weight of the mutual recommendation, β represents the weight of the semantic similarity, α>0, β>0, and α+β=1.
The computer device according to claim 9, wherein said performing post-processing on said candidate clauses to obtain a summary sentence comprises:

identifying the type of each of the alternative clauses;

When the type of the target clause in the alternative clauses is an interrogative sentence, the next clause adjacent to the target clause is obtained, and the obtained clause is added to the summary sentence;

When one of the constituent words in the specified associated phrase is obtained in the candidate clause, the clause to which the word associated with the constituent word belongs is obtained, and the obtained clause is added to the summary sentence.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to:

Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;

Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;

calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;

calculating the semantic similarity between every two clauses in the plurality of clauses;

calculating the positional similarity between every two clauses in the multiple clauses;

The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;

The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;

Filter according to the importance of each clause to obtain alternative clauses;

Perform post-processing on the candidate clauses to obtain a summary sentence.
The computer-readable storage medium as claimed in claim 16, wherein, said acquisition dictionary according to the task scene performs segmentation processing on the data to be processed, and obtains a plurality of clauses comprising:

Identify the current task scenario;

Retrieving a dictionary matching the current task scene as a target dictionary;

Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
The computer-readable storage medium according to claim 16, wherein said calculating the degree of mutual recommendation between every two clauses in said plurality of clauses comprises:

configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;

For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;

Determine the word weight and part of speech of the target word;

Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;

L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
The computer-readable storage medium according to claim 16, wherein said calculating the semantic similarity between every two clauses in said plurality of clauses comprises:

Vectorize each clause to obtain the embedded vector representation of each clause;

Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;

The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
The computer-readable storage medium according to claim 16, wherein said calculating the positional similarity between every two clauses in said plurality of clauses comprises:

Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;

When any clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and determine the corresponding The matrix cell value of is the first value;

When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;

When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;

When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;

When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;

Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.