CN114036927A

CN114036927A - Text abstract extraction method and device, computer equipment and storage medium

Info

Publication number: CN114036927A
Application number: CN202111359470.3A
Authority: CN
Inventors: 张剑; 陈青青
Original assignee: Dongguan University of Technology
Current assignee: Dongguan University of Technology
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-02-11

Abstract

The invention discloses a method for extracting a text abstract, which is applied to the field of data processing and is used for improving the accuracy of extracting the text abstract. The method comprises the following steps: acquiring and analyzing text data to obtain a text sentence set; acquiring all selection results of K candidate sentences selected from the text sentence set to obtain data to be matched; sequentially selecting data to be matched and text data to calculate the matching degree to obtain a matching value; deleting the data to be matched with the matching value lower than a preset threshold value, and updating the text sentence set; updating the K value, and comparing the updated K value with a preset abstract sentence value to obtain a comparison result; when the comparison result is that the K value is smaller than the preset abstract sentence value, returning to obtain all the selection results of selecting K candidate sentences from the text sentence set, and continuing to execute the step of obtaining the data to be matched; and when the comparison result is that the K value is not less than the preset abstract sentence value, taking the data to be matched with the maximum matching value as the abstract corresponding to the text data.

Description

Text abstract extraction method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for extracting a text abstract, a computer device, and a storage medium.

Background

The text abstract is a lossy semantic compression process based on background knowledge, and the core problem is how to locate and screen out important information from a text. The method can directly provide the text abstract result for the user to read, not only does not influence the reading effect, but also can reduce the reading time of the user and improve the reading efficiency; and the text abstract can also provide assistance for other downstream tasks in the natural language processing field, such as long text emotion analysis, a search engine, a recommendation system and the like.

The existing text abstract extraction scheme is to select abstract sentences by setting a threshold value of similarity and based on a greedy algorithm, to divide words of each abstract sentence by adopting a word divider, and to vectorize the sentences. However, the above scheme has a plurality of problems, for example, how to determine the reasonability of threshold value selection, and ambiguous words may occur in the word segmentation process, which leads to errors introduced in the word segmentation result. The greedy algorithm is adopted to cause that sentences ranked ahead last time can be reserved certainly, the problem that the current optimal is not necessarily the global optimal is not considered, and the accuracy of text abstract extraction is low due to the problem.

Therefore, the existing method has the problem of low accuracy of text abstract extraction.

Disclosure of Invention

The embodiment of the invention provides a method and a device for extracting a text abstract, a computer device and a storage medium, which are used for improving the accuracy of extracting the text abstract.

A method for extracting text summaries comprises the following steps:

acquiring text data, and performing text data analysis on the text data to obtain a text sentence set corresponding to the text data;

acquiring all selection results of K candidate sentences selected from the text sentence set, and adding the selection results meeting preset conditions into a set to be matched as data to be matched;

on the basis of a summary matching model, sequentially selecting one piece of data to be matched from the set to be matched and the text data according to a preset selection sequence to calculate the matching degree, so as to obtain a matching value corresponding to the data to be matched;

deleting the data to be matched with the matching value lower than a preset threshold value, and updating the text sentence set according to the deleted data to be matched;

updating the K value in the K candidate sentences based on a preset updating mode, and comparing the updated K value with a preset abstract sentence value to obtain a comparison result;

when the comparison result is that the K value is smaller than the preset abstract sentence value, returning to the step of obtaining all selection results of selecting K candidate sentences from the text sentence set, taking the selection results meeting preset conditions as data to be matched, and adding the selection results into the set to be matched for continuous execution;

and when the comparison result shows that the K value is not smaller than the preset abstract sentence value, taking the data to be matched with the maximum matching value as the abstract corresponding to the text data.

An apparatus for extracting a text abstract, comprising:

the text sentence set acquisition module is used for acquiring text data and analyzing the text data to obtain a text sentence set corresponding to the text data;

a to-be-matched set acquisition module, configured to acquire all selection results of K candidate sentences selected from the text sentence set, and add the selection results meeting preset conditions as to-be-matched data to the to-be-matched set;

the matching value calculation module is used for sequentially selecting one piece of data to be matched from the set to be matched and the text data according to a preset selection sequence based on the abstract matching model to calculate the matching degree, so as to obtain a matching value corresponding to the data to be matched;

the updating module is used for deleting the data to be matched, of which the matching value is lower than a preset threshold value, and updating the text sentence set according to the deleted data to be matched;

the comparison module is used for updating the K value in the K candidate sentences based on a preset updating mode and comparing the updated K value with a preset abstract sentence value to obtain a comparison result;

the circulation module is used for returning to the step of obtaining all the selection results of selecting the K candidate sentences from the text sentence set when the comparison result indicates that the K value is smaller than the preset abstract sentence value, and adding the selection results meeting preset conditions into the set to be matched as data to be matched for continuous execution;

and the abstract acquisition module is used for taking the data to be matched with the maximum matching value as the abstract corresponding to the text data when the comparison result shows that the K value is not less than the preset abstract sentence value.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above text summarization extraction method when executing the computer program.

A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the steps of the above-described method for extracting a text excerpt.

According to the method and device for extracting the text abstract, the computer equipment and the storage medium, the text sentence set corresponding to the text data is obtained by acquiring the text data and analyzing the text data. And acquiring all selection results of K candidate sentences selected from the text sentence set, and adding the selection results meeting preset conditions into the set to be matched as data to be matched. And on the basis of the abstract matching model, sequentially selecting one piece of data to be matched from the set to be matched and the text data according to a preset selection sequence to calculate the matching degree, so as to obtain a matching value corresponding to the data to be matched. And deleting the data to be matched with the matching value lower than the preset threshold value, and updating the text sentence set according to the deleted data to be matched. And updating the K value in the K candidate sentences based on a preset updating mode, and comparing the updated K value with a preset abstract sentence value to obtain a comparison result. And when the comparison result is that the K value is smaller than the preset abstract sentence value, returning to obtain all the selection results of selecting the K candidate sentences from the text sentence set, taking the selection results meeting the preset conditions as the data to be matched, and adding the selection results into the set to be matched for continuous execution. And when the comparison result is that the K value is not less than the preset abstract sentence value, taking the data to be matched with the maximum matching value as the abstract corresponding to the text data. Through the steps, the candidate sentences selected to be used as the abstract are the best current results considered based on the existing sentences, and therefore the accuracy rate of text abstract extraction is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a method for extracting a text abstract according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for extracting text excerpts according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus for extracting a text abstract according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The text abstract extraction method provided by the application can be applied to the application environment shown in fig. 1, wherein the computer device is communicated with the server through a network. The computer device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, among others. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a method for extracting a text abstract is provided, which is described by taking the method applied to the server in fig. 1 as an example, and includes the following steps S101 to S107:

s101, acquiring text data, and analyzing the text data to obtain a text sentence set corresponding to the text data.

In step S101, the text data is obtained by a method including, but not limited to, text data obtained by web crawling given a specific article.

The implementation methods of the text data analysis include but are not limited to word analysis and sentence analysis. The word analysis refers to an analysis method for acquiring words in the text data, and the sentence analysis refers to an analysis method for acquiring sentences in the text data.

Preferably, statement analysis is employed herein.

The text data is analyzed by adopting sentence analysis to obtain the text sentence set corresponding to the text data, so that each text sentence set can be correspondingly analyzed subsequently, the abstract corresponding to the text data is extracted, and the accuracy rate of extracting the text abstract is improved.

S102, obtaining all selection results of K candidate sentences selected from the text sentence set, taking the selection results meeting preset conditions as data to be matched, and adding the data to be matched into the set to be matched.

In step S102, the K sentence candidates refer to candidate sentences extracted from the text sentence set to serve as abstracts.

It should be noted that the K value refers to the number of candidate sentences. The value range of the K value is 1 to the total number of the text sentence sets, for example, the total number of the text sentence sets is a, and the value range of the K value is a positive integer in (1, a).

The selection results refer to all cases of selecting K candidate sentences from the text sentence set.

For example, when the total number of the text sentence sets is 10 and K is 1, the selection result refers to that 1 text sentence set is selected from the 10 text sentence sets as the candidate sentence. There are 10 possible situations due to the selection of a set of 1 text sentences from a set of 10 text sentences. These 10 possible cases are therefore all cases of selecting 1 candidate sentence from a set of 10 text sentences.

The preset condition is a condition for screening a selection result which can be used for calculating the matching degree. The preset conditions include, but are not limited to, redundant sentence elimination and relevancy elimination. The redundant sentence elimination method refers to a method for screening a selection result by analyzing whether redundancy exists between candidate sentences in the selection result. The relevance elimination method refers to a method for screening the selection result by analyzing the relevance between the candidate sentences in the selection result.

By acquiring all the selection results and adding the selection results meeting the preset conditions as the data to be matched into the set to be matched, corresponding analysis can be performed on all candidate sentences in each selection result in the follow-up process, so that the abstracts corresponding to the text data are extracted, and the accuracy of extracting the text abstracts is improved.

S103, on the basis of the abstract matching model, one piece of data to be matched and the text data are sequentially selected from the set to be matched according to a preset selection sequence to be subjected to matching degree calculation, and a matching value corresponding to the data to be matched is obtained.

In step S103, the abstract matching model is a model used to calculate the matching degree between the data to be matched and the text data.

The implementation method of the abstract matching model comprises but is not limited to an abstract matching model based on a Word2Vec algorithm and an abstract matching model based on Doc2Vec.

Preferably, a Doc2Vec based summary matching model is employed here. It should be understood that Doc2Vec is an unsupervised method, similar to the principle of Word2Vec, and has the advantages of not fixing the length of sentences, receiving sentences with different lengths as training samples, and the like.

Implementation methods of the Doc2Vec based summary matching model include, but are not limited to, using a third party library genesis in Python. The method comprises the following steps: given n texts of input data and a label corresponding to the input data, a title can be used as a label, and the label is unique. The input text is recorded as doc1, doc2, … and docn, and the label corresponding to the input text is recorded as docLabel 1, docName2, … and docName. Dividing the input content by taking characters as units, and removing stop words. The input data is adjusted to the input sample format required by Doc2Vec, and because the input required by the Doc2Vec model in genim is in a fixed format: text, text label, here we need to wrap the entered text with the tagggedDocument class in Doc2Vec in gensim. The Doc2Vec model is loaded and training begins. After the model training is completed, the trained model is stored and recorded as M0. M0 may be used to predict the vector of new text, here using the method doc2 vector. info _ vector () in genim.

The preset selection sequence comprises a positive sequence selection sequence limited to the set to be matched, a reverse sequence selection sequence limited to the set to be matched and a random selection sequence limited to the set to be matched.

The matching degree calculation is a method for calculating the matching degree of the vector corresponding to the data to be matched and the vector corresponding to the text data. The matching degree calculation is realized by, but not limited to, cosine similarity and Euclidean distance.

The method comprises the steps of sequentially selecting one piece of data to be matched from a set to be matched and text data through a summary matching model to calculate the matching degree, obtaining a matching value corresponding to the data to be matched, and effectively improving the accuracy of text summary extraction through calculating the matching value of the data to be matched and the text data.

And S104, deleting the data to be matched with the matching value lower than the preset threshold value, and updating the text sentence set according to the deleted data to be matched.

In step S104, the preset threshold is selected by, but not limited to, obtaining an empirical value and obtaining a rejection rate. The elimination rate obtaining means that a threshold value is obtained through calculation according to a preset elimination rate. Here, the elimination ratio is a ratio at which a sentence ranked in a certain ratio later is eliminated after the matching degree is calculated each time. For example, when the culling rate is T, the sentences ranked T% later are deleted after the matching degree calculation is performed each time.

It should be understood that the elimination rate includes a dynamic elimination rate and a static elimination rate. The dynamic elimination rate refers to that in the sentences with ranked matching degrees, if the matching degree difference of more sentences is small, the value of the dynamic elimination rate can be increased, and when the matching degree difference of the sentences is large, the value of the dynamic elimination rate can be reduced.

Preferably, the elimination rate is used here to obtain the preset threshold.

The updating refers to a method of sorting the text sentence sets according to the matching values and eliminating the text sentence sets arranged in a certain proportion behind the text sentence sets based on the elimination rate to obtain new text sentence sets.

The data to be matched with the matching value lower than the preset threshold value is deleted, and the text sentence set is updated according to the deleted data to be matched, so that the accuracy rate of extracting the text abstract is effectively improved.

And S105, updating the K value in the K candidate sentence based on a preset updating mode, and comparing the updated K value with a preset abstract sentence value to obtain a comparison result.

In step S105, the preset update method is a method of updating the K value.

For example, an update method of K ═ K +1 is adopted, and the K value is updated by adding 1 to the K value once per cycle. It should be understood that the preset updating manner is not particularly limited herein.

Preferably, the embodiment of the present invention updates the K value by using K + 1.

The preset abstract sentence value refers to the number of sentences of the preset text abstract.

The K value is updated and compared with the preset abstract sentence value to obtain a comparison result, and corresponding processing is carried out according to the comparison result, so that the accuracy of text abstract extraction is improved.

S106, when the comparison result is that the K value is smaller than the preset abstract sentence value, returning to obtain all the selection results of the K candidate sentences selected from the text sentence set, taking the selection results meeting the preset conditions as the data to be matched, and adding the data to be matched into the set to be matched to continue executing.

In step S106, when the comparison result indicates that the K value is smaller than the preset abstract sentence value, it indicates that the data to be matched at this time does not satisfy the algorithm end condition.

By returning to the step of selecting the candidate sentences, the text data is subjected to iterative processing, so that the accuracy of text abstract extraction is improved.

And S107, when the comparison result is that the K value is not less than the preset abstract sentence value, taking the data to be matched with the maximum matching value as the abstract corresponding to the text data.

In step S107, when the comparison result indicates that the K value is not less than the preset abstract sentence value, it indicates that the data to be matched here meets the algorithm end condition.

The data to be matched with the maximum matching value is used as the abstract corresponding to the text data, so that the accuracy of text abstract extraction is improved.

According to the method for extracting the text abstract, the text data is obtained, and the text data is analyzed, so that the text sentence set corresponding to the text data is obtained. And acquiring all selection results of K candidate sentences selected from the text sentence set, and adding the selection results meeting preset conditions into the set to be matched as data to be matched. And on the basis of the abstract matching model, sequentially selecting one piece of data to be matched from the set to be matched and the text data according to a preset selection sequence to calculate the matching degree, so as to obtain a matching value corresponding to the data to be matched. And deleting the data to be matched with the matching value lower than the preset threshold value, and updating the text sentence set according to the deleted data to be matched. And updating the K value in the K candidate sentences based on a preset updating mode, and comparing the updated K value with a preset abstract sentence value to obtain a comparison result. And when the comparison result is that the K value is smaller than the preset abstract sentence value, returning to obtain all the selection results of selecting the K candidate sentences from the text sentence set, taking the selection results meeting the preset conditions as the data to be matched, and adding the selection results into the set to be matched for continuous execution. And when the comparison result is that the K value is not less than the preset abstract sentence value, taking the data to be matched with the maximum matching value as the abstract corresponding to the text data. Through the steps, the candidate sentences selected to be used as the abstract are the best current results considered based on the existing sentences, and therefore the accuracy rate of text abstract extraction is improved.

In some optional implementations of the present embodiment, step S101 further includes the following steps S1011 to S1013:

s1011, acquiring text data.

And S1012, analyzing the text data to obtain at least one sentence terminator in the text data.

And S1013, dividing the text data based on all the sentence end signs to obtain a text sentence set corresponding to the text data.

In step S1011, the above-mentioned method of acquiring text data includes, but is not limited to, text data given a specific article, web crawl.

It should be noted here that the acquired text data may contain some meaningless characters or redundant punctuation marks, which may interfere with the text data, and therefore, data cleaning by regular expression or other techniques is required. Meanwhile, the processing mode may be different for different languages, and it is first necessary to detect the language type of the text data. For example, language type detection of text data is implemented using a third party library in python, such as langdetect, etc. If the text data is Chinese data, the word segmentation is generally required to be performed on the text data, but errors are often introduced in the word segmentation stage. Therefore, the word is not divided, and the processing is performed in units of characters according to the English processing mode.

In step S1012, the sentence end symbol is a symbol for marking the end of a sentence. For example, in a chinese sentence, the sentence terminator includes, but is not limited to "? ",". ","! ", in english sentence, the above sentence terminator includes but is not limited to"? ","! ".

In step S1013, the text data is divided based on all the sentence end symbols to obtain a text sentence set corresponding to the text data, and the number of the text sentence set is counted.

In some optional implementations of this embodiment, step S102 further includes step S201 to step S205:

s201, obtaining all selection results of selecting K candidate sentences from the text sentence set, wherein each selection result comprises the K candidate sentences.

S202, aiming at each selection result, one sentence is sequentially selected from the K candidate sentences as the sentence to be judged according to a preset selection sequence.

S203, comparing the sentence to be judged with the K-1 sentence candidate sentence to obtain a comparison result.

And S204, if the comparison result shows that the sentence to be judged is overlapped with the candidate sentence of the K-1 sentence, deleting the selection result.

S205, if the comparison result does not exist, the sentence to be judged is coincident with the candidate sentence of the K-1 sentence, the selection result is used as the data to be matched, and the data to be matched is added into the set to be matched.

In step S201, the K sentence candidate sentences refer to candidate sentences extracted from the text sentence set to serve as abstracts.

In step S202, the preset selecting sequence includes a positive selecting sequence, a negative selecting sequence, and a random selecting sequence. Preferably, a positive-order selection order is adopted here, that is, the selection is performed in the order from the first sentence to the last sentence in the K candidate sentences of each selection result.

The sentence to be judged is a sentence selected from the K candidate sentences, and the sentence needs to be judged with other unselected sentences in the K candidate sentences.

In step S203, the comparison includes redundancy comparison.

It should be understood that the redundancy comparison means comparing the sentence to be determined with any one of the K-1 candidate sentences, and if a coincident speech block exists in both sentences, the comparison result indicates that the sentence to be determined and the K-1 candidate sentence are coincident.

It should be noted that, if a coincident word block exists in two sentences at the same time, the method for determining the presence of a coincident word block may be to divide the sentence into sentences composed of a plurality of words/characters, and the plurality of words/characters that are the same or similar are regarded as being repeated. For example, a sentence is divided into sentences composed of five words/words, which are identical or similar and are considered to be repeated.

Through the comparative analysis of the sentence to be judged and the K-1 sentence candidate sentence, the selection result with redundancy can be effectively identified, and the selection result with redundancy is deleted, so that the redundancy of the selection result is reduced, and the accuracy rate of text abstract extraction is improved.

In some optional implementations of this embodiment, step S103 further includes step S301 to step S303:

s301, sequentially selecting one piece of data to be matched from the set to be matched as the data to be processed according to a preset selection sequence.

S302, extracting the features of the data to be processed and the text data based on the abstract matching model to obtain a vector to be processed corresponding to the data to be processed and a text vector corresponding to the text data.

S303, calculating cosine similarity of the vector to be processed and the text vector, and taking the obtained similarity value as a matching value corresponding to the data to be processed.

In step S301, the preset selection sequence includes a positive selection sequence, a negative selection sequence, and a random selection sequence. Preferably, a positive-order selection sequence is adopted here, that is, the selection is performed according to the first data to be matched to the last data to be matched from left to right of the set to be matched.

In step S303, a matching value corresponding to the data to be processed is calculated according to the following formula.

Wherein cos (doc)_vec,sum_vec) Means cosine similarity, doc, of the vector to be processed and the text vector_veciRefers to the ith text vector, sum_veciThe vector is the ith vector to be processed, wherein the value range of i is (1, n).

By the method, the corresponding matching degree of the text data and the data to be matched is obtained through calculation, and the high matching degree is used as the subsequent abstract, so that the accuracy rate of text abstract extraction is improved.

In some optional implementations of this embodiment, before step S104, the method for extracting a text abstract includes steps S401 to S402:

s401, obtaining a preset elimination rate.

S402, multiplying the preset elimination rate by the number of the selection results to obtain a preset threshold value.

In step S401, after the matching degree is calculated each time for the preset elimination rate, the sentences ranked at a certain ratio later are eliminated.

The preset threshold value is obtained through obtaining the preset elimination rate, so that the preset threshold value has generalization capability, and the change of the elimination rate caused by the change of the selection result can be used, thereby improving the accuracy of extracting the text abstract.

In some optional implementations of this embodiment, step S107 further includes the following steps S701 to S702:

and S701, when the comparison result is that the K value is not less than the preset abstract sentence value, acquiring the data to be matched with the maximum matching value.

And S702, converting the data to be matched based on the text data to obtain the abstract corresponding to the text data.

In step S702, the conversion processing is to adjust the order of the data to be matched. It should be understood that, in the process of selecting text data, matching degree calculation is performed on the selected result and sorting is performed according to the matching degree, and the order of sentences in the abstract corresponding to the finally obtained text data is disturbed, so that the matching data needs to be adjusted based on the sentence order of the text data.

In step S702, it is specifically: if the text data is time text data, converting the data to be matched based on a time sorting method to obtain an abstract corresponding to the text data;

and if the text data is non-time text data, converting the data to be matched based on an expansion sorting method to obtain the abstract corresponding to the text data.

The time sequence method is a method of selecting a certain time as a reference node and then calculating other absolute time relative to the time. For example, if the selected reference time point is 12 pm, 1 hour later than it is 1 pm, which is to be ranked behind 12 pm.

The above expanding sorting method is a method of putting sentences having close content relevance together. By describing one content and then another content, the fluency can be reduced.

By converting the data to be matched, the sequence of the data to be matched can be adjusted according to the sentence sequence of the text data, so that the abstract corresponding to the text data is obtained, and the accuracy rate of extracting the text abstract is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, an apparatus for extracting a text abstract is provided, where the apparatus for extracting a text abstract corresponds to the method for extracting a text abstract in the above embodiment one to one. As shown in fig. 3, the text abstract extracting apparatus includes a text sentence collection obtaining module 11, a to-be-matched collection obtaining module 12, a matching value calculating module 13, an updating module 14, a comparing module 15, a looping module 16, and an abstract obtaining module 17. The functional modules are explained in detail as follows:

and the text sentence set acquisition module 11 is configured to acquire text data and perform text data analysis on the text data to obtain a text sentence set corresponding to the text data.

And the to-be-matched set acquisition module 12 is configured to acquire all selection results of selecting K candidate sentences from the text sentence set, and add the selection results meeting preset conditions as to-be-matched data to the to-be-matched set.

And the matching value calculating module 13 is configured to sequentially select one to-be-matched data from the to-be-matched set and the text data according to a preset selection sequence based on the abstract matching model to perform matching degree calculation, so as to obtain a matching value corresponding to the to-be-matched data.

And the updating module 14 is configured to delete the to-be-matched data with the matching value lower than the preset threshold, and update the text sentence set according to the deleted to-be-matched data.

And the comparison module 15 is configured to update the K value in the K candidate sentence based on a preset update mode, and compare the updated K value with a preset abstract sentence value to obtain a comparison result.

And the circulation module 16 is configured to, when the comparison result is that the K value is smaller than the preset abstract sentence value, return to obtain all the selection results of selecting the K candidate sentences from the text sentence set, and add the selection results meeting the preset condition as the data to be matched to the set to be matched for continued execution.

And the abstract acquiring module 17 is configured to, when the comparison result is that the K value is not less than the preset abstract sentence value, take the data to be matched with the largest matching value as the abstract corresponding to the text data.

In one embodiment, the text sentence collection obtaining module 11 further includes:

a text data acquisition unit for acquiring text data.

And the sentence end character acquisition unit is used for analyzing and processing the text data to obtain at least one sentence end character in the text data.

And the text sentence set acquisition unit is used for dividing the text data based on all the sentence end symbols to obtain a text sentence set corresponding to the text data.

In one embodiment, the to-be-matched set obtaining module 12 further includes:

and the selection result acquisition unit is used for acquiring all selection results for selecting the K candidate sentences from the text sentence set, wherein each selection result comprises the K candidate sentences.

And the sentence to be judged acquiring unit is used for sequentially selecting one sentence from the K candidate sentences as the sentence to be judged according to the preset selection sequence aiming at each selection result.

And the comparison unit is used for comparing the sentence to be judged with the K-1 sentence candidate sentence to obtain a comparison result.

And the selection result deleting unit is used for deleting the selection result if the comparison result shows that the sentence to be judged is overlapped with the candidate sentence of the K-1 sentence.

And the data to be matched acquisition unit is used for taking the selected result as the data to be matched and adding the data to be matched into the set to be matched if the comparison result does not exist, namely the sentence to be judged is overlapped with the candidate sentence of the K-1 sentence.

In one embodiment, the matching value calculating module 13 further includes:

and the to-be-processed data acquisition unit is used for sequentially selecting one to-be-matched data from the to-be-matched set as the to-be-processed data according to a preset selection sequence.

And the feature extraction unit is used for extracting features of the data to be processed and the text data based on the abstract matching model to obtain a vector to be processed corresponding to the data to be processed and a text vector corresponding to the text data.

And the matching value calculating unit is used for calculating cosine similarity of the vector to be processed and the text vector and taking the obtained similarity value as a matching value corresponding to the data to be processed.

In one embodiment, before the updating module 14, the text abstract extracting device further includes:

and the preset elimination rate acquisition module is used for acquiring the preset elimination rate.

And the preset threshold value acquisition module is used for multiplying the preset elimination rate and the number of the selection results to obtain a preset threshold value.

In one embodiment, the summary obtaining module 17 further includes:

and the data acquisition unit is used for acquiring the data to be matched with the maximum matching value when the comparison result is that the K value is not less than the preset abstract sentence value.

And the conversion unit is used for converting the data to be matched based on the text data to obtain the abstract corresponding to the text data.

In one embodiment, the conversion unit further comprises:

and the first conversion unit is used for converting the data to be matched based on a time sorting method if the text data is time text data to obtain the abstract corresponding to the text data.

And the second conversion unit is used for performing conversion processing on the data to be matched based on an expansion sorting method to obtain the abstract corresponding to the text data if the text data is non-time text data.

Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.

For the specific limitation of the text abstract extraction device, reference may be made to the above limitation on the text abstract extraction method, which is not described herein again. The modules in the text abstract extracting device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data involved in the text abstract extraction method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of text summarization.

In one embodiment, a computer device is provided, which includes a memory, a processor and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the method for extracting a text abstract in the above embodiments are implemented, for example, steps S101 to S107 shown in fig. 2 and other extensions of the method and related steps. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the text summarization extraction apparatus in the above embodiments, such as the functions of the modules 11 to 17 shown in fig. 3. To avoid repetition, further description is omitted here.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.

The memory may be integrated in the processor or may be provided separately from the processor.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the method for extracting a text abstract in the above-described embodiments, such as the steps S101 to S107 shown in fig. 2 and extensions of other extensions and related steps of the method. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the text summarization extraction apparatus in the above-described embodiments, such as the functions of the modules 11 to 17 shown in fig. 3. To avoid repetition, further description is omitted here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for extracting a text abstract is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of obtaining and analyzing the text data to obtain a set of text sentences corresponding to the text data comprises:

acquiring text data;

analyzing the text data to obtain at least one sentence terminator in the text data;

and dividing the text data based on all the sentence end symbols to obtain a text sentence set corresponding to the text data.

3. The method according to claim 1, wherein the step of obtaining all selection results for selecting K candidate sentences from the text sentence set, and adding the selection results satisfying a preset condition as data to be matched to the set to be matched comprises:

acquiring all selection results for selecting K candidate sentences from the text sentence set, wherein each selection result comprises K candidate sentences;

according to each selection result, one sentence is sequentially selected from the K candidate sentences as a sentence to be judged according to a preset selection sequence;

comparing the sentence to be judged with the K-1 candidate sentence to obtain a comparison result;

if the comparison result indicates that the sentence to be judged and the K-1 candidate sentence are overlapped, deleting the selection result;

and if the comparison result does not exist, the sentence to be judged is coincident with the candidate sentence of the K-1 sentence, and the selection result is taken as data to be matched and added into a set to be matched.

4. The method according to claim 1, wherein the step of sequentially selecting one to-be-matched data from the to-be-matched set to perform matching degree calculation with the text data according to a preset selection sequence based on the abstract matching model to obtain a matching value corresponding to the to-be-matched data comprises:

sequentially selecting one piece of data to be matched from the set to be matched as data to be processed according to a preset selection sequence;

extracting the characteristics of the data to be processed and the text data based on a summary matching model to obtain a vector to be processed corresponding to the data to be processed and a text vector corresponding to the text data;

and calculating cosine similarity of the vector to be processed and the text vector, and taking the obtained similarity value as a matching value corresponding to the data to be processed.

5. The method according to claim 1, wherein before deleting the data to be matched whose matching value is lower than a preset threshold and updating the text sentence set according to the deleted data to be matched, the method further comprises:

obtaining a preset elimination rate;

and multiplying the preset elimination rate by the number of the selection results to obtain a preset threshold value.

6. The method according to claim 1, wherein when the comparison result indicates that the K value is not less than the preset abstract sentence value, the step of using the data to be matched with the largest matching value as the abstract corresponding to the text data comprises:

when the comparison result is that the K value is not smaller than the preset abstract sentence value, acquiring data to be matched with the maximum matching value;

and converting the data to be matched based on the text data to obtain the abstract corresponding to the text data.

7. The method according to claim 6, wherein the step of converting the data to be matched based on the text data to obtain the abstract corresponding to the text data comprises:

if the text data is time text data, converting the data to be matched based on a time sorting method to obtain an abstract corresponding to the text data;

and if the text data is non-time text data, converting the data to be matched based on an expansion sorting method to obtain an abstract corresponding to the text data.

8. An apparatus for extracting a text abstract, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the text summarization extraction method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for extracting a text abstract according to any one of claims 1 to 7.