CN110750637A

CN110750637A - Text abstract extraction method and device, computer equipment and storage medium

Info

Publication number: CN110750637A
Application number: CN201910753710.4A
Authority: CN
Inventors: 张思亮
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2020-02-04
Anticipated expiration: 2039-08-15
Also published as: CN110750637B

Abstract

The invention relates to a text abstract extraction method, a text abstract extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: processing a text to be processed by using a target text classification model obtained by pre-training to obtain the category of the text to be processed; and executing the following cyclic processing on the text to be processed until all sentences in the text to be processed are deleted: randomly deleting a sentence which is not deleted from the text to be processed to obtain a residual text; processing the residual texts by using the target text classification model to obtain the categories of the residual texts; judging whether the type of the residual text is the same as that of the text to be processed or not, and if not, restoring the deleted sentence into the text to be processed; and taking the residual text obtained after the loop processing as the target text abstract. The invention combines the abstract obtained by the integral text semantics and improves the accuracy of text abstract extraction.

Description

Text abstract extraction method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text abstract extraction method, a text abstract extraction device, computer equipment and a storage medium.

Background

The abstract is a simple and coherent short text capable of reflecting the central content of a certain text, and can help people to shorten the reading time when reading a large amount of texts. The automatic text summarization technology is used for analyzing and processing a lengthy text by utilizing a series of text processing technologies through a computer, extracting main central ideas of the text, generating a brief and generalized summary, and helping a user to locate the content desired by the user.

The automatic text summarization technology is a research hotspot in the field of natural language processing, and is divided into an extraction type summary and a generation type summary according to the generation mode of summary content. At present, the generation technology is not mature, and the abstract is generated by an extraction method commonly used in the industry. However, such a method is literally, semantic relations of contexts are not utilized, the extracted abstract lacks relevance, key contents cannot be extracted according to the contexts, and user requirements cannot be met.

Disclosure of Invention

In view of the above deficiencies of the prior art, the present invention provides a method, an apparatus, a computer device and a storage medium for extracting a text abstract, so as to solve the problem that the prior art does not extract an abstract by using a context semantic relationship.

In order to achieve the above object, the present invention provides a text abstract extracting method, which comprises the following steps:

processing a text to be processed by using a target text classification model obtained by pre-training to obtain the category of the text to be processed;

and executing the following cyclic processing on the text to be processed until all sentences in the text to be processed are deleted:

randomly deleting a sentence which is not deleted from the text to be processed to obtain a residual text;

processing the residual texts by using the target text classification model to obtain the categories of the residual texts;

judging whether the type of the residual text is the same as that of the text to be processed or not, and if not, restoring the deleted sentences to the text to be processed;

and taking the residual text obtained after the loop processing as the target text abstract.

Further, the target text classification model is obtained by training through the following steps:

collecting a sample data set, wherein the sample data set comprises a plurality of training texts, and each training text is marked with a corresponding category;

dividing the sample data set into a training set and a verification set according to a preset proportion;

training to obtain the target text classification model based on the training set;

and verifying the target text classification model based on the verification set, and finishing training if the verification is passed.

Further, the text to be processed and the training text are complaint texts.

Further, the categories of the text to be processed and the training text comprise time-ineligibility, price diversity, service attitude and the like.

Further, the target text classification model is a TEXTCNN model that includes an embedding layer, a convolutional layer, a pooling layer, a fully-connected layer, and a Softmax classification layer.

Further, the step of processing the text to be processed by using the pre-trained target text classification model is as follows:

vectorizing the text to be processed through the embedding layer to obtain a word vector of the text to be processed;

performing convolution processing on the word vectors of the text to be processed through the convolution layer to extract the features of the text to be processed;

performing pooling treatment on the characteristics of the text to be treated through the pooling layer to obtain dimension reduction characteristics of the text to be treated;

transmitting the dimensionality reduction features of the text to be processed to the Softmax classification layer through the full connection layer;

and processing the dimensionality reduction features of the text to be processed through the Softmax classification layer to obtain the category of the text to be processed.

Further, the text abstract extraction method further comprises the following steps: and preprocessing the text to be processed before processing the text to be processed by using a target text classification model obtained by pre-training.

In order to achieve the above object, the present invention further provides a text abstract extracting apparatus, including:

the category acquisition module is used for processing the text to be processed by utilizing a target text classification model obtained by pre-training to obtain the category of the text to be processed;

the cyclic deletion processing module is used for executing the following cyclic processing on the text to be processed until all sentences in the text to be processed are deleted:

and the abstract acquisition module is used for acquiring the residual text obtained after the circulation processing is finished as the abstract of the target text.

Further, the text abstract extracting device further comprises: a model training module for training the target text classification model, the model training module comprising:

the system comprises a sample data set acquisition unit, a data processing unit and a data processing unit, wherein the sample data set is used for acquiring a sample data set, the sample data set comprises a plurality of training texts, and each training text is labeled with a corresponding category label;

the sample data set dividing unit is used for dividing the sample data set into a training set and a verification set according to a preset proportion;

the training unit is used for training to obtain the target text classification model based on the training set;

and the verification unit is used for verifying the target text classification model based on the verification set, and if the verification passes, the training is finished.

Further, the text to be processed and the training text are complaint texts.

Further, the category acquisition module is specifically configured to:

and processing the dimensionality reduction features of the residual text through the Softmax classification layer to obtain the category of the residual text.

Further, the text abstract extracting device further comprises: and the preprocessing module is used for preprocessing the text to be processed before processing the text to be processed by utilizing the target text classification model obtained by pre-training.

In order to achieve the above object, the present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the aforementioned method when executing the computer program.

In order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the aforementioned method.

By adopting the technical scheme, the invention has the following beneficial effects:

the method comprises the steps of deleting sentences in a text to be processed through random circulation, calculating whether the text type after the sentences are deleted is the same as that before the sentences are deleted, if so, indicating that the deleted sentences have small semantic contribution to the text and should be deleted, otherwise, indicating that the deleted sentences have large semantic contribution to the text and should not be deleted, recovering the deleted sentences in the text, and obtaining the abstract of the text when all the sentences in the text are deleted. Since the above process is implemented based on the classification model which is trained based on the semantic meaning, the abstract obtained based on the present invention is an abstract combined with the overall semantic meaning of the text, i.e., the abstract can truly outline the overall information of the text from the semantic aspect. In addition, the invention randomly deletes sentences when deleting sentences, ensures that key semantics are not influenced by sequence, and improves the accuracy of text abstract generation while giving consideration to the performance of text processing speed.

Drawings

FIG. 1 is a flow chart of one embodiment of a text summarization method of the present invention;

FIG. 2 is a block diagram of an embodiment of a text summarization extraction apparatus according to the present invention;

fig. 3 is a hardware architecture diagram of one embodiment of the computer apparatus of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

As shown in fig. 1, the present invention provides a text abstract extracting method, which specifically includes the following steps:

s0, training according to the collected sample data set to obtain a target text classification model, wherein the specific training process comprises the following steps:

and S01, collecting a sample data set, wherein the sample data set comprises a plurality of training texts, and each training text is labeled with a corresponding category. In this embodiment, the training text may be a complaint text. For example, assuming that a vehicle insurance company needs to quickly obtain a complaint summary from a customer's complaint text, the sample data set collected should contain complaint texts labeled with different categories, including but not limited to age failure, price dissimilarity, and service attitude. It should be understood that, in addition to the complaint text, for different application scenarios, corresponding sample data sets may be collected according to different needs.

And S02, dividing the collected sample data set into a training set and a verification set according to a preset proportion, wherein the training set accounts for 80% and the verification set accounts for 20%.

And S03, training by adopting a gradient descent algorithm based on the training set to obtain a target text classification model. In the invention, the target text classification model is preferably a commonly used text classification model, namely a TEXTCNN model, wherein the TextCNN is a model for classifying texts by using a convolutional neural network and comprises an embedding layer, a convolutional layer, a pooling layer, a full-link layer and a Softmax classification layer.

And S04, verifying whether the performances of the target text classification model obtained by training, such as Accuracy (Accuracy), Precision (Precision), Recall (Recall), F1_ score (F1 score) and the like, meet preset conditions or not based on the verification set, if so, indicating that the target text classification model passes verification, finishing the training, otherwise, increasing the number of the training texts in the training set and retraining the target text classification model.

S1, obtaining a text to be processed, wherein the text to be processed can be a complaint text, for example, a complaint text of a car insurance customer.

S2, processing the text to be processed by using the trained target text classification model (TEXTCNN model) to obtain the category of the text to be processed, and specifically realizing the following steps:

s21, vectorizing the text to be processed through the embedded layer of the TEXTCNN model to obtain a word vector of the text to be processed;

s22, performing convolution processing on the word vectors of the text to be processed through the convolution layer of the TEXTCNN model to extract the characteristics of the text to be processed;

s23, performing pooling processing on the features of the text to be processed through a pooling layer of the TEXTCNN model to obtain dimension reduction features of the text to be processed;

s24, transmitting the dimensionality reduction features of the text to be processed to a Softmax classification layer through a full connection layer of the TEXTCNN model;

and S25, calculating the probability of the text to be processed corresponding to various classification labels according to the dimensionality reduction characteristics of the text to be processed through a Softmax classification layer of the TEXTCNN model, and taking the classification label with the maximum probability as the category of the text to be processed.

And S3, performing sentence segmentation on the text to be processed. In particular, the present invention may be labeled in terms of sentence-level symbols, such as periods ". ", exclamation point"! ", question mark"? And the like, and dividing sentences of the text to be processed. For example, suppose the pending text is the following complaint "apply for non-accident rescue, there is only one telephone contact in the middle to tell that two hours need to elapse, and as a result, it waits for 4 hours or more, or it does not elapse, for which complaints are not satisfied. The client calls that the user does not need to rescue at present and finds people for rescue by himself. The multiple contact with the Ann Union rescue 028 and 65200801 cannot be answered by people, and the client requires the company to give a saying. The facility handles the reply as soon as possible, thanks! ", the following four sentences are obtained after sentence division processing: sentence 1 is "non-accident rescue requested", only one telephone contact in the middle tells that two hours need to be waited for, and as a result, 4 hours or more are waited for or the time is not waited for, which is not satisfied with complaints. "the 2 nd sentence is" the customer says that I do not need rescue now, finds people for rescue by oneself. And the 3 rd sentence is that the multiple contact link rescue 028 and 65200801 is not answered by people, and the client requires me to give a statement. "sentence 4 is" the trouble mechanism handles the reply as soon as possible, thanks! ".

After the sentence dividing processing is finished, a corresponding deletion flag bit is set for each sentence, the initial value of the deletion flag bit is set to be 0, and when the deletion flag bit is 0, the corresponding sentence is not deleted.

And S4, randomly selecting a certain undeleted sentence from the text to be processed for deletion to obtain the residual text. After the sentence is selected and deleted, the sentence is marked as deleted, so that the sentence can not be deleted when the step is repeatedly executed subsequently. In this embodiment, marking a sentence as deleted means: and setting the deletion flag position of the sentence to be 1, wherein when the deletion flag position is 1, the corresponding sentence is deleted.

S5, processing the residual text by using a target text classification model, namely a TEXTCNN model, and obtaining the category of the residual text, wherein the specific flow is as follows:

s51, carrying out vectorization processing on the residual text through the embedded layer of the TEXTCNN model to obtain word vectors of the residual text;

s52, performing convolution processing on the word vectors of the residual text through the convolution layer of the TEXTCNN model to extract the characteristics of the residual text;

s53, performing pooling treatment on the characteristics of the residual text through a pooling layer of the TEXTCNN model to obtain dimension reduction characteristics of the residual text;

s54, transmitting the dimension reduction features of the residual text to a Softmax classification layer through a full connection layer of the TEXTCNN model;

and S55, calculating the probability of each classification label corresponding to the residual text through a Softmax classification layer of the TEXTCNN model, and taking the classification label with the maximum probability as the category of the residual text.

S6, determining whether the type of the remaining text obtained by deleting the sentence is the same as the type of the text to be processed, if so, indicating that the deleted sentence is not important to the overall semantics of the text to be processed, i.e. the sentence should be deleted from the target text abstract of the text to be processed, then executing step S8; if not, go to step S7.

S7, if the category of the remaining text is different from the category of the text to be processed, it indicates that the deleted sentence is important for the overall semantic meaning of the text, i.e. the sentence should not be deleted from the target text abstract of the text to be processed. Therefore, the deleted sentence is restored to the text to be processed, and step S8 is executed.

S8, determining whether all sentences in the text to be processed have been deleted, that is, determining whether all deletion flags of all sentences are 1, if yes, executing step S9, otherwise, returning to step S4 to execute the next loop processing.

And S9, taking the residual text finally obtained after all sentences in the text to be processed are deleted as the target text abstract to be extracted.

One application scenario of the present invention is: suppose that a text X to be processed includes A, B, C, D four sentences, and the type of the text processed by the target text classification model is M. When the method is used for processing, the sentence D is randomly deleted firstly, if the type of the text after the sentence D is deleted is still M, the sentence D is not important for the text X, and the sentence D can be deleted to obtain the residual text comprising the sentence A, B, C; randomly deleting a sentence C in the residual text, if the type of the text after the sentence C is deleted is not M, explaining that the sentence C is important to the text X, and if the sentence C can not be deleted, recovering the sentence C and still obtaining the residual text comprising the sentence A, B, C; and then, circularly and randomly deleting the sentences which are not deleted in the residual text, wherein the sentences C are not deleted any more because the sentences C are deleted, and the rest of the text obtained by deleting all the sentences in the text M is taken as the abstract in the same way. Taking the text to be processed as the complaint text provided in step S3 as an example, assuming that the category obtained after processing the text by the TEXTCNN model is "time-lapse failure", the category obtained after deleting the 1 st sentence is changed, and the category obtained after deleting the 2 nd, 3 rd or 4 th sentences is still "time-lapse failure", it indicates that the 1 st sentence is critical to the complaint text, and the 2 nd to 4 th sentences are non-critical to the text and should be deleted from the summary thereof, so that the summary of the complaint text is the 1 st sentence.

It can be seen that the invention deletes sentences in the text to be processed by random loop, and calculates whether the text type after the sentence deletion is the same as that before the deletion, if the same, it indicates that the deleted sentences contribute little to the semantics of the text and should be deleted, otherwise, it indicates that the deleted sentences contribute much to the semantics of the text and should not be deleted, and then restores the deleted sentences in the text, and when all the sentences in the text are deleted, the abstract of the text is obtained. The invention is realized based on the classification model, and the classification model is based on semantic training, so the abstract obtained based on the invention is the abstract combined with the whole text semantics, namely, the abstract can truly outline the whole information of the text from the aspect of the semantics, and the accuracy of text abstract generation is improved while the text processing speed performance is considered.

As a preferable scheme of this embodiment, before executing step S2, the method further includes preprocessing the acquired text to be processed, specifically including preprocessing the text to be processed, such as stop word filtering, that is, detecting whether a word in the text to be processed matches a stop word in a preset stop word table, and if so, deleting the matched word. It will be understood that stop words are generally unrealistic words such as "in", "with", "get", "having", etc.

It should be noted that, for the sake of simplicity, the present embodiment is described as a series of acts, but those skilled in the art should understand that the present invention is not limited by the described order of acts, because some steps can be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Example two

As shown in fig. 2, the present embodiment provides a text abstract extracting apparatus 10, which includes:

the model training module 11 is used for training to obtain a target text classification model;

the obtaining module 12 is configured to process a to-be-processed text by using a target text classification model obtained through pre-training to obtain a category of the to-be-processed text, where the to-be-processed text may be a complaint text;

the loop deletion processing module 13 is configured to perform the following loop processing on the text to be processed until all sentences in the text to be processed are deleted:

judging whether the types of the residual texts are the same as the types of the texts to be processed, if not, restoring the deleted sentences to the texts to be processed;

and the abstract acquiring module 14 is configured to acquire a remaining text obtained after the loop processing is finished as a target text abstract.

In this embodiment, the model training module 11 includes:

the sample data set acquisition unit is used for acquiring a sample data set, the sample data set comprises a plurality of training texts, and each training text is marked with a corresponding category, wherein the training texts can be complaint texts;

the training unit is used for training to obtain a target text classification model based on a training set;

and the verification unit is used for verifying the target text classification model based on the verification set, finishing training if the verification is passed, and increasing the number of the training texts in the training set and re-training the target classification model if the verification is not passed.

In this embodiment, the target text classification model is a TEXTCNN model, which includes an embedding layer, a convolution layer, a pooling layer, a full-link layer, and a Softmax classification layer.

In this embodiment, the category obtaining module 12 is specifically configured to:

vectorizing the text to be processed through an embedded layer of the TEXTCNN model to obtain a word vector of the text to be processed;

performing convolution processing on word vectors of the text to be processed through a convolution layer of the TEXTCNN model to extract characteristics of the text to be processed;

performing pooling processing on the features of the text to be processed through a pooling layer of the TEXTCNN model to obtain dimension reduction features of the text to be processed;

transmitting the dimensionality reduction features of the text to be processed to a Softmax classification layer through a full connection layer of the TEXTCNN model;

and calculating the probability of the text to be processed corresponding to various classification labels according to the dimensionality reduction characteristics of the text to be processed by a Softmax classification layer of the TEXTCNN model, and taking the classification label with the maximum probability as the category of the text to be processed.

In this embodiment, the text abstract extracting apparatus 10 may further include a preprocessing module, configured to perform preprocessing on the text to be processed before processing the text to be processed by using a target text classification model obtained through pre-training, specifically including preprocessing such as stop word filtering, that is, detecting whether a word in the text to be processed matches a stop word in a preset stop word list, and if yes, deleting the matched word. It will be understood that the term of disablement is typically a shorthand term without actual meaning, such as "being," "ground," "being," "having," etc.

It should also be understood by those skilled in the art that the embodiments described in the specification are preferred embodiments and that the modules referred to are not necessarily essential to the invention.

EXAMPLE III

The present invention also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device 20 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 22, which may be communicatively coupled to each other via a system bus, as shown in FIG. 3. It is noted that fig. 3 only shows the computer device 20 with components 21-22, but it is to be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.

In the present embodiment, the memory 21 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 20, such as a hard disk or a memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 20. Of course, the memory 21 may also include both internal and external storage devices of the computer device 20. In this embodiment, the memory 21 is generally used for storing an operating system and various application software installed on the computer device 20, such as the program codes of the text abstract extracting apparatus 10 of the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is configured to operate the program code stored in the storage 21 or process data, for example, operate the text abstract extracting apparatus 10, so as to implement the text abstract extracting method of the first embodiment.

Example four

The present invention also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App (business) store, etc., on which a computer program is stored, which when executed by a processor implements a corresponding function. The computer-readable storage medium of the embodiment is used for storing the text abstract extracting apparatus 10, and when being executed by a processor, the computer-readable storage medium implements the text abstract extracting method of the first embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that the above embodiment method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims

1. A text abstract extraction method is characterized by comprising the following steps:

2. The method for extracting a text abstract according to claim 1, wherein the target text classification model is obtained by training through the following steps:

collecting a sample data set, wherein the sample data set comprises a plurality of training texts, and each training text is labeled with a corresponding category;

3. The method of claim 2, wherein the text to be processed and the training text are complaint texts.

4. The method according to claim 3, wherein the categories of the text to be processed and the training text include time-out, price disagreement and service attitude.

5. The text summarization extraction method of claim 1 wherein the target text classification model is a TEXTCNN model comprising an embedding layer, a convolution layer, a pooling layer, a fully-connected layer, and a Softmax classification layer.

6. The method for extracting a text abstract according to claim 5, wherein the step of processing the text to be processed by using the pre-trained target text classification model comprises:

7. The method of claim 1, further comprising: the method comprises the steps of preprocessing a text to be processed before processing the text to be processed by using a target text classification model obtained through pre-training.

8. An apparatus for extracting a text abstract, comprising:

judging whether the type of the residual text is the same as that of the text to be processed or not, and if not, restoring the deleted sentence into the text to be processed;

and the abstract acquisition module is used for acquiring the residual text obtained after the circulation processing is finished as the target text abstract.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented by the processor when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.