CN110750637B

CN110750637B - Text abstract extraction method, device, computer equipment and storage medium

Info

Publication number: CN110750637B
Application number: CN201910753710.4A
Authority: CN
Inventors: 张思亮
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2024-05-24
Anticipated expiration: 2039-08-15
Also published as: CN110750637A

Abstract

The invention discloses a text abstract extraction method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: processing a text to be processed by using a target text classification model obtained through pre-training to obtain the category of the text to be processed; and executing the following circulation processing on the text to be processed until all sentences in the text to be processed are deleted: randomly deleting a sentence which is not deleted from the text to be processed to obtain a residual text; processing the residual text by using the target text classification model to obtain the category of the residual text; judging whether the category of the residual text is the same as the category of the text to be processed, if not, restoring the deleted sentence to the text to be processed; and taking the residual text obtained after the circulation processing is finished as a target text abstract. The method and the device are combined with the abstract obtained by the text whole semantics, and improve the accuracy of text abstract extraction.

Description

Text abstract extraction method, device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text abstract extraction method, a device, computer equipment and a storage medium.

Background

The short text is simple and coherent, which can reflect the center content of a certain text, and can help people to shorten the reading time when reading a large amount of text. The automatic text summarization technology is to analyze and process lengthy texts by a computer through a series of text processing technologies, extract main central ideas of the texts, generate a section of brief summary with generalization, and help users to locate the content wanted by themselves.

The automatic text summarization technology is a research hotspot in the field of natural language processing, and is divided into an extraction type summary and a generation type summary according to the generation mode of summary content. At present, the generation type technology is still immature, an extraction type method is generally used in the industry for generating abstracts, the text is generally subjected to word segmentation, stop words and other pretreatment steps, a text matrix is constructed by utilizing a TF-IDF algorithm, sentence scores are calculated, and sentences serving as abstracts are selected according to the scores. However, such a method stays on the literal, does not use the semantic relation of the context, and the extracted abstract lacks relevance, cannot extract key content according to the context, and cannot meet the requirements of users.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a text abstract extraction method, a text abstract extraction device, computer equipment and a storage medium, which are used for solving the problem that the prior art does not utilize the semantic relation of the context to extract the abstract.

In order to achieve the above object, the present invention provides a text abstract extraction method, comprising the steps of:

processing a text to be processed by using a target text classification model obtained through pre-training to obtain the category of the text to be processed;

and executing the following circulation processing on the text to be processed until all sentences in the text to be processed are deleted:

randomly deleting a sentence which is not deleted from the text to be processed to obtain a residual text;

Processing the residual text by using the target text classification model to obtain the category of the residual text;

Judging whether the category of the residual text is the same as the category of the text to be processed, if not, restoring the deleted sentence to the text to be processed;

And taking the residual text obtained after the circulation processing is finished as a target text abstract.

Further, the target text classification model is obtained through training of the following steps:

collecting a sample data set, wherein the sample data set comprises a plurality of training texts, and each training text is marked with a corresponding category;

dividing the sample data set into a training set and a verification set according to a preset proportion;

training to obtain the target text classification model based on the training set;

and verifying the target text classification model based on the verification set, and training the end if the verification is passed.

Further, the text to be processed and the training text are complaint texts.

Further, the categories of the text to be processed and the training text comprise aging failure, price difference and service attitude, and the like.

Further, the target text classification model is TEXTCNN models, and the TEXTCNN models include an embedding layer, a convolution layer, a pooling layer, a full connection layer, and a Softmax classification layer.

Further, the steps of processing the text to be processed by using the target text classification model obtained by pre-training are as follows:

vectorizing the text to be processed through the embedded layer to obtain word vectors of the text to be processed;

carrying out convolution processing on the word vector of the text to be processed through the convolution layer so as to extract the characteristics of the text to be processed;

Carrying out pooling treatment on the characteristics of the text to be treated through the pooling layer to obtain dimension reduction characteristics of the text to be treated;

transmitting the dimension reduction characteristics of the text to be processed to the Softmax classification layer through the full connection layer;

And processing the dimension reduction features of the text to be processed through the Softmax classification layer to obtain the category of the text to be processed.

Further, the text abstract extraction method further comprises the following steps: and preprocessing the text to be processed before processing the text to be processed by utilizing the target text classification model obtained through pre-training.

In order to achieve the above object, the present invention further provides a text abstract extracting device, including:

The class acquisition module is used for processing the text to be processed by utilizing the target text classification model obtained by pre-training to obtain the class of the text to be processed;

the loop pruning processing module is used for executing the following loop processing on the text to be processed until all sentences in the text to be processed are deleted:

and the abstract acquisition module is used for acquiring the residual text obtained after the circulation processing is finished as a target text abstract.

Further, the text abstract extracting device further comprises: the model training module is used for training the target text classification model, and comprises:

The system comprises a sample data set acquisition unit, a data processing unit and a data processing unit, wherein the sample data set is used for acquiring a sample data set, the sample data set comprises a plurality of training texts, and each training text is marked with a corresponding category label;

a sample data set dividing unit for dividing the sample data set into a training set and a verification set according to a predetermined ratio;

the training unit is used for training and obtaining the target text classification model based on the training set;

And the verification unit is used for verifying the target text classification model based on the verification set, and if the verification is passed, the training is finished.

Further, the text to be processed and the training text are complaint texts.

Further, the category acquisition module is specifically configured to:

And processing the dimension reduction features of the residual text through the Softmax classification layer to obtain the category of the residual text.

Further, the text abstract extracting device further comprises: the preprocessing module is used for preprocessing the text to be processed before the text to be processed is processed by utilizing the target text classification model obtained through pre-training.

To achieve the above object, the present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps of the aforementioned method when executing the computer program.

In order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the aforementioned method.

By adopting the technical scheme, the invention has the following beneficial effects:

According to the method, sentences in the text to be processed are deleted through random circulation, whether the text category after the deletion of the sentences is the same as that before the deletion is calculated, if so, the deleted sentences have small semantic contribution to the text, and the sentences are deleted, otherwise, the deleted sentences have large semantic contribution to the text, and the sentences are not deleted, so that the deleted sentences are restored in the text, and when all the sentences in the text are deleted, the abstract of the text is obtained. Because the above process is realized based on the classification model, and the classification model is trained based on the semantics, the abstract obtained based on the invention is an abstract combining the whole semantics of the text, i.e. the abstract can truly abstract the whole information of the text from the aspect of the semantics. In addition, the method and the device for deleting the text abstract randomly delete sentences, ensure that key semantics are not influenced by sequences, and improve the accuracy of generating the text abstract while considering the text processing speed performance.

Drawings

FIG. 1 is a flow chart of one embodiment of a text summarization method of the present invention;

FIG. 2 is a block diagram of one embodiment of a text summarization apparatus of the present invention;

FIG. 3 is a hardware architecture diagram of one embodiment of a computer device of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, the invention provides a text abstract extraction method, which specifically comprises the following steps:

S0, training according to the collected sample data set to obtain a target text classification model, wherein the specific training process comprises the following steps:

S01, collecting a sample data set, wherein the sample data set comprises a plurality of training texts, and each training text is marked with a corresponding category. In this embodiment, the training text may be complaint text. For example, assuming that a vehicle insurance company needs to quickly obtain a complaint abstract from a customer's complaint text, the collected sample data set should contain complaint text marked with different categories, including but not limited to less than aging, price objection, service attitudes. It should be appreciated that in addition to complaint text, corresponding sample data sets may be collected for different application scenarios, depending on different needs.

S02, the collected sample data set is divided into a training set and a verification set according to a preset proportion, wherein the training set accounts for 80% and the verification set accounts for 20%.

S03, training by adopting a gradient descent algorithm based on the training set to obtain a target text classification model. In the present invention, the target text classification model is preferably a commonly used text classification model-TEXTCNN model, textCNN is a model for classifying text by using a convolutional neural network, and includes an embedding layer, a convolutional layer, a pooling layer, a full-connection layer and a Softmax classification layer.

S04, based on the verification set, verifying whether the performances such as Accuracy (Accuracy), precision (Precision), recall (Recall), F1 score and the like of the target text classification model obtained through training meet preset conditions, if yes, the training is finished, otherwise, the number of training texts in the training set is increased, and the target classification model is retrained.

S1, acquiring a text to be processed, wherein the text to be processed can be a complaint text, such as a complaint text of a vehicle insurance client.

S2, processing the text to be processed by using a target text classification model (TEXTCNN model) obtained through training to obtain the category of the text to be processed, wherein the method is realized through the following steps:

s21, vectorizing the text to be processed through an embedding layer of the TEXTCNN model to obtain word vectors of the text to be processed;

s22, carrying out convolution processing on word vectors of the text to be processed through a convolution layer of a TEXTCNN model so as to extract characteristics of the text to be processed;

S23, pooling the characteristics of the text to be processed through a pooling layer of the TEXTCNN model to obtain the dimension reduction characteristics of the text to be processed;

S24, transmitting the dimension reduction characteristics of the text to be processed to a Softmax classification layer through a full connection layer of TEXTCNN models;

s25, calculating the probability of the text to be processed corresponding to various classification labels according to the dimension reduction characteristics of the text to be processed by a Softmax classification layer of the TEXTCNN model, and taking the classification label with the highest probability as the class of the text to be processed.

S3, sentence dividing processing is carried out on the text to be processed. In particular, the present invention may follow sentence-level punctuation marks, such as periods. ", exclamation mark" -! ", question marks"? "etc., sentence the text to be processed. For example, suppose that the text to be processed is a non-accident rescue of the following complaint content application, only one phone contact in the middle tells that two hours are needed, and the result is that 4 hours are more or less elapsed, and complaints are not satisfied. The client calls that the user does not need me to rescue at present and finds himself to rescue. The user can answer the rescue 028-65200801 by contacting the Ann-Lian with the Chinese patent, and the client asks the driver to give a personal description. Please mechanism processes replies as soon as possible, thank you-! ", the following four sentences are obtained after sentence dividing processing: the 1 st sentence is the non-accident rescue of the application, only one telephone contact in the middle is informed that two hours are needed, and the result is more than 4 hours or not, and complaints are not satisfied. And the 2 nd sentence is that the client calls that the rescue of me is not needed at present, and the client finds the person to rescue. And 3, the sentence is that' the multiple contact alliance rescue 028-65200801 is not answered by a person, and the client asks the department to give a personal description. "the 4 th sentence is" please mechanism processes replies as soon as possible, thank you-! ".

After the sentence dividing process is completed, a corresponding deletion flag bit is set for each sentence, the initial value of the deletion flag bit is set to 0, and when the deletion flag bit is 0, the corresponding sentence is not deleted.

S4, randomly selecting a sentence which is not deleted from the text to be processed to delete, and obtaining a residual text. After the sentence is selected to be deleted, the sentence is marked as deleted, so that the sentence is not deleted when the step is repeatedly executed. In this embodiment, labeling the foregoing sentence as deleted means: and setting the deletion mark position of the sentence as 1, and when the deletion mark position is 1, indicating that the corresponding sentence is deleted.

S5, processing the residual text by using a target text classification model, namely TEXTCNN model, to obtain the category of the residual text, wherein the specific flow is as follows:

s51, carrying out vectorization processing on the residual text through an embedding layer of the TEXTCNN model to obtain word vectors of the residual text;

S52, carrying out convolution processing on word vectors of the residual text through a convolution layer of the TEXTCNN model so as to extract characteristics of the residual text;

S53, carrying out pooling treatment on the characteristics of the residual text through a pooling layer of the TEXTCNN model to obtain the dimension reduction characteristics of the residual text;

s54, transmitting the dimension reduction features of the residual text to a Softmax classification layer through a full connection layer of TEXTCNN models;

and S55, calculating the probability of the residual text corresponding to each class label through a Softmax class layer of the TEXTCNN model, and taking the class label with the highest probability as the class of the residual text.

S6, judging whether the type of the residual text obtained by deleting the certain sentence is the same as the type of the text to be processed, if so, explaining that the whole semantic meaning of the deleted certain sentence to be processed is not important, namely, the sentence is deleted from the target text abstract of the text to be processed, and executing the step S8; if not, step S7 is performed.

And S7, if the type of the rest text is different from the type of the text to be processed, explaining that the deleted certain sentence is important to the whole semantic of the text, namely, the sentence should not be deleted from the target text abstract of the text to be processed. Therefore, the deleted sentence is restored to the text to be processed, and step S8 is performed.

S8, judging whether all sentences in the text to be processed are deleted, namely judging whether the deletion flag bits of all sentences are 1, if so, executing the step S9, otherwise, returning to the step S4 to execute the next circulation processing.

And S9, deleting all sentences in the text to be processed to obtain a residual text as a target text abstract to be extracted.

One application scenario of the application is: assume that a text to be processed X includes A, B, C, D sentences, and the type obtained after the text is processed by the target text classification model is M. When the method is adopted for processing, firstly, sentences D are randomly deleted, if the category of the text after the sentences D are deleted is still M, the sentences D are not important to the text X, and the sentences D can be deleted to obtain the rest text comprising sentences A, B, C; randomly deleting the sentence C in the residual text, if the category of the text after deleting the sentence C is not M, explaining that the sentence C is important for the text X, and recovering the sentence C if deleting the sentence C can not be deleted, so as to still obtain the residual text comprising the sentence A, B, C; and then, continuously circularly and randomly deleting the un-deleted sentences in the rest text, and deleting the sentences C, so that the sentences C are not deleted any more, and the rest text obtained by deleting all the sentences in the text M is taken as the abstract. Taking the text to be processed as the complaint text provided in the step S3 as an example, assuming that the category obtained after the text is processed by the TEXTCNN model is 'time efficiency is not reached', the category obtained after the 1 st sentence is deleted is changed, and the category obtained after the 2 nd, 3 rd or 4 th sentences are deleted is still 'time efficiency is not reached', which indicates that the 1 st sentence is critical to the complaint text, the 2 nd to 4 th sentences are non-critical to the text and are deleted from the abstract thereof, thereby obtaining the abstract of the complaint text as the 1 st sentence.

Therefore, the invention deletes the sentence in the text to be processed through random circulation, calculates whether the text category after the sentence is deleted is the same as that before the deletion, if so, the sentence which indicates that the deleted sentence has little semantic contribution to the text should be deleted, otherwise, the sentence which indicates that the deleted sentence has large semantic contribution to the text should not be deleted, and restores the deleted sentence in the text, and when all the sentences in the text are deleted, the abstract of the text is obtained. The invention is realized based on the classification model, and the classification model is based on semantic training, so that the abstract obtained based on the invention is an abstract combining the whole text semantics, namely, the abstract can truly abstract the whole text information from the aspect of semantics, and the accuracy of generating the text abstract is improved while considering the text processing speed performance.

As a preferred solution of this embodiment, the present invention further includes preprocessing the obtained text to be processed before executing step S2, specifically includes preprocessing the text to be processed such as stop word filtering, that is, detecting whether there is a word in the text to be processed that matches with a stop word in a preset stop word list, and if so, deleting the matched word. It should be understood that the term "deactivated" is generally an imaginary term that has no actual meaning, such as "ground", "obtained", "having" and the like.

It should be noted that, for simplicity of description, the present embodiment is shown as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts, as some steps may be performed in other order or simultaneously in accordance with the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required for the present invention.

Example two

As shown in fig. 2, the present embodiment provides a text digest extracting apparatus 10, including:

The model training module 11 is used for training to obtain a target text classification model;

The obtaining module 12 is configured to process a text to be processed by using a target text classification model obtained by training in advance, so as to obtain a category of the text to be processed, where the text to be processed may be a complaint text;

the loop pruning processing module 13 is configured to perform the following loop processing on the text to be processed until all sentences in the text to be processed are deleted:

Judging whether the category of the rest text is the same as the category of the text to be processed, if not, restoring the deleted sentence to the text to be processed;

the abstract obtaining module 14 is configured to obtain, as a target text abstract, a remaining text obtained after the loop processing is completed.

In the present embodiment, the model training module 11 includes:

The system comprises a sample data set acquisition unit, a first analysis unit and a second analysis unit, wherein the sample data set acquisition unit is used for acquiring a sample data set, the sample data set comprises a plurality of training texts, each training text is marked with a corresponding category, and the training texts can be complaint texts;

The training unit is used for training to obtain a target text classification model based on the training set;

And the verification unit is used for verifying the target text classification model based on the verification set, if the verification is passed, the training is finished, and if the verification is not passed, the number of the training texts in the training set is increased and the target classification model is retrained.

In this embodiment, the target text classification model is TEXTCNN model, and the TEXTCNN model includes an embedding layer, a convolution layer, a pooling layer, a full-connection layer, and a Softmax classification layer.

In this embodiment, the category obtaining module 12 is specifically configured to:

Vectorizing the text to be processed through an embedding layer of TEXTCNN model to obtain word vectors of the text to be processed;

carrying out convolution processing on word vectors of the text to be processed through a convolution layer of TEXTCNN model so as to extract characteristics of the text to be processed;

pooling the characteristics of the text to be processed through a pooling layer of TEXTCNN model to obtain the dimension reduction characteristics of the text to be processed;

transmitting the dimension reduction characteristics of the text to be processed to a Softmax classification layer through a full connection layer of TEXTCNN models;

and calculating the probability of the text to be processed corresponding to various classification labels according to the dimension reduction characteristics of the text to be processed through a Softmax classification layer of the TEXTCNN model, and taking the classification label with the highest probability as the class of the text to be processed.

In this embodiment, the text abstract extracting device 10 may further include a preprocessing module, configured to preprocess the text to be processed before the text to be processed is processed by using the target text classification model obtained by training in advance, specifically including preprocessing such as stop word filtering, that is, detecting whether there is a word in the text to be processed that matches with a stop word in the preset stop word list, and if so, deleting the matched word. It should be understood that the term "deactivated" is generally an imaginary term that has no actual meaning, such as "ground", "obtained", "having" and the like.

Those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments and that the modules referred to are not necessarily essential to the invention.

Example III

The invention also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server or a cabinet server (comprising independent servers or a server cluster formed by a plurality of servers) and the like which can execute programs. The computer device 20 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 22, which may be communicatively coupled to each other via a system bus, as shown in fig. 3. It should be noted that fig. 3 only shows a computer device 20 having components 21-22, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may be implemented instead.

In the present embodiment, the memory 21 (i.e., readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk provided on the computer device 20, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Of course, the memory 21 may also include both internal storage units of the computer device 20 and external storage devices. In this embodiment, the memory 21 is typically used to store an operating system and various types of application software installed on the computer device 20, such as program codes of the text digest extraction apparatus 10 of the second embodiment. Further, the memory 21 may be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is configured to execute the program code or process data stored in the memory 21, for example, execute the text digest extraction apparatus 10, to implement the text digest extraction method of the first embodiment.

Example IV

The present invention also provides a computer readable storage medium such as a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor, performs a corresponding function. The computer readable storage medium of the present embodiment is used for storing the text digest extracting apparatus 10, and when executed by a processor, implements the text digest extracting method of the first embodiment.

From the above description of the embodiments, it will be clear to those skilled in the art that the above embodiment method may be implemented by means of software plus necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather to utilize the equivalent structures or equivalent processes disclosed in the present specification and the accompanying drawings, or to directly or indirectly apply to other related technical fields, which are all encompassed by the present invention.

Claims

1. The text abstract extraction method is characterized by comprising the following steps of:

processing a text to be processed by utilizing a target text classification model obtained in advance based on semantic training to obtain the category of the text to be processed;

2. The text summarization method of claim 1 wherein the target text classification model is trained by:

And verifying the target text classification model based on the verification set, and if the verification is passed, ending training.

3. The text summarization method of claim 2 wherein the text to be processed and training text are complaint text.

4. A text summarization method according to claim 3 wherein the categories of text to be processed and training text include age-out, price objection and service attitude.

5. The text summarization method of claim 1, wherein the target text classification model is a TEXTCNN model and the TEXTCNN model comprises an embedding layer, a convolution layer, a pooling layer, a full-join layer, and a Softmax classification layer.

6. The text summarization method according to claim 5, wherein the processing the text to be processed using the target text classification model previously trained based on semantics comprises the steps of:

7. The text summarization method of claim 1, further comprising: and preprocessing the text to be processed before processing the text to be processed by utilizing the target text classification model obtained through pre-training.

8. A text digest extraction apparatus comprising:

the category acquisition module is used for processing the text to be processed by utilizing a target text classification model which is obtained in advance based on semantic training, so as to obtain the category of the text to be processed;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.