CN110362674B

CN110362674B - Microblog news abstract extraction type generation method based on convolutional neural network

Info

Publication number: CN110362674B
Application number: CN201910650915.XA
Authority: CN
Inventors: 滕辉; 刘肖萌; 龙飞
Original assignee: Chinaso Information Technology Co ltd
Current assignee: Chinaso Information Technology Co ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2020-08-04
Anticipated expiration: 2039-07-18
Also published as: CN110362674A

Abstract

The invention discloses a microblog news abstract extraction type generation method based on a convolutional neural network, which relates to the field of natural language processing and comprises the following steps of: capturing the content of the microblog website as an initial news data set Q by using a data acquisition module; processing the news data set Q to obtain a data set Q'; constructing a convolutional neural network to extract event elements from the processed news data set Q' to obtain abstract content S; and further processing the abstract content S by using a text similarity algorithm and a maximum edge correlation model to obtain the extracted abstract text summary. The method can facilitate news workers and the like to further and rapidly analyze and retrieve the generated abstract contents, removes semantic repeated contents by adopting a text similarity algorithm, and adopts a maximum edge correlation model to balance the correlation and diversity of the extracted contents to obtain more comprehensive and accurate content abstract.

Description

Microblog news abstract extraction type generation method based on convolutional neural network

Technical Field

The invention relates to the field of natural language processing, in particular to a microblog news abstract extraction type generation method based on a convolutional neural network.

Background

Automatic text generation is an important research direction in the field of natural language processing. The text automatic generation technology also has wide application prospect, and can be applied to man-machine interaction operations such as intelligent question answering and machine translation; in addition, the automatic text generation system can also be used for realizing automatic writing of news manuscripts, library retrieval and the like.

In the fields of natural language processing and artificial intelligence, automatic text generation technology has had several influential achievements and applications, for example, the united states union has used news writing software to automatically write news manuscripts to report company performance since 2014 and 7, which greatly reduces the workload of journalists.

The key technology in the text automatic generation technology is text abstract generation, a given document or a document set is automatically analyzed, the key point information in the document or the document set is extracted, and finally a short abstract is output. The current text summarization method is mainly divided into two methods: generating and extracting. The extraction formula is mainly based on sentence extraction, that is, the sentences in the original text are used as units for evaluation and extraction. The second is a generative method, which generally needs to perform syntactic and semantic analysis on a text by using a natural language understanding technology, fuse information, and generate a new abstract sentence by using the natural language generating technology.

In the prior art documents, the abstract generation system based on the deep neural network proposed in patent CN201610232659.9 and the abstract generation system based on the deep learning and attention mechanism proposed in patent CN 201811416029.2 are both of the generation formulas. The generated abstract comprises partial keywords, so that the correct word sequence cannot be formed and the performance of the generated abstract is satisfactory.

Disclosure of Invention

The invention aims to provide a microblog news abstract extraction type generation method based on a convolutional neural network, so that the problems in the prior art are solved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a microblog news abstract extraction type generation method based on a convolutional neural network comprises the following steps:

s1, capturing microblog website contents as an initial news data set Q by using a data acquisition module;

s2, processing the news data set Q to obtain a data set Q';

s3, constructing a convolutional neural network to extract event elements from the processed news data set Q' to obtain abstract content S;

and S4, further processing the abstract content S by using a text similarity algorithm and a maximum edge correlation model to obtain the abstracted abstract text summary after extraction.

Preferably, the processing mode of the news data set Q in step S2 is filtering, similar merging and deduplication, and specifically includes:

s21, traversing all samples of the news data set Q, removing pictures, videos and html labels to obtain the news data set Q_tmp；

S22, traverse the news data set Q in the step S21_tmpAll samples of (1), the time and place of the extracted sample, and recorded as a time and place mark matrix

t is the time value, loc is the place value, i 1,2_tmpTotal number of samples;

s23, traversing the label matrix obtained in the step S22

News data set Q_tmpThe corresponding samples with the same label vector are merged to obtain a news data set Q '═ { Q'₁,q'₂,...,q'_MAnd M is the total number of samples.

Preferably, step S3 specifically includes:

s31, traversing all samples of the news data set Q', performing single sentence segmentation and manual labeling on the samples to obtain a model data set

Wherein l_jText single sentence c after segmenting for sample_jLabel of l_j∈ { time, place, event description, cause, pass, result }, j 1, 2., K is the total number of single sentences of the model dataset;

s32, extracting a model data set

Obtaining a news data set feature matrix by using the feature vector of the text single sentence

S33, constructing a convolutional neural network, and recording the convolutional neural network as TextCNN, wherein the TextCNN network structure comprises a convolutional layer, a maximum pooling layer, 2 full-link layers and a softmax layer;

s34, characterizing the model data set

Randomly dividing the training set, the testing set and the verification set according to the ratio of 4:2: 1;

s35, training the convolutional neural network TextCNN obtained in the step S33 by using the training set and the verification set which are divided in the step S34 to obtain a trained network Model;

and S36, abstracting the test set in the step S34 by using the Model obtained in the step S35 to obtain a text single sentence set which only comprises time, place, event description, passage, cause and result and is marked as abstract content S.

Preferably, step S32 specifically includes:

1) extracting the model dataset

Text single sentence c in (1)₁Obtaining a weight matrix according to the TF-IDF characteristics₁，

Wherein the content of the first and second substances,_ias a single sentence of text c₁The TF-IDF characteristic value is corresponding to the vocabulary table as

n is a text single sentence c₁The total number of words of (c);

2) extracting Word2Vec characteristics of the vocabulary V to obtain a text single sentence c₁Feature matrix F_n×m：

Wherein f is_iAs a vocabulary table V₁The Word2Vec eigenvector of the ith Word, wherein m is the dimension of the eigenvector, and the value of m is 300;

3) using the weight matrix obtained in step 1)₁And the feature matrix F obtained in step 2)_n×mTo obtain a text single sentence c₁A feature matrix F':

4) normalizing the feature matrix F' obtained in the above steps according to rows to obtain a normalized feature matrix

5) Traverse a model dataset

Repeating the steps (1) to (4) to obtain the characteristics of the model data set

l_iAs a model data set

And K is the sum of the single sentences of the model data set.

Preferably, step S4 specifically includes:

s41, traversing all text single sentences in the abstract content S, and calculating cosine similarity values between the text single sentences

S42, filtering out cosine similarity value in the abstract content S

To obtain the abstract content without duplication

S43, using the maximum edge correlation model to abstract the content

And processing to obtain the extracted abstract text.

Preferably, step S43 specifically includes:

(1) traverse summary content

Obtaining a candidate abstract text s by adopting a formula in the text single sentence:

(2) adding the candidate abstract text s obtained in the step into a candidate abstract set summary;

(3) repeating the steps (1) to (2) for C times to obtain a candidate abstract set summary, namely the extracted abstract text, wherein C is a positive integer and is a positive integer

Total number of sentences in.

Preferably, the formula adopted in step (1) is:

wherein, the value of the lambda is 0.9,

representing summary content

Sentence i and the whole abstract content

Cosine similarity of (d);

expressed as summary content

And setting the cosine similarity of the ith sentence and the sum of the candidate abstract sets to null.

Preferably, the data collection module in step S1 is a real-time crawler module.

The invention has the beneficial effects that:

the microblog news abstract extraction type generation method based on the convolutional neural network has the following advantages:

1. according to the microblog news abstract extracting type generating method based on the convolutional neural network, the microblog news content is extracted in an abstract mode, the abstract sentences have better readability, and news workers and the like can further rapidly analyze and retrieve the generated abstract content conveniently.

2. The abstract extraction method adopts TF-IDF weighted Word2Vec Word vectors, further utilizes a convolutional neural network to comprehensively consider various characteristics of sentences to classify the importance of the sentences, finishes extraction of contents containing six elements of news, including six elements of time, place, event description, passage, cause, result and the like, and further finishes abstract generation.

3. The invention adopts a text similarity algorithm to remove semantic repeated contents and adopts a maximum edge correlation model to balance the correlation and diversity of the extracted contents so as to obtain more comprehensive and accurate content abstract.

Drawings

FIG. 1 is a flowchart of an abstract abstraction type generation method in embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a convolutional neural network in embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The embodiment provides a convolutional neural network-based microblog news digest extraction type generation method, as shown in fig. 1, which includes the following steps:

s1, capturing microblog website contents as an initial news data set by using a real-time crawler module, and recording the initial news data set as a news data set Q ═ Q { (Q) } Q }₁,q₂,...,q_NWherein q is_iThe method comprises the steps of obtaining an ith sample in a news data set, wherein i is 1,2, and N is the total number of samples in the news data set;

s2, filtering the news data set Q, merging the same type and removing duplication to obtain a data set Q', and the specific steps are as follows:

t is the time value, loc is the place value, i 1,2_tmpTotal number of samples;

s23, traversing the label matrix obtained in the step S22

S3, constructing a convolutional neural network to extract event elements from the processed news data set Q' to obtain abstract content S, and the specific steps are as follows:

s32, extracting a model data set

1) Extracting the model dataset

n is a text single sentence c₁The total number of words of (c);

5) Traverse a model dataset

l_iAs a model data set

And K is the sum of the single sentences of the model data set.

S33, constructing a convolutional neural network, as shown in FIG. 2, which is marked as TextCNN, wherein the TextCNN network structure is a convolutional layer, a maximum pooling layer, 2 full-link layers and a softmax layer;

in the convolutional layer in the embodiment, the number of convolutional kernels is 256, the size of the convolutional kernels is 5, the activation function is the Relu function, the number of neurons in the full-link layer is 128, the learning rate is 0.001, and the random inactivation rate is 0.5;

s34, characterizing the model data set

S4, further processing the abstract content S by using a text similarity algorithm and a maximum edge correlation model to obtain an extracted abstract text summary, wherein the step S4 specifically comprises the following steps:

S42, filtering out cosine similarity value in the abstract content S

To obtain the abstract content without duplication

S43, using the maximum edge correlation model to the abstract content obtained in the above steps

And processing to obtain the extracted abstract text.

Step S43 specifically includes:

(1) traverse summary content

Obtaining a candidate abstract text s by adopting the following formula for the text single sentence in the text;

wherein, the value of the lambda is 0.9,

representing summary content

Sentence i and the whole abstract content

Cosine similarity of (d);

expressed as summary content

Total number of sentences in.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

2. The abstract extraction method adopts TF-IDF weighted Word2Vec Word vectors, further utilizes a convolutional neural network to comprehensively consider various characteristics of sentences to classify the importance of the sentences, finishes the extraction of contents containing six elements of news, including six elements of time, place, event description, passage, cause and result, and further finishes abstract generation.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A microblog news abstract extraction type generation method based on a convolutional neural network is characterized by comprising the following steps:

s2, processing the news data set Q to obtain a data set Q';

s4, further processing the abstract content S by using a text similarity algorithm and a maximum edge correlation model to obtain an extracted abstract text summary;

in step S2, the processing mode of the news data set Q is filtering, similar merging and duplicate removal, and specifically includes:

t is the time value, loc is the place value, i 1,2_tmpTotal number of samples;

s23, traversing the label matrix obtained in the step S22

News data set Q_tmpThe corresponding samples with the same label vector are merged to obtain a news data set Q '═ { Q'₁,q'₂,...,q'_MM is the total number of samples;

step S3 specifically includes:

s32, extracting a model data set

s34, characterizing the model data set

s36, abstracting the test set in the step S34 by using the Model obtained in the step S35 to obtain a text single sentence set which only comprises time, place, event description, passage, cause and result and is marked as abstract content S;

step S32 specifically includes:

1) extracting the model dataset

n is a text single sentence c₁The total number of words of (c);

4) to the above stepsNormalizing the obtained feature matrix F' according to rows to obtain a normalized feature matrix

5) Traverse a model dataset

Repeating the steps 1) to 4) to obtain the characteristics of the model data set

l_iAs a model data set

The ith label is the sum of the single sentences of the model data set;

step S4 specifically includes:

S42, filtering out cosine similarity value in the abstract content S

To obtain the abstract content without duplication

S43, using the maximum edge correlation model to abstract the content

Processing to obtain an extracted abstract text;

step S43 specifically includes:

(1) traverse summary content

Total number of Chinese sentences;

the formula adopted in step (1) is as follows:

wherein, the value of the lambda is 0.9,

representing summary content

Sentence i and the whole abstract content

Cosine similarity of (d);

expressed as summary content

2. The extraction-type generation method of microblog news digests based on the convolutional neural network as claimed in claim 1, wherein the data acquisition module in the step S1 is a real-time crawler module.