CN110019814B

CN110019814B - News information aggregation method based on data mining and deep learning

Info

Publication number: CN110019814B
Application number: CN201810743949.9A
Authority: CN
Inventors: 翁健; 黄芝琪; 李文灏; 陈杰彬; 罗伟其; 张悦
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2021-07-27
Anticipated expiration: 2038-07-09
Also published as: CN110019814A

Abstract

The invention discloses a news information aggregation method based on data mining and deep learning, which comprises the steps of using a crawler to capture data of a news portal website in the same time period to obtain news information and comment information; then, classifying and removing the weight of the news by applying a vector space model, a TF-IDF weight calculation method, a synonym forest method and the distance measurement of cosin, and aggregating the news with the same content; the function of summarizing all the comments is realized through an algorithm of text summarization; and finally, automatically generating the abstract of the article through a deep neural network model. The method can be convenient for readers to efficiently and quickly acquire the contents of all news platforms and reader comments.

Description

News information aggregation method based on data mining and deep learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a news information aggregation and abstract generation and comment summarization method.

Background

Under the internet era, the daily information data volume shows explosive growth, and news is one of the main ways for people to acquire information in life. Different from the traditional paper news, the network news has wide spread, large audience and quick update, and the operation cost is far lower than that of the traditional mode, so the network news is generally accepted by the society. For readers, the cost for reading the online news is low, the content is rich, the time is saved, and the readers can select the content which interests the readers to read, and the method is not limited to the fixed content provided by the traditional newspaper. In addition, almost all news sites provide a platform for the reader to speak and discuss, where the reader is free to express his or her own opinion. Meanwhile, for some popular events, the main comment content of readers can reflect the direction of public opinion, and a group of companies for analyzing network public opinion information are also emerged. Popular news and reviews are also the contents most readers prefer to read.

Meanwhile, news platforms are numerous, the quality of contents is uneven, problems are brought to readers, news describing the same contents are scattered on different platforms, the expression forms are different, and the reading effect of the readers is not well influenced by the inconsistency of operation of the platforms. Therefore, how to find useful information from news platforms with inconsistent contents in multiple forms and generate abstracts of the useful information to summarize reader comments, so that readers can efficiently read the useful information is a problem to be solved.

In the existing text similarity identification method, the text similarity is usually represented by directly according to the included angle value of the feature weight vector, although the system works well under most conditions, the method lacks effective processing on synonyms; in addition, the conventional collection system adopts a similar extraction method for news content abstract and reader comments to extract key content for display, the method is good in syntactic aspect, and the reader comments are excellent in performance due to short and refined reasons, but the extraction quality and the content fluency of the whole literary are poor in humanity, and no clear writing logic exists.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, provides a news information aggregation method based on data mining and deep learning, improves a characteristic weight vector corner clipping method by combining a synonym forest method, generates an abstract for news contents by adopting a deep learning method, and summarizes comments by adopting a decimation method.

The purpose of the invention is realized by the following technical scheme: a news information aggregation method based on data mining and deep learning comprises the following steps:

1. adopting a crawler frame to perform data crawling on news and comments of a specified website platform;

2. classifying all news, and classifying the contents of each news by combining a vector space model, a cos similarity value, a TF-IDF algorithm and a synonym forest method;

3. generating an article abstract, and generating a text abstract by adopting a deep neural network structure;

4. and summarizing the comments corresponding to the news, preprocessing the comment text, and directly extracting key comments by adopting a TF-IDF algorithm to summarize the text.

Preferably, the data crawl is performed using a scrapy crawler framework.

Preferably, classifying news is equivalent to: giving two texts, and judging whether the contents of the two texts are the same;

treating a text as a space vector

Each word in the text represents a dimension in the vector space, and the number of times the word appears in the text represents a vector

Length in this dimension, such that a text is completely converted into a vector in space;

assuming that there are n texts now, there are n vectors, and the space in which the n vectors are located is formed by dimensions represented by all non-repeated words in the n texts; in order to judge that two texts are similar, calculating the value of the included angle cosin of the corresponding vectors, wherein the closer the value is to 1, the more similar the two texts are, and the closer the value is to 0, the more dissimilar the two texts are;

suppose two vectors

And

all are n-dimensional vectors, and the method for calculating the cosine value of the included angle between the two vectors is as follows:

the text content is considered identical if the calculated similarity exceeds a set threshold.

Further, when judging whether the two texts are similar, selecting one of the texts as a reference text, selecting k keywords of the reference text from the reference text according to TF-IDF weights of the words, and establishing a vector space R by taking the keywords as dimensions of the vector space_kRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors

Calculating a cosin value of an included angle between two vectors, and when the value is greater than a set threshold value, considering that texts corresponding to the two vectors are similar;

when the TF-IDF model is used for calculation, a word is assumed to be represented as a, and the occurrence frequency of a in the text i is n_a,iTotal number of words of text i is N_iThe number of all texts is D, D where a is_aIf it occurs in text i, the weight formula of this word in text i is:

w_a,ithe larger the value, the more important the representation a is in the text i; the text is subjected to word segmentation, TF-IDF values of all non-repetitive words are calculated, then the words are sorted from big to small, and the first X words are taken as key words.

Furthermore, considering synonyms and near synonyms of words, the synonymy degree of the quantifier is measured by word similarity, the word similarity is a numerical value, and the value range is set to be [0, 1]]Based on synonym forest, calculating word similarity between words and exceeding a certain thresholdJudging that the two words are the same, based on the judgment, obtaining the word a and the synonym thereof by a synonym forest-based method, modifying a word weight calculation formula, and converting d_aThe definition of (1) is modified from "how many texts the word a appears" to "how many texts the word a and its synonyms appear", while the word weight calculation formula remains unchanged:

specifically, the word similarity calculation method includes:

in Chinese words, one word often expresses many meanings, namely, a plurality of semantic items exist, all the semantic items are considered for calculating the word similarity, the calculation of the semantic item similarity is based on a synonym forest structure, and the calculation is performed according to the semantic distance of two semantic items by using the number of the semantic item in the word;

firstly, judging which layer two meaning items as leaf nodes in a synonym forest are branched, namely the numbers of the two meaning items are different, starting from the 1 st layer, if the two meaning items are the same, multiplying by 1, otherwise multiplying by a corresponding coefficient at the branch layer, and then multiplying by the corresponding coefficient

As normalization processing of the similarity of the meaning items, the similarity of the meaning items is controlled to be [0, 1]Where n is the total number of nodes of the branch layer;

the density of the tree where the words are located and the number of branches directly influence the similarity, the value of the similarity with higher density is more accurate than the value of the similarity with lower density, and then a parameter (n-k +1)/n is multiplied, wherein n is the total number of nodes of a branch layer, and k is the distance between two branches;

assuming that the numbers of the two meaning terms are different at the S layer, the corresponding coefficient of the S layer is S, and let A, B the similarity of the two meaning terms be represented by Sim:

when the similarity of the words is calculated, the meaning items of the two words are respectively calculated pairwise, and the maximum value is taken as the similarity value of the two words.

Preferably, the text abstract is generated by means of a deep neural network structure, a Seq2Seq technology, also called an Encoder-Decoder architecture, is adopted, wherein the Encoder and the Decoder are both formed by a plurality of layers of RNN/LSTM, and the Encoder is responsible for encoding an original text into a vector C; the Decoder is responsible for extracting information from the vector C, obtaining semantics and generating a text abstract.

Preferably, the specific implementation method of the general comment content comprises the following steps:

a) obtaining comment contents in real time, segmenting words of the text, and counting the times of occurrence of all the words respectively;

b) selecting N words with the highest word frequency as keywords;

c) dividing the text into sentences, calculating the number of keywords in each sentence, and dividing the number of keywords by the length of the sentence to obtain a value as a weight value of the sentence;

d) and splicing the sentences with the largest weight according to the appearance order in the text to form a summary text for output.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the system provided by the invention can acquire different news platform information and classify the news platform information, so that the problem of poor reading experience caused by the differentiation and inconsistency of a plurality of news platform information is avoided, and the influence of invalid news on readers is reduced.

2. In the invention, the technology of synonym forest is adopted in processing similar texts, so that the problem of synonym misjudgment is better avoided compared with the prior system.

3. The invention adopts the deep learning technology to process the abstract of the article, and avoids the defects of poor readability, incoherent content and the like of the existing abstract generation method, thereby enabling the abstract and the comment of the article to be more approximate to the standard natural language and improving the efficiency of reading news contents by readers.

4. The invention extracts and summarizes the comments, perfects the function of the news information aggregation system and further improves the efficiency of reading the comments by readers.

Drawings

Fig. 1 is a basic flowchart of an embodiment news information aggregation method.

FIG. 2 is a flowchart of an embodiment news classification step.

FIG. 3 is a flowchart illustrating summarizing news review content according to an embodiment.

FIG. 4 is a 5-level structure of an embodiment synonym forest.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

A news information aggregation method based on data mining and deep learning comprises the following steps:

1. and crawling news and comments of a specified website platform, and performing data crawling on five news portal websites including a new wave, a search, an update, a network exchange and the world wide web by adopting a script crawler frame.

2. And classifying all news, and classifying the contents of each news by adopting methods such as a vector space model, a cos similarity value, a TF-IDF algorithm, a synonym forest and the like.

3. And generating an article abstract, and generating a text abstract by adopting an improved deep neural network structure.

4. And summarizing the comments corresponding to the news, and directly extracting key comments by adopting a TF-IDF algorithm through preprocessing the comment text to summarize the text.

Step one, data crawling.

And (4) crawling a crawler frame to crawl five news portal websites including new waves, search foxes, Tencent, network easiness and the world Wide Web. The Scapy is a data grabbing frame of Python, and can be used for grabbing data from a webpage end and saving the data to the local.

And step two, classifying the obtained news.

(1) The main flow of classifying local news is shown in fig. 2.

(2) Vector space model and cosin similarity value

To classify downloaded news content, the problems that need to be solved are: given two texts, how to judge the contents of the two texts are approximately the same.

Treating a text as a space vector

Length in this dimension, such that a text is completely converted into a vector in space. For example, if the content of a text segment is "today beijing rains, i is happy", then the result of word segmentation is "today/beijing/raining/,/i/very/happy", the text segment can be regarded as a vector in a seven-dimensional space, and the vector representation is:

[1 1 1 1 1 1 1]^T

the numbers in the corresponding rows are sequentially "today", "beijing", "raining", "i", "very", ". At this time, the text is abstracted into a vector for representation. Assuming that there are now n texts, there are n such vectors, and the space in which these n vectors are located is made up of the dimensions represented by all the non-repeated words in the n texts. In order to judge that two texts are similar, the included angle cosin value of the corresponding vectors is calculated, the closer the value is to 1, the more similar the two texts are, and the closer the value is to 0, the more dissimilar the two texts are.

For example, if there are three texts:

the content of the first text is "wherever we are, i amAll feel happy, the word segmentation result is — "no matter/we/what/where/,/we/all/feel/very/happy";

the content of the second text is "we find happy wherever, the word segmentation result is-" we/anys/on/what/where/all/felt/very/happy ";

the content of the third text is "today we are happy because today is friday", the segmentation result is- "today/we/very/happy/,/because/today/friday".

the first text:

the second text:

the third text:

then, the similarity between the three texts is calculated, and two vectors are assumed

And

cosin similarity of text one and text two:

cosin similarity of text one to text three:

cosin similarity of text two to text three:

from the calculation result, it can be known that the similarity between the text one and the text two is the highest, and the similarity between the text two and the text three is the lowest. In practical applications, a threshold may be set, for example, if the calculated similarity exceeds 0.75, the text content is considered to be almost the same, and punctuation marks obtained during word segmentation or various words such as "yes" and "on" should be eliminated.

In the design of this algorithm, the similarity comparison of two texts is optimized in performance. However, the above algorithm has a problem that when the number of texts is very large, the corresponding vector space dimension becomes very high, and a vector corresponding to one text has a value of 0 in most dimensions in the vector space, resulting in very high time complexity and space complexity.

In order to avoid the performance degradation problem caused by the excessively high vector space dimension, the following improvements are made to the algorithm: when judging whether the two texts are similar, selecting one of the texts as a reference text, and selecting the reference text from the baseSelecting k keywords of the reference text according to TF-IDF weight of the words in the quasi text, and establishing a vector space R by taking the keywords as the dimension of the vector space_kRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors

Calculating the cosin value of the included angle between the two vectors, and considering that the texts corresponding to the two vectors are similar when the value is greater than a set threshold (through multiple experiments, the threshold is preferably 0.75).

Still taking the above three texts as an example, if it is determined whether the second text and the third text are similar to the first text, the first word segmentation of the first text results in the following results, no matter/we/in/what/where/,/we/all/feeling/very/happy, and the terms that appear are: whatever | where | and | where | | the | is felt | very happy. A total of 10 words, text one, text two, and text three are now described using three 10-dimensional vectors:

text one:

text two:

text three:

calculating the similarity:

cosin similarity of text one and text two:

cosin similarity of text one to text three:

if it is determined whether text three is similar to text two, then the segmentation of text two yields the result "we/whatever/where/all/feel/very/happy" and the terms that have occurred are: we feel | very happy regardless of | at | where | place |. A total of 9 words, text two and text three are now described using two 9-dimensional vectors:

text two:

text three:

calculating the similarity:

cosin similarity of text two to text three:

for the classification of news, the optimization described above was used based on the following experience: if the story contents are the same, the keywords of the two news are almost the same. From the practical operation result, the optimization is worthy of affirmation, and the storage space and the operation time are saved.

(3) TF-IDF algorithm

The main idea of TF-IDF consists of the following two points: 1. the more times a word appears in a text, the more important it is in the text; 2. the less often a word appears in all text, the more important it is. Here, "all texts" refers to all texts in one corpus. TF refers to the frequency of occurrence of a word in the text, and IDF refers to the frequency of occurrence of a word in all texts.

w_a,ithe larger the value, the more important the representation a is in the text i.

With the TF-IDF calculation method, the keywords in the text can be extracted. The basic idea is to divide the text into words, calculate the TF-IDF values of all non-repeated words, then sort the words, and take the words with the largest value as the key words. As for how many words are taken as keywords, a balance needs to be made according to actual conditions, if few keywords are taken, it may be difficult to better describe the main content of a text through the keywords, and if many keywords are taken, more computing resources may be spent in computing the similarity of the texts. In this item, the number of keywords selected is 10.

(4) Improvement of TF-IDF algorithm based on synonym forest

In the algorithm, keyword acquisition and similarity matching are successfully carried out on a plurality of texts, and higher accuracy is achieved. However, in the actual testing process, the above algorithm still has a problem that after a text is subjected to a TF-IDF algorithm to extract a keyword, if a word importance weight is calculated and used for judging similarity only by whether the content of the keyword is the same, a synonym judgment failure occurs, for example, words such as "school, college, university, and subject" occur in a plurality of texts at the same time, and it is not possible to classify the texts into one category by using the above algorithm, and it is obvious that in some environments, they need to be treated as the same word. Therefore, the similarity calculation of the words is carried out by adopting the thought based on the synonym forest to optimize the algorithm.

The synonym forest is a dictionary, which not only includes synonyms of a word, but also includes a certain number of similar words, i.e. related words in a broad sense. The synonym forest organizes all the included entries together according to a tree-like hierarchical structure. The vocabulary is divided into 3 types of big, middle and small. There are many words in each subclass, and these words are divided into several word groups according to the distance and relevance of word meaning. The words in each word group are further divided into a plurality of lines, and the words in the same line have the same word sense or strong correlation of the word senses. For example, soybeans, green soybeans, and soybeans are in the same row; tomatoes and tomatoes are in the same row; hiring, barren, lower and middle farmers, upper and middle farmers and Funong are also in the same line.

The thesaurus classification of the synonym forest adopts a hierarchy and has a 5-layer structure, as shown in fig. 4. As the level increases, word senses become increasingly detailed, and by level 5, the number of words in each category is already small. And selecting the classified vocabulary of the fifth layer to replace the keywords, wherein the synonyms and the related words are regarded as the same words in the 5 th layer because the synonyms and the related words are some in the 5 th layer.

The synonymy degree of the quantifier is measured by the word similarity, the word similarity is a numerical value, and the value range is set to be between [0 and 1 ]. A similarity of 1 to one word and 0 means that both words are not replaceable in any context.

When calculating the word similarity, the semantic item similarity needs to be calculated first. In Chinese words, a word often has many meanings, that is, many meanings. For example, "proud" may represent both recognition and derogation. Therefore, all the meaning items are considered for calculating the word similarity. And the calculation of the similarity of the semantic items is based on the structure of the synonym forest, and the semantic distance between two semantic items is calculated by using the serial numbers of the semantic items in the words.

First, it is determined which layer the two semantic items serving as leaf nodes in the synonym forest branch, i.e., which layer the numbers of the two semantic items are different. If the branch at the 4 th layer is assumed, judging from the 1 st layer, multiplying by 1 if the branch is the same, otherwise multiplying by the corresponding coefficient at the branch layer, and then multiplying by the corresponding coefficient

As normalization processing of the similarity of the meaning items, the similarity of the meaning items is controlled to be [0, 1]Where n is the total number of nodes of the branch layer.

The density of the tree where the words are located and the number of branches directly influence the similarity, and the similarity value with higher density is more accurate than the similarity value with lower density. And then multiplied by a parameter (n-k +1)/n, where n is the total number of nodes in the branch layer and k is the distance between two branches. Thus, the calculated value can be refined, and the calculation result is more accurate.

Let the similarity of two terms be represented by Sim:

if the two words are not on the same tree,

Sim(A,B)＝f

if the branch at layer 2 is a, the coefficient is a, X is the parameter (n-k +1)/n,

if, at layer 3, the branch, the coefficient is b,

if, at layer 4, the branch, the coefficient is c,

if, at the level 5 branch, the coefficient is d,

According to the above method, taking the word "people" as an example, similarity calculation is performed, and the result is shown in table 1:

TABLE 1 semantic similarity of the term "people" to other terms

It can be seen that the calculation result of the semantic similarity is basically consistent with the semantic similarity judged by human cognition, and the objective reality can be reflected truly, that is, the algorithm can accurately and objectively reflect the semantic correlation among the words, and an effective measurement is provided for the semantic correlation among the words.

Testing multiple words assumes that when the word similarity value is greater than 0.7, the two words are approximately considered to be the same.

After the words a and the synonyms thereof are obtained by a method based on the synonym forest, the word weight calculation formula is modified, and d_aThe definition of (1) is modified from "how many texts the word a appears" to "how many texts the word a and its synonyms appear", while the word weight calculation formula remains unchanged:

from the practical operation result, the improvement can well alleviate the defect that the synonyms cannot be judged.

The compared news is continuously replaced, and all similarities are found out. Comparing each news of the i platform with all the news of the other platforms once; then, the second news of the i platform is compared with the news of the other platforms once, and so on.

And step three, automatically generating the text abstract by adopting a deep neural network.

Due to the specific coherence and high summarization of the text abstract, the extraction method cannot achieve a good effect; the problem can be solved well by the generative text digest realized by the deep neural network structure. The invention adopts the Seq2Seq technology proposed by the GoogleBrain team, also called Encoder-Decoder architecture, wherein Encoder and Decoder are both composed of several layers of RNN/LSTM, and Encoder is responsible for encoding the original text into a vector C; the Decoder is responsible for extracting information from the vector C, obtaining semantics and generating a text abstract.

Step four, summarizing news comment contents.

(1) The main flow of summarizing and presenting news review content is shown in fig. 3.

(2) Concrete implementation method for summarizing and commenting contents

The comment content of news is different from the content of news text, the comment content is updated in real time, and the content of the news text is basically unchanged once being published, so that a method of crawler acquisition and analysis is adopted for the content of the news text, and all popular comments are stored in a memory in a real-time acquisition mode and are summarized and displayed in response to the summarization problem of the comments.

To summarize the review, one idea of the translation problem is: all comments are spliced to form a text, and then the text is summarized. One method used here uses the idea that the higher the word frequency (TF) is, the more likely it is to become a keyword as a basis for text summarization, and the specific steps are as follows:

a) segmenting words of the text, and counting the times of occurrence of all the words respectively;

b) selecting a plurality of words with the highest word frequency as keywords;

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A news information aggregation method based on data mining and deep learning is characterized by comprising the following steps:

s1, data crawling is carried out on news and comments of the specified website platform by adopting a crawler frame;

s2, classifying all news, and classifying the contents of each news by combining a vector space model and a cos similarity value;

when judging whether the two texts are similar, selecting one of the two texts as a reference text, selecting key words of k reference texts from the reference text according to TF-IDF weights of the words, and establishing a vector space R by taking the key words as dimensions of the vector space_kRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors

the TF-IDF algorithm is improved based on a synonym forest, the number of the meaning item in the word is utilized, calculation is carried out according to the semantic distance of the two meaning items, when the similarity of the word is calculated, the meaning items of the two words are respectively calculated pairwise, and the maximum value is taken as the similarity value of the two words;

s3, generating an article abstract, and generating a text abstract by adopting a deep neural network structure;

and S4, summarizing the comments corresponding to the news, preprocessing the comment text, and directly extracting key comments by adopting a TF-IDF algorithm to summarize the text.

2. The news information aggregation method based on data mining and deep learning of claim 1, wherein data crawling is performed by using a script crawler framework.

3. The method for aggregating news information based on data mining and deep learning of claim 1, wherein in step S2, the classification of news is equivalent to: giving two texts, and judging whether the contents of the two texts are the same;

treating a text as a space vector

suppose two vectors

And

4. The news information-gathering method based on data mining and deep learning of claim 3, wherein when judging whether two texts are similar, one of the two texts is selected as a reference text, and k reference texts are selected from the reference text according to TF-IDF weights of wordsKeywords of the text are taken as the dimensionality of the vector space to establish the vector space R_kRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors

when the TF-IDF model is used for calculation, a word is assumed to be represented as a, and the occurrence frequency of a in the text i is n_a，iTotal number of words of text i is N_iThe number of all texts is D, D where a is_aIf it occurs in text i, the weight formula of this word in text i is:

w_a，ithe larger the value, the more important the representation a is in the text i; the text is subjected to word segmentation, TF-IDF values of all non-repetitive words are calculated, then the words are sorted from big to small, and the first X words are taken as key words.

5. The news information aggregation method based on data mining and deep learning of claim 4, wherein synonyms and synonyms of words are considered, the synonymy degree of quantifiers is measured by word similarity, the word similarity is a numerical value, and a value range is set to be [0, 1]]Based on the synonym forest, calculating the similarity of the terms, judging that the two terms are the same if a certain threshold value is exceeded, modifying the term weight calculation formula after the term a and the synonym thereof are obtained by a synonym forest based method, and modifying the term weight calculation formula and converting the d_aThe definition of (1) is modified from "how many texts the word a appears" to "how many texts the word a and its synonyms appear", while the word weight calculation formula remains unchanged:

6. the news information aggregation method based on data mining and deep learning of claim 5, wherein the word similarity calculation method is as follows:

7. The news information aggregation method based on data mining and deep learning of claim 1, wherein the text summary is generated by means of a deep neural network structure, and a Seq2Seq technique, also called an encorder-Decoder architecture, is adopted, wherein the encorder and the Decoder are both composed of several layers of RNN/LSTM, and the encorder is responsible for encoding the original text into a vector C; the Decoder is responsible for extracting information from the vector C, obtaining semantics and generating a text abstract.

8. The news information aggregation method based on data mining and deep learning of claim 1, wherein a specific implementation method of summarizing comment content comprises:

b) selecting N words with the highest word frequency as keywords;