CN110019814B - News information aggregation method based on data mining and deep learning - Google Patents

News information aggregation method based on data mining and deep learning Download PDF

Info

Publication number
CN110019814B
CN110019814B CN201810743949.9A CN201810743949A CN110019814B CN 110019814 B CN110019814 B CN 110019814B CN 201810743949 A CN201810743949 A CN 201810743949A CN 110019814 B CN110019814 B CN 110019814B
Authority
CN
China
Prior art keywords
text
words
texts
similarity
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810743949.9A
Other languages
Chinese (zh)
Other versions
CN110019814A (en
Inventor
翁健
黄芝琪
李文灏
陈杰彬
罗伟其
张悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201810743949.9A priority Critical patent/CN110019814B/en
Publication of CN110019814A publication Critical patent/CN110019814A/en
Application granted granted Critical
Publication of CN110019814B publication Critical patent/CN110019814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a news information aggregation method based on data mining and deep learning, which comprises the steps of using a crawler to capture data of a news portal website in the same time period to obtain news information and comment information; then, classifying and removing the weight of the news by applying a vector space model, a TF-IDF weight calculation method, a synonym forest method and the distance measurement of cosin, and aggregating the news with the same content; the function of summarizing all the comments is realized through an algorithm of text summarization; and finally, automatically generating the abstract of the article through a deep neural network model. The method can be convenient for readers to efficiently and quickly acquire the contents of all news platforms and reader comments.

Description

News information aggregation method based on data mining and deep learning
Technical Field
The invention relates to the technical field of natural language processing, in particular to a news information aggregation and abstract generation and comment summarization method.
Background
Under the internet era, the daily information data volume shows explosive growth, and news is one of the main ways for people to acquire information in life. Different from the traditional paper news, the network news has wide spread, large audience and quick update, and the operation cost is far lower than that of the traditional mode, so the network news is generally accepted by the society. For readers, the cost for reading the online news is low, the content is rich, the time is saved, and the readers can select the content which interests the readers to read, and the method is not limited to the fixed content provided by the traditional newspaper. In addition, almost all news sites provide a platform for the reader to speak and discuss, where the reader is free to express his or her own opinion. Meanwhile, for some popular events, the main comment content of readers can reflect the direction of public opinion, and a group of companies for analyzing network public opinion information are also emerged. Popular news and reviews are also the contents most readers prefer to read.
Meanwhile, news platforms are numerous, the quality of contents is uneven, problems are brought to readers, news describing the same contents are scattered on different platforms, the expression forms are different, and the reading effect of the readers is not well influenced by the inconsistency of operation of the platforms. Therefore, how to find useful information from news platforms with inconsistent contents in multiple forms and generate abstracts of the useful information to summarize reader comments, so that readers can efficiently read the useful information is a problem to be solved.
In the existing text similarity identification method, the text similarity is usually represented by directly according to the included angle value of the feature weight vector, although the system works well under most conditions, the method lacks effective processing on synonyms; in addition, the conventional collection system adopts a similar extraction method for news content abstract and reader comments to extract key content for display, the method is good in syntactic aspect, and the reader comments are excellent in performance due to short and refined reasons, but the extraction quality and the content fluency of the whole literary are poor in humanity, and no clear writing logic exists.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, provides a news information aggregation method based on data mining and deep learning, improves a characteristic weight vector corner clipping method by combining a synonym forest method, generates an abstract for news contents by adopting a deep learning method, and summarizes comments by adopting a decimation method.
The purpose of the invention is realized by the following technical scheme: a news information aggregation method based on data mining and deep learning comprises the following steps:
1. adopting a crawler frame to perform data crawling on news and comments of a specified website platform;
2. classifying all news, and classifying the contents of each news by combining a vector space model, a cos similarity value, a TF-IDF algorithm and a synonym forest method;
3. generating an article abstract, and generating a text abstract by adopting a deep neural network structure;
4. and summarizing the comments corresponding to the news, preprocessing the comment text, and directly extracting key comments by adopting a TF-IDF algorithm to summarize the text.
Preferably, the data crawl is performed using a scrapy crawler framework.
Preferably, classifying news is equivalent to: giving two texts, and judging whether the contents of the two texts are the same;
treating a text as a space vector
Figure BDA0001723883240000021
Each word in the text represents a dimension in the vector space, and the number of times the word appears in the text represents a vector
Figure BDA0001723883240000022
Length in this dimension, such that a text is completely converted into a vector in space;
assuming that there are n texts now, there are n vectors, and the space in which the n vectors are located is formed by dimensions represented by all non-repeated words in the n texts; in order to judge that two texts are similar, calculating the value of the included angle cosin of the corresponding vectors, wherein the closer the value is to 1, the more similar the two texts are, and the closer the value is to 0, the more dissimilar the two texts are;
suppose two vectors
Figure BDA0001723883240000023
And
Figure BDA0001723883240000024
all are n-dimensional vectors, and the method for calculating the cosine value of the included angle between the two vectors is as follows:
Figure BDA0001723883240000025
the text content is considered identical if the calculated similarity exceeds a set threshold.
Further, when judging whether the two texts are similar, selecting one of the texts as a reference text, selecting k keywords of the reference text from the reference text according to TF-IDF weights of the words, and establishing a vector space R by taking the keywords as dimensions of the vector spacekRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors
Figure BDA0001723883240000031
Calculating a cosin value of an included angle between two vectors, and when the value is greater than a set threshold value, considering that texts corresponding to the two vectors are similar;
when the TF-IDF model is used for calculation, a word is assumed to be represented as a, and the occurrence frequency of a in the text i is na,iTotal number of words of text i is NiThe number of all texts is D, D where a isaIf it occurs in text i, the weight formula of this word in text i is:
Figure BDA0001723883240000032
wa,ithe larger the value, the more important the representation a is in the text i; the text is subjected to word segmentation, TF-IDF values of all non-repetitive words are calculated, then the words are sorted from big to small, and the first X words are taken as key words.
Furthermore, considering synonyms and near synonyms of words, the synonymy degree of the quantifier is measured by word similarity, the word similarity is a numerical value, and the value range is set to be [0, 1]]Based on synonym forest, calculating word similarity between words and exceeding a certain thresholdJudging that the two words are the same, based on the judgment, obtaining the word a and the synonym thereof by a synonym forest-based method, modifying a word weight calculation formula, and converting daThe definition of (1) is modified from "how many texts the word a appears" to "how many texts the word a and its synonyms appear", while the word weight calculation formula remains unchanged:
Figure BDA0001723883240000033
specifically, the word similarity calculation method includes:
in Chinese words, one word often expresses many meanings, namely, a plurality of semantic items exist, all the semantic items are considered for calculating the word similarity, the calculation of the semantic item similarity is based on a synonym forest structure, and the calculation is performed according to the semantic distance of two semantic items by using the number of the semantic item in the word;
firstly, judging which layer two meaning items as leaf nodes in a synonym forest are branched, namely the numbers of the two meaning items are different, starting from the 1 st layer, if the two meaning items are the same, multiplying by 1, otherwise multiplying by a corresponding coefficient at the branch layer, and then multiplying by the corresponding coefficient
Figure BDA0001723883240000034
As normalization processing of the similarity of the meaning items, the similarity of the meaning items is controlled to be [0, 1]Where n is the total number of nodes of the branch layer;
the density of the tree where the words are located and the number of branches directly influence the similarity, the value of the similarity with higher density is more accurate than the value of the similarity with lower density, and then a parameter (n-k +1)/n is multiplied, wherein n is the total number of nodes of a branch layer, and k is the distance between two branches;
assuming that the numbers of the two meaning terms are different at the S layer, the corresponding coefficient of the S layer is S, and let A, B the similarity of the two meaning terms be represented by Sim:
Figure BDA0001723883240000041
when the similarity of the words is calculated, the meaning items of the two words are respectively calculated pairwise, and the maximum value is taken as the similarity value of the two words.
Preferably, the text abstract is generated by means of a deep neural network structure, a Seq2Seq technology, also called an Encoder-Decoder architecture, is adopted, wherein the Encoder and the Decoder are both formed by a plurality of layers of RNN/LSTM, and the Encoder is responsible for encoding an original text into a vector C; the Decoder is responsible for extracting information from the vector C, obtaining semantics and generating a text abstract.
Preferably, the specific implementation method of the general comment content comprises the following steps:
a) obtaining comment contents in real time, segmenting words of the text, and counting the times of occurrence of all the words respectively;
b) selecting N words with the highest word frequency as keywords;
c) dividing the text into sentences, calculating the number of keywords in each sentence, and dividing the number of keywords by the length of the sentence to obtain a value as a weight value of the sentence;
d) and splicing the sentences with the largest weight according to the appearance order in the text to form a summary text for output.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the system provided by the invention can acquire different news platform information and classify the news platform information, so that the problem of poor reading experience caused by the differentiation and inconsistency of a plurality of news platform information is avoided, and the influence of invalid news on readers is reduced.
2. In the invention, the technology of synonym forest is adopted in processing similar texts, so that the problem of synonym misjudgment is better avoided compared with the prior system.
3. The invention adopts the deep learning technology to process the abstract of the article, and avoids the defects of poor readability, incoherent content and the like of the existing abstract generation method, thereby enabling the abstract and the comment of the article to be more approximate to the standard natural language and improving the efficiency of reading news contents by readers.
4. The invention extracts and summarizes the comments, perfects the function of the news information aggregation system and further improves the efficiency of reading the comments by readers.
Drawings
Fig. 1 is a basic flowchart of an embodiment news information aggregation method.
FIG. 2 is a flowchart of an embodiment news classification step.
FIG. 3 is a flowchart illustrating summarizing news review content according to an embodiment.
FIG. 4 is a 5-level structure of an embodiment synonym forest.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example 1
A news information aggregation method based on data mining and deep learning comprises the following steps:
1. and crawling news and comments of a specified website platform, and performing data crawling on five news portal websites including a new wave, a search, an update, a network exchange and the world wide web by adopting a script crawler frame.
2. And classifying all news, and classifying the contents of each news by adopting methods such as a vector space model, a cos similarity value, a TF-IDF algorithm, a synonym forest and the like.
3. And generating an article abstract, and generating a text abstract by adopting an improved deep neural network structure.
4. And summarizing the comments corresponding to the news, and directly extracting key comments by adopting a TF-IDF algorithm through preprocessing the comment text to summarize the text.
Step one, data crawling.
And (4) crawling a crawler frame to crawl five news portal websites including new waves, search foxes, Tencent, network easiness and the world Wide Web. The Scapy is a data grabbing frame of Python, and can be used for grabbing data from a webpage end and saving the data to the local.
And step two, classifying the obtained news.
(1) The main flow of classifying local news is shown in fig. 2.
(2) Vector space model and cosin similarity value
To classify downloaded news content, the problems that need to be solved are: given two texts, how to judge the contents of the two texts are approximately the same.
Treating a text as a space vector
Figure BDA0001723883240000061
Each word in the text represents a dimension in the vector space, and the number of times the word appears in the text represents a vector
Figure BDA0001723883240000062
Length in this dimension, such that a text is completely converted into a vector in space. For example, if the content of a text segment is "today beijing rains, i is happy", then the result of word segmentation is "today/beijing/raining/,/i/very/happy", the text segment can be regarded as a vector in a seven-dimensional space, and the vector representation is:
[1 1 1 1 1 1 1]T
the numbers in the corresponding rows are sequentially "today", "beijing", "raining", "i", "very", ". At this time, the text is abstracted into a vector for representation. Assuming that there are now n texts, there are n such vectors, and the space in which these n vectors are located is made up of the dimensions represented by all the non-repeated words in the n texts. In order to judge that two texts are similar, the included angle cosin value of the corresponding vectors is calculated, the closer the value is to 1, the more similar the two texts are, and the closer the value is to 0, the more dissimilar the two texts are.
For example, if there are three texts:
Figure BDA0001723883240000063
the content of the first text is "wherever we are, i amAll feel happy, the word segmentation result is — "no matter/we/what/where/,/we/all/feel/very/happy";
Figure BDA0001723883240000064
the content of the second text is "we find happy wherever, the word segmentation result is-" we/anys/on/what/where/all/felt/very/happy ";
Figure BDA0001723883240000065
the content of the third text is "today we are happy because today is friday", the segmentation result is- "today/we/very/happy/,/because/today/friday".
The similarity between three texts can be judged, and firstly, the words appearing in all the texts are counted as follows, and the words are segmented by "|": whatever | where | and | where | | the | will | feel | very | happy | because | is | friday. There are a total of fourteen words, so a fourteen-dimensional vector representation can be used for the three texts, each dimension corresponding to the number of occurrences of the fourteen words. Thus, the vectors corresponding to the three texts are obtained as follows:
the first text:
Figure BDA0001723883240000071
the second text:
Figure BDA0001723883240000072
the third text:
Figure BDA0001723883240000073
then, the similarity between the three texts is calculated, and two vectors are assumed
Figure BDA0001723883240000074
And
Figure BDA0001723883240000075
all are n-dimensional vectors, and the method for calculating the cosine value of the included angle between the two vectors is as follows:
Figure BDA0001723883240000076
cosin similarity of text one and text two:
Figure BDA0001723883240000077
cosin similarity of text one to text three:
Figure BDA0001723883240000078
cosin similarity of text two to text three:
Figure BDA0001723883240000079
from the calculation result, it can be known that the similarity between the text one and the text two is the highest, and the similarity between the text two and the text three is the lowest. In practical applications, a threshold may be set, for example, if the calculated similarity exceeds 0.75, the text content is considered to be almost the same, and punctuation marks obtained during word segmentation or various words such as "yes" and "on" should be eliminated.
In the design of this algorithm, the similarity comparison of two texts is optimized in performance. However, the above algorithm has a problem that when the number of texts is very large, the corresponding vector space dimension becomes very high, and a vector corresponding to one text has a value of 0 in most dimensions in the vector space, resulting in very high time complexity and space complexity.
In order to avoid the performance degradation problem caused by the excessively high vector space dimension, the following improvements are made to the algorithm: when judging whether the two texts are similar, selecting one of the texts as a reference text, and selecting the reference text from the baseSelecting k keywords of the reference text according to TF-IDF weight of the words in the quasi text, and establishing a vector space R by taking the keywords as the dimension of the vector spacekRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors
Figure BDA00017238832400000710
Calculating the cosin value of the included angle between the two vectors, and considering that the texts corresponding to the two vectors are similar when the value is greater than a set threshold (through multiple experiments, the threshold is preferably 0.75).
Still taking the above three texts as an example, if it is determined whether the second text and the third text are similar to the first text, the first word segmentation of the first text results in the following results, no matter/we/in/what/where/,/we/all/feeling/very/happy, and the terms that appear are: whatever | where | and | where | | the | is felt | very happy. A total of 10 words, text one, text two, and text three are now described using three 10-dimensional vectors:
text one:
Figure BDA0001723883240000081
text two:
Figure BDA0001723883240000082
text three:
Figure BDA0001723883240000083
calculating the similarity:
cosin similarity of text one and text two:
Figure BDA0001723883240000084
cosin similarity of text one to text three:
Figure BDA0001723883240000085
if it is determined whether text three is similar to text two, then the segmentation of text two yields the result "we/whatever/where/all/feel/very/happy" and the terms that have occurred are: we feel | very happy regardless of | at | where | place |. A total of 9 words, text two and text three are now described using two 9-dimensional vectors:
text two:
Figure BDA0001723883240000086
text three:
Figure BDA0001723883240000087
calculating the similarity:
cosin similarity of text two to text three:
Figure BDA0001723883240000088
for the classification of news, the optimization described above was used based on the following experience: if the story contents are the same, the keywords of the two news are almost the same. From the practical operation result, the optimization is worthy of affirmation, and the storage space and the operation time are saved.
(3) TF-IDF algorithm
The main idea of TF-IDF consists of the following two points: 1. the more times a word appears in a text, the more important it is in the text; 2. the less often a word appears in all text, the more important it is. Here, "all texts" refers to all texts in one corpus. TF refers to the frequency of occurrence of a word in the text, and IDF refers to the frequency of occurrence of a word in all texts.
When the TF-IDF model is used for calculation, a word is assumed to be represented as a, and the occurrence frequency of a in the text i is na,iTotal number of words of text i is NiThe number of all texts is D, D where a isaIf it occurs in text i, the weight formula of this word in text i is:
Figure BDA0001723883240000091
wa,ithe larger the value, the more important the representation a is in the text i.
With the TF-IDF calculation method, the keywords in the text can be extracted. The basic idea is to divide the text into words, calculate the TF-IDF values of all non-repeated words, then sort the words, and take the words with the largest value as the key words. As for how many words are taken as keywords, a balance needs to be made according to actual conditions, if few keywords are taken, it may be difficult to better describe the main content of a text through the keywords, and if many keywords are taken, more computing resources may be spent in computing the similarity of the texts. In this item, the number of keywords selected is 10.
(4) Improvement of TF-IDF algorithm based on synonym forest
In the algorithm, keyword acquisition and similarity matching are successfully carried out on a plurality of texts, and higher accuracy is achieved. However, in the actual testing process, the above algorithm still has a problem that after a text is subjected to a TF-IDF algorithm to extract a keyword, if a word importance weight is calculated and used for judging similarity only by whether the content of the keyword is the same, a synonym judgment failure occurs, for example, words such as "school, college, university, and subject" occur in a plurality of texts at the same time, and it is not possible to classify the texts into one category by using the above algorithm, and it is obvious that in some environments, they need to be treated as the same word. Therefore, the similarity calculation of the words is carried out by adopting the thought based on the synonym forest to optimize the algorithm.
The synonym forest is a dictionary, which not only includes synonyms of a word, but also includes a certain number of similar words, i.e. related words in a broad sense. The synonym forest organizes all the included entries together according to a tree-like hierarchical structure. The vocabulary is divided into 3 types of big, middle and small. There are many words in each subclass, and these words are divided into several word groups according to the distance and relevance of word meaning. The words in each word group are further divided into a plurality of lines, and the words in the same line have the same word sense or strong correlation of the word senses. For example, soybeans, green soybeans, and soybeans are in the same row; tomatoes and tomatoes are in the same row; hiring, barren, lower and middle farmers, upper and middle farmers and Funong are also in the same line.
The thesaurus classification of the synonym forest adopts a hierarchy and has a 5-layer structure, as shown in fig. 4. As the level increases, word senses become increasingly detailed, and by level 5, the number of words in each category is already small. And selecting the classified vocabulary of the fifth layer to replace the keywords, wherein the synonyms and the related words are regarded as the same words in the 5 th layer because the synonyms and the related words are some in the 5 th layer.
The synonymy degree of the quantifier is measured by the word similarity, the word similarity is a numerical value, and the value range is set to be between [0 and 1 ]. A similarity of 1 to one word and 0 means that both words are not replaceable in any context.
When calculating the word similarity, the semantic item similarity needs to be calculated first. In Chinese words, a word often has many meanings, that is, many meanings. For example, "proud" may represent both recognition and derogation. Therefore, all the meaning items are considered for calculating the word similarity. And the calculation of the similarity of the semantic items is based on the structure of the synonym forest, and the semantic distance between two semantic items is calculated by using the serial numbers of the semantic items in the words.
First, it is determined which layer the two semantic items serving as leaf nodes in the synonym forest branch, i.e., which layer the numbers of the two semantic items are different. If the branch at the 4 th layer is assumed, judging from the 1 st layer, multiplying by 1 if the branch is the same, otherwise multiplying by the corresponding coefficient at the branch layer, and then multiplying by the corresponding coefficient
Figure BDA0001723883240000101
As normalization processing of the similarity of the meaning items, the similarity of the meaning items is controlled to be [0, 1]Where n is the total number of nodes of the branch layer.
The density of the tree where the words are located and the number of branches directly influence the similarity, and the similarity value with higher density is more accurate than the similarity value with lower density. And then multiplied by a parameter (n-k +1)/n, where n is the total number of nodes in the branch layer and k is the distance between two branches. Thus, the calculated value can be refined, and the calculation result is more accurate.
Let the similarity of two terms be represented by Sim:
if the two words are not on the same tree,
Sim(A,B)=f
if the branch at layer 2 is a, the coefficient is a, X is the parameter (n-k +1)/n,
Figure BDA0001723883240000102
if, at layer 3, the branch, the coefficient is b,
Figure BDA0001723883240000103
if, at layer 4, the branch, the coefficient is c,
Figure BDA0001723883240000111
if, at the level 5 branch, the coefficient is d,
Figure BDA0001723883240000112
when the similarity of the words is calculated, the meaning items of the two words are respectively calculated pairwise, and the maximum value is taken as the similarity value of the two words.
According to the above method, taking the word "people" as an example, similarity calculation is performed, and the result is shown in table 1:
TABLE 1 semantic similarity of the term "people" to other terms
Figure BDA0001723883240000113
It can be seen that the calculation result of the semantic similarity is basically consistent with the semantic similarity judged by human cognition, and the objective reality can be reflected truly, that is, the algorithm can accurately and objectively reflect the semantic correlation among the words, and an effective measurement is provided for the semantic correlation among the words.
Testing multiple words assumes that when the word similarity value is greater than 0.7, the two words are approximately considered to be the same.
After the words a and the synonyms thereof are obtained by a method based on the synonym forest, the word weight calculation formula is modified, and daThe definition of (1) is modified from "how many texts the word a appears" to "how many texts the word a and its synonyms appear", while the word weight calculation formula remains unchanged:
Figure BDA0001723883240000114
from the practical operation result, the improvement can well alleviate the defect that the synonyms cannot be judged.
The compared news is continuously replaced, and all similarities are found out. Comparing each news of the i platform with all the news of the other platforms once; then, the second news of the i platform is compared with the news of the other platforms once, and so on.
And step three, automatically generating the text abstract by adopting a deep neural network.
Due to the specific coherence and high summarization of the text abstract, the extraction method cannot achieve a good effect; the problem can be solved well by the generative text digest realized by the deep neural network structure. The invention adopts the Seq2Seq technology proposed by the GoogleBrain team, also called Encoder-Decoder architecture, wherein Encoder and Decoder are both composed of several layers of RNN/LSTM, and Encoder is responsible for encoding the original text into a vector C; the Decoder is responsible for extracting information from the vector C, obtaining semantics and generating a text abstract.
Step four, summarizing news comment contents.
(1) The main flow of summarizing and presenting news review content is shown in fig. 3.
(2) Concrete implementation method for summarizing and commenting contents
The comment content of news is different from the content of news text, the comment content is updated in real time, and the content of the news text is basically unchanged once being published, so that a method of crawler acquisition and analysis is adopted for the content of the news text, and all popular comments are stored in a memory in a real-time acquisition mode and are summarized and displayed in response to the summarization problem of the comments.
To summarize the review, one idea of the translation problem is: all comments are spliced to form a text, and then the text is summarized. One method used here uses the idea that the higher the word frequency (TF) is, the more likely it is to become a keyword as a basis for text summarization, and the specific steps are as follows:
a) segmenting words of the text, and counting the times of occurrence of all the words respectively;
b) selecting a plurality of words with the highest word frequency as keywords;
c) dividing the text into sentences, calculating the number of keywords in each sentence, and dividing the number of keywords by the length of the sentence to obtain a value as a weight value of the sentence;
d) and splicing the sentences with the largest weight according to the appearance order in the text to form a summary text for output.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (8)

1. A news information aggregation method based on data mining and deep learning is characterized by comprising the following steps:
s1, data crawling is carried out on news and comments of the specified website platform by adopting a crawler frame;
s2, classifying all news, and classifying the contents of each news by combining a vector space model and a cos similarity value;
when judging whether the two texts are similar, selecting one of the two texts as a reference text, selecting key words of k reference texts from the reference text according to TF-IDF weights of the words, and establishing a vector space R by taking the key words as dimensions of the vector spacekRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors
Figure FDA0003025823580000011
Calculating a cosin value of an included angle between two vectors, and when the value is greater than a set threshold value, considering that texts corresponding to the two vectors are similar;
the TF-IDF algorithm is improved based on a synonym forest, the number of the meaning item in the word is utilized, calculation is carried out according to the semantic distance of the two meaning items, when the similarity of the word is calculated, the meaning items of the two words are respectively calculated pairwise, and the maximum value is taken as the similarity value of the two words;
s3, generating an article abstract, and generating a text abstract by adopting a deep neural network structure;
and S4, summarizing the comments corresponding to the news, preprocessing the comment text, and directly extracting key comments by adopting a TF-IDF algorithm to summarize the text.
2. The news information aggregation method based on data mining and deep learning of claim 1, wherein data crawling is performed by using a script crawler framework.
3. The method for aggregating news information based on data mining and deep learning of claim 1, wherein in step S2, the classification of news is equivalent to: giving two texts, and judging whether the contents of the two texts are the same;
treating a text as a space vector
Figure FDA0003025823580000012
Each word in the text represents a dimension in the vector space, and the number of times the word appears in the text represents a vector
Figure FDA0003025823580000013
Length in this dimension, such that a text is completely converted into a vector in space;
assuming that there are n texts now, there are n vectors, and the space in which the n vectors are located is formed by dimensions represented by all non-repeated words in the n texts; in order to judge that two texts are similar, calculating the value of the included angle cosin of the corresponding vectors, wherein the closer the value is to 1, the more similar the two texts are, and the closer the value is to 0, the more dissimilar the two texts are;
suppose two vectors
Figure FDA0003025823580000014
And
Figure FDA0003025823580000015
all are n-dimensional vectors, and the method for calculating the cosine value of the included angle between the two vectors is as follows:
Figure FDA0003025823580000021
the text content is considered identical if the calculated similarity exceeds a set threshold.
4. The news information-gathering method based on data mining and deep learning of claim 3, wherein when judging whether two texts are similar, one of the two texts is selected as a reference text, and k reference texts are selected from the reference text according to TF-IDF weights of wordsKeywords of the text are taken as the dimensionality of the vector space to establish the vector space RkRespectively counting the occurrence times of each keyword in the two texts to form corresponding k-dimensional vectors
Figure FDA0003025823580000022
Calculating a cosin value of an included angle between two vectors, and when the value is greater than a set threshold value, considering that texts corresponding to the two vectors are similar;
when the TF-IDF model is used for calculation, a word is assumed to be represented as a, and the occurrence frequency of a in the text i is na,iTotal number of words of text i is NiThe number of all texts is D, D where a isaIf it occurs in text i, the weight formula of this word in text i is:
Figure FDA0003025823580000023
wa,ithe larger the value, the more important the representation a is in the text i; the text is subjected to word segmentation, TF-IDF values of all non-repetitive words are calculated, then the words are sorted from big to small, and the first X words are taken as key words.
5. The news information aggregation method based on data mining and deep learning of claim 4, wherein synonyms and synonyms of words are considered, the synonymy degree of quantifiers is measured by word similarity, the word similarity is a numerical value, and a value range is set to be [0, 1]]Based on the synonym forest, calculating the similarity of the terms, judging that the two terms are the same if a certain threshold value is exceeded, modifying the term weight calculation formula after the term a and the synonym thereof are obtained by a synonym forest based method, and modifying the term weight calculation formula and converting the daThe definition of (1) is modified from "how many texts the word a appears" to "how many texts the word a and its synonyms appear", while the word weight calculation formula remains unchanged:
Figure FDA0003025823580000024
6. the news information aggregation method based on data mining and deep learning of claim 5, wherein the word similarity calculation method is as follows:
in Chinese words, one word often expresses many meanings, namely, a plurality of semantic items exist, all the semantic items are considered for calculating the word similarity, the calculation of the semantic item similarity is based on a synonym forest structure, and the calculation is performed according to the semantic distance of two semantic items by using the number of the semantic item in the word;
firstly, judging which layer two meaning items as leaf nodes in a synonym forest are branched, namely the numbers of the two meaning items are different, starting from the 1 st layer, if the two meaning items are the same, multiplying by 1, otherwise multiplying by a corresponding coefficient at the branch layer, and then multiplying by the corresponding coefficient
Figure FDA0003025823580000031
As normalization processing of the similarity of the meaning items, the similarity of the meaning items is controlled to be [0, 1]Where n is the total number of nodes of the branch layer;
the density of the tree where the words are located and the number of branches directly influence the similarity, the value of the similarity with higher density is more accurate than the value of the similarity with lower density, and then a parameter (n-k +1)/n is multiplied, wherein n is the total number of nodes of a branch layer, and k is the distance between two branches;
assuming that the numbers of the two meaning terms are different at the S layer, the corresponding coefficient of the S layer is S, and let A, B the similarity of the two meaning terms be represented by Sim:
Figure FDA0003025823580000032
when the similarity of the words is calculated, the meaning items of the two words are respectively calculated pairwise, and the maximum value is taken as the similarity value of the two words.
7. The news information aggregation method based on data mining and deep learning of claim 1, wherein the text summary is generated by means of a deep neural network structure, and a Seq2Seq technique, also called an encorder-Decoder architecture, is adopted, wherein the encorder and the Decoder are both composed of several layers of RNN/LSTM, and the encorder is responsible for encoding the original text into a vector C; the Decoder is responsible for extracting information from the vector C, obtaining semantics and generating a text abstract.
8. The news information aggregation method based on data mining and deep learning of claim 1, wherein a specific implementation method of summarizing comment content comprises:
a) obtaining comment contents in real time, segmenting words of the text, and counting the times of occurrence of all the words respectively;
b) selecting N words with the highest word frequency as keywords;
c) dividing the text into sentences, calculating the number of keywords in each sentence, and dividing the number of keywords by the length of the sentence to obtain a value as a weight value of the sentence;
d) and splicing the sentences with the largest weight according to the appearance order in the text to form a summary text for output.
CN201810743949.9A 2018-07-09 2018-07-09 News information aggregation method based on data mining and deep learning Active CN110019814B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810743949.9A CN110019814B (en) 2018-07-09 2018-07-09 News information aggregation method based on data mining and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810743949.9A CN110019814B (en) 2018-07-09 2018-07-09 News information aggregation method based on data mining and deep learning

Publications (2)

Publication Number Publication Date
CN110019814A CN110019814A (en) 2019-07-16
CN110019814B true CN110019814B (en) 2021-07-27

Family

ID=67188331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810743949.9A Active CN110019814B (en) 2018-07-09 2018-07-09 News information aggregation method based on data mining and deep learning

Country Status (1)

Country Link
CN (1) CN110019814B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522948A (en) * 2020-04-22 2020-08-11 中电科新型智慧城市研究院有限公司 Method and system for intelligently processing official document

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature
CN103678564A (en) * 2013-12-09 2014-03-26 国家计算机网络与信息安全管理中心 Internet product research system based on data mining
CN107688652A (en) * 2017-08-31 2018-02-13 苏州大学 The evolutionary abstraction generating method of Internet media event
CN107784099A (en) * 2017-10-24 2018-03-09 济南浪潮高新科技投资发展有限公司 A kind of method for automatically generating Chinese news in brief
CN107908694A (en) * 2017-11-01 2018-04-13 平安科技(深圳)有限公司 Public sentiment clustering method, application server and the computer-readable recording medium of internet news
CN107908622A (en) * 2017-11-22 2018-04-13 昆明理工大学 A kind of transcription comparison method based on synonymous conjunctive word

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2463515A (en) * 2008-04-23 2010-03-24 British Telecomm Classification of online posts using keyword clusters derived from existing posts
CA2777506C (en) * 2009-10-15 2016-10-18 Rogers Communications Inc. System and method for grouping multiple streams of data
CN107122340B (en) * 2017-03-30 2018-11-06 浙江省科技信息研究院 A kind of similarity detection method of the science and technology item return based on synonym analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature
CN103678564A (en) * 2013-12-09 2014-03-26 国家计算机网络与信息安全管理中心 Internet product research system based on data mining
CN107688652A (en) * 2017-08-31 2018-02-13 苏州大学 The evolutionary abstraction generating method of Internet media event
CN107784099A (en) * 2017-10-24 2018-03-09 济南浪潮高新科技投资发展有限公司 A kind of method for automatically generating Chinese news in brief
CN107908694A (en) * 2017-11-01 2018-04-13 平安科技(深圳)有限公司 Public sentiment clustering method, application server and the computer-readable recording medium of internet news
CN107908622A (en) * 2017-11-22 2018-04-13 昆明理工大学 A kind of transcription comparison method based on synonymous conjunctive word

Also Published As

Publication number Publication date
CN110019814A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN106970910B (en) Keyword extraction method and device based on graph model
CN107577671B (en) Subject term extraction method based on multi-feature fusion
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN110727796A (en) Multi-scale difficulty vector classification method for graded reading materials
CN109446313B (en) Sequencing system and method based on natural language analysis
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115329085A (en) Social robot classification method and system
CN114048305A (en) Plan recommendation method for administrative penalty documents based on graph convolution neural network
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN110019814B (en) News information aggregation method based on data mining and deep learning
Chader et al. Sentiment Analysis for Arabizi: Application to Algerian Dialect.
CN107291686B (en) Method and system for identifying emotion identification
Jawad et al. Combination Of Convolution Neural Networks And Deep Neural Networks For Fake News Detection
Lai et al. An unsupervised approach to discover media frames
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
CN112214511A (en) API recommendation method based on WTP-WCD algorithm
Alashri et al. Lexi-augmenter: Lexicon-based model for tweets sentiment analysis
CN116186211B (en) Text aggressiveness detection and conversion method
CN113641788B (en) Unsupervised long and short film evaluation fine granularity viewpoint mining method
US11928427B2 (en) Linguistic analysis of seed documents and peer groups

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant