CN112580355B

CN112580355B - News information topic detection and real-time aggregation method

Info

Publication number: CN112580355B
Application number: CN202011613849.8A
Authority: CN
Inventors: 吴琼; 刘武雷; 王元卓; 郭建永
Original assignee: Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Current assignee: Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-08-31
Anticipated expiration: 2040-12-30
Also published as: CN112580355A

Abstract

The invention belongs to the technical field of natural language processing, and particularly relates to a news information topic detection and real-time aggregation method. The method can be used for finishing the real-time pushing of news information leisurely through data acquisition, data processing, text fusion model construction and real-time aggregation. On the basis of constructing a text feature model by using a multi-feature fusion method, a distributed real-time streaming data calculation method is adopted to distribute topic clustering tasks to different calculation nodes, so that the accuracy and the real-time performance of news information real-time aggregation are improved, the performance problem under a single node is solved, and finally, a news information aggregation result can be pushed to an end user through a terminal device, and the method is convenient and practical.

Description

News information topic detection and real-time aggregation method

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a multi-data news information topic detection and real-time aggregation method.

Background

The continuous innovation and rapid development of information technology bring a profound influence to news spreading, media platforms based on the internet are continuously increased, the spreading speed and the number of news are increased day by day, network news information becomes messy, different media platforms forward and copy the same news information, and the homogenization problem is serious. Therefore, how to automatically mine and analyze the hot topics worth attention at present in an immense information sea by utilizing computer technology and comprehensively display the aggregated hot news to users in real time is a research hotspot and focus of current network news. In addition, as the data scale of the network news is rapidly increased, the original serialized topic discovery and tracking method cannot be effectively executed due to the limitation of conditions such as memory capacity when processing a mass news data set, and the requirement on timeliness and the like is difficult to meet.

Disclosure of Invention

Aiming at the defects and problems that the conventional news data are increased sharply, and the conventional serialized topic finding and tracking method is often unable to be effectively executed due to the limitation of conditions such as memory capacity and the like when processing a mass news data set, and is difficult to meet the requirements in aspects such as timeliness and the like, the invention provides a topic detection and aggregation method for multi-data news information on the basis of constructing a text feature model by utilizing a multi-feature fusion method, distributes topic clustering tasks to different computing nodes, improves the accuracy and real-time performance of news information real-time aggregation, and solves the performance problem under a single node.

The technical scheme adopted by the invention for solving the technical problems is as follows: a news information topic detection and real-time aggregation method comprises the following steps:

step one, distributed data acquisition: collecting news information from an internet news media website in real time through a distributed collection program to serve as original data;

step two, data preprocessing: carrying out text denoising, Chinese word segmentation, word filtering stop words, part of speech tagging, keyword extraction and named entity identification on original data to obtain a data document set D to be processed;

step three, constructing a text feature model: the text feature model is constructed by utilizing a multi-feature fusion method, and the model construction method comprises the following steps:

(1) obtaining subject characteristics of a text by utilizing a named entity recognition technology and an LDA model which are integrated, receiving a document set D as input, and calculating the similarity sim (p, q) of the texts p and q_lda，

In the formula: p and q are probability vectors of the quantitative texts, and DKL is a vector distance calculated by adopting relative entropy;

(2) obtaining semantic features of the text by using a Word2Vect model, and calculating the semantic similarity sim (p, q) of the text p and the text q by using cosine similarity_v2q，

(3) A text fusion model is obtained by fusing the theme characteristics and the semantic characteristics by adopting the weighting factors,

sim(p，q)＝α*sim(p，q)_lda+β*sim(p，q)_v2q

in the formula: α, β are weighting factors, α + β ═ 1;

(4) adding time attenuation factors to the text fusion model to update the model, calculating the similarity of the updated text,

sim(p，q)＝e^-k*(t2-t1)*α*sim(p，q)_lda+e^-k*(t2-t1)*β*sim(p，q)_v2q

in the formula: k is an attenuation factor, t₂And t₁Is the update time of both articles;

step four, distributed real-time clustering: the method for clustering news information in real time by adopting a distributed real-time clustering algorithm comprises the following steps:

(1) vectorizing the collected and preprocessed text, transferring the vector data to task scheduling nodes of a distributed real-time aggregation algorithm according to an input sequence, uniformly numbering the tasks by the task scheduling nodes, and then issuing the tasks to task execution nodes;

(2) traversing the feature vectors of the text by the task execution nodes, and calculating the similarity of each vector and other vectors of the calculation nodes according to the updated text fusion model to obtain a candidate similarity set;

(3) selecting the maximum similarity from the similarity candidate set and recording the feature vector corresponding to the maximum similarity to form a feature vector similarity set;

(4) filtering combinations with the similarity smaller than a specified threshold value from the feature vector similarity set to obtain a filtering set, and outputting the result to the message middleware;

(5) and taking out the filtering set from the message middleware, merging and outputting the sets with the same text until all clusters are not updated any more, and obtaining the real-time clustered news information.

Step five, pushing in real time: and pushing the news information clustered in real time to the user in real time through a visualization tool.

In the above news information topic detection and real-time aggregation method, the internet news media data is various news information from various media platforms.

According to the news information topic detection and real-time aggregation method, in the first step, data acquisition adopts a distributed architecture design, a task generation module executes a generated acquisition task, and a task execution module executes the acquisition task.

According to the news information topic detection and real-time aggregation method, message middleware can be arranged between the task generation module and the task execution module, and the two modules are respectively in communication connection with the message middleware to finish data transmission.

According to the news information topic detection and real-time aggregation method, the distributed acquisition program comprises the task scheduling center and the task acquisition nodes, wherein the task scheduling center acquires tasks from the task list and issues the acquired tasks to the specific task acquisition nodes through the message middleware to generate the acquisition tasks to be executed of the form and shadow; the task acquisition node is used for executing an acquisition task and downloading and acquiring page news data.

The invention has the beneficial effects that:

the text similarity calculation method utilizes a multi-feature fusion method to construct a text feature model, utilizes a named entity recognition technology and an LDA model to obtain the subject feature of the text, and fully considers the named entity factors and the time factors to construct a frame of text similarity calculation.

The invention adopts a distributed real-time clustering algorithm to distribute topic clustering tasks to different computing nodes, improves the accuracy and real-time performance of news information real-time aggregation, and solves the performance problem under a single node.

According to the topic detection and real-time aggregation method for multi-data news information, real-time pushing of the news information is finished through data acquisition, data processing, text fusion model construction and real-time aggregation, during data acquisition, a task generation module executes and generates an acquisition task, a task execution module executes the acquisition task, and the two modules can dynamically expand or reduce resources according to the size of a task amount scheduling program without influencing normal operation of a system, so that acquisition efficiency is guaranteed.

Drawings

FIG. 1 is a schematic view of the overall process of the present invention.

Fig. 2 is a schematic diagram of data acquisition processing according to the present invention.

Detailed Description

The invention provides a topic detection and aggregation method for multi-data news information on the basis of constructing a text feature model by using a multi-feature fusion method, and topic clustering tasks are distributed to different computing nodes, so that the accuracy and the real-time performance of news information real-time aggregation are improved, and the performance problem under a single node is solved. The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1, the news information topic detection and real-time aggregation method of the present invention includes the following steps.

Step one, distributed data acquisition: the news information from the internet news media website is collected in real time through a distributed collection program to serve as original data.

The method comprises the following steps:

(1) generating an acquisition task, generating a corresponding acquisition task according to the data volume of the data source, and transmitting the acquired task to the message middleware;

(2) and receiving the acquisition task, executing the acquisition task, and acquiring data according to the acquisition task in the middle of the acquisition task and the received message to obtain first data.

The mutual news media data is various news information from various media platforms (including but not limited to websites of traditional news media in various cities and internet news sites). In the specific implementation process, script can be used as a framework of an acquisition program, a task acquisition module extracts a task according to an initialized data source and a task extraction rule, and writes the analyzed task into a kafka acquisition task; and the acquisition module reads the tasks from the kafka and performs data acquisition and completes preprocessing and warehousing work. During implementation, partial task acquisition or execution nodes can be dynamically started and suspended by a scheduling program according to the task quantity condition in kafka.

Or a distributed architecture design is adopted, the task generating module executes and generates the acquisition task, and the task executing module executes the acquisition task; meanwhile, a message middleware can be arranged between the task generating module and the task executing module, and the two modules are respectively in communication connection with the message middleware to finish data transmission; the task generation module and the task execution module can dynamically expand or reduce resources according to the size of the task amount and the scheduling program without influencing the normal operation of the system, and the acquisition efficiency is ensured.

for the Chinese word segmentation, because the current Chinese word segmentation technology is relatively mature and the word segmentation effect of the mainstream word segmentation tool is relatively close, the open-source Chinese word segmentation tool, such as the jieba word segmentation, can be directly used in the implementation.

Aiming at the filtering of stop words, new stop words can be further added into the stop word library in a further perfection manner by combining the characteristics of news aggregation on the basis of considering the common natural language processing stop word set.

Step three, constructing a text feature model: constructing a text feature model by using a multi-feature fusion method, wherein the model construction method comprises the following steps:

(1) obtaining subject characteristics of texts by utilizing named entity recognition technology and LDA model, receiving a document set D as input, and calculating text similarity sim (p, q) of texts p and q_lda，

In the formula: p and q are probability vectors of texts, and DKL is a vector distance calculated by using relative entropy.

In the formula: p is a radical of_iAnd q is_iRespectively, representing different text.

The important meaning of the Word2Vect model (Word vector) is that natural language is converted into a vector that a computer can understand the computation. Word2Vec is a Word vector computation model proposed by Google. The Word2 vent tool mainly comprises two models: continuous bag of words model (CBOW, continuous bag of words) and skip-word model (skip-gram). CBOW is a word vector obtained by training according to the context to predict a target word; and the Skip-gram is trained according to the target word to predict the surrounding words to obtain a word vector. In the specific implementation process, because the Skip-gram has a good effect on large-scale corpora, the Skip-gram is adopted to construct word vectors, the news2016zh corpora is used as a training corpus to construct word vectors, and a trained model is used to represent texts.

(3) Fusing text features and semantic features by using weighting factors to obtain a text fusion model,

sim(p，q)＝α*sim(p，q)_lda+β*sim(p，q)_v2q

in the formula: α and β are weighting factors, and α + β is 1.

(4) Adding a time attenuation factor to the text fusion model to update the model, wherein the similarity of the updated text is calculated as follows:

sim(p，q)＝e^-k*(t2-t1)*α*sim(p，q)_lda+e^-k*(t2-t1)*β*sim(p，q)_v2q

in the formula: k is an attenuation factor, t₂And t₁Is the update time of both articles.

(1) and vectorizing the collected and preprocessed text, transferring the vector data to task scheduling nodes of a distributed real-time aggregation algorithm according to an input sequence, uniformly numbering the tasks by the task scheduling nodes, and then issuing the tasks to task execution nodes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be covered thereby.

Claims

1. A news information topic detection and real-time aggregation method is characterized in that: the method comprises the following steps:

In the formula: p and q are probability vectors of texts, and DKL is a vector distance calculated by adopting relative entropy;

sim(p，q)＝α*sim(p，q)_lda+β*sim(p，q)_v2q

in the formula: α, β are weighting factors, α + β ═ 1;

sim(p，q)＝e^-k*(t2-t1)*α*sim(p，q)_lda+e^-k*(t2-t1)*β*sim(p，q)_v2q

(5) taking out a filtering set from the message middleware, merging and outputting the sets with the same text until all clusters are not updated any more, and obtaining real-time clustered news information;

2. The news information topic detection and real-time aggregation method as claimed in claim 1, wherein: the internet news media website data is various news information from various media platforms.

3. The news information topic detection and real-time aggregation method as claimed in claim 1, wherein: in the first step, data acquisition adopts a distributed architecture design, a task generation module executes a generated acquisition task, and a task execution module executes the acquisition task.

4. The news information topic detection and real-time aggregation method as claimed in claim 3, wherein: and a message middleware can be arranged between the task generating module and the task executing module, and the two modules are respectively in communication connection with the message middleware to finish data transmission.

5. The news information topic detection and real-time aggregation method as claimed in claim 1, wherein: the distributed acquisition program comprises a task scheduling center and task acquisition nodes, wherein the task scheduling center acquires tasks from a task list and issues the acquisition tasks to the specific task acquisition nodes to generate corresponding acquisition tasks to be executed; the task acquisition node is used for executing an acquisition task and downloading and acquiring page news data.