CN111090811B - Massive news hot topic extraction method and system - Google Patents

Massive news hot topic extraction method and system Download PDF

Info

Publication number
CN111090811B
CN111090811B CN201911344883.7A CN201911344883A CN111090811B CN 111090811 B CN111090811 B CN 111090811B CN 201911344883 A CN201911344883 A CN 201911344883A CN 111090811 B CN111090811 B CN 111090811B
Authority
CN
China
Prior art keywords
news
text data
model
similarity
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911344883.7A
Other languages
Chinese (zh)
Other versions
CN111090811A (en
Inventor
宿红毅
王军义
闫波
郑宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201911344883.7A priority Critical patent/CN111090811B/en
Publication of CN111090811A publication Critical patent/CN111090811A/en
Application granted granted Critical
Publication of CN111090811B publication Critical patent/CN111090811B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for extracting massive news hot topics. The method comprises the steps of obtaining similarity among news text data by adopting a parallelization training model, obtaining classification of the news text data according to the similarity by adopting an improved convolutional neural network model, clustering topics of the classified news text data by adopting a clustering algorithm, and further detecting hot topics under each category from massive news data so as to ensure the accuracy of news hot topic extraction, wherein the whole news hot topic extraction process is based on the parallelization mode, so that the extraction efficiency of the news hot topics can be further improved.

Description

Massive news hot topic extraction method and system
Technical Field
The invention relates to the technical field of big data processing and analysis, in particular to a method and a system for extracting massive news hot topics.
Background
In recent years, with the rapid development of the internet, the explosion type increase of the internet information is faced, people are particularly important for rapidly and accurately acquiring news event topics which are interested by people and are popular in the current society from massive network news information, in order to solve the problem, the current technical method is utilized for classifying the news information in the internet, the popular news information containing social dynamics and social focuses is found, the people are focused from the disordered internet information rapidly and simply, the news information which is closely related to the life of people and can reflect the current popular life trend and the social trend is more and more focused by researchers in various fields.
While the Internet information is rapidly increased, news information in the network is gradually becoming bulkier and huge, and various information is distributed in a staggered way, so that no law exists. For these reasons, it is a great challenge and difficulty for people to acquire news events of interest from a huge amount of network information. Therefore, how to quickly extract interesting focus events and related event development dynamic news information from massive network news, and filter useless information from the vast network news, so that organized and organized users are helped to dig out social hot events in time, people are helped to acquire the hot dynamics of the current society, and the current society hot events become the hot spots of the current research.
The explosive growth of network information brings great difficulty to the calculation and processing of data, the traditional data processing mode can not meet the requirement of large-scale data processing, and the processing of massive data becomes the bottleneck of current production and technological development.
Along with the increasing growth of news information in a network, the traditional TDT technology is more and more difficult to process massive news data in face of huge pressure brought by massive information processing, and the problem is relieved by the rising of a distributed computing technology, so that the distributed technology is introduced into the processing of massive network news data by utilizing the advantages of the technology in the processing of massive data, and the efficiency of analyzing hot topics of the network can be greatly improved. From related research results, the current method applied to network news hot topic detection and discovery has achieved some achievements gradually, but still the problem that the extraction efficiency of the news hot data is improved while the news hot data is accurately extracted cannot be solved.
Disclosure of Invention
The invention aims to provide a method and a system for extracting massive hot news topics, which can improve the extraction efficiency of the hot news topics while improving the extraction accuracy of the massive hot news topics.
In order to achieve the above object, the present invention provides the following solutions:
a massive news hot topic extraction method comprises the following steps:
acquiring news text data;
preprocessing the acquired news text data;
acquiring a parallelized training model; the parallelization training model is a network training model which takes preprocessed news text data as input and takes similarity among the news text data as output;
obtaining the similarity between the news text data according to the preprocessed news text data by using the parallelized training model;
acquiring an improved convolutional neural network model; the improved convolutional neural network model takes similarity among news text data as input and classification of the news text data as output;
obtaining the classification of the news text data according to the similarity among the news text data by utilizing the improved convolutional neural network model;
and clustering topics of the classified news text data by adopting a clustering algorithm to obtain news hot topics.
Optionally, the parallelized training model is a linear combination model of a parallelized word vector model and a parallelized topic model.
Optionally, before the obtaining the parallelized training model, the method further includes:
obtaining a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set;
and performing parallelization training on the training sample set to obtain the similarity between the news text data in the training sample set.
Optionally, after obtaining the classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model, the method further includes:
performing word frequency distribution analysis, regional distribution analysis and site distribution analysis on the classified news text data;
and respectively counting the news text data with the same word frequency distribution, the news text data with the same region distribution and the news text data with the same site distribution.
Optionally, the clustering algorithm is used for performing topic clustering on the classified news text data to obtain news hot topics, and the method includes:
clustering topics of the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic set;
clustering the first hot topic sets in a specific time period by adopting a clustering algorithm to obtain second hot topic sets; the news text data in the second hot topic collection is the extracted hot news topics.
A massive news hot topic extraction system, comprising:
the data acquisition module is used for acquiring news text data;
the preprocessing module is used for preprocessing the acquired news text data;
the training model acquisition module is used for acquiring a parallelized training model; the parallelization training model is a network training model which takes preprocessed news text data as input and takes similarity among the news text data as output;
the first similarity determining module is used for obtaining the similarity between the news text data according to the preprocessed news text data by utilizing the parallelized training model;
the convolutional neural network model acquisition module is used for acquiring an improved convolutional neural network model; the improved convolutional neural network model takes similarity among news text data as input and classification of the news text data as output;
the data classification module is used for obtaining the classification of the news text data according to the similarity among the news text data by utilizing the improved convolutional neural network model;
the news hot topic acquisition module is used for clustering topics of the classified news text data by adopting a clustering algorithm to obtain news hot topics.
Optionally, the parallelized training model is a linear combination model of a parallelized word vector model and a parallelized topic model.
Optionally, the system further comprises:
the training sample set acquisition module is used for acquiring training samples, and performing calibration sampling on news text data in the training samples to obtain training sample sets;
and the second similarity determining module is used for carrying out parallelization training on the training sample set to obtain the similarity among the news text data in the training sample set.
Optionally, the system further comprises:
the analysis module is used for performing word frequency distribution analysis, regional distribution analysis and site distribution analysis on the classified news text data;
and the statistics module is used for respectively carrying out statistics on the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
Optionally, the news hot topic obtaining module includes:
the first hot topic collection acquisition unit is used for carrying out topic clustering on the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic collection;
the second hot topic set acquisition unit is used for clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set; the news text data in the second hot topic collection is the extracted hot news topics.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the massive hot news topic extraction method and system provided by the invention, the similarity between news text data is obtained by adopting the parallelized training model, the classification of the news text data is obtained according to the similarity by adopting the improved convolutional neural network model, and then the classified news text data is subjected to topic clustering by adopting the Single-Pass clustering algorithm, so that hot topics under each category are detected from the massive news data, the accuracy of hot news topic extraction is ensured, and the whole hot news topic extraction process is based on the parallelization mode, so that the extraction efficiency of the hot news topics can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for extracting massive hot news topics provided by an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a massive news hot topic extraction system provided by an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a method and a system for extracting massive hot news topics, which can improve the extraction efficiency of the hot news topics while improving the extraction accuracy of the massive hot news topics.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Fig. 1 is a flowchart of a method for extracting massive news hot topics provided by an embodiment of the present invention, as shown in fig. 1, and the method for extracting massive news hot topics includes:
s100, acquiring news text data.
S101, preprocessing the acquired news text data.
S102, acquiring a parallelized training model. The parallelization training model is a network training model which takes preprocessed news text data as input and takes similarity among the news text data as output.
S103, obtaining the similarity between the news text data according to the preprocessed news text data by using the parallelization training model.
S104, acquiring an improved convolutional neural network model. The improved convolutional neural network model takes similarity among news text data as input and classification of the news text data as output.
S105, utilizing the improved convolutional neural network model, and obtaining the classification of the news text data according to the similarity among the news text data.
And S106, clustering topics of the classified news text data by adopting a clustering algorithm to obtain news hot topics.
In order to improve the comprehensiveness of acquiring news text data, the invention adopts a parallelized web crawler technology when acquiring the news text data in S100, and the specific process of acquiring the data is as follows:
aiming at the characteristics of each portal, firstly, the address link information of the news webpage is acquired from the portal, and then the webpage content in each address information is crawled. Finally, extracting effective information contents such as text, title, time, keywords and the like of the news webpage from the webpage source code information. In order to facilitate the crawling of large-scale data and the improvement of crawling efficiency, a distributed parallelization calculation framework is adopted by a web crawler for crawling, and the crawling mode of being applied to a plurality of servers in a cluster can greatly improve the crawling efficiency.
The specific form of the distributed parallelization computation framework is as follows: the distributed crawler system adopts a master-slave structure, namely one master node controls all slave nodes to execute crawling tasks, and the master node is responsible for distributing tasks and guaranteeing the load balance of all slave nodes in the cluster. A distributed crawler may be considered as a combination of multiple centralized crawler systems, each slave node being equivalent to a centralized crawler system controlled and managed by a master node in the distributed crawler system to enable it to work cooperatively. Map/Reduce-based parallelization computational frameworks are used on a distributed cluster basis to improve the efficiency of crawling.
In order to facilitate the processing of the news text data in the subsequent process, the preprocessing of the acquired news text data in S101 requires preprocessing operations such as chinese word segmentation and stop word filtering on the news text.
The Chinese word segmentation is an important stage of text processing, and the Chinese word segmentation and the English word segmentation are maximally different in that spaces are arranged between the English words, so that words can be well distinguished, the Chinese word segmentation has higher difficulty, and the main Chinese word segmentation technology at present is mainly based on hidden Markov and conditional random fields. The invention adopts an open-source jieba word segmentation framework, namely adopts a hidden Markov model to segment words and label parts of speech, and updates a resultant word library in time for words with improper word segmentation in a jieba dictionary.
In the process of stopping word filtering, since news is long text data, a large number of useless words are contained in the news, for example: words such as 'woolen' and 'yes' which are not relevant to topic detection have poor expression effect on documents and influence on the operation efficiency of a later model, so that the words must be filtered, and a stop word list is established to remove irrelevant words, wherein the stop word list refers to a Chinese stop word list of a fox searching laboratory. In addition, after the part of speech tagging is performed by using jieba, some pronouns, chinese words and prepositions are filtered out.
In order to increase the efficiency of processing news with a huge amount of data, a parallelization processing method is also required in this process.
The parallelized training model described in S102 is a linear combination model of a parallelized word vector model and a parallelized topic model.
In the step S102, specifically, a Word2vec Word vector model and an LDA (latentirich allocation implicit dirichlet allocation) topic model are used to respectively perform model training on the news text, a linear combination of two model vectors is used when calculating the text similarity, and a weighting factor alpha is added, so that the text similarity can be calculated by combining the Word2vec with the LDA.
The text similarity is calculated by adopting cosine similarity through combined modeling in a mode of weighting Word2vec Word vector models and LDA topic models, and specifically comprises the following steps: assume that the vectors of document X after vectorization by Word2vec and LDA are X respectively w And X L . The vector of the document Y after vectorization by Word2vec and LDA is Y respectively w And Y L . The text similarity formula using Word2vec modeling is as follows:
sim Word2vec (X,Y)=cos(X w ,Y w ),
the text similarity formula according to the LDA modeling approach is as follows:
sim LDA (X,Y)=cos(X L ,Y L ),
in the topic detection model, a Word2vec and LDA combined modeling mode is adopted, wherein the combined modeling mode is a linear combination of the two formulas when calculating text similarity, and a weighting factor alpha is added, so that a similarity formula of two documents X and Y can be obtained:
sim(X,Y)=asim Word2vec (X,Y)+(1-a)sim LDA (X,Y),
the text similarity can thus be calculated in a way that Word2vec is combined with LDA.
Since the model training speed of the serial model is slow, the serial mode is improved, and the training of the model is accelerated by using a distributed parallel mode.
The specific method for accelerating model training by using the parallelization mode is as follows:
during model training, a previous serial training method is modified into distributed GPU parallel training, each time the host end transmits each parameter of the initialized network model and a SENTENCE with the LENGTH of MAX_SENTECE_LENGTH to the GPU end, the GPU end starts MAX_SENTECE_LENGTH GPU threads to train the MAX_SENTECE_LENGTH words simultaneously, and therefore task-level parallelism is achieved, and acceleration of model training is achieved. The number of threads is also optimized. Since a thread bundle (warp) contains 32 threads, which is the basic unit of GPU thread scheduling and execution, the total number of threads recommended to be used is a multiple of 32, which can achieve higher throughput.
In the serial model training mode, the max_send_length macro is defined as 1000, and this value determines that when dividing SENTENCEs, the last SENTENCE is removed, and the LENGTH of each SENTENCE is 1000, that is, 1000 words are in each SENTENCE. The number of threads which are opened each time when the GPU training model is used is determined according to the LENGTH of SENTENCEs, one thread corresponds to one word in the SENTENCEs, so that 1000 threads are opened each time, but 1000 is not a multiple of 32, the macro definition of MAX_SENTENCE_LENGTH is changed from 1000 to 1024, 1024 threads are opened each time, and the number of threads is a multiple of 32, so that higher throughput rate can be obtained, and the execution efficiency is improved.
In addition, in order to improve the accuracy of news text data processing, before S102, it includes:
and acquiring a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set.
And performing parallelization training on the training sample set to obtain the similarity between the news text data in the training sample set.
In the process of executing S104 and S105, news classification is performed by combining the pre-trained Word2vec with a convolutional neural network, and compared with the traditional convolutional neural network, the network structure is different in that an embedding layer of Word vectors is added in front of the convolutional layer.
Because the current common practice is to put the embedding layer into the neural network for self-training, the training parameters are expanded, the training time is prolonged, and the model is fitted excessively, the improved convolutional neural network model of the embedding layer formed by embedding Word2vec Word vector models is adopted to process the data, so that the training parameters can be reduced, and the training time can be reduced.
Specific embeddingThe method comprises the following steps: the text dataset is news text data, assuming the dataset has d documents, each word of a document has a mapping, assuming w i Where i is a word of the document, and assuming that the maximum length of the news text is m, the original input is d×m. And then a pre-trained Word2vec model is adopted, each Word has a Word vector mapping, and the original input can be converted into the dimension of d x m x n on the assumption that the dimension of the Word vector is n. At this time, the dimension is extended by one dimension to meet the requirement of two-dimensional convolution, and the dimension of the input layer is d×m×n×1. Thus, the embedding of the ebedding layer by the Word2vec model is obtained.
The convolutional neural network is different from the traditional CNN, the traditional CNN is formed by connecting the upper layer of convolutional layer with the lower layer of convolutional layer in series, the convolutional layer is convolved with convolution kernels of three different sizes respectively, and the convolutional layers obtained by convolving the convolution kernels of the three different sizes are in a parallel mode with the degree of parallelism of 3. The convolution layer adopts convolution kernel embedding layers with 3 different sizes to extract features, then the pooling layer extracts the most representative features in each feature map, and finally the multi-classification is completed through the full-connection layer.
In addition, in order to perform statistics and analysis on news text data under each category, after S105, the massive news hot topic extraction method provided by the invention further includes:
and performing word frequency distribution analysis, regional distribution analysis and site distribution analysis on the classified news text data.
And respectively counting the news text data with the same word frequency distribution, the news text data with the same region distribution and the news text data with the same site distribution.
In order to further improve the extraction accuracy and high efficiency of the hot news topics, in S106, a Single-Pass clustering algorithm is adopted to perform topic clustering on the classified news text data to obtain hot news topics, which specifically includes:
and carrying out topic clustering on the classified news text data in the unit window by adopting a Single-Pass clustering algorithm to obtain a first hot topic set.
And clustering the first hot topic set in a specific time period by adopting a Single-Pass clustering algorithm to obtain a second hot topic set. The news text data in the second hot topic collection is the extracted hot news topics.
The specific implementation process of S106 is as follows:
by adopting the improved double-layer Single-Pass clustering method, news hot topics in a period of time can be extracted, and the hot topics in each period of time can be combined to obtain an integral hot topic set in a plurality of periods of time.
The invention designs a double-layer Single-Pass clustering approximate thought that: the Single-Pass clustering is adopted to perform topic clustering on the network news in the unit time window (the unit time window is the day), so that hot topics of each time window can be obtained, for example, hot topics of each day within 1 week can be obtained through primary clustering. However, this does not meet our requirements, as it is not possible to do with Single-Pass clustering of a Single layer only when we need to get a global hot topic within 1 week. Therefore, the invention designs a secondary clustering on the basis of the topic set obtained by the primary clustering, and the secondary clustering mode adopts Single-Pass clustering, for example, the primary clustering obtains hot topic sets of each day in a week, and objects processed by the secondary clustering become the hot topic sets. And combining the hot topic sets every day in a week through secondary clustering, so that the hot topic of the whole week can be obtained.
The double-layer Single-Pass clustering design has the advantages that a hot topic set of a unit time window can be obtained, topics under the unit time window can be combined, and hot topics of a larger time window can be detected. Because the objects processed during secondary clustering are topic sets of primary clustering, the clustering cost is greatly reduced, and the clustering efficiency is improved.
The specific algorithm steps of the double-layer Single-Pass clustering designed by the invention are described in detail below. The method is divided into a primary clustering part and a secondary clustering part.
Wherein, primary clustering includes:
input: d news document vectors, similarity threshold s
And (3) outputting: topic collection T
The method comprises the following specific steps:
step 1, a first document d in the documents is input 1 Make a first new topic t 1 . Where the topic center vector is the average of the vectors that enter the topic.
Step 2, inputting document d i (i=2, 3,4, ·, d), d i Similarity is calculated with existing class sets. And taking the topic set with the maximum similarity as a candidate set, entering the topic set if the maximum similarity is greater than s, and forming a new topic if the maximum similarity is not greater than s.
And 3, repeating the step 2 until the last document is input, and outputting all topic sets T.
The secondary clustering includes:
step 1: all topic sets T obtained according to primary clustering i Enter the first topic T 1 Build up of T-involved 1 Setting a similarity threshold s;
step 2: the next topic set T i Incoming, calculate T i The similarity between each topic i in the topic sets and topics in the known topic sets is obtained, and the topic set with the largest similarity is taken as a candidate set.
Step 3: if the similarity value is larger than s, combining the two topic sets, otherwise, establishing a new topic set containing topic i;
step 4: repeating the steps 2 and 3 until the last topic set is processed, and outputting a combined topic set T'.
Compared with the prior art, the method for extracting the mass hot news topics has the following characteristics and beneficial effects:
1. the invention collects the latest news data of the domestic main portal, which mainly adopts the web crawler technology to crawl the network news on a large scale. The invention further provides a method for modeling the text by combining the Word2vec Word vector model with the LDA topic model, and the two models are combined to model in a weighted mode to perform similarity calculation, so that the modeling mode has an obvious effect higher than that of independently using the two models.
2. According to the method and the device for classifying the news, text classification is carried out before topic detection, and the crawled news is classified, so that the requirements of different users on different fields can be met, and the time cost of clustering can be saved. For text classification, the invention adopts a mode of combining pre-trained Word2vec with a convolutional neural network to carry out algorithm design, so that the training number of parameters is reduced in the training process of the neural network, overfitting is prevented, and in the design of the convolutional neural network, the invention adopts a three-layer convolutional network structure, and compared with the traditional neural network structure formed by series connection, the invention adopts three different convolutional check inputs to carry out convolutional operation.
3. According to the method, clustering operation is carried out on the classified data to extract hot topics under each category, and because network news is updated continuously, a traditional clustering algorithm is not applicable, so that double-layer Single-Pass clustering is designed, the method not only can extract hot topics of news in a period of time, but also can combine the hot topics of each period of time to obtain an integral hot topic set in a plurality of periods of time.
4. The parallelization processing has a necessary trend of big data processing, so the invention changes the traditional serialization operation mode into parallelization, and the parallelization mode is based in the whole process. The method not only improves the accuracy of hot topics, but also greatly improves the processing efficiency of massive news data. Moreover, the method and the device can help the user to accurately and quickly acquire the topics focused by the user.
In addition, for the method, the invention correspondingly provides a massive news hot topic extraction system, as shown in fig. 2, which comprises the following steps: the system comprises a data acquisition module 1, a preprocessing module 2, a training model acquisition module 3, a first similarity determination module 4, a convolutional neural network model acquisition module 5, a data classification module 6 and a news hot topic acquisition module 7.
The data acquisition module 1 is used for acquiring news text data.
The preprocessing module 2 is used for preprocessing the acquired news text data.
The training model acquisition module 3 is used for acquiring a parallelized training model. The parallelization training model is a network training model which takes preprocessed news text data as input and takes similarity among the news text data as output.
The first similarity determining module 4 is configured to obtain a similarity between news text data according to the preprocessed news text data by using the parallelized training model.
The convolutional neural network model acquisition module 5 is used for acquiring an improved convolutional neural network model. The improved convolutional neural network model takes similarity among news text data as input and classification of the news text data as output.
The data classification module 6 is configured to obtain a classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model.
The news hot topic acquisition module 7 is used for clustering topics of the classified news text data by adopting a clustering algorithm to obtain news hot topics.
In addition, to further improve the accuracy of the data processing, the system may further include: a training sample set acquisition module and a second similarity determination module.
The training sample set acquisition module is used for acquiring training samples, and performing calibration sampling on news text data in the training samples to obtain a training sample set.
The second similarity determining module is used for performing parallelization training on the training sample set to obtain similarity among news text data in the training sample set.
In order to improve the comprehensiveness of the data analysis process, the system may further include: an analysis module and a statistics module.
And the analysis module is used for performing word frequency distribution analysis, regional distribution analysis and site distribution analysis on the classified news text data.
And the statistics module is used for respectively carrying out statistics on the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
The hot news topic acquisition module 7 may specifically further include: the device comprises a first hot topic set acquisition unit and a second hot topic set acquisition unit.
The first hot topic collection acquisition unit is used for clustering topics of the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic collection.
The second hot topic set acquisition unit is used for clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set. The news text data in the second hot topic collection is the extracted hot news topics.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (9)

1. The method for extracting the massive news hot topics is characterized by comprising the following steps of:
acquiring news text data;
preprocessing the acquired news text data;
acquiring a parallelized training model; the parallelization training model is a network training model which takes preprocessed news text data as input and takes similarity among the news text data as output;
obtaining the similarity between the news text data according to the preprocessed news text data by using the parallelized training model;
acquiring an improved convolutional neural network model; the improved convolutional neural network model takes similarity among news text data as input and classification of the news text data as output;
obtaining the classification of the news text data according to the similarity among the news text data by utilizing the improved convolutional neural network model;
clustering topics of the classified news text data by adopting a clustering algorithm to obtain news hot topics;
the parallelization training model is a linear combination model of a parallelization word vector model and a parallelization topic model;
model training is carried out on news texts by using a Word2vec Word vector model and an LDA topic model respectively, linear combination of two model vectors is adopted when text similarity is calculated, and a weighting factor alpha is added;
the text similarity is calculated by adopting cosine similarity through combined modeling in a mode of weighting Word2vec Word vector models and LDA topic models, and specifically comprises the following steps: assume that the vector after document X adopts Word2vec Word vector model and LDA topic model vectorization is X respectively w And X L The method comprises the steps of carrying out a first treatment on the surface of the The vector of the document Y after vectorization by using Word2vec Word vector model and LDA topic model is Y respectively w And Y L The method comprises the steps of carrying out a first treatment on the surface of the The text similarity formula using Word2vec modeling is as follows:
sim Word2vec (X,Y)=cos(X w ,Y w );
the text similarity formula according to the LDA modeling approach is as follows:
sim LDA (X,Y)=cos(X L ,Y L );
in the topic detection model, a Word2vec and LDA combined modeling mode is adopted, wherein the combined modeling mode is that linear combination of the two formulas is adopted when text similarity is calculated, and a weighting factor alpha is added, so that similarity formulas of two documents X and Y are obtained:
sim(X,Y)=asim Word2vec (X,Y)+(1-a)sim LDA (X,Y);
in the formula, sim Word2vec (X, Y) is the text similarity, sim of Word2vec modeling mode LDA (X, Y) is the text similarity according to the LDA modeling approach, sim (X, Y) is the similarity of the two documents X, Y.
2. The method for extracting massive news hot topics according to claim 1, further comprising, before the obtaining the parallelized training model:
obtaining a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set;
and performing parallelization training on the training sample set to obtain the similarity between the news text data in the training sample set.
3. The method for extracting massive news hot topics according to claim 1, wherein the obtaining the classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model further comprises:
performing word frequency distribution analysis, regional distribution analysis and site distribution analysis on the classified news text data;
and respectively counting the news text data with the same word frequency distribution, the news text data with the same region distribution and the news text data with the same site distribution.
4. The massive news hot topic extraction method according to claim 1, wherein the topic clustering of the classified news text data by adopting a Single-Pass clustering algorithm is performed to obtain news hot topics, and the method comprises the following steps:
clustering topics of the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic set;
clustering the first hot topic sets in a specific time period by adopting a clustering algorithm to obtain second hot topic sets; the news text data in the second hot topic collection is the extracted hot news topics.
5. A massive news hot topic extraction system, comprising:
the data acquisition module is used for acquiring news text data;
the preprocessing module is used for preprocessing the acquired news text data;
the training model acquisition module is used for acquiring a parallelized training model; the parallelization training model is a network training model which takes preprocessed news text data as input and takes similarity among the news text data as output;
the first similarity determining module is used for obtaining the similarity between the news text data according to the preprocessed news text data by utilizing the parallelized training model;
the convolutional neural network model acquisition module is used for acquiring an improved convolutional neural network model; the improved convolutional neural network model takes similarity among news text data as input and classification of the news text data as output;
the data classification module is used for obtaining the classification of the news text data according to the similarity among the news text data by utilizing the improved convolutional neural network model;
the news hot topic acquisition module is used for clustering topics of the classified news text data by adopting a clustering algorithm to obtain news hot topics;
the parallelization training model is a linear combination model of a parallelization word vector model and a parallelization topic model;
model training is carried out on news texts by using a Word2vec Word vector model and an LDA topic model respectively, linear combination of two model vectors is adopted when text similarity is calculated, and a weighting factor alpha is added;
the text similarity is calculated by adopting cosine similarity through combined modeling in a mode of weighting Word2vec Word vector models and LDA topic models, and specifically comprises the following steps: assume that the vector after document X adopts Word2vec Word vector model and LDA topic model vectorization is X respectively w And X L The method comprises the steps of carrying out a first treatment on the surface of the The vector of the document Y after vectorization by using Word2vec Word vector model and LDA topic model is Y respectively w And Y L The method comprises the steps of carrying out a first treatment on the surface of the The text similarity formula using Word2vec modeling is as follows:
sim Word2vec (X,Y)=cos(X w ,Y w );
the text similarity formula according to the LDA modeling approach is as follows:
sim LDA (X,Y)=cos(X L ,Y L );
in the topic detection model, a Word2vec and LDA combined modeling mode is adopted, wherein the combined modeling mode is that linear combination of the two formulas is adopted when text similarity is calculated, and a weighting factor alpha is added, so that similarity formulas of two documents X and Y are obtained:
sim(X,Y)=asim Word2vec (X,Y)+(1-a)sim LDA (X,Y);
in the formula, sim Word2vec (X, Y) is the text similarity, sim of Word2vec modeling mode LDA (X, Y) is the text similarity according to the LDA modeling approach, sim (X, Y) is the similarity of the two documents X, Y.
6. The massive news hot topic extraction system of claim 5, wherein the parallelized training model is a linear combination model of a parallelized word vector model and a parallelized topic model.
7. The massive news hot topic extraction system of claim 5, further comprising:
the training sample set acquisition module is used for acquiring training samples, and performing calibration sampling on news text data in the training samples to obtain training sample sets;
and the second similarity determining module is used for carrying out parallelization training on the training sample set to obtain the similarity among the news text data in the training sample set.
8. The massive news hot topic extraction system of claim 5, further comprising:
the analysis module is used for performing word frequency distribution analysis, regional distribution analysis and site distribution analysis on the classified news text data;
and the statistics module is used for respectively carrying out statistics on the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
9. The massive news hot topic extraction system of claim 5, wherein the news hot topic acquisition module includes:
the first hot topic collection acquisition unit is used for carrying out topic clustering on the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic collection;
the second hot topic set acquisition unit is used for clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set; the news text data in the second hot topic collection is the extracted hot news topics.
CN201911344883.7A 2019-12-24 2019-12-24 Massive news hot topic extraction method and system Active CN111090811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911344883.7A CN111090811B (en) 2019-12-24 2019-12-24 Massive news hot topic extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911344883.7A CN111090811B (en) 2019-12-24 2019-12-24 Massive news hot topic extraction method and system

Publications (2)

Publication Number Publication Date
CN111090811A CN111090811A (en) 2020-05-01
CN111090811B true CN111090811B (en) 2023-09-01

Family

ID=70395273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911344883.7A Active CN111090811B (en) 2019-12-24 2019-12-24 Massive news hot topic extraction method and system

Country Status (1)

Country Link
CN (1) CN111090811B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113157857B (en) * 2021-03-13 2023-06-02 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN112905751B (en) * 2021-03-19 2024-03-29 常熟理工学院 Topic evolution tracking method combining topic model and twin network model
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10148700B2 (en) * 2016-06-30 2018-12-04 Fortinet, Inc. Classification of top-level domain (TLD) websites based on a known website classification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Also Published As

Publication number Publication date
CN111090811A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN111090811B (en) Massive news hot topic extraction method and system
Cao et al. A density-based method for adaptive LDA model selection
CN108090070B (en) Chinese entity attribute extraction method
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN104850650B (en) Short text extending method based on category relation
CN106202518A (en) Based on CHI and the short text classification method of sub-category association rule algorithm
CN109885686A (en) A kind of multilingual file classification method merging subject information and BiLSTM-CNN
CN111782797A (en) Automatic matching method for scientific and technological project review experts and storage medium
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN102646095B (en) Object classifying method and system based on webpage classification information
CN103324700A (en) Noumenon concept attribute learning method based on Web information
Ritu et al. Performance analysis of different word embedding models on bangla language
CN114265937A (en) Intelligent classification analysis method and system of scientific and technological information, storage medium and server
Saha et al. Sentiment Classification in Bengali News Comments using a hybrid approach with Glove
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Wu et al. Mining query subtopics from questions in community question answering
CN112231453B (en) Intelligent question-answering method and device, computer equipment and storage medium
Shao et al. Multilingual Text-Image Olfactory Object Matching Based on Object Detection.
CN109408805A (en) A kind of Tibetan language sentiment analysis method and system based on interacting depth study
CN117235582A (en) Multi-granularity information processing method and device based on electronic medical record
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN111241846B (en) Self-adaptive determination method for theme dimension in theme mining model
CN107657060B (en) Feature optimization method based on semi-structured text classification
Zhu et al. A parallel attribute reduction algorithm based on Affinity Propagation clustering.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant