CN111090811A - Method and system for extracting massive news hot topics - Google Patents

Method and system for extracting massive news hot topics Download PDF

Info

Publication number
CN111090811A
CN111090811A CN201911344883.7A CN201911344883A CN111090811A CN 111090811 A CN111090811 A CN 111090811A CN 201911344883 A CN201911344883 A CN 201911344883A CN 111090811 A CN111090811 A CN 111090811A
Authority
CN
China
Prior art keywords
text data
news
news text
model
hot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911344883.7A
Other languages
Chinese (zh)
Other versions
CN111090811B (en
Inventor
宿红毅
王军义
闫波
郑宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201911344883.7A priority Critical patent/CN111090811B/en
Publication of CN111090811A publication Critical patent/CN111090811A/en
Application granted granted Critical
Publication of CN111090811B publication Critical patent/CN111090811B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for extracting massive hot news topics. The method comprises the steps of obtaining the similarity among news text data by adopting a parallelized training model, obtaining the classification of the news text data according to the similarity by adopting an improved convolutional neural network model, then carrying out topic clustering on the classified news text data by adopting a clustering algorithm, and further detecting hot topics under various categories from massive news data so as to ensure the accuracy of extracting the hot topics of the news, wherein the whole extraction process of the hot topics of the news is based on a parallelized mode, so that the extraction efficiency of the hot topics of the news can be further improved.

Description

Method and system for extracting massive news hot topics
Technical Field
The invention relates to the technical field of big data processing and analysis, in particular to a method and a system for extracting massive hot news topics.
Background
In recent years, with the rapid development of the internet, people are particularly paying attention to how to quickly and accurately acquire interesting news events from massive network news information and to how to quickly and accurately acquire interesting news events which are more concerned by the public in the current society.
When the internet information is increased rapidly, the news information in the network becomes too bloated and large, various information is distributed in a staggered mode, and no rule exists. For these reasons, people have great challenges and difficulties in obtaining interesting news events from massive network information. Therefore, how to quickly extract focus events which people are interested in and news information of related event development dynamic from massive network news, and how to filter useless information from vast network news, so that the method has clear organization and arrangement, helps users to dig out social hot events in time, helps people to acquire hot trends of the current society, and becomes a hot spot of research nowadays.
The explosive growth of network information brings great difficulty to the calculation and processing of data, the traditional data processing mode can not meet the requirement of large-scale data processing, and the processing of mass data becomes the bottleneck of the current production and technological development.
With the increasing growth of news information in a network and the huge pressure brought by the processing of massive information, the traditional TDT technology is more and more difficult in processing massive news data, and the distributed computing technology is started, so that the problem is solved, the distributed technology is introduced into the processing of massive network news data by utilizing the advantage of the technology in processing massive data, and the efficiency of analyzing network hot topics can be greatly improved. From the related research results, the existing methods for detecting and discovering the hot topics of the network news gradually gain some achievements, but still cannot solve the problem of improving the extraction efficiency of the hot news data while accurately extracting the hot news data.
Disclosure of Invention
The invention aims to provide a method and a system for extracting massive news hot topics, which can improve the accuracy of extracting massive news hot topics and improve the efficiency of extracting the news hot topics.
In order to achieve the purpose, the invention provides the following scheme:
a method for extracting massive news hot topics comprises the following steps:
acquiring news text data;
preprocessing the acquired news text data;
acquiring a parallelized training model; the parallelized training model is a network training model which takes preprocessed news text data as input and takes the similarity among the news text data as output;
obtaining the similarity between the news text data according to the preprocessed news text data by using the parallelized training model;
obtaining an improved convolutional neural network model; the improved convolutional neural network model is a neural network model which takes the similarity among news text data as input and takes the classification of the news text data as output;
obtaining the classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model;
and carrying out topic clustering on the classified news text data by adopting a clustering algorithm to obtain hot news topics.
Optionally, the parallelized training model is a linear combination model of a parallelized word vector model and a parallelized topic model.
Optionally, before the obtaining the parallelized training model, the method further includes:
acquiring a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set;
and performing parallelization training on the training sample set to obtain the similarity among the news text data in the training sample set.
Optionally, after obtaining the classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model, the method further includes:
performing word frequency distribution analysis, region distribution analysis and site distribution analysis on the classified news text data;
and respectively counting the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
Optionally, the clustering algorithm is used to perform topic clustering on the classified news text data to obtain a news hot topic, including:
carrying out topic clustering on the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic set;
clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set; and the news text data in the second hot topic set is the extracted news hot topic.
A massive news hot topic extraction system comprises:
the data acquisition module is used for acquiring news text data;
the preprocessing module is used for preprocessing the acquired news text data;
the training model acquisition module is used for acquiring a parallelized training model; the parallelized training model is a network training model which takes preprocessed news text data as input and takes the similarity among the news text data as output;
the first similarity determining module is used for obtaining the similarity between the news text data according to the preprocessed news text data by utilizing the parallelized training model;
the convolutional neural network model acquisition module is used for acquiring an improved convolutional neural network model; the improved convolutional neural network model is a neural network model which takes the similarity among news text data as input and takes the classification of the news text data as output;
the data classification module is used for obtaining classification of the news text data according to the similarity between the news text data by utilizing the improved convolutional neural network model;
and the news hot topic acquisition module is used for carrying out topic clustering on the classified news text data by adopting a clustering algorithm to obtain the news hot topic.
Optionally, the parallelized training model is a linear combination model of a parallelized word vector model and a parallelized topic model.
Optionally, the system further includes:
the training sample set acquisition module is used for acquiring a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set;
and the second similarity determining module is used for performing parallel training on the training sample set to obtain the similarity between the news text data in the training sample set.
Optionally, the system further includes:
the analysis module is used for carrying out word frequency distribution analysis, region distribution analysis and site distribution analysis on the classified news text data;
and the statistical module is used for respectively carrying out statistics on the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
Optionally, the news hot topic obtaining module includes:
the first hot topic set acquisition unit is used for carrying out topic clustering on the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic set;
the second hot topic set acquisition unit is used for clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set; and the news text data in the second hot topic set is the extracted news hot topic.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the method and the system for extracting the massive news hot topics, provided by the invention, the similarity among the news text data is obtained by adopting the parallelized training model, the classification of the news text data is obtained by adopting the improved convolutional neural network model according to the similarity, then the classified news text data is subjected to topic clustering by adopting a Single-Pass clustering algorithm, and the hot topics under various categories are further detected from the massive news data, so that the accuracy of extracting the news hot topics is ensured, and the whole extraction process of the news hot topics is based on the parallelization mode, so that the extraction efficiency of the news hot topics can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of a method for extracting massive news hot topics according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a massive news hot topic extraction system provided in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for extracting massive news hot topics, which can improve the accuracy of extracting massive news hot topics and improve the efficiency of extracting the news hot topics.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a method for extracting massive news hot topics provided by an embodiment of the present invention, and as shown in fig. 1, the method for extracting massive news hot topics includes:
s100, obtaining news text data.
S101, preprocessing the acquired news text data.
And S102, acquiring a parallelized training model. The parallelized training model is a network training model which takes the preprocessed news text data as input and the similarity among the news text data as output.
S103, obtaining the similarity between the news text data according to the preprocessed news text data by using the parallelized training model.
And S104, obtaining the improved convolutional neural network model. The improved convolutional neural network model is a neural network model which takes the similarity among news text data as input and takes the classification of the news text data as output.
And S105, obtaining the classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model.
And S106, carrying out topic clustering on the classified news text data by adopting a clustering algorithm to obtain a hot news topic.
In order to improve comprehensiveness of acquisition of news text data, the invention adopts parallelization web crawler technology when S100 acquires the news text data, and the specific data acquisition process comprises the following steps:
aiming at the characteristics of each portal website, address link information of a news webpage is firstly obtained from the portal website, and then webpage content in each address information is crawled. And finally, extracting effective information contents such as text, title, time, keywords and the like of the news webpage from the webpage source code information. In order to be conveniently applied to crawling of large-scale data and improvement of crawling efficiency, a network crawler crawls by adopting a distributed parallel computing framework, and the crawling efficiency can be greatly improved by applying a crawling mode of parallel servers in a cluster.
The specific form of the distributed parallelized computation framework is as follows: the distributed crawler system adopts a master-slave structure, namely, one master node controls all slave nodes to execute crawling tasks, and the master node is responsible for distributing the tasks, so that the load balance of all the slave nodes in the cluster is ensured. The distributed crawler can be regarded as a combination of a plurality of centralized crawler systems, each slave node is equivalent to one centralized crawler system, and the centralized crawler systems are controlled and managed by one master control node in the distributed crawler systems so as to work cooperatively. And a parallelization computing framework based on Map/Reduce is used on the basis of the distributed cluster to improve the crawling efficiency.
In order to facilitate the processing of the news text data in the subsequent process, the process of preprocessing the obtained news text data in S101 needs to perform preprocessing operations such as chinese segmentation and stop word filtering on the news text.
The Chinese word segmentation is an important stage of text processing, and the biggest difference between the Chinese word segmentation and the English word segmentation is that a space is arranged between English words, so that the words can be well distinguished, so that the Chinese word segmentation has higher difficulty, and the mainstream Chinese word segmentation technology is mainly based on hidden Markov and conditional random fields at present. The invention adopts an open-source jieba word segmentation frame, namely, a hidden Markov model is adopted to perform word segmentation and part of speech tagging, and a word bank of the jieba is updated in time for the words with improper word segmentation in the jieba dictionary.
In the process of stop word filtering, since news is long text data, the news contains a large number of useless words, such as: words such as "in", "out", "yes", and the like, which are irrelevant to topic detection, must be filtered because the words have poor representation effect on documents and influence the operation efficiency of a later model, so that a stop word list is established to remove irrelevant words, and the stop word list refers to a Chinese stop word list of a fox search laboratory. In addition, after part-of-speech tagging is performed by using the jieba, pronouns, moods and prepositions are filtered out.
In order to improve the efficiency of processing a huge amount of news, it is necessary to use a parallelized processing method.
The parallelized training model in S102 is a linear combination model of the parallelized word vector model and the parallelized topic model.
In the step S102, specifically, a Word2vec Word vector model and an LDA (latentdirichletaltation implicit Dirichlet distribution) topic model are respectively used to perform model training on the news text, and when calculating the text similarity, a linear combination of two model vectors is used, and a weighting factor a is added, so that the text similarity can be calculated by combining Word2vec and LDA.
The method adopts a Word2vec Word vector model and LDA topic model weighting combined modeling mode and uses cosine similarity to calculate text similarity, and specifically comprises the following steps: suppose that the vector of the document X after vectorization by adopting Word2vec and LDA is X respectivelywAnd XL. The vector of the document Y after vectorization by adopting Word2vec and LDA is respectively YwAnd YL. The text similarity formula using the Word2vec modeling mode is as follows:
simWord2vec(X,Y)=cos(Xw,Yw),
the text similarity formula according to the LDA modeling approach is as follows:
simLDA(X,Y)=cos(XL,YL),
in the topic detection model, a Word2vec and LDA joint modeling manner is adopted, and a linear combination of the above two formulas is adopted when text similarity is calculated, and a weighting factor a is added to the text, so that a similarity formula of two documents X and Y can be obtained:
sim(X,Y)=asimWord2vec(X,Y)+(1-a)simLDA(X,Y),
thus, the text similarity can be calculated by combining Word2vec and LDA.
Because the serial model training speed is slow, the serial mode is improved, and the distributed parallel mode is used for accelerating the training of the model.
The specific method for accelerating the model training by using the parallelization mode is as follows:
during model training, the previous serial training method is modified into distributed GPU parallel training, each initialized parameter of the network model and a SENTENCE with the LENGTH of MAX _ SENTENCE _ LENGTH are transmitted to the GPU end by the host end every time, MAX _ SENTENCE _ LENGTH GPU threads are started by the GPU end to train the MAX _ SENTENCE _ LENGTH words at the same time, and therefore parallel training is achieved at the task level. The number of threads is also optimized. Since one thread bundle (warp) contains 32 threads, which is the basic unit for GPU thread scheduling and execution, the total number of threads recommended to be used is a multiple of 32, which can achieve higher throughput.
In the serial model training mode, the MAX _ sense _ LENGTH macro is defined as 1000, and this value determines that when the SENTENCE is divided, the LENGTH of the SENTENCE is 1000 except the last SENTENCE, i.e. each SENTENCE has 1000 words. And when the GPU training model is used, the number of threads opened each time is determined according to the LENGTH of a SENTENCE, one thread corresponds to one word in the SENTENCE, 1000 threads are opened each time, but 1000 is not a multiple of 32, so that the macro definition of MAX _ SENTENCE _ LENGTH is changed from 1000 to 1024, 1024 threads are opened each time, and the number of threads is a multiple of 32, so that higher throughput rate can be obtained, and execution efficiency is improved.
Further, in order to improve the accuracy of the processing of the news text data, before S102, the method includes:
obtaining a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set.
And performing parallelization training on the training sample set to obtain the similarity among the news text data in the training sample set.
In the process of executing S104 and S105, news classification is performed by combining the pre-trained Word2vec with the convolutional neural network, and the network structure is different from the conventional convolutional neural network in that an embedding layer, i.e., an embedding layer of Word vectors, is added before the convolutional layer.
Because the imbedding layer is put into the neural network for self-training in the current common method, the training parameters are expanded, the training time is prolonged, and the model is over-fitted, the improved convolutional neural network model of the imbedding layer formed by embedding the Word2vec Word vector model is adopted to process data, so that the training parameters can be reduced, and the training time can be reduced.
The specific embedding mode is as follows: the data set is news text data, assuming that the data set has d documents, each word of the document has a mapping, assumed to be wiWhere i is the word of the document, and assuming the maximum length of the news text is m, the original input is d m. Then, a pre-trained Word2vec model is adopted, each Word has a Word vector mapping, and the dimension of the Word vector is assumed to be n, so that the original input can be converted into the dimension of d m n. At this time, the dimension is expanded by one dimension to meet the requirement of two-dimensional convolution, and then the dimension of the input layer is d × m × n × 1. This results in an embedding layer embedded by the Word2vec model.
The convolutional neural network is different from the traditional CNN, the traditional CNN is formed by connecting convolutional layers in series, the upper convolutional layer is closely connected with the lower convolutional layer, the convolutional layers are respectively subjected to convolution of convolutional kernels with different sizes for three times, and the convolutional layers obtained by the convolution of the convolutional kernels with different sizes are in a parallel connection mode with the parallel degree of 3. The convolution layer adopts convolution of convolution kernels embedding layers with 3 different sizes to extract features, then the pooling layer extracts the most representative features in each feature map, and finally multi-classification is completed through a full connection layer.
In addition, in order to count and analyze the news text data under each category, after S105, the method for extracting massive news hot topics provided by the present invention further includes:
and performing word frequency distribution analysis, region distribution analysis and site distribution analysis on the classified news text data.
And respectively counting the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
In order to further improve the extraction accuracy and the high efficiency of the news hot topics, a Single-Pass clustering algorithm is adopted in S106 to perform topic clustering on the classified news text data to obtain the news hot topics, and the method specifically includes:
and carrying out topic clustering on the classified news text data in the unit window by adopting a Single-Pass clustering algorithm to obtain a first hot topic set.
And clustering the first hot topic set in a specific time period by adopting a Single-Pass clustering algorithm to obtain a second hot topic set. And the news text data in the second hot topic set is the extracted news hot topic.
The specific implementation process of S106 is as follows:
by adopting the improved double-layer Single-Pass clustering method, news hot topics in a period of time can be extracted, and hot topics in each time period can be combined to obtain an integral hot topic set in a plurality of time periods.
The double-layer Single-Pass clustering designed by the invention has the following general idea: the Single-Pass clustering is adopted to perform topic clustering on the network news in a unit time window (the unit time window in the invention is day), so that hot topics in each time window can be obtained, for example, hot topics in each day within 1 week can be obtained through primary clustering. However, this does not satisfy our requirements, because Single-Pass clustering by only a Single layer is not possible when we need to get the overall hot topic within 1 week. Therefore, secondary clustering is designed on the basis of the topic sets obtained by primary clustering, and the secondary clustering mode also adopts Single-Pass clustering, for example, hot topic sets in a week are obtained by primary clustering, and objects processed by secondary clustering become the hot topic sets. And merging the hot topic sets of each day in a week through secondary clustering, thereby obtaining the hot topics of the whole week.
The advantage of designing double-layer Single-Pass clustering is that a hot topic set of a unit time window can be obtained, and topics under the unit time window can be combined to detect hot topics of a larger time window. Because the object processed in the secondary clustering is the topic set of the primary clustering, the clustering cost is greatly reduced, and the clustering efficiency is improved.
The specific algorithm steps of the double-layer Single-Pass clustering designed by the invention are described in detail below. The method is divided into a primary clustering part and a secondary clustering part.
Wherein the primary clustering comprises:
inputting: d news document vector, similarity threshold s
And (3) outputting: topic set T
The method comprises the following specific steps:
step 1, inputting a first document d in a document1To make a first new topic t1. Where the topic center vector is the average of the vectors into the topic.
Step 2, inputting a document di(i ═ 2,3,4,. cndot., d), and reacting diSimilarity is calculated with existing sets of classes. And taking the topic set with the maximum similarity as a candidate set, entering the topic set if the maximum similarity is greater than s, and forming a new topic if the maximum similarity is not greater than s.
And 3, repeating the step 2 until the last document is input, and outputting all the topic sets T.
The secondary clustering comprises the following steps:
step 1: all topic sets T obtained according to primary clusteringiGo to the first topic T1Creating an inclusion T1Setting a similarity threshold value s for the known topic set;
step 2: the next topic set TiIn, calculate TiThe similarity between each topic i in the set of known topics and the topic in the set of known topics is taken as a candidate set.
And step 3: if the similarity value is obtained to be s, the two topic sets are merged, otherwise, a new topic set is established, and the new topic set comprises a topic i;
and 4, step 4: and (5) repeating the steps 2 and 3 until the last topic set is processed, and outputting a combined topic set T'.
Compared with the prior art, the method for extracting the massive hot news topics, provided by the invention, has the following characteristics and beneficial effects:
1. the data acquisition of the invention is the latest news data of the main domestic portal website, and the web crawl technology is mainly adopted to crawl the web news in a large scale. Secondly, the text is modeled by combining a Word2vec Word vector model and an LDA topic model, similarity calculation is carried out by jointly modeling the two models in a weighting mode, and the effect of the modeling mode is obviously improved compared with that of the mode using the models alone.
2. According to the method, the texts are classified before topic detection, and the crawled news is classified, so that the requirements of different users on different fields can be met, and the clustering time cost can be saved. For text classification, the invention adopts a mode of combining pre-trained Word2vec and a convolutional neural network to carry out algorithm design, thus reducing the training number of parameters in the training process of the neural network and preventing overfitting.
3. According to the invention, the classified data is clustered to extract hot topics under each category, and as the network news is updated continuously, the traditional clustering algorithm is not suitable, so that the invention designs double-layer Single-Pass clustering, and by the method, not only can the hot topics of news within a period of time be extracted, but also the hot topics of each time period can be merged to obtain an integral hot topic set in a plurality of time periods.
4. The parallelization processing has the inevitable trend of big data processing, so the invention changes the traditional serialization operation mode into the parallelization mode and is based on the parallelization mode in the whole process. The whole method not only improves the accuracy of the hot topics, but also greatly improves the processing efficiency of mass news data. In addition, the method and the device can help the user to accurately and quickly acquire the topic of the user's attention.
In addition, aiming at the method, the invention also correspondingly provides a system for extracting massive news hot topics, as shown in fig. 2, the system comprises: the system comprises a data acquisition module 1, a preprocessing module 2, a training model acquisition module 3, a first similarity determination module 4, a convolutional neural network model acquisition module 5, a data classification module 6 and a news hot topic acquisition module 7.
The data acquiring module 1 is configured to acquire news text data.
The preprocessing module 2 is used for preprocessing the acquired news text data.
The training model obtaining module 3 is used for obtaining a parallelized training model. The parallelized training model is a network training model which takes the preprocessed news text data as input and the similarity among the news text data as output.
The first similarity determining module 4 is configured to obtain a similarity between the news text data according to the preprocessed news text data by using the parallelized training model.
The convolutional neural network model obtaining module 5 is used for obtaining the improved convolutional neural network model. The improved convolutional neural network model is a neural network model which takes the similarity among news text data as input and takes the classification of the news text data as output.
And the data classification module 6 is used for obtaining the classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model.
The news hot topic acquisition module 7 is configured to perform topic clustering on the classified news text data by using a clustering algorithm to obtain a news hot topic.
In addition, in order to further improve the accuracy of data processing, the system may further include: the device comprises a training sample set acquisition module and a second similarity determination module.
The training sample set acquisition module is used for acquiring training samples, and performing calibration sampling on news text data in the training samples to obtain a training sample set.
And the second similarity determining module is used for performing parallel training on the training sample set to obtain the similarity between the news text data in the training sample set.
In order to improve the comprehensiveness of the data analysis process, the system may further include: an analysis module and a statistic module.
And the analysis module is used for carrying out word frequency distribution analysis, region distribution analysis and site distribution analysis on the classified news text data.
The statistical module is used for respectively carrying out statistics on the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
The news hot topic acquisition module 7 may further include: the device comprises a first hot topic set acquisition unit and a second hot topic set acquisition unit.
The first hot topic set acquisition unit is used for carrying out topic clustering on the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic set.
The second hot topic set acquisition unit is used for clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set. And the news text data in the second hot topic set is the extracted news hot topic.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A method for extracting massive news hot topics is characterized by comprising the following steps:
acquiring news text data;
preprocessing the acquired news text data;
acquiring a parallelized training model; the parallelized training model is a network training model which takes preprocessed news text data as input and takes the similarity among the news text data as output;
obtaining the similarity between the news text data according to the preprocessed news text data by using the parallelized training model;
obtaining an improved convolutional neural network model; the improved convolutional neural network model is a neural network model which takes the similarity among news text data as input and takes the classification of the news text data as output;
obtaining the classification of the news text data according to the similarity between the news text data by using the improved convolutional neural network model;
and carrying out topic clustering on the classified news text data by adopting a clustering algorithm to obtain hot news topics.
2. The method of claim 1, wherein the parallelized training model is a linear combination model of a parallelized word vector model and a parallelized topic model.
3. The method for extracting massive news hot topics according to claim 1, wherein before the obtaining of the parallelized training model, the method further comprises:
acquiring a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set;
and performing parallelization training on the training sample set to obtain the similarity among the news text data in the training sample set.
4. The method as claimed in claim 1, wherein after the classification of the news text data is obtained according to the similarity between the news text data by using the improved convolutional neural network model, the method further comprises:
performing word frequency distribution analysis, region distribution analysis and site distribution analysis on the classified news text data;
and respectively counting the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
5. The method for extracting massive news hot topics as claimed in claim 1, wherein the step of performing topic clustering on the classified news text data by using a Single-Pass clustering algorithm to obtain news hot topics comprises the following steps:
carrying out topic clustering on the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic set;
clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set; and the news text data in the second hot topic set is the extracted news hot topic.
6. A massive news hot topic extraction system is characterized by comprising:
the data acquisition module is used for acquiring news text data;
the preprocessing module is used for preprocessing the acquired news text data;
the training model acquisition module is used for acquiring a parallelized training model; the parallelized training model is a network training model which takes preprocessed news text data as input and takes the similarity among the news text data as output;
the first similarity determining module is used for obtaining the similarity between the news text data according to the preprocessed news text data by utilizing the parallelized training model;
the convolutional neural network model acquisition module is used for acquiring an improved convolutional neural network model; the improved convolutional neural network model is a neural network model which takes the similarity among news text data as input and takes the classification of the news text data as output;
the data classification module is used for obtaining classification of the news text data according to the similarity between the news text data by utilizing the improved convolutional neural network model;
and the news hot topic acquisition module is used for carrying out topic clustering on the classified news text data by adopting a clustering algorithm to obtain the news hot topic.
7. The mass news hot topic extraction system according to claim 6, wherein the parallelized training model is a linear combination model of a parallelized word vector model and a parallelized topic model.
8. The mass news hot topic extraction system as claimed in claim 6, wherein the system further comprises:
the training sample set acquisition module is used for acquiring a training sample, and performing calibration sampling on news text data in the training sample to obtain a training sample set;
and the second similarity determining module is used for performing parallel training on the training sample set to obtain the similarity between the news text data in the training sample set.
9. The mass news hot topic extraction system as claimed in claim 6, wherein the system further comprises:
the analysis module is used for carrying out word frequency distribution analysis, region distribution analysis and site distribution analysis on the classified news text data;
and the statistical module is used for respectively carrying out statistics on the news text data with the same word frequency distribution, the news text data with the same regional distribution and the news text data with the same site distribution.
10. The system for extracting massive news hot topics as claimed in claim 6, wherein the news hot topic acquisition module comprises:
the first hot topic set acquisition unit is used for carrying out topic clustering on the classified news text data in the unit window by adopting a clustering algorithm to obtain a first hot topic set;
the second hot topic set acquisition unit is used for clustering the first hot topic set in a specific time period by adopting a clustering algorithm to obtain a second hot topic set; and the news text data in the second hot topic set is the extracted news hot topic.
CN201911344883.7A 2019-12-24 2019-12-24 Massive news hot topic extraction method and system Active CN111090811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911344883.7A CN111090811B (en) 2019-12-24 2019-12-24 Massive news hot topic extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911344883.7A CN111090811B (en) 2019-12-24 2019-12-24 Massive news hot topic extraction method and system

Publications (2)

Publication Number Publication Date
CN111090811A true CN111090811A (en) 2020-05-01
CN111090811B CN111090811B (en) 2023-09-01

Family

ID=70395273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911344883.7A Active CN111090811B (en) 2019-12-24 2019-12-24 Massive news hot topic extraction method and system

Country Status (1)

Country Link
CN (1) CN111090811B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905751A (en) * 2021-03-19 2021-06-04 常熟理工学院 Topic evolution tracking method combining topic model and twin network model
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113157857A (en) * 2021-03-13 2021-07-23 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
US20180007090A1 (en) * 2016-06-30 2018-01-04 Fortinet, Inc. Classification of top-level domain (tld) websites based on a known website classification
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
US20180007090A1 (en) * 2016-06-30 2018-01-04 Fortinet, Inc. Classification of top-level domain (tld) websites based on a known website classification
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113157857A (en) * 2021-03-13 2021-07-23 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN113157857B (en) * 2021-03-13 2023-06-02 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN112905751A (en) * 2021-03-19 2021-06-04 常熟理工学院 Topic evolution tracking method combining topic model and twin network model
CN112905751B (en) * 2021-03-19 2024-03-29 常熟理工学院 Topic evolution tracking method combining topic model and twin network model
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification

Also Published As

Publication number Publication date
CN111090811B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN111090811B (en) Massive news hot topic extraction method and system
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN106372061A (en) Short text similarity calculation method based on semantics
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN103207913A (en) Method and system for acquiring commodity fine-grained semantic relation
CN110188349A (en) A kind of automation writing method based on extraction-type multiple file summarization method
Çakir et al. Text mining analysis in Turkish language using big data tools
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
Nandi et al. Bangla news recommendation using doc2vec
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN110457472A (en) The emotion association analysis method for electric business product review based on SOM clustering algorithm
CN109359299A (en) A kind of internet of things equipment ability ontology based on commodity data is from construction method
CN109522396A (en) A kind of method of knowledge processing and system towards science and techniques of defence field
Aksonov et al. Question-Answering Systems Development Based on Big Data Analysis
Hassan et al. Automatic document topic identification using wikipedia hierarchical ontology
CN114065749A (en) Text-oriented Guangdong language recognition model and training and recognition method of system
Rawat et al. Topic modelling of legal documents using NLP and bidirectional encoder representations from transformers
CN110929509B (en) Domain event trigger word clustering method based on louvain community discovery algorithm
Xu et al. Research on Tibetan hot words, sensitive words tracking and public opinion classification
Suresh et al. A fuzzy based hybrid hierarchical clustering model for twitter sentiment analysis
CN110377845B (en) Collaborative filtering recommendation method based on interval semi-supervised LDA
CN113157857A (en) Hot topic detection method, device and equipment for news
CN107657060B (en) Feature optimization method based on semi-structured text classification
Zhang et al. Research and implementation of keyword extraction algorithm based on professional background knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant