CN107122478B

CN107122478B - Method for extracting hot topics based on keywords

Info

Publication number: CN107122478B
Application number: CN201710304817.1A
Authority: CN
Inventors: 陆川; 孙健; 杨伟
Original assignee: Chengdu Yunshu Future Information Science Co ltd
Current assignee: Chengdu Yunshu Future Information Science Co ltd
Priority date: 2017-05-03
Filing date: 2017-05-03
Publication date: 2020-05-08
Anticipated expiration: 2037-05-03
Also published as: CN107122478A

Abstract

The invention discloses a method for extracting hot topics based on keywords, which is characterized by forming a corpus by unifying massive data in a format and performing word segmentation processing, performing parallelization and block segmentation processing on the corpus to obtain a candidate word set of each block, performing TFIDF weighting and de-weighting processing on each candidate word set to obtain a reference document, performing cosine similarity processing on the reference document and other texts in the block to extract a text similar to the reference document, finding a plurality of hot topics of the similar texts by performing word frequency descending order on the candidate word sets in the similar texts, and finally extracting topic hot spots from the hot topics, wherein the topic hot spots can represent the main viewpoints of the massive data.

Description

Method for extracting hot topics based on keywords

Technical Field

The invention belongs to the technical field of network public opinion monitoring, and particularly relates to a method for extracting hot topics based on keywords.

Background

With the vigorous development of internet technology and the rapid popularization of related applications, everyone is no longer just a consumer of information, but rather a producer of information, and a netizen can acquire or publish information on various websites such as microblogs, social contacts, news, blogs and the like at any time and any place by means of network terminals such as computers, mobile phones and the like, and a plurality of existing commercial portal websites can collect and provide rich news reports such as newsurfing, network easiness and the like for users. For example, the rich and comprehensive content published on the platform by the user not only creates a hot topic of disputes in the social network, but also attracts a plurality of traditional media to further follow up the related events by utilizing the microblog.

Due to the explosive growth of the internet data and the characteristics of fast food, fragmentation and the like, the problems of information overload and lack of integrity become more obvious, people at a fast pace feel reluctant to numerous new information emerging all the time, but people urgently want to know hot topics being discussed in society in time and quickly, and the hot topics are characterized by timeliness, diversity, generality and the like.

How to efficiently dig out effective information in the internet, a plurality of difficult problems are faced in network monitoring, for example, users have higher and higher intellectualization to network monitoring, and a hotspot obtained from a small amount of data text is not in accordance with modern network monitoring; in the process of data mining, the user faces no more simple and small amount of text information … …, and the hot topic detection technology is taken as a data mining technology which can automatically discover and organize semantic association of network information and help the user to quickly acquire a network information overview, and has attracted strong attention in academia and industry in recent years.

As a research hotspot which is concerned in the field of information processing, public opinion topic detection and tracking technology takes news media information streams as research objects in the early development stage, finds new information which is interesting to users and tracks the information by monitoring news-described topics, and finally organizes news related to a certain topic and presents the news to the users in a certain mode. Since social media have been increasingly popular due to rapid development of computer technology and widespread popularization of the internet, researchers have focused on social media forms such as blogs, mails, communities, and forums, which are representative of these times. Different from language normalization and content effectiveness of news reports, social media text content is high in randomness, is full of a large amount of worthless information, and is low in relevance among documents. In the face of mass internet information which is continuously emerging, simple manual supervision is difficult, traditional hot topic finding technologies aim at a small amount of text and little text content, hot detection means generally search from known topics, if the existing hot topic is added into an original hot topic, so that the hot degree of the original hot topic is improved, and subsequent tracking is performed. However, the detection and tracking technology is directed to a small number of documents, and if a large amount of internet information is faced, the traditional topic detection technology is adopted, so that the practical application requirement of detecting the hot spot topic in the massive and continuous information stream is difficult to meet, even if the detection is realized, the time complexity is very high, the time delay is very obvious, the energy of a user is very limited, and the useful knowledge of the related topic cannot be obtained by reading all the documents.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for extracting hot topics based on keywords, wherein the hot topics are extracted based on the keywords, and discovery and tracking of the hot topics every day are realized.

In order to achieve the above object, the present invention provides a method for extracting a hot topic based on a keyword, comprising the steps of:

(1) crawling massive text data through a crawler, unifying the text data into a txt text format, and storing the txt text format in a database;

(2) extracting text data in the database, and performing word segmentation processing on the text data by utilizing a Chinese word segmentation packet to obtain a language database consisting of words;

(3) equally dividing the corpus into M blocks, and filtering the word segmentation result in each corpus by using a stop word list and a filtering rule to obtain block candidate word sets of the M blocks;

(4) TFIDF weighting is carried out on the ith candidate keyword in the pth block (p is 1,2, …, M) candidate word set to obtain M weighted block candidate word sets;

(4.1) calculating the frequency of the ith candidate keyword in the pth block in the jth text

Wherein the content of the first and second substances,

representing the number of times that the ith candidate keyword in the pth block appears in the jth text,

representing the sum of the occurrence times of all candidate keywords in the pth block in the jth text, and k representing the total number of all candidate keywords in the jth text in the pth block;

(4.2) calculating the universal importance measure IDF of the ith candidate keyword in the pth block_i ^p

Wherein, | D^pI represents the total number of texts in the pth block, and I^pL represents the number of texts containing the ith candidate keyword in the pth block;

(4.3) calculating the ith candidate keyword in the pth block in the jth text

Weight value

(4.4) according to the method in the steps (4.1) - (4.3), continuously processing the residual k-1 candidate keywords of the jth text in the pth block, then processing other texts in the pth block, and after the processing of the pth block candidate word set is finished, according to the method in the steps (4.1) - (4.3), continuously processing the residual block candidate word set to finally obtain M weighted block candidate word sets;

(5) acquiring a reference text

(5.1) selecting the same candidate keywords in the weighted block candidate word set of the pth block, adding weights corresponding to the same candidate keywords, keeping the original weights of different candidate keywords, and completing the de-weighting processing of the weighted block candidate word set of the pth block so as to obtain a block keyword set;

(5.2) carrying out descending order arrangement on the weights of the corresponding candidate keywords in the block keyword set, and then finding a first text containing the candidate keywords in the pth block candidate word set by taking the candidate keywords with the largest weights as a reference, and marking the first text as a reference text;

(5.3) continuously processing the residual weighted block candidate word sets according to the method in the steps (5.1) - (5.2) to finally obtain M pieces of reference texts;

(6) and finding a text set

(6.1) finding out a reference text corresponding to the keyword set of the block p, and then forming a weight vector w by using weights corresponding to candidate keywords in the reference text₀；

(6.2) finding out other texts in the keyword set of the block p, and forming a weight vector w by using weights corresponding to candidate keywords in the texts respectively₁,w₂,…,w_t…,w_TT represents the total number of texts contained in the pth block keyword set;

(6.3) calculating similarity cosine values of other texts in the keyword set of the p block and the reference text by using a cosine similarity formula;

(6.4) utilizing T in the pth block keyword setForming cosine vectors Q by the cosine values of the similarity, comparing each vector in the cosine vectors Q with a preset threshold theta, if a certain vector is larger than the preset threshold theta, judging that the similarity between the text corresponding to the vector and the reference text is high, and adding the text and the reference text into a text set together

(6.5) continuing to process the residual block keyword sets according to the method in the steps (6.1) - (6.4) to finally obtain M text sets

(7) In a text collection

Adding 1 to the word frequency of the candidate keywords which repeatedly appear in the text, further counting the word frequency of all the candidate keywords in the text, then performing descending arrangement on the word frequency, taking out the first candidate keywords as hot keywords, and marking the hot keywords as a text set

Hot spot theme of

Obtaining the residual M-1 hot spot subjects in the same way;

(8) in the candidate word set of the block p, eliminating the text set

Extracting the block candidate word set consisting of the residual texts to h-1 hot topic topics according to the method in the steps (4) to (7) for the similar texts appearing in the text list

In the same way, h-1 hot topic topics are respectively extracted from the rest M-1 block candidate word sets;

(9) and respectively storing all texts corresponding to h hot topics obtained from the M block candidate word sets into the hot text sets of the corresponding blocks, combining the M hot text sets to serve as a corpus, extracting the candidate keyword sets according to the method in the step (3), and repeating the steps (4) to (7) to obtain more hot topics.

The invention aims to realize the following steps:

the invention relates to a method for extracting hot topics based on keywords, which is used for unifying massive data into a format and performing word segmentation processing to form a corpus. Parallelizing and blocking processing a material library, performing the same preprocessing on each block to obtain a candidate word set of each block, then performing TFIDF (trivial text field frequency) weighting on candidate words of each text in each block, and performing de-weighting processing on the candidate word set after each block is weighted to obtain a reference document; and then cosine similarity processing is carried out on each block of reference text and other texts in the block, a text similar to the reference text is extracted, candidate keyword sets in the similar texts in each block are sorted by word frequency descending order to find hot topics of the similar texts, then the candidate keyword sets of the similar texts are removed from the total text candidate keyword set in each block, TFIDF (fuzzy inference) is carried out again on the text in each block to find the reference text, and cosine similarity processing is carried out to obtain a plurality of hot topics in each block. Finally, the candidate keyword sets corresponding to the hot topics in each block are combined to form a new candidate keyword set, then all the new candidate keyword sets in each block are combined to form a large candidate keyword set, the steps are repeated, and a plurality of hot topics of the large candidate keyword are found out, so that the hot topics are extracted from the hot topics, and the main view of the mass data can be represented.

Meanwhile, the method for extracting the hot topics based on the keywords further has the following beneficial effects:

(1) by parallelizing the text data, for the text in each block, as a large amount of texts are not hot spots, useless texts can be deleted as long as the total text of the hot topic contains about half of the texts in the block through parallelization. The method is far faster than the method for obtaining the hot subject by integral processing, because a large amount of useless texts are tired of the running speed, and the defect of insufficient memory during integral processing is made up by block processing. This can improve our efficiency and result faster.

(2) In TFIDF weighting, the weights we get are more accurate than the weights obtained by the whole process because the computer results have a certain number of bits limitation. Due to the block processing, the number of the texts is reduced, so that the calculation speed is improved, and the calculation precision is ensured. TFIDF weighting is an important basis for the following process, and thus TFIDF is to be guaranteed.

(3) The method has the greatest advantage that the hot topic is extracted from the hot topic, which is not done by people before, and the obtained effect is ideal, and the required result can be completely obtained.

Drawings

FIG. 1 is a flowchart of a method for extracting hot topics based on keywords according to the present invention;

FIG. 2 is a graphical representation of the weights of each candidate key for each text in the block candidate set.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flowchart of a method for extracting hot topics based on keywords according to the present invention.

In this embodiment, as shown in fig. 1, the method for extracting a hot topic based on a keyword according to the present invention includes the following steps:

s1, crawling the news websites through a crawler, for example: crawling … … for new waves, hundredths and Tencent, crawling 100 news text data sets A on the same day, unifying the text data into a txt text format, and storing the text data into a database;

a [' driving school services, charges are not publicly opaque; … … the driver qualified by the medical examination can be recovered for the person who has not submitted the medical examination and is logged off. ' policemen go to the hospital to take pictures of the old people for changing the certificates. "the old people … … also take the device to go home to exchange certificates for the old people, and for the special case, the old people can inform us by telephone, and people can arrange policemen to handle the certificate at home. "', … …,' main field 0: 1, the 'Fudi' Western-style safety is not provided for the national foot for more time, and … … national foot is weak to attack and is worthy of scaling. Next, opponents of 12-strength matches are only strong or weak compared with Syrian, and when the national foot of backwater one-battle attacks the rule and vigor, a more stable gate is selected, which is the urgent affairs', … … ];

in this embodiment, the text data may also be unified into a plurality of text formats such as csv, json, and the like.

And S2, extracting text data in the database, and performing word segmentation processing on the text data by using a Chinese word segmentation packet to obtain a corpus B consisting of words in the texts.

B [ 'driving school', 'service', 'charge', 'public', 'transparent', 'test', 'subject', 'three four', 'back and forth', 'driving license', 'experience', 'heart', '10 month', '13 day', 'Chinese', 'city', 'province', 'government', 'learning', 'aim', 'change', 'Sichuan province', 'vehicle', 'driving', 'training', 'test', 'institution', 'reform', 'implementation', 'scheme', 'abbreviation', … … horse gas ',' drive correction ',' driving license ',' drive ',' high voltage ',' high fire ',' peak ',' high altitude ',' high new technology ',' high wave ',' high school ',' high altitude ',' high voltage ', and the like' can be generated in the environment, 'yellow Bowen', 'dim', 'encouragement', 'Dragon fountain zone', … … ];

s3, equally dividing the corpus into 2 blocks, and filtering the word segmentation result in each corpus by using a stop word list and a filtering rule to obtain block candidate word sets of 2 blocks, wherein the candidate word sets of 2 blocks are respectively:

1 [ [ 'driving school', 'service', 'fee charging', 'public', 'transparent', 'test', 'subject', 'three four', 'back and forth', 'driving license', 'experience', 'heart', '10 month', '13 day', 'Chinese', 'city', 'province', 'government', 'learning', 'aim', 'change', 'Sichuan province', 'vehicle', 'driving', 'training', 'test', 'institution', 'reform', 'implementation', 'scheme', 'short', '… …' training ',' facility ',' disabled ',' driving ',' adjustment ',' old 'physical examination', 'body', 'inspection', 'age', '60', 'week', 'adjustment', '70', 'week', 'adjustment', 'week', 'adjustment', 'delivery of' physical examination ',' adjustment ',' age ',' adjustment ',' transmission ', and' for example ',' device ',' adjustment ',' device ',' body ',' inspection ',' adjustment, 'logout', 'driver', 'person', 'physical', 'eligibility', 'recovery', 'driving', 'eligibility', 'western', 'city', '… …, [' electric shock ',' movie-circle ',' writer ',' group ',' domestic ',' force ',' writer ',' Liu Jiu ',' Liu Yun ',' undoubtedly ',' movie ',' comedy ',' participation ',' movie ',' mobile ',' von Xiao ',' movie ',' actor ',' identity ','2016 ',' Liu Yun ',' movie ',' work ',' audience ',' V ',' participation ',' 36 ',' movie ',' Von little ',' director ',' little adaptation ',' little word using ',' Liu 'etc', 'Liu' woman ',' rain ',' bird ',' pig ',' 36 ', a', 'consumer', '… …', 'commercial', 'trade', 'pig' and 'can' are included in the device, 'practice', 'test', 'round', 'fill', 'confidence', 'west', 'city', 'zhangjie' ];

2 [ [ 'Sichuan', 'daily', 'education', 'department', 'guarantee', 'household', 'office', 'registration', 'age-appropriate', 'child', 'accept', 'compulsory education', 'office', 'study', 'office', 'registration', 'management', 'notice', 'whole-time', 'make', 'household', 'person', 'clear', 'registration', 'check up', 'management', 'notice', 'make', 'place', 'check up', 'death', 'household', 'person', 'counter', 'clear up', 'place', 'clear up', 'death', 'household', 'place', 'clear up', 'place', 'communication', 'real', 'communication', 'touch up', 'muzzle', 'person', 'prime', 'person', 'resident', 'first place', 'notice', … …, 'muzzle', 'original place', 'rural', 'migration', 'residential', 'change into', 'town', 'resident', 'muzzle', 'Sichuan', 'date', 'Association', 'Sichuan', 'school society', 'Association', 'city institute', 'season', 'movable', '7 month', '13 day', 'open', 'real study', 'city', 'gay', 'city' and 'combined', 'city' ',' city school society ',' association ',' city initiation ',' city side ',' activity ',' city side ', etc.' voting ', the' and 'are' provided in the 'city, the' and the 'are provided in the' way of the 'city' and the 'city' voting system, the 'is provided with the' guide ',' city 'and the' guide ', the' guide, 'voting', 'focus', 'chinese', 'public', 'dialog', 'page', 'select', 'vote' ];

in order to quickly process to obtain a required result, a corpus formed by all the documents is subjected to blocking processing, and each text corpus is ensured to be uniform as much as possible; and then preprocessing each text corpus, wherein the preprocessing is to filter word segmentation results by using a stop word list and a filtering rule, wherein the stop word list comprises auxiliary words, prepositions, conjunctions and other virtual words and words with the length of 1 and without actual meanings. And designing corresponding rules for filtering the useless strings with obvious rules, such as frequently-occurring collocation of the number words and quantifier, some common but meaningless prefixes and suffixes, and the like.

S4, TFIDF weighting is carried out on each candidate keyword of each text in each candidate word set, for TFIDF weighting, TF of each candidate keyword is firstly calculated, then IDF of each candidate keyword is calculated, and finally weight of TFIDF of each candidate keyword can be obtained. The weight matrix of these two blocks is shown in fig. 2, where the row represents each text and the column represents the weight of the candidate keyword;

s5, selecting the same candidate keywords in each weighted block candidate word set, adding the weights corresponding to the same candidate keywords, keeping the original weights of different candidate keywords, and completing the de-weight processing of each weighted block candidate word set to obtain a block keyword set;

the keyword set of block 1: [ … … ' bus ', ' public pre ', ' public place ', ' public key ', ' public affairs ', ' public officer ', ' company ', ' public announcement ', ' park ', ' public security hall ', ' public security office ', ' public security department ', ' public security ', ' public key ', ' public match ', ' kilogram ', ' public money ', ' public key ', ' public benefit ', ' public key ', ' kilometer ', ' six seventy ', ' six nails ', ' shared ', ' common totality ', ' close ', ' key ', ' concern ', ' close ', ' excitation ', ' … … ', and ' foreign ' can ' be recognized;

and 2, a keyword set of the block: [ … … 'princess', 'public toilet', 'company', 'public announcement', 'park', 'public security bureau', 'apartment', 'public', 'kilogram', 'public', 'highway', 'kilometer', 'shared', 'resonant', 'concerned', 'off', 'concerned', 'associated', 'key', 'common', 'excited', 'its parent', 'provided with', 'typical', 'model', 'aged', 'internal', and 'internal' … …;

and (3) arranging each candidate word set in a descending order according to the weight, extracting the keyword with the highest weight, finding out a text containing the keyword from each candidate word set, and defining the text as a reference text. The reference text of each block is represented by the index value corresponding to the text. The text index value of the 1 st block is 1, and the text index value corresponding to the second block is 32.

S6, forming weight vector w for the weight corresponding to the candidate keyword in each reference text₀The weight values corresponding to the candidate keywords of other texts in each block respectively form a weight vector w₁,w₂,…,w_t…,w₄₉And performing cosine similarity calculation, obtaining a cosine vector Q for each block, comparing each value in the cosine vector Q with a threshold value of 0.5 set by the user, and adding the text corresponding to the cosine value which is greater than the threshold value of 0.5 into the reference text because the similarity of the two texts is very high.

S7, extracting candidate keywords corresponding to the similar texts of each block, arranging the candidate keywords according to the word frequency from big to small, and taking the first 6 keywords as text topics corresponding to the texts.

Block 1 first hotspot topic: reconstructing roads by high-speed traffic vehicles in scenic spots;

block 2 first hotspot topic: the entrepreneurship automobile market reform acceleration project;

and S8, removing the candidate word sets of the similar texts corresponding to the candidate word sets in each block, repeating the steps S4-S8 to obtain other hot topics, and mainly finding the next 3 hot topics in each block.

Block 1 second hotspot topic: tan Weiwei nature at the musical release party of the concert Wang Fei;

block 1 third hotspot topic: the match fans can take care of the super-opened relay department players;

block 1 fourth hotspot topic: the cultural south China fills and researches cultural relic museum Sichuan;

block 2 second hotspot topic: the city develops double-flow to build airport Jianyang;

block 2 third hotspot topic: playing gold medals women at the Rencai diving Olympic Games;

block 2 fourth hotspot topic: turning over the motor vehicles at the intersection of the road intersection for public transportation;

and S9, extracting the candidate word sets of the texts corresponding to the hot topics in each block to form a new candidate word set. Combining each new candidate word set to form a larger candidate word set, and repeating the steps of S4-S8 to extract the hot topic which can represent most text contents of the whole data set.

First hotspot topic: urban development is healthy and is constructed in both flows;

second hotspot topic: the market of the automobile reform service battery industry;

the third hotspot topic: modification and control of high-speed traffic vehicles in scenic spots;

fourth hotspot topic: playing gold medals women at the Rencai diving Olympic Games;

although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A method for extracting hot topics based on keywords is characterized by comprising the following steps:

(4) TFIDF weighting is carried out on the ith candidate keyword in the pth block candidate word set to obtain M weighted block candidate word sets, wherein p is 1,2, … and M;

Wherein the content of the first and second substances,

(4.3) calculating the ith candidate in the pth blockKeywords in the jth text

Weight value

(4.4) continuously processing the residual k-1 candidate keywords of the jth text in the pth block according to the steps (4.1) - (4.3), then processing other texts in the pth block, and continuously processing the residual block candidate word sets according to the steps (4.1) - (4.3) after the processing of the pth block candidate word set is finished, so as to finally obtain M weighted block candidate word sets;

(5) acquiring a reference text

(5.2) carrying out descending order arrangement on the weights of the candidate keywords corresponding to the block keyword set, then taking the candidate keyword with the largest weight as a reference, finding a first text containing the candidate keyword with the largest weight in the pth block candidate word set, and marking the first text as a reference text;

(5.3) continuously processing the block candidate word sets after the remaining empowerments according to the steps (5.1) - (5.2) to finally obtain M pieces of reference texts;

(6) and finding a text set

(6.4) forming a cosine vector Q by utilizing T similarity cosine values in the p-th block keyword set, comparing each element in the cosine vector Q with a preset threshold theta, if one element is greater than the preset threshold theta, judging that the similarity between the text corresponding to the vector and the reference text is high, and adding the text and the reference text into the text set together

(6.5) continuing to process the residual block keyword sets according to the steps (6.1) - (6.4) to finally obtain M text sets

(7) In a text collection

Hot spot theme of

Obtaining the residual M-1 hot spot subjects in the same way;

(8) in the candidate word set of the block p, eliminating the text set

Extracting the block candidate word set consisting of the residual texts to h-1 hot topic according to the steps (4) to (7) in the similar texts appearing in the text list

(9) and respectively storing all texts corresponding to the h hot topics obtained in the M block candidate word sets into the hot text sets of the corresponding blocks, combining the M hot text sets to serve as a corpus, extracting the candidate keyword sets according to the step (3), and repeating the steps (4) - (7) to obtain more hot topics.

2. The method for extracting hot topics based on keywords as claimed in claim 1, wherein in step (1), the text data is further unified into csv and json text formats.

3. The method as claimed in claim 1, wherein the stop vocabulary includes auxiliary words, prepositions, conjunctions, and words with length of 1 and without actual meaning; the filtering rule is to filter frequently occurring collocation of the number words and quantifier, common but meaningless suffixes and useless strings.