CN107122478B - Method for extracting hot topics based on keywords - Google Patents

Method for extracting hot topics based on keywords Download PDF

Info

Publication number
CN107122478B
CN107122478B CN201710304817.1A CN201710304817A CN107122478B CN 107122478 B CN107122478 B CN 107122478B CN 201710304817 A CN201710304817 A CN 201710304817A CN 107122478 B CN107122478 B CN 107122478B
Authority
CN
China
Prior art keywords
text
block
candidate
keywords
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710304817.1A
Other languages
Chinese (zh)
Other versions
CN107122478A (en
Inventor
陆川
孙健
杨伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Yunshu Future Information Science Co ltd
Original Assignee
Chengdu Yunshu Future Information Science Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Yunshu Future Information Science Co ltd filed Critical Chengdu Yunshu Future Information Science Co ltd
Priority to CN201710304817.1A priority Critical patent/CN107122478B/en
Publication of CN107122478A publication Critical patent/CN107122478A/en
Application granted granted Critical
Publication of CN107122478B publication Critical patent/CN107122478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Abstract

The invention discloses a method for extracting hot topics based on keywords, which is characterized by forming a corpus by unifying massive data in a format and performing word segmentation processing, performing parallelization and block segmentation processing on the corpus to obtain a candidate word set of each block, performing TFIDF weighting and de-weighting processing on each candidate word set to obtain a reference document, performing cosine similarity processing on the reference document and other texts in the block to extract a text similar to the reference document, finding a plurality of hot topics of the similar texts by performing word frequency descending order on the candidate word sets in the similar texts, and finally extracting topic hot spots from the hot topics, wherein the topic hot spots can represent the main viewpoints of the massive data.

Description

Method for extracting hot topics based on keywords
Technical Field
The invention belongs to the technical field of network public opinion monitoring, and particularly relates to a method for extracting hot topics based on keywords.
Background
With the vigorous development of internet technology and the rapid popularization of related applications, everyone is no longer just a consumer of information, but rather a producer of information, and a netizen can acquire or publish information on various websites such as microblogs, social contacts, news, blogs and the like at any time and any place by means of network terminals such as computers, mobile phones and the like, and a plurality of existing commercial portal websites can collect and provide rich news reports such as newsurfing, network easiness and the like for users. For example, the rich and comprehensive content published on the platform by the user not only creates a hot topic of disputes in the social network, but also attracts a plurality of traditional media to further follow up the related events by utilizing the microblog.
Due to the explosive growth of the internet data and the characteristics of fast food, fragmentation and the like, the problems of information overload and lack of integrity become more obvious, people at a fast pace feel reluctant to numerous new information emerging all the time, but people urgently want to know hot topics being discussed in society in time and quickly, and the hot topics are characterized by timeliness, diversity, generality and the like.
How to efficiently dig out effective information in the internet, a plurality of difficult problems are faced in network monitoring, for example, users have higher and higher intellectualization to network monitoring, and a hotspot obtained from a small amount of data text is not in accordance with modern network monitoring; in the process of data mining, the user faces no more simple and small amount of text information … …, and the hot topic detection technology is taken as a data mining technology which can automatically discover and organize semantic association of network information and help the user to quickly acquire a network information overview, and has attracted strong attention in academia and industry in recent years.
As a research hotspot which is concerned in the field of information processing, public opinion topic detection and tracking technology takes news media information streams as research objects in the early development stage, finds new information which is interesting to users and tracks the information by monitoring news-described topics, and finally organizes news related to a certain topic and presents the news to the users in a certain mode. Since social media have been increasingly popular due to rapid development of computer technology and widespread popularization of the internet, researchers have focused on social media forms such as blogs, mails, communities, and forums, which are representative of these times. Different from language normalization and content effectiveness of news reports, social media text content is high in randomness, is full of a large amount of worthless information, and is low in relevance among documents. In the face of mass internet information which is continuously emerging, simple manual supervision is difficult, traditional hot topic finding technologies aim at a small amount of text and little text content, hot detection means generally search from known topics, if the existing hot topic is added into an original hot topic, so that the hot degree of the original hot topic is improved, and subsequent tracking is performed. However, the detection and tracking technology is directed to a small number of documents, and if a large amount of internet information is faced, the traditional topic detection technology is adopted, so that the practical application requirement of detecting the hot spot topic in the massive and continuous information stream is difficult to meet, even if the detection is realized, the time complexity is very high, the time delay is very obvious, the energy of a user is very limited, and the useful knowledge of the related topic cannot be obtained by reading all the documents.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for extracting hot topics based on keywords, wherein the hot topics are extracted based on the keywords, and discovery and tracking of the hot topics every day are realized.
In order to achieve the above object, the present invention provides a method for extracting a hot topic based on a keyword, comprising the steps of:
(1) crawling massive text data through a crawler, unifying the text data into a txt text format, and storing the txt text format in a database;
(2) extracting text data in the database, and performing word segmentation processing on the text data by utilizing a Chinese word segmentation packet to obtain a language database consisting of words;
(3) equally dividing the corpus into M blocks, and filtering the word segmentation result in each corpus by using a stop word list and a filtering rule to obtain block candidate word sets of the M blocks;
(4) TFIDF weighting is carried out on the ith candidate keyword in the pth block (p is 1,2, …, M) candidate word set to obtain M weighted block candidate word sets;
(4.1) calculating the frequency of the ith candidate keyword in the pth block in the jth text
Figure BDA0001285331450000031
Figure BDA0001285331450000032
Wherein the content of the first and second substances,
Figure BDA0001285331450000033
representing the number of times that the ith candidate keyword in the pth block appears in the jth text,
Figure BDA0001285331450000034
representing the sum of the occurrence times of all candidate keywords in the pth block in the jth text, and k representing the total number of all candidate keywords in the jth text in the pth block;
(4.2) calculating the universal importance measure IDF of the ith candidate keyword in the pth blocki p
Figure BDA0001285331450000035
Wherein, | DpI represents the total number of texts in the pth block, and IpL represents the number of texts containing the ith candidate keyword in the pth block;
(4.3) calculating the ith candidate keyword in the pth block in the jth text
Figure BDA0001285331450000036
Weight value
Figure BDA0001285331450000037
(4.4) according to the method in the steps (4.1) - (4.3), continuously processing the residual k-1 candidate keywords of the jth text in the pth block, then processing other texts in the pth block, and after the processing of the pth block candidate word set is finished, according to the method in the steps (4.1) - (4.3), continuously processing the residual block candidate word set to finally obtain M weighted block candidate word sets;
(5) acquiring a reference text
(5.1) selecting the same candidate keywords in the weighted block candidate word set of the pth block, adding weights corresponding to the same candidate keywords, keeping the original weights of different candidate keywords, and completing the de-weighting processing of the weighted block candidate word set of the pth block so as to obtain a block keyword set;
(5.2) carrying out descending order arrangement on the weights of the corresponding candidate keywords in the block keyword set, and then finding a first text containing the candidate keywords in the pth block candidate word set by taking the candidate keywords with the largest weights as a reference, and marking the first text as a reference text;
(5.3) continuously processing the residual weighted block candidate word sets according to the method in the steps (5.1) - (5.2) to finally obtain M pieces of reference texts;
(6) and finding a text set
Figure BDA0001285331450000049
(6.1) finding out a reference text corresponding to the keyword set of the block p, and then forming a weight vector w by using weights corresponding to candidate keywords in the reference text0
(6.2) finding out other texts in the keyword set of the block p, and forming a weight vector w by using weights corresponding to candidate keywords in the texts respectively1,w2,…,wt…,wTT represents the total number of texts contained in the pth block keyword set;
(6.3) calculating similarity cosine values of other texts in the keyword set of the p block and the reference text by using a cosine similarity formula;
Figure BDA0001285331450000041
(6.4) utilizing T in the pth block keyword setForming cosine vectors Q by the cosine values of the similarity, comparing each vector in the cosine vectors Q with a preset threshold theta, if a certain vector is larger than the preset threshold theta, judging that the similarity between the text corresponding to the vector and the reference text is high, and adding the text and the reference text into a text set together
Figure BDA0001285331450000042
(6.5) continuing to process the residual block keyword sets according to the method in the steps (6.1) - (6.4) to finally obtain M text sets
Figure BDA0001285331450000043
(7) In a text collection
Figure BDA0001285331450000044
Adding 1 to the word frequency of the candidate keywords which repeatedly appear in the text, further counting the word frequency of all the candidate keywords in the text, then performing descending arrangement on the word frequency, taking out the first candidate keywords as hot keywords, and marking the hot keywords as a text set
Figure BDA0001285331450000045
Hot spot theme of
Figure BDA0001285331450000046
Obtaining the residual M-1 hot spot subjects in the same way;
(8) in the candidate word set of the block p, eliminating the text set
Figure BDA0001285331450000047
Extracting the block candidate word set consisting of the residual texts to h-1 hot topic topics according to the method in the steps (4) to (7) for the similar texts appearing in the text list
Figure BDA0001285331450000048
In the same way, h-1 hot topic topics are respectively extracted from the rest M-1 block candidate word sets;
(9) and respectively storing all texts corresponding to h hot topics obtained from the M block candidate word sets into the hot text sets of the corresponding blocks, combining the M hot text sets to serve as a corpus, extracting the candidate keyword sets according to the method in the step (3), and repeating the steps (4) to (7) to obtain more hot topics.
The invention aims to realize the following steps:
the invention relates to a method for extracting hot topics based on keywords, which is used for unifying massive data into a format and performing word segmentation processing to form a corpus. Parallelizing and blocking processing a material library, performing the same preprocessing on each block to obtain a candidate word set of each block, then performing TFIDF (trivial text field frequency) weighting on candidate words of each text in each block, and performing de-weighting processing on the candidate word set after each block is weighted to obtain a reference document; and then cosine similarity processing is carried out on each block of reference text and other texts in the block, a text similar to the reference text is extracted, candidate keyword sets in the similar texts in each block are sorted by word frequency descending order to find hot topics of the similar texts, then the candidate keyword sets of the similar texts are removed from the total text candidate keyword set in each block, TFIDF (fuzzy inference) is carried out again on the text in each block to find the reference text, and cosine similarity processing is carried out to obtain a plurality of hot topics in each block. Finally, the candidate keyword sets corresponding to the hot topics in each block are combined to form a new candidate keyword set, then all the new candidate keyword sets in each block are combined to form a large candidate keyword set, the steps are repeated, and a plurality of hot topics of the large candidate keyword are found out, so that the hot topics are extracted from the hot topics, and the main view of the mass data can be represented.
Meanwhile, the method for extracting the hot topics based on the keywords further has the following beneficial effects:
(1) by parallelizing the text data, for the text in each block, as a large amount of texts are not hot spots, useless texts can be deleted as long as the total text of the hot topic contains about half of the texts in the block through parallelization. The method is far faster than the method for obtaining the hot subject by integral processing, because a large amount of useless texts are tired of the running speed, and the defect of insufficient memory during integral processing is made up by block processing. This can improve our efficiency and result faster.
(2) In TFIDF weighting, the weights we get are more accurate than the weights obtained by the whole process because the computer results have a certain number of bits limitation. Due to the block processing, the number of the texts is reduced, so that the calculation speed is improved, and the calculation precision is ensured. TFIDF weighting is an important basis for the following process, and thus TFIDF is to be guaranteed.
(3) The method has the greatest advantage that the hot topic is extracted from the hot topic, which is not done by people before, and the obtained effect is ideal, and the required result can be completely obtained.
Drawings
FIG. 1 is a flowchart of a method for extracting hot topics based on keywords according to the present invention;
FIG. 2 is a graphical representation of the weights of each candidate key for each text in the block candidate set.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flowchart of a method for extracting hot topics based on keywords according to the present invention.
In this embodiment, as shown in fig. 1, the method for extracting a hot topic based on a keyword according to the present invention includes the following steps:
s1, crawling the news websites through a crawler, for example: crawling … … for new waves, hundredths and Tencent, crawling 100 news text data sets A on the same day, unifying the text data into a txt text format, and storing the text data into a database;
a [' driving school services, charges are not publicly opaque; … … the driver qualified by the medical examination can be recovered for the person who has not submitted the medical examination and is logged off. ' policemen go to the hospital to take pictures of the old people for changing the certificates. "the old people … … also take the device to go home to exchange certificates for the old people, and for the special case, the old people can inform us by telephone, and people can arrange policemen to handle the certificate at home. "', … …,' main field 0: 1, the 'Fudi' Western-style safety is not provided for the national foot for more time, and … … national foot is weak to attack and is worthy of scaling. Next, opponents of 12-strength matches are only strong or weak compared with Syrian, and when the national foot of backwater one-battle attacks the rule and vigor, a more stable gate is selected, which is the urgent affairs', … … ];
in this embodiment, the text data may also be unified into a plurality of text formats such as csv, json, and the like.
And S2, extracting text data in the database, and performing word segmentation processing on the text data by using a Chinese word segmentation packet to obtain a corpus B consisting of words in the texts.
B [ 'driving school', 'service', 'charge', 'public', 'transparent', 'test', 'subject', 'three four', 'back and forth', 'driving license', 'experience', 'heart', '10 month', '13 day', 'Chinese', 'city', 'province', 'government', 'learning', 'aim', 'change', 'Sichuan province', 'vehicle', 'driving', 'training', 'test', 'institution', 'reform', 'implementation', 'scheme', 'abbreviation', … … horse gas ',' drive correction ',' driving license ',' drive ',' high voltage ',' high fire ',' peak ',' high altitude ',' high new technology ',' high wave ',' high school ',' high altitude ',' high voltage ', and the like' can be generated in the environment, 'yellow Bowen', 'dim', 'encouragement', 'Dragon fountain zone', … … ];
s3, equally dividing the corpus into 2 blocks, and filtering the word segmentation result in each corpus by using a stop word list and a filtering rule to obtain block candidate word sets of 2 blocks, wherein the candidate word sets of 2 blocks are respectively:
1 [ [ 'driving school', 'service', 'fee charging', 'public', 'transparent', 'test', 'subject', 'three four', 'back and forth', 'driving license', 'experience', 'heart', '10 month', '13 day', 'Chinese', 'city', 'province', 'government', 'learning', 'aim', 'change', 'Sichuan province', 'vehicle', 'driving', 'training', 'test', 'institution', 'reform', 'implementation', 'scheme', 'short', '… …' training ',' facility ',' disabled ',' driving ',' adjustment ',' old 'physical examination', 'body', 'inspection', 'age', '60', 'week', 'adjustment', '70', 'week', 'adjustment', 'week', 'adjustment', 'delivery of' physical examination ',' adjustment ',' age ',' adjustment ',' transmission ', and' for example ',' device ',' adjustment ',' device ',' body ',' inspection ',' adjustment, 'logout', 'driver', 'person', 'physical', 'eligibility', 'recovery', 'driving', 'eligibility', 'western', 'city', '… …, [' electric shock ',' movie-circle ',' writer ',' group ',' domestic ',' force ',' writer ',' Liu Jiu ',' Liu Yun ',' undoubtedly ',' movie ',' comedy ',' participation ',' movie ',' mobile ',' von Xiao ',' movie ',' actor ',' identity ','2016 ',' Liu Yun ',' movie ',' work ',' audience ',' V ',' participation ',' 36 ',' movie ',' Von little ',' director ',' little adaptation ',' little word using ',' Liu 'etc', 'Liu' woman ',' rain ',' bird ',' pig ',' 36 ', a', 'consumer', '… …', 'commercial', 'trade', 'pig' and 'can' are included in the device, 'practice', 'test', 'round', 'fill', 'confidence', 'west', 'city', 'zhangjie' ];
2 [ [ 'Sichuan', 'daily', 'education', 'department', 'guarantee', 'household', 'office', 'registration', 'age-appropriate', 'child', 'accept', 'compulsory education', 'office', 'study', 'office', 'registration', 'management', 'notice', 'whole-time', 'make', 'household', 'person', 'clear', 'registration', 'check up', 'management', 'notice', 'make', 'place', 'check up', 'death', 'household', 'person', 'counter', 'clear up', 'place', 'clear up', 'death', 'household', 'place', 'clear up', 'place', 'communication', 'real', 'communication', 'touch up', 'muzzle', 'person', 'prime', 'person', 'resident', 'first place', 'notice', … …, 'muzzle', 'original place', 'rural', 'migration', 'residential', 'change into', 'town', 'resident', 'muzzle', 'Sichuan', 'date', 'Association', 'Sichuan', 'school society', 'Association', 'city institute', 'season', 'movable', '7 month', '13 day', 'open', 'real study', 'city', 'gay', 'city' and 'combined', 'city' ',' city school society ',' association ',' city initiation ',' city side ',' activity ',' city side ', etc.' voting ', the' and 'are' provided in the 'city, the' and the 'are provided in the' way of the 'city' and the 'city' voting system, the 'is provided with the' guide ',' city 'and the' guide ', the' guide, 'voting', 'focus', 'chinese', 'public', 'dialog', 'page', 'select', 'vote' ];
in order to quickly process to obtain a required result, a corpus formed by all the documents is subjected to blocking processing, and each text corpus is ensured to be uniform as much as possible; and then preprocessing each text corpus, wherein the preprocessing is to filter word segmentation results by using a stop word list and a filtering rule, wherein the stop word list comprises auxiliary words, prepositions, conjunctions and other virtual words and words with the length of 1 and without actual meanings. And designing corresponding rules for filtering the useless strings with obvious rules, such as frequently-occurring collocation of the number words and quantifier, some common but meaningless prefixes and suffixes, and the like.
S4, TFIDF weighting is carried out on each candidate keyword of each text in each candidate word set, for TFIDF weighting, TF of each candidate keyword is firstly calculated, then IDF of each candidate keyword is calculated, and finally weight of TFIDF of each candidate keyword can be obtained. The weight matrix of these two blocks is shown in fig. 2, where the row represents each text and the column represents the weight of the candidate keyword;
s5, selecting the same candidate keywords in each weighted block candidate word set, adding the weights corresponding to the same candidate keywords, keeping the original weights of different candidate keywords, and completing the de-weight processing of each weighted block candidate word set to obtain a block keyword set;
the keyword set of block 1: [ … … ' bus ', ' public pre ', ' public place ', ' public key ', ' public affairs ', ' public officer ', ' company ', ' public announcement ', ' park ', ' public security hall ', ' public security office ', ' public security department ', ' public security ', ' public key ', ' public match ', ' kilogram ', ' public money ', ' public key ', ' public benefit ', ' public key ', ' kilometer ', ' six seventy ', ' six nails ', ' shared ', ' common totality ', ' close ', ' key ', ' concern ', ' close ', ' excitation ', ' … … ', and ' foreign ' can ' be recognized;
and 2, a keyword set of the block: [ … … 'princess', 'public toilet', 'company', 'public announcement', 'park', 'public security bureau', 'apartment', 'public', 'kilogram', 'public', 'highway', 'kilometer', 'shared', 'resonant', 'concerned', 'off', 'concerned', 'associated', 'key', 'common', 'excited', 'its parent', 'provided with', 'typical', 'model', 'aged', 'internal', and 'internal' … …;
and (3) arranging each candidate word set in a descending order according to the weight, extracting the keyword with the highest weight, finding out a text containing the keyword from each candidate word set, and defining the text as a reference text. The reference text of each block is represented by the index value corresponding to the text. The text index value of the 1 st block is 1, and the text index value corresponding to the second block is 32.
S6, forming weight vector w for the weight corresponding to the candidate keyword in each reference text0The weight values corresponding to the candidate keywords of other texts in each block respectively form a weight vector w1,w2,…,wt…,w49And performing cosine similarity calculation, obtaining a cosine vector Q for each block, comparing each value in the cosine vector Q with a threshold value of 0.5 set by the user, and adding the text corresponding to the cosine value which is greater than the threshold value of 0.5 into the reference text because the similarity of the two texts is very high.
S7, extracting candidate keywords corresponding to the similar texts of each block, arranging the candidate keywords according to the word frequency from big to small, and taking the first 6 keywords as text topics corresponding to the texts.
Block 1 first hotspot topic: reconstructing roads by high-speed traffic vehicles in scenic spots;
block 2 first hotspot topic: the entrepreneurship automobile market reform acceleration project;
and S8, removing the candidate word sets of the similar texts corresponding to the candidate word sets in each block, repeating the steps S4-S8 to obtain other hot topics, and mainly finding the next 3 hot topics in each block.
Block 1 second hotspot topic: tan Weiwei nature at the musical release party of the concert Wang Fei;
block 1 third hotspot topic: the match fans can take care of the super-opened relay department players;
block 1 fourth hotspot topic: the cultural south China fills and researches cultural relic museum Sichuan;
block 2 second hotspot topic: the city develops double-flow to build airport Jianyang;
block 2 third hotspot topic: playing gold medals women at the Rencai diving Olympic Games;
block 2 fourth hotspot topic: turning over the motor vehicles at the intersection of the road intersection for public transportation;
and S9, extracting the candidate word sets of the texts corresponding to the hot topics in each block to form a new candidate word set. Combining each new candidate word set to form a larger candidate word set, and repeating the steps of S4-S8 to extract the hot topic which can represent most text contents of the whole data set.
First hotspot topic: urban development is healthy and is constructed in both flows;
second hotspot topic: the market of the automobile reform service battery industry;
the third hotspot topic: modification and control of high-speed traffic vehicles in scenic spots;
fourth hotspot topic: playing gold medals women at the Rencai diving Olympic Games;
although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (3)

1. A method for extracting hot topics based on keywords is characterized by comprising the following steps:
(1) crawling massive text data through a crawler, unifying the text data into a txt text format, and storing the txt text format in a database;
(2) extracting text data in the database, and performing word segmentation processing on the text data by utilizing a Chinese word segmentation packet to obtain a language database consisting of words;
(3) equally dividing the corpus into M blocks, and filtering the word segmentation result in each corpus by using a stop word list and a filtering rule to obtain block candidate word sets of the M blocks;
(4) TFIDF weighting is carried out on the ith candidate keyword in the pth block candidate word set to obtain M weighted block candidate word sets, wherein p is 1,2, … and M;
(4.1) calculating the frequency of the ith candidate keyword in the pth block in the jth text
Figure FDA0002415884970000011
Figure FDA0002415884970000012
Wherein the content of the first and second substances,
Figure FDA0002415884970000013
representing the number of times that the ith candidate keyword in the pth block appears in the jth text,
Figure FDA0002415884970000014
representing the sum of the occurrence times of all candidate keywords in the pth block in the jth text, and k representing the total number of all candidate keywords in the jth text in the pth block;
(4.2) calculating the universal importance measure IDF of the ith candidate keyword in the pth blocki p
Figure FDA0002415884970000015
Wherein, | DpI represents the total number of texts in the pth block, and IpL represents the number of texts containing the ith candidate keyword in the pth block;
(4.3) calculating the ith candidate in the pth blockKeywords in the jth text
Figure FDA0002415884970000016
Weight value
Figure FDA0002415884970000017
(4.4) continuously processing the residual k-1 candidate keywords of the jth text in the pth block according to the steps (4.1) - (4.3), then processing other texts in the pth block, and continuously processing the residual block candidate word sets according to the steps (4.1) - (4.3) after the processing of the pth block candidate word set is finished, so as to finally obtain M weighted block candidate word sets;
(5) acquiring a reference text
(5.1) selecting the same candidate keywords in the weighted block candidate word set of the pth block, adding weights corresponding to the same candidate keywords, keeping the original weights of different candidate keywords, and completing the de-weighting processing of the weighted block candidate word set of the pth block so as to obtain a block keyword set;
(5.2) carrying out descending order arrangement on the weights of the candidate keywords corresponding to the block keyword set, then taking the candidate keyword with the largest weight as a reference, finding a first text containing the candidate keyword with the largest weight in the pth block candidate word set, and marking the first text as a reference text;
(5.3) continuously processing the block candidate word sets after the remaining empowerments according to the steps (5.1) - (5.2) to finally obtain M pieces of reference texts;
(6) and finding a text set
Figure FDA0002415884970000021
(6.1) finding out a reference text corresponding to the keyword set of the block p, and then forming a weight vector w by using weights corresponding to candidate keywords in the reference text0
(6.2) finding out other texts in the keyword set of the block p, and forming a weight vector w by using weights corresponding to candidate keywords in the texts respectively1,w2,…,wt…,wTT represents the total number of texts contained in the pth block keyword set;
(6.3) calculating similarity cosine values of other texts in the keyword set of the p block and the reference text by using a cosine similarity formula;
Figure FDA0002415884970000022
(6.4) forming a cosine vector Q by utilizing T similarity cosine values in the p-th block keyword set, comparing each element in the cosine vector Q with a preset threshold theta, if one element is greater than the preset threshold theta, judging that the similarity between the text corresponding to the vector and the reference text is high, and adding the text and the reference text into the text set together
Figure FDA0002415884970000023
(6.5) continuing to process the residual block keyword sets according to the steps (6.1) - (6.4) to finally obtain M text sets
Figure FDA0002415884970000024
(7) In a text collection
Figure FDA0002415884970000025
Adding 1 to the word frequency of the candidate keywords which repeatedly appear in the text, further counting the word frequency of all the candidate keywords in the text, then performing descending arrangement on the word frequency, taking out the first candidate keywords as hot keywords, and marking the hot keywords as a text set
Figure FDA0002415884970000026
Hot spot theme of
Figure FDA0002415884970000027
Obtaining the residual M-1 hot spot subjects in the same way;
(8) in the candidate word set of the block p, eliminating the text set
Figure FDA0002415884970000028
Extracting the block candidate word set consisting of the residual texts to h-1 hot topic according to the steps (4) to (7) in the similar texts appearing in the text list
Figure FDA0002415884970000029
In the same way, h-1 hot topic topics are respectively extracted from the rest M-1 block candidate word sets;
(9) and respectively storing all texts corresponding to the h hot topics obtained in the M block candidate word sets into the hot text sets of the corresponding blocks, combining the M hot text sets to serve as a corpus, extracting the candidate keyword sets according to the step (3), and repeating the steps (4) - (7) to obtain more hot topics.
2. The method for extracting hot topics based on keywords as claimed in claim 1, wherein in step (1), the text data is further unified into csv and json text formats.
3. The method as claimed in claim 1, wherein the stop vocabulary includes auxiliary words, prepositions, conjunctions, and words with length of 1 and without actual meaning; the filtering rule is to filter frequently occurring collocation of the number words and quantifier, common but meaningless suffixes and useless strings.
CN201710304817.1A 2017-05-03 2017-05-03 Method for extracting hot topics based on keywords Active CN107122478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710304817.1A CN107122478B (en) 2017-05-03 2017-05-03 Method for extracting hot topics based on keywords

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710304817.1A CN107122478B (en) 2017-05-03 2017-05-03 Method for extracting hot topics based on keywords

Publications (2)

Publication Number Publication Date
CN107122478A CN107122478A (en) 2017-09-01
CN107122478B true CN107122478B (en) 2020-05-08

Family

ID=59728105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710304817.1A Active CN107122478B (en) 2017-05-03 2017-05-03 Method for extracting hot topics based on keywords

Country Status (1)

Country Link
CN (1) CN107122478B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992477B (en) * 2017-11-30 2019-03-29 北京神州泰岳软件股份有限公司 Text subject determines method and device
CN108897861A (en) * 2018-07-01 2018-11-27 东莞市华睿电子科技有限公司 A kind of information search method
JP7052617B2 (en) * 2018-07-26 2022-04-12 トヨタ自動車株式会社 Information processing equipment, information processing system, and information processing method
CN112269852A (en) * 2020-10-23 2021-01-26 深圳中泓在线股份有限公司 Method, system and storage medium for generating public opinion topic
CN117271710B (en) * 2023-11-17 2024-01-30 山东接力教育集团有限公司 Teaching assistance hot spot data intelligent analysis system based on big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5361090B2 (en) * 2011-08-26 2013-12-04 日本電信電話株式会社 Topic word acquisition apparatus, method, and program
CN103577593A (en) * 2013-11-14 2014-02-12 中国科学院声学研究所 Method and system for video aggregation based on microblog hot topics
CN103617169A (en) * 2013-10-23 2014-03-05 杭州电子科技大学 Microblog hot topic extracting method based on Hadoop
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5361090B2 (en) * 2011-08-26 2013-12-04 日本電信電話株式会社 Topic word acquisition apparatus, method, and program
CN103617169A (en) * 2013-10-23 2014-03-05 杭州电子科技大学 Microblog hot topic extracting method based on Hadoop
CN103577593A (en) * 2013-11-14 2014-02-12 中国科学院声学研究所 Method and system for video aggregation based on microblog hot topics
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method

Also Published As

Publication number Publication date
CN107122478A (en) 2017-09-01

Similar Documents

Publication Publication Date Title
CN107122478B (en) Method for extracting hot topics based on keywords
Baroni et al. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors
Haidar et al. Multilingual cyberbullying detection system: Detecting cyberbullying in Arabic content
CN103793503B (en) Opinion mining and classification method based on web texts
Hashemi et al. Guided transformer: Leveraging multiple external sources for representation learning in conversational search
CN107315778A (en) A kind of natural language the analysis of public opinion method based on big data sentiment analysis
KR101536520B1 (en) Method and server for extracting topic and evaluating compatibility of the extracted topic
CN103500175B (en) A kind of method based on sentiment analysis on-line checking microblog hot event
CN107220352A (en) The method and apparatus that comment collection of illustrative plates is built based on artificial intelligence
JP2017511922A (en) Method, system, and storage medium for realizing smart question answer
CA2720842A1 (en) System and method for value significance evaluation of ontological subjects of network and the applications thereof
Riadi Detection of cyberbullying on social media using data mining techniques
Kwak et al. What we read, what we search: Media attention and public attention among 193 countries
CN112417127B (en) Dialogue model training and dialogue generation methods, devices, equipment and media
Pota et al. A subword-based deep learning approach for sentiment analysis of political tweets
US20140006317A1 (en) Automatic content composition generation
McCreadie et al. Relevance in microblogs: Enhancing tweet retrieval using hyperlinked documents
Elgesem et al. Bloggers’ responses to the Snowden affair: Combining automated and manual methods in the analysis of news blogging
Campbell et al. Content+ context networks for user classification in twitter
Krokos et al. A look into twitter hashtag discovery and generation
Sukel et al. Multimodal classification of urban micro-events
CN112836109A (en) Heritage tourist site recommendation method and system
Wasim et al. Extracting and modeling user interests based on social media
Karsdorp et al. The love equation: Computational modeling of romantic relationships in french classical drama
Jin Yue Opera Start a New Journey under the China’s Cultural Policy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant