CN110377823A - A kind of building of hot spot digging system under Hadoop frame - Google Patents

A kind of building of hot spot digging system under Hadoop frame Download PDF

Info

Publication number
CN110377823A
CN110377823A CN201910570822.6A CN201910570822A CN110377823A CN 110377823 A CN110377823 A CN 110377823A CN 201910570822 A CN201910570822 A CN 201910570822A CN 110377823 A CN110377823 A CN 110377823A
Authority
CN
China
Prior art keywords
hot
module
keyword
information
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910570822.6A
Other languages
Chinese (zh)
Inventor
肖清林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central Mdt Infotech Ltd Of United States Of Xiamen
Original Assignee
Central Mdt Infotech Ltd Of United States Of Xiamen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central Mdt Infotech Ltd Of United States Of Xiamen filed Critical Central Mdt Infotech Ltd Of United States Of Xiamen
Priority to CN201910570822.6A priority Critical patent/CN110377823A/en
Publication of CN110377823A publication Critical patent/CN110377823A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The building of hot spot digging system under a kind of Hadoop frame, comprising the following specific steps collecting data information A and being pre-processed to it from network using cloud computer Hadoop cluster module, obtaining preprocessed data information B and sending it to digging system;Preprocessed data information B is segmented, keyword set C is obtained;Each keyword D in keyword set C is screened by previous hot information dictionary;Each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;Hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.The present invention can quickly obtain hot topic set, and can improve and obtain the precision that hot topic obtains from network.

Description

A kind of building of hot spot digging system under Hadoop frame
Technical field
The present invention relates to the hot spot digging systems under language data processing technology field more particularly to a kind of Hadoop frame Building.
Background technique
With the fast development of internet, more and more users participate in internet, and user can be on the internet The information of oneself needs is released news or obtains, the information in internet is incremented by daily;But since information content is huge, to use Family is to bring puzzlement finding with focal point hot information;Present concern of the more and more users to hot spot, but people The hot information of oneself needs often can not be effectively obtained from network;For this purpose, proposing a kind of Hadoop frame in the application Under hot spot digging system building, to help user that can quickly obtain hot topic set from network, and can improve from net The precision that hot topic obtains is obtained on network.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes that the hot spot under a kind of Hadoop frame excavates system The building of system, the present invention can quickly obtain hot topic set, and can improve obtained from network hot topic obtain it is accurate Degree.
(2) technical solution
To solve the above problems, the present invention provides a kind of buildings of the hot spot digging system under Hadoop frame, including Step in detail below:
S1, using cloud computer Hadoop cluster module collecting data information A from network, and to data information A into Row pretreatment, obtains preprocessed data information B;
Preprocessed data information B is sent to digging system by S2, cloud computer Hadoop cluster module;
S3, preprocessed data information B is segmented, obtains keyword set C;
S4, each keyword D in keyword set C is screened by previous hot information dictionary;
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then the operation in S5 is executed;
S5, the frequency of occurrences and relay quantity according to each keyword D within current time and given historical time window Overall ranking, each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
S6, Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;
S7, hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.
Preferably, carrying out pretreatment to data information A includes duplicate removal processing and filtering messy code processing.
Preferably, cloud computer Hadoop cluster module includes data acquisition module and data processing module;Data are adopted Collect module and data processing module communication connection, data acquisition module is used for the collecting data information A from network, and data are believed Breath A is sent to data processing module;
Data processing module and digging system communication connection, data processing module are used to pre-process data information A, obtain Preprocessed data information B.
Preferably, data acquisition module obtains data information A by the way of web crawler from network.
Preferably, digging system includes word segmentation module, screening judgment module, hot information sorting module, Term co-occurrence network Construct module, cluster calculation module and memory module;
Memory module forms previous hot spot for storing previous hot information, the previous hot information stored in memory module Information dictionary;
Word segmentation module and data processing module communication connection, word segmentation module are used to segment preprocessed data information B, Obtain keyword set C;
Judgment module and word segmentation module communication connection are screened, and screens judgment module and memory module communication connection, screening Judgment module is for screening each keyword D in keyword set C by previous hot information dictionary;
Hot information sorting module and screening judgment module communication connection, after hot information sorting module is used for screening The frequency of occurrences of each keyword D within current time and given historical time window and the overall ranking for relaying quantity, will be each A keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
Term co-occurrence network struction module and hot information sorting module communication connection, Term co-occurrence network struction module is based on The correlation of each hot word E in hot word set F is calculated, and constructs Term co-occurrence network;
Cluster calculation module is connect with Term co-occurrence network struction module communication, and cluster calculation module is used for according to Term co-occurrence net Network divides hot word set F using term clustering algorithm, obtains hot topic set.
Preferably, when screening judgment module screens each keyword D,
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then keyword D is sent to hot information Sorting module.
Above-mentioned technical proposal of the invention has following beneficial technical effect:
In the present invention, by the way that cloud computing platform Hadoop and digging system to be used in combination, cloud computing platform is utilized The storage enhanced scalability of Hadoop is that the realization of data digging system provides the foundation, and passes through the cloud computer being equipped with Hadoop cluster module obtains data information A from network and handles data information A, then by digging system to processing Data information A further progress afterwards segments and screen to the keyword after participle according to previous hot information dictionary To hot word E, and hot word set F is constructed, hot word co-occurrence network is constructed after calculating hot word E correlation, using hot word clustering algorithm pair Hot word set F is divided, and hot topic set is quickly obtained, and is imitated to greatly improve and obtain hot topic from network Rate and precision.
Detailed description of the invention
Fig. 1 is the flow chart of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
Fig. 2 is the system principle diagram of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured The concept of invention.
Fig. 1 is the flow chart of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
Embodiment 1
As shown in Figure 1, the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention, including following tool Body step:
S1, using cloud computer Hadoop cluster module collecting data information A from network, and to data information A into Row pretreatment, obtains preprocessed data information B;
Preprocessed data information B is sent to digging system by S2, cloud computer Hadoop cluster module;
S3, preprocessed data information B is segmented, obtains keyword set C;
S4, each keyword D in keyword set C is screened by previous hot information dictionary;
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then the operation in S5 is executed;
S5, the frequency of occurrences and relay quantity according to each keyword D within current time and given historical time window Overall ranking, each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
S6, Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;
S7, hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.
In an alternative embodiment, carrying out pretreatment to data information A includes duplicate removal processing and filtering messy code processing.
Fig. 2 is the system principle diagram of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
As shown in Fig. 2, the hot spot digging system under a kind of Hadoop frame proposed by the present invention, including cloud computer Hadoop cluster module and digging system;Cloud computer Hadoop cluster module and digging system communication connection, cloud calculate Machine Hadoop cluster module is used for the collecting data information A from network and is sent to excavation system after handling data information A System;To treated, data information A is handled digging system, obtains hot topic set.
In an alternative embodiment, cloud computer Hadoop cluster module includes at data acquisition module and data Manage module;
Data acquisition module and data processing module communication connection, data acquisition module are used to acquire data letter from network A is ceased, and data information A is sent to data processing module;
Data processing module and digging system communication connection, data processing module are used to pre-process data information A, obtain Preprocessed data information B.
In an alternative embodiment, data acquisition module obtains data letter by the way of web crawler from network Cease A.
In an alternative embodiment, digging system includes word segmentation module, screening judgment module, hot information sequence mould Block, Term co-occurrence network struction module, cluster calculation module and memory module;
Memory module forms previous hot spot for storing previous hot information, the previous hot information stored in memory module Information dictionary;
Word segmentation module and data processing module communication connection, word segmentation module are used to segment preprocessed data information B, Obtain keyword set C;
Judgment module and word segmentation module communication connection are screened, and screens judgment module and memory module communication connection, screening Judgment module is for screening each keyword D in keyword set C by previous hot information dictionary;
Hot information sorting module and screening judgment module communication connection, after hot information sorting module is used for screening The frequency of occurrences of each keyword D within current time and given historical time window and the overall ranking for relaying quantity, will be each A keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
Term co-occurrence network struction module and hot information sorting module communication connection, Term co-occurrence network struction module is based on The correlation of each hot word E in hot word set F is calculated, and constructs Term co-occurrence network;
Cluster calculation module is connect with Term co-occurrence network struction module communication, and cluster calculation module is used for according to Term co-occurrence net Network divides hot word set F using the hot word clustering algorithm propagated using multi-tag, obtains hot topic set.
In an alternative embodiment, when screening judgment module screens each keyword D,
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then keyword D is sent to hot information Sorting module.
In the present invention, by the way that cloud computing platform Hadoop and digging system to be used in combination, cloud computing platform is utilized The storage enhanced scalability of Hadoop is that the realization of data digging system provides the foundation, and passes through the cloud computer being equipped with Hadoop cluster module obtains data information A from network and handles data information A, then by digging system to processing Data information A further progress afterwards segments and screen to the keyword after participle according to previous hot information dictionary To hot word E, and hot word set F is constructed, hot word co-occurrence network is constructed after calculating hot word E correlation, using hot word clustering algorithm pair Hot word set F is divided, and hot topic set is quickly obtained, and is imitated to greatly improve and obtain hot topic from network Rate and precision.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (6)

1. a kind of building of the hot spot digging system under Hadoop frame, which is characterized in that comprising the following specific steps
S1, it is carried out in advance using cloud computer Hadoop cluster module collecting data information A from network, and to data information A Processing, obtains preprocessed data information B;
Preprocessed data information B is sent to digging system by S2, cloud computer Hadoop cluster module;
S3, preprocessed data information B is segmented, obtains keyword set C;
S4, each keyword D in keyword set C is screened by previous hot information dictionary;
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then the operation in S5 is executed;
S5, according to the frequency of occurrences of each keyword D within current time and given historical time window and the comprehensive of quantity is relayed Ranking is closed, each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
S6, Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;
S7, hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.
2. the building of the hot spot digging system under a kind of Hadoop frame according to claim 1, which is characterized in that logarithm It is believed that it includes duplicate removal processing and filtering messy code processing that breath A, which carries out pretreatment,.
3. the building of the hot spot digging system under a kind of Hadoop frame according to claim 1, which is characterized in that cloud Computer Hadoop cluster module includes data acquisition module and data processing module;Data acquisition module and data processing module Communication connection, data acquisition module is used for the collecting data information A from network, and data information A is sent to data processing mould Block;
Data processing module and digging system communication connection, data processing module are used to pre-process data information A, obtain pre- place Manage data information B.
4. the building of the hot spot digging system under a kind of Hadoop frame according to claim 3, which is characterized in that data Acquisition module obtains data information A by the way of web crawler from network.
5. the building of the hot spot digging system under a kind of Hadoop frame according to claim 3, which is characterized in that excavate System includes word segmentation module, screening judgment module, hot information sorting module, Term co-occurrence network struction module, cluster calculation mould Block and memory module;
Memory module forms previous hot information for storing previous hot information, the previous hot information stored in memory module Dictionary;
Word segmentation module and data processing module communication connection, word segmentation module are obtained for segmenting to preprocessed data information B Keyword set C;
Judgment module and word segmentation module communication connection are screened, and screens judgment module and memory module communication connection, screening judgement Module is for screening each keyword D in keyword set C by previous hot information dictionary;
Hot information sorting module and screening judgment module communication connection, hot information sorting module are used for each after screening The frequency of occurrences of the keyword D within current time and given historical time window and the overall ranking for relaying quantity, by each pass Keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
Term co-occurrence network struction module and hot information sorting module communication connection, Term co-occurrence network struction module is for calculating heat The correlation of each hot word E in set of words F, and construct Term co-occurrence network;
Cluster calculation module is connect with Term co-occurrence network struction module communication, and cluster calculation module is used for according to Term co-occurrence network, Hot word set F is divided using term clustering algorithm, obtains hot topic set.
6. the building of the hot spot digging system under a kind of Hadoop frame according to claim 5, which is characterized in that screening When judgment module screens each keyword D,
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then keyword D is sent to hot information sequence Module.
CN201910570822.6A 2019-06-28 2019-06-28 A kind of building of hot spot digging system under Hadoop frame Pending CN110377823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910570822.6A CN110377823A (en) 2019-06-28 2019-06-28 A kind of building of hot spot digging system under Hadoop frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910570822.6A CN110377823A (en) 2019-06-28 2019-06-28 A kind of building of hot spot digging system under Hadoop frame

Publications (1)

Publication Number Publication Date
CN110377823A true CN110377823A (en) 2019-10-25

Family

ID=68251202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910570822.6A Pending CN110377823A (en) 2019-06-28 2019-06-28 A kind of building of hot spot digging system under Hadoop frame

Country Status (1)

Country Link
CN (1) CN110377823A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101187A1 (en) * 2001-10-19 2003-05-29 Xerox Corporation Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects
CN103617169A (en) * 2013-10-23 2014-03-05 杭州电子科技大学 Microblog hot topic extracting method based on Hadoop
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101187A1 (en) * 2001-10-19 2003-05-29 Xerox Corporation Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects
CN103617169A (en) * 2013-10-23 2014-03-05 杭州电子科技大学 Microblog hot topic extracting method based on Hadoop
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114938477A (en) * 2022-06-23 2022-08-23 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment

Similar Documents

Publication Publication Date Title
CN105389349B (en) Dictionary update method and device
CN105160038B (en) Data analysis method and system based on audit database
CN104965905B (en) A kind of method and apparatus of Web page classifying
Liu et al. Weighted graph clustering for community detection of large social networks
CN110223168A (en) A kind of anti-fraud detection method of label propagation and system based on business connection map
CN106844640B (en) Webpage data analysis processing method
CN103678670A (en) Micro-blog hot word and hot topic mining system and method
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN106940679A (en) Data processing method and device
CN108038205A (en) For the viewpoint analysis prototype system of Chinese microblogging
CN104199974A (en) Microblog-oriented dynamic topic detection and evolution tracking method
CN105069025A (en) Intelligent aggregation visualization and management control system for big data
CN102411638A (en) Method for generating multimedia summary of news search result
CN104504024B (en) Keyword method for digging based on content of microblog and system
CN104102658B (en) Content of text method for digging and device
CN111382956A (en) Enterprise group relationship mining method and device
CN106557558A (en) A kind of data analysing method and device
CN111831802A (en) Urban domain knowledge detection system and method based on LDA topic model
CN111198897B (en) Scientific research hotspot topic analysis method and device and electronic equipment
CN109947934A (en) For the data digging method and system of short text
CN103218368B (en) A kind of method and apparatus excavating hot word
CN105518644A (en) Method for processing and displaying real-time social data on map
CN108984514A (en) Acquisition methods and device, storage medium, the processor of word
CN109597926A (en) A kind of information acquisition method and system based on social media emergency event
CN111598700A (en) Financial wind control system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191025

RJ01 Rejection of invention patent application after publication