CN110377823A - A kind of building of hot spot digging system under Hadoop frame - Google Patents
A kind of building of hot spot digging system under Hadoop frame Download PDFInfo
- Publication number
- CN110377823A CN110377823A CN201910570822.6A CN201910570822A CN110377823A CN 110377823 A CN110377823 A CN 110377823A CN 201910570822 A CN201910570822 A CN 201910570822A CN 110377823 A CN110377823 A CN 110377823A
- Authority
- CN
- China
- Prior art keywords
- hot
- module
- keyword
- information
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The building of hot spot digging system under a kind of Hadoop frame, comprising the following specific steps collecting data information A and being pre-processed to it from network using cloud computer Hadoop cluster module, obtaining preprocessed data information B and sending it to digging system;Preprocessed data information B is segmented, keyword set C is obtained;Each keyword D in keyword set C is screened by previous hot information dictionary;Each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;Hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.The present invention can quickly obtain hot topic set, and can improve and obtain the precision that hot topic obtains from network.
Description
Technical field
The present invention relates to the hot spot digging systems under language data processing technology field more particularly to a kind of Hadoop frame
Building.
Background technique
With the fast development of internet, more and more users participate in internet, and user can be on the internet
The information of oneself needs is released news or obtains, the information in internet is incremented by daily;But since information content is huge, to use
Family is to bring puzzlement finding with focal point hot information;Present concern of the more and more users to hot spot, but people
The hot information of oneself needs often can not be effectively obtained from network;For this purpose, proposing a kind of Hadoop frame in the application
Under hot spot digging system building, to help user that can quickly obtain hot topic set from network, and can improve from net
The precision that hot topic obtains is obtained on network.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes that the hot spot under a kind of Hadoop frame excavates system
The building of system, the present invention can quickly obtain hot topic set, and can improve obtained from network hot topic obtain it is accurate
Degree.
(2) technical solution
To solve the above problems, the present invention provides a kind of buildings of the hot spot digging system under Hadoop frame, including
Step in detail below:
S1, using cloud computer Hadoop cluster module collecting data information A from network, and to data information A into
Row pretreatment, obtains preprocessed data information B;
Preprocessed data information B is sent to digging system by S2, cloud computer Hadoop cluster module;
S3, preprocessed data information B is segmented, obtains keyword set C;
S4, each keyword D in keyword set C is screened by previous hot information dictionary;
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then the operation in S5 is executed;
S5, the frequency of occurrences and relay quantity according to each keyword D within current time and given historical time window
Overall ranking, each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
S6, Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;
S7, hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.
Preferably, carrying out pretreatment to data information A includes duplicate removal processing and filtering messy code processing.
Preferably, cloud computer Hadoop cluster module includes data acquisition module and data processing module;Data are adopted
Collect module and data processing module communication connection, data acquisition module is used for the collecting data information A from network, and data are believed
Breath A is sent to data processing module;
Data processing module and digging system communication connection, data processing module are used to pre-process data information A, obtain
Preprocessed data information B.
Preferably, data acquisition module obtains data information A by the way of web crawler from network.
Preferably, digging system includes word segmentation module, screening judgment module, hot information sorting module, Term co-occurrence network
Construct module, cluster calculation module and memory module;
Memory module forms previous hot spot for storing previous hot information, the previous hot information stored in memory module
Information dictionary;
Word segmentation module and data processing module communication connection, word segmentation module are used to segment preprocessed data information B,
Obtain keyword set C;
Judgment module and word segmentation module communication connection are screened, and screens judgment module and memory module communication connection, screening
Judgment module is for screening each keyword D in keyword set C by previous hot information dictionary;
Hot information sorting module and screening judgment module communication connection, after hot information sorting module is used for screening
The frequency of occurrences of each keyword D within current time and given historical time window and the overall ranking for relaying quantity, will be each
A keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
Term co-occurrence network struction module and hot information sorting module communication connection, Term co-occurrence network struction module is based on
The correlation of each hot word E in hot word set F is calculated, and constructs Term co-occurrence network;
Cluster calculation module is connect with Term co-occurrence network struction module communication, and cluster calculation module is used for according to Term co-occurrence net
Network divides hot word set F using term clustering algorithm, obtains hot topic set.
Preferably, when screening judgment module screens each keyword D,
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then keyword D is sent to hot information
Sorting module.
Above-mentioned technical proposal of the invention has following beneficial technical effect:
In the present invention, by the way that cloud computing platform Hadoop and digging system to be used in combination, cloud computing platform is utilized
The storage enhanced scalability of Hadoop is that the realization of data digging system provides the foundation, and passes through the cloud computer being equipped with
Hadoop cluster module obtains data information A from network and handles data information A, then by digging system to processing
Data information A further progress afterwards segments and screen to the keyword after participle according to previous hot information dictionary
To hot word E, and hot word set F is constructed, hot word co-occurrence network is constructed after calculating hot word E correlation, using hot word clustering algorithm pair
Hot word set F is divided, and hot topic set is quickly obtained, and is imitated to greatly improve and obtain hot topic from network
Rate and precision.
Detailed description of the invention
Fig. 1 is the flow chart of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
Fig. 2 is the system principle diagram of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join
According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair
Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured
The concept of invention.
Fig. 1 is the flow chart of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
Embodiment 1
As shown in Figure 1, the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention, including following tool
Body step:
S1, using cloud computer Hadoop cluster module collecting data information A from network, and to data information A into
Row pretreatment, obtains preprocessed data information B;
Preprocessed data information B is sent to digging system by S2, cloud computer Hadoop cluster module;
S3, preprocessed data information B is segmented, obtains keyword set C;
S4, each keyword D in keyword set C is screened by previous hot information dictionary;
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then the operation in S5 is executed;
S5, the frequency of occurrences and relay quantity according to each keyword D within current time and given historical time window
Overall ranking, each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
S6, Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;
S7, hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.
In an alternative embodiment, carrying out pretreatment to data information A includes duplicate removal processing and filtering messy code processing.
Fig. 2 is the system principle diagram of the building of the hot spot digging system under a kind of Hadoop frame proposed by the present invention.
As shown in Fig. 2, the hot spot digging system under a kind of Hadoop frame proposed by the present invention, including cloud computer
Hadoop cluster module and digging system;Cloud computer Hadoop cluster module and digging system communication connection, cloud calculate
Machine Hadoop cluster module is used for the collecting data information A from network and is sent to excavation system after handling data information A
System;To treated, data information A is handled digging system, obtains hot topic set.
In an alternative embodiment, cloud computer Hadoop cluster module includes at data acquisition module and data
Manage module;
Data acquisition module and data processing module communication connection, data acquisition module are used to acquire data letter from network
A is ceased, and data information A is sent to data processing module;
Data processing module and digging system communication connection, data processing module are used to pre-process data information A, obtain
Preprocessed data information B.
In an alternative embodiment, data acquisition module obtains data letter by the way of web crawler from network
Cease A.
In an alternative embodiment, digging system includes word segmentation module, screening judgment module, hot information sequence mould
Block, Term co-occurrence network struction module, cluster calculation module and memory module;
Memory module forms previous hot spot for storing previous hot information, the previous hot information stored in memory module
Information dictionary;
Word segmentation module and data processing module communication connection, word segmentation module are used to segment preprocessed data information B,
Obtain keyword set C;
Judgment module and word segmentation module communication connection are screened, and screens judgment module and memory module communication connection, screening
Judgment module is for screening each keyword D in keyword set C by previous hot information dictionary;
Hot information sorting module and screening judgment module communication connection, after hot information sorting module is used for screening
The frequency of occurrences of each keyword D within current time and given historical time window and the overall ranking for relaying quantity, will be each
A keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
Term co-occurrence network struction module and hot information sorting module communication connection, Term co-occurrence network struction module is based on
The correlation of each hot word E in hot word set F is calculated, and constructs Term co-occurrence network;
Cluster calculation module is connect with Term co-occurrence network struction module communication, and cluster calculation module is used for according to Term co-occurrence net
Network divides hot word set F using the hot word clustering algorithm propagated using multi-tag, obtains hot topic set.
In an alternative embodiment, when screening judgment module screens each keyword D,
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then keyword D is sent to hot information
Sorting module.
In the present invention, by the way that cloud computing platform Hadoop and digging system to be used in combination, cloud computing platform is utilized
The storage enhanced scalability of Hadoop is that the realization of data digging system provides the foundation, and passes through the cloud computer being equipped with
Hadoop cluster module obtains data information A from network and handles data information A, then by digging system to processing
Data information A further progress afterwards segments and screen to the keyword after participle according to previous hot information dictionary
To hot word E, and hot word set F is constructed, hot word co-occurrence network is constructed after calculating hot word E correlation, using hot word clustering algorithm pair
Hot word set F is divided, and hot topic set is quickly obtained, and is imitated to greatly improve and obtain hot topic from network
Rate and precision.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention
Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention
Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing
Change example.
Claims (6)
1. a kind of building of the hot spot digging system under Hadoop frame, which is characterized in that comprising the following specific steps
S1, it is carried out in advance using cloud computer Hadoop cluster module collecting data information A from network, and to data information A
Processing, obtains preprocessed data information B;
Preprocessed data information B is sent to digging system by S2, cloud computer Hadoop cluster module;
S3, preprocessed data information B is segmented, obtains keyword set C;
S4, each keyword D in keyword set C is screened by previous hot information dictionary;
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then the operation in S5 is executed;
S5, according to the frequency of occurrences of each keyword D within current time and given historical time window and the comprehensive of quantity is relayed
Ranking is closed, each keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
S6, Term co-occurrence network is constructed according to the correlation between hot word D each in hot word set F;
S7, hot word set F is divided using clustering algorithm according to Term co-occurrence network, obtains hot topic set.
2. the building of the hot spot digging system under a kind of Hadoop frame according to claim 1, which is characterized in that logarithm
It is believed that it includes duplicate removal processing and filtering messy code processing that breath A, which carries out pretreatment,.
3. the building of the hot spot digging system under a kind of Hadoop frame according to claim 1, which is characterized in that cloud
Computer Hadoop cluster module includes data acquisition module and data processing module;Data acquisition module and data processing module
Communication connection, data acquisition module is used for the collecting data information A from network, and data information A is sent to data processing mould
Block;
Data processing module and digging system communication connection, data processing module are used to pre-process data information A, obtain pre- place
Manage data information B.
4. the building of the hot spot digging system under a kind of Hadoop frame according to claim 3, which is characterized in that data
Acquisition module obtains data information A by the way of web crawler from network.
5. the building of the hot spot digging system under a kind of Hadoop frame according to claim 3, which is characterized in that excavate
System includes word segmentation module, screening judgment module, hot information sorting module, Term co-occurrence network struction module, cluster calculation mould
Block and memory module;
Memory module forms previous hot information for storing previous hot information, the previous hot information stored in memory module
Dictionary;
Word segmentation module and data processing module communication connection, word segmentation module are obtained for segmenting to preprocessed data information B
Keyword set C;
Judgment module and word segmentation module communication connection are screened, and screens judgment module and memory module communication connection, screening judgement
Module is for screening each keyword D in keyword set C by previous hot information dictionary;
Hot information sorting module and screening judgment module communication connection, hot information sorting module are used for each after screening
The frequency of occurrences of the keyword D within current time and given historical time window and the overall ranking for relaying quantity, by each pass
Keyword D is ranked up from high to low, filters out hot word E, and constructs hot word set F;
Term co-occurrence network struction module and hot information sorting module communication connection, Term co-occurrence network struction module is for calculating heat
The correlation of each hot word E in set of words F, and construct Term co-occurrence network;
Cluster calculation module is connect with Term co-occurrence network struction module communication, and cluster calculation module is used for according to Term co-occurrence network,
Hot word set F is divided using term clustering algorithm, obtains hot topic set.
6. the building of the hot spot digging system under a kind of Hadoop frame according to claim 5, which is characterized in that screening
When judgment module screens each keyword D,
When keyword D is the hot spot word occurred in previous hot information dictionary, then keyword D is given up;
When keyword D is not the hot spot word occurred in previous hot information dictionary, then keyword D is sent to hot information sequence
Module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910570822.6A CN110377823A (en) | 2019-06-28 | 2019-06-28 | A kind of building of hot spot digging system under Hadoop frame |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910570822.6A CN110377823A (en) | 2019-06-28 | 2019-06-28 | A kind of building of hot spot digging system under Hadoop frame |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110377823A true CN110377823A (en) | 2019-10-25 |
Family
ID=68251202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910570822.6A Pending CN110377823A (en) | 2019-06-28 | 2019-06-28 | A kind of building of hot spot digging system under Hadoop frame |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110377823A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114938477A (en) * | 2022-06-23 | 2022-08-23 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN114938477B (en) * | 2022-06-23 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030101187A1 (en) * | 2001-10-19 | 2003-05-29 | Xerox Corporation | Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects |
CN103617169A (en) * | 2013-10-23 | 2014-03-05 | 杭州电子科技大学 | Microblog hot topic extracting method based on Hadoop |
CN103678670A (en) * | 2013-12-25 | 2014-03-26 | 福州大学 | Micro-blog hot word and hot topic mining system and method |
-
2019
- 2019-06-28 CN CN201910570822.6A patent/CN110377823A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030101187A1 (en) * | 2001-10-19 | 2003-05-29 | Xerox Corporation | Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects |
CN103617169A (en) * | 2013-10-23 | 2014-03-05 | 杭州电子科技大学 | Microblog hot topic extracting method based on Hadoop |
CN103678670A (en) * | 2013-12-25 | 2014-03-26 | 福州大学 | Micro-blog hot word and hot topic mining system and method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114938477A (en) * | 2022-06-23 | 2022-08-23 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN114938477B (en) * | 2022-06-23 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105389349B (en) | Dictionary update method and device | |
CN105160038B (en) | Data analysis method and system based on audit database | |
CN104965905B (en) | A kind of method and apparatus of Web page classifying | |
Liu et al. | Weighted graph clustering for community detection of large social networks | |
CN110223168A (en) | A kind of anti-fraud detection method of label propagation and system based on business connection map | |
CN106844640B (en) | Webpage data analysis processing method | |
CN103678670A (en) | Micro-blog hot word and hot topic mining system and method | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN106940679A (en) | Data processing method and device | |
CN108038205A (en) | For the viewpoint analysis prototype system of Chinese microblogging | |
CN104199974A (en) | Microblog-oriented dynamic topic detection and evolution tracking method | |
CN105069025A (en) | Intelligent aggregation visualization and management control system for big data | |
CN102411638A (en) | Method for generating multimedia summary of news search result | |
CN104504024B (en) | Keyword method for digging based on content of microblog and system | |
CN104102658B (en) | Content of text method for digging and device | |
CN111382956A (en) | Enterprise group relationship mining method and device | |
CN106557558A (en) | A kind of data analysing method and device | |
CN111831802A (en) | Urban domain knowledge detection system and method based on LDA topic model | |
CN111198897B (en) | Scientific research hotspot topic analysis method and device and electronic equipment | |
CN109947934A (en) | For the data digging method and system of short text | |
CN103218368B (en) | A kind of method and apparatus excavating hot word | |
CN105518644A (en) | Method for processing and displaying real-time social data on map | |
CN108984514A (en) | Acquisition methods and device, storage medium, the processor of word | |
CN109597926A (en) | A kind of information acquisition method and system based on social media emergency event | |
CN111598700A (en) | Financial wind control system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191025 |
|
RJ01 | Rejection of invention patent application after publication |