CN109241432A - Discrete data acquisition analysis system and method - Google Patents
Discrete data acquisition analysis system and method Download PDFInfo
- Publication number
- CN109241432A CN109241432A CN201811045808.6A CN201811045808A CN109241432A CN 109241432 A CN109241432 A CN 109241432A CN 201811045808 A CN201811045808 A CN 201811045808A CN 109241432 A CN109241432 A CN 109241432A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- term
- discrete
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The present invention discloses a kind of discrete data acquisition analysis system and method, it include: that data acquisition module acquires discrete data for real-time various dimensions, data include internet data, Hadoop data, server running log data and the data that can access other operation systems of web page media text data, web crawlers crawl;Data analysis module is used to carry out data parsing and data cleansing to those collected data;Data memory module is used to store the data after data cleansing by inverted index mode, and establishes the mapping relations in data between participle and locating document;Data retrieval module extracts keyword, and the corresponding target data of term is grabbed out from data memory module for analyzing the term of input;Data visualization module is used to carry out marking sequence to target data according to term, shows data dependence with topological diagram.Data acquisition channel of the present invention is more, retrieves more efficient more accurate.
Description
Technical field
The present invention relates to big datas to grab technical field, more particularly to a kind of discrete data acquisition analysis system and side
Method.
Background technique
Big data era afterwards, it is important to how from PB grades of most roots, different numbers after it experienced data and largely acquire
According to inherent, the potential relationship of cleaning and then mining data in the isomeric data of type, discrete data, or even distinguish hot topic degree and phase
Guan Du, and visually dissolve with various charts the internal relations of data.And currently, acquisition data in terms of usually acquisition channel it is few, adopt
The data of collection just can not be comprehensive, cause search result not precisely, low efficiency;Lack depth in terms of data mining and excavates discrete data
Inherent, potential effective ways, cause data user rate not high.
Summary of the invention
The present invention is in view of the problems of the existing technology and insufficient, provides a kind of discrete data acquisition analysis system and side
Method.
The present invention is to solve above-mentioned technical problem by following technical proposals:
The present invention provides a kind of discrete data acquisition analysis system, it is characterized in that comprising a data acquisition module, a data
Analysis module, a data memory module, a data retrieval module and a data visualization module;
The data acquisition module acquires discrete data for real-time various dimensions, and the data include web page media textual data
According to the internet data of, web crawlers crawl, Hadoop data, server running log data and other business can be accessed
The data of system;
The data analysis module is used to carry out data parsing and data cleansing to those collected data;
The data memory module is used to store the data after data cleansing by inverted index mode, and establishes in data and divide
Mapping relations between word and locating document;
The data retrieval module extracts keyword, and from data memory module for analyzing the term of input
Grab out the corresponding target data of term;
The data visualization module is used to carry out marking sequence to target data according to term, shows data phase with topological diagram
Guan Xing.
Preferably, the data retrieval module is used to provide corresponding segmenter for different language.
Preferably, other described operation systems include traditional database, the traditional database include oracle database,
Mysql database and sqlserver database.
The present invention also provides a kind of discrete data capturing analysis methods, it is characterized in that comprising following steps:
Acquire to S1, real-time various dimensions discrete data, the data include that web page media text data, web crawlers grab
Internet data, Hadoop data, server running log data and the data that other operation systems can be accessed;
S2, data parsing and data cleansing are carried out to those collected data;
S3, the data after data cleansing are stored by inverted index mode, and established in data between participle and locating document
Mapping relations;
S4, the term of input is analyzed, extracts keyword, and it is corresponding to grab out from data memory module term
Data in document are as target data;
S5, marking sequence is carried out to target data according to term, data dependence is showed with topological diagram.
Preferably, in step s 4, providing corresponding segmenter for different language.
Preferably, other described operation systems include traditional database, the traditional database include oracle database,
Mysql database and sqlserver database.
On the basis of common knowledge of the art, above-mentioned each optimum condition, can any combination to get each preferable reality of the present invention
Example.
The positive effect of the present invention is that:
1. the acquisition channel of data of the present invention is more, data are more polynary, more comprehensively, for realizing that the big data in certain industry acquires
Solid guarantee is provided with analysis.
2. providing corresponding segmenter the present invention provides the identification of the language vocabulary of almost Perfect for different language, being
Realize that search provides wider dimension and higher-quality guarantee in real time.
3. the present invention realizes the excavation and foundation of the inherent complete set of a large amount of non-relational data, and can be with the figure of topological diagram
Effect displaying, and can external any mainstream map demonstration tool or any type of figure exhibition of plug-in unit progress.
4. the present invention can be as needed by data dump in cluster to file, mail, log, traditional database, hadoop
Deng storage terminal.
Detailed description of the invention
Fig. 1 is the structural block diagram of the discrete data acquisition analysis system of present pre-ferred embodiments;
Fig. 2 is the relational graph of the original document of present pre-ferred embodiments, entry and inverted index structure;
Fig. 3 is the flow chart of the discrete data capturing analysis method of present pre-ferred embodiments.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.On the contrary, this
The embodiment of invention includes all changes fallen within the scope of the spiritual and intension of attached claims, modification and is equal
Object.
As shown in Figure 1, the present embodiment provides a kind of discrete data acquisition analysis systems comprising a data acquisition module 1,
One data analysis module 2, a data memory module 3, a data retrieval module 4 and a data visualization module 5.
The data acquisition module 1 acquires discrete data for real-time various dimensions, and the data include web page media
Text data, web crawlers crawl internet data, Hadoop data, server running log data and it can be accessed
The data of his operation system (such as traditional database: oracle database, mysql database and sqlserver database etc.).
Data acquisition channel is more, data are more polynary, more comprehensively, for realizing that the acquisition of the big data in certain industry and analysis provide heavily fortified point
It is real to ensure.
The data analysis module 2 is used to carry out data parsing and data cleansing to those collected data, with filtering
Some invalid data in collected data.
The data memory module 3 is used to store the data after data cleansing by inverted index mode, and establishes number
According to the mapping relations between middle participle and locating document.In inverted index mode, data be towards word (Term) rather than face
To document.
Inverted index is a kind of structure, it is suitable for quick full-text search.One inverted index by document it is all not
The list of repetitor is constituted, and for wherein each word, there is the lists of documents comprising it.Original document, entry and the row's of falling rope
The relationship of guiding structure is as shown in Figure 2.
The data retrieval module 4 extracts keyword, and store mould from data for analyzing the term of input
The corresponding target data of term is grabbed out in block 3, wherein provide corresponding segmenter for different language.
Corresponding segmenter is provided for different language, to realize that search in real time provides wider dimension and higher-quality
Guarantee.
The data visualization module 5 is used to carry out marking sequence to target data according to term, is showed with topological diagram
Data dependence.Such as: can behavioural analysis, anti-fraud, network security, drug discovery, personalized medicine, or based on lasting
Real time data constructs personalized recommendation.
As shown in figure 3, the present embodiment also provides a kind of discrete data capturing analysis method comprising following steps:
Discrete data are acquired to step 101, real-time various dimensions, the data include web page media text data, web crawlers
Internet data, Hadoop data, server running log data and the data of traditional database storage of crawl;
Step 102 carries out data parsing and data cleansing to those collected data;
Step 103 stores the data after data cleansing by inverted index mode, and establishes participle and locating document in data
Between mapping relations;
Step 104 analyzes the term of input, extracts keyword, and term is grabbed out from data memory module
Data in corresponding document are as target data;
Step 105 carries out marking sequence to target data according to term, shows data dependence with topological diagram.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with
A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding
And modification, the scope of the present invention is by appended claims and its equivalent limits.
Claims (6)
1. a kind of discrete data acquisition analysis system, which is characterized in that it includes a data acquisition module, data analysis mould
Block, a data memory module, a data retrieval module and a data visualization module;
The data acquisition module acquires discrete data for real-time various dimensions, and the data include web page media textual data
According to the internet data of, web crawlers crawl, Hadoop data, server running log data and other business can be accessed
The data of system;
The data analysis module is used to carry out data parsing and data cleansing to those collected data;
The data memory module is used to store the data after data cleansing by inverted index mode, and establishes in data and divide
Mapping relations between word and locating document;
The data retrieval module extracts keyword, and from data memory module for analyzing the term of input
Grab out the corresponding target data of term;
The data visualization module is used to carry out marking sequence to target data according to term, shows data phase with topological diagram
Guan Xing.
2. discrete data acquisition analysis system according to claim 1, which is characterized in that the data retrieval module is used for
Corresponding segmenter is provided for different language.
3. discrete data acquisition analysis system according to claim 1, which is characterized in that other described operation systems include
Traditional database, the traditional database include oracle database, mysql database and sqlserver database.
4. a kind of discrete data capturing analysis method, which is characterized in that itself the following steps are included:
Acquire to S1, real-time various dimensions discrete data, the data include that web page media text data, web crawlers grab
Internet data, Hadoop data, server running log data and the data that other operation systems can be accessed;
S2, data parsing and data cleansing are carried out to those collected data;
S3, the data after data cleansing are stored by inverted index mode, and established in data between participle and locating document
Mapping relations;
S4, the term of input is analyzed, extracts keyword, and it is corresponding to grab out from data memory module term
Data in document are as target data;
S5, marking sequence is carried out to target data according to term, data dependence is showed with topological diagram.
5. discrete data capturing analysis method according to claim 4, which is characterized in that in step s 4, for difference
Language provides corresponding segmenter.
6. discrete data capturing analysis method according to claim 4, which is characterized in that other described operation systems include
Traditional database, the traditional database include oracle database, mysql database and sqlserver database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811045808.6A CN109241432A (en) | 2018-09-07 | 2018-09-07 | Discrete data acquisition analysis system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811045808.6A CN109241432A (en) | 2018-09-07 | 2018-09-07 | Discrete data acquisition analysis system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109241432A true CN109241432A (en) | 2019-01-18 |
Family
ID=65067373
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811045808.6A Pending CN109241432A (en) | 2018-09-07 | 2018-09-07 | Discrete data acquisition analysis system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109241432A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111625719A (en) * | 2020-05-21 | 2020-09-04 | 四川九八村信息科技有限公司 | Propaganda channel expanding system and method for plasma single-collection station |
CN112434209A (en) * | 2020-12-07 | 2021-03-02 | 广东电网有限责任公司佛山供电局 | Multi-channel and rapid knowledge point collecting system |
CN113051234A (en) * | 2021-04-19 | 2021-06-29 | 国际关系学院 | Mobile on-site big data analysis platform |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622443A (en) * | 2012-03-13 | 2012-08-01 | 北京邮电大学 | Customized screening system and method for microblog |
CN102915381A (en) * | 2012-11-20 | 2013-02-06 | 公安部第三研究所 | Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method |
CN104731851A (en) * | 2014-12-16 | 2015-06-24 | 芜湖乐锐思信息咨询有限公司 | Big data analysis method based on topological network |
CN106354772A (en) * | 2016-08-23 | 2017-01-25 | 成都卡莱博尔信息技术股份有限公司 | Mass data system with data cleaning function |
CN106776719A (en) * | 2016-11-21 | 2017-05-31 | 北海高创电子信息孵化器有限公司 | A kind of on-line information consultant search method |
CN107038225A (en) * | 2017-03-31 | 2017-08-11 | 江苏飞搏软件股份有限公司 | The search method of information intelligent retrieval system |
CN107633075A (en) * | 2017-09-22 | 2018-01-26 | 吉林大学 | A kind of multi-source heterogeneous data fusion platform and fusion method |
CN107818130A (en) * | 2017-09-15 | 2018-03-20 | 深圳市电陶思创科技有限公司 | The method for building up and system of a kind of search engine |
CN108491438A (en) * | 2018-02-12 | 2018-09-04 | 陆夏根 | A kind of technology policy retrieval analysis method |
-
2018
- 2018-09-07 CN CN201811045808.6A patent/CN109241432A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622443A (en) * | 2012-03-13 | 2012-08-01 | 北京邮电大学 | Customized screening system and method for microblog |
CN102915381A (en) * | 2012-11-20 | 2013-02-06 | 公安部第三研究所 | Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method |
CN104731851A (en) * | 2014-12-16 | 2015-06-24 | 芜湖乐锐思信息咨询有限公司 | Big data analysis method based on topological network |
CN106354772A (en) * | 2016-08-23 | 2017-01-25 | 成都卡莱博尔信息技术股份有限公司 | Mass data system with data cleaning function |
CN106776719A (en) * | 2016-11-21 | 2017-05-31 | 北海高创电子信息孵化器有限公司 | A kind of on-line information consultant search method |
CN107038225A (en) * | 2017-03-31 | 2017-08-11 | 江苏飞搏软件股份有限公司 | The search method of information intelligent retrieval system |
CN107818130A (en) * | 2017-09-15 | 2018-03-20 | 深圳市电陶思创科技有限公司 | The method for building up and system of a kind of search engine |
CN107633075A (en) * | 2017-09-22 | 2018-01-26 | 吉林大学 | A kind of multi-source heterogeneous data fusion platform and fusion method |
CN108491438A (en) * | 2018-02-12 | 2018-09-04 | 陆夏根 | A kind of technology policy retrieval analysis method |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111625719A (en) * | 2020-05-21 | 2020-09-04 | 四川九八村信息科技有限公司 | Propaganda channel expanding system and method for plasma single-collection station |
CN111625719B (en) * | 2020-05-21 | 2023-06-13 | 四川九八村信息科技有限公司 | Propaganda channel expanding system and method for single plasma collecting station |
CN112434209A (en) * | 2020-12-07 | 2021-03-02 | 广东电网有限责任公司佛山供电局 | Multi-channel and rapid knowledge point collecting system |
CN113051234A (en) * | 2021-04-19 | 2021-06-29 | 国际关系学院 | Mobile on-site big data analysis platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liang et al. | Dynamic clustering of streaming short documents | |
US11036791B2 (en) | Computerized system and method for determining non-redundant tags from a user's network activity | |
Healy | The performativity of networks | |
CN110704411B (en) | Knowledge graph building method and device suitable for art field and electronic equipment | |
CN104182389B (en) | A kind of big data analyzing business intelligence service system based on semanteme | |
US10691770B2 (en) | Real-time classification of evolving dictionaries | |
Kovacs-Gyori et al. | # London2012: Towards citizen-contributed urban planning through sentiment analysis of twitter data | |
CN109241432A (en) | Discrete data acquisition analysis system and method | |
CN110533212A (en) | Urban waterlogging public sentiment monitoring and pre-alarming method based on big data | |
US11263523B1 (en) | System and method for organizational health analysis | |
CN108363748B (en) | Topic portrait system and topic portrait method based on knowledge | |
Yahia et al. | A new approach for evaluation of data mining techniques | |
CN109992653A (en) | Information processing method and processing system | |
CN104615701B (en) | The embedded big data visualization engine cluster in smart city based on video cloud platform | |
US20150142780A1 (en) | Apparatus and method for analyzing event time-space correlation in social web media | |
CN107239509A (en) | Towards single Topics Crawling method and system of short text | |
Scharl et al. | Tourism intelligence and visual media analytics for destination management organizations | |
Kanza et al. | City nexus: Discovering pairs of jointly-visited locations based on geo-tagged posts in social networks | |
CN110222057A (en) | A kind of construction method of aerosol document formatted data base | |
CN111061853B (en) | Method for rapidly acquiring FAQ model training corpus | |
CN114756685A (en) | Complaint risk identification method and device for complaint sheet | |
CN205754379U (en) | Log processing system | |
Ogbuju et al. | The sentiment analysis of EndSARS protest in Nigeria | |
Ishikawa et al. | Generalized difference method for generating integrated hypotheses in social big data | |
Yang et al. | KOSMOS: Knowledge-graph oriented social media and mainstream media overview system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190118 |