CN104679827A - Big data-based public information association method and mining engine - Google Patents

Big data-based public information association method and mining engine Download PDF

Info

Publication number
CN104679827A
CN104679827A CN201510017418.8A CN201510017418A CN104679827A CN 104679827 A CN104679827 A CN 104679827A CN 201510017418 A CN201510017418 A CN 201510017418A CN 104679827 A CN104679827 A CN 104679827A
Authority
CN
China
Prior art keywords
public information
data
information
source
allowing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510017418.8A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Get Great Information Technology Co Ltd
Original Assignee
Beijing Get Great Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Get Great Information Technology Co Ltd filed Critical Beijing Get Great Information Technology Co Ltd
Priority to CN201510017418.8A priority Critical patent/CN104679827A/en
Publication of CN104679827A publication Critical patent/CN104679827A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a big data-based public information association method and a mining engine. The method includes the steps of 1, collecting internet public information sources, and collecting data sources related to mass public information according to types of direct acquisition and authentication acquisition; 2, allowing a multi-source matching system to perform matching types of information according to the different data sources; 3, allowing a multi-format information extraction system to extract specified data and elements according to different formats of information carriers; 4, allowing a multi-dimensional association integrating-analyzing system to integrate and analyze gathered data by means of operations such as deduplication, denoising, false removal and clustering according to an association algorithm of public information models; 5, allowing an experts correction system to correct related algorithms of deep learning and the systems used in the step above, on the basis of various indexes and a quality assessment model acquired; 6, allowing a visual display system to visually and integrally display the specified public information according to the principle of time series.

Description

A kind of public information correlating method based on large data and excavation engine
Technical field
The present invention relates to based on large data public information correlating method and excavate the technical field of engine, specifically a kind of association analysis method that the complete period data of specifying in the evolution of artificial person's object are carried out and the actualizing technology excavating engine.
Background technology
Internet era, data, information become important Enterprise Resource, valuable information is extracted rapidly in the mass data of making rapid progress, numerous and jumbled and the dispersion of information simultaneously on internet, universal search engine has become the essential tool of people's obtaining information, can active searching information and can automatic indexing, provide inquiry service, when user entered keyword is inquired about, this website can return all network address that user comprises this keyword message, and provides the link of leading to this information.At present, there is a lot of search engine system in internet, but functionally with in performance all there are some defects, especially in inquiry public information, lacked relevance and accuracy.
Hadoop is a distributed system architecture, is a software platform can more easily developing and run process large-scale data.
NoSQL, the database of general reference non-relational, have easy expansion, big data quantity, high-performance, data model flexibly, the feature such as high availability.
Microblogging is one and focuses on ageing and random based on the platform of customer relationship Information Sharing, propagation and acquisition, and micro-blog more can give expression to thought all the time and latest tendency.
Micro-letter public platform, provides the new service platform of business service and subscriber management capabilities to individual, enterprise and tissue.
Excavated the public information and incidence relation that flow in the platforms such as website, microblogging, micro-letter by the degree of depth, the true complete period data comprehensively objectively understanding artificial person's object have become a kind of demand of reality; Meanwhile, reaching its maturity of the distributed storage that the large data ecosystem provides, calculating, NoSQL database, data relation analysis instrument and data mining algorithm etc., also for the large data mining of open letter provides technical support.At present, also do not have ripe process based on the public information correlating method of large data and excavate engine.
Summary of the invention
In order to overcome limitation and the deficiency of technique scheme, the invention provides a kind of public information correlating method based on large data and excavating engine.
The technical solution adopted in the present invention realizes in the following manner, and concrete steps are as follows:
(1) gather internet public information, adopt the mode of directly collection and certification collection to obtain the data source of magnanimity public information;
This engine gathers all public information in internet, contains business, proprietary and common data sets, under the prerequisite observing the original access rule of data set, by directly to gather and certification gathers two kinds of modes and maximizes the extension territory and data source thereof that obtain public information.
(2) multi-source matching system, according to the difference (website, microblogging, micro-letter, Mobile solution) of information source, carries out the coupling of the corresponding pattern of information; The difference of information source, its corresponding data source model is also different, and the information pattern of website, microblogging, micro-letter and Mobile solution client is also different, and exploitation adapts to the pattern matching system of multi-source.
(3) multi-format information extraction system, according to the different-format of information carrier, extracts the data and key element of specifying; Platform integration multi-source data, are placed in a unified quantitative test environment by information sets different for information pattern.By building Multiple Velocity Model, simple extraction model becomes the element of complex model, thus builds streamlined, a modular information extraction streaming system.
Form modeling is the basis that data pick-up carries out.Form model is responsible for identification to key message and conversion, wherein further comprises the descriptor to source data.The representative of these objects be the social property information of artificial person's object, a model can represent a mechanism, a company, an Enterprise Human, and the nature person's object information in any reality is not in this data area.
(4) multidimensional associates whole analysis system, according to the coupling index of public information model, by duplicate removal, denoising, goes the operation such as puppet, cluster, carries out confluence analysis to the data after gathering; Comprise the association analysis instrument of many covers, to meet the needs that multi dimensional analysis associates with complexity.
System is carried out compound to data, gathers, changes, is compared and cluster even depth learning manipulation, comprises categorical variables and relative variable, time series and Various types of data dimension.By numerous isolated tidal data recovering to specific environment, then reason out valuable result via time series and other deep analyses, there is the characteristic of real-time analysis simultaneously.
(5) expert amendment system, based on the indices obtained and data quality model, the related algorithm of Corrected Depth study;
Iteratively faster combines fine setting analysis and constantly promotes data value, and therefore whole system becomes more clever, constantly circulates.
(6) visual presentation system, according to time series principle, gives visual integration exhibition by the public information of artificial person's object.Multi-source data unity is a unified various dimensions model shown by system, by abundant visual represent form by abstract become directly perceived, by user provide one the overall close examination angle of concern object associated data.Visual presentation is along with source data real-time update, and user can see information the most timely at any time.
Meanwhile, externally provide extendability, customizability and application programming interfaces, realize customizing messages stream from bottom data integration, self-definition model to User Interface, be designed to an open platform.This customizing messages can be shared, links, recombinate, and is not modifiable product, but a kind of material that can join flexibly in new workflow, both can be iterated, also can add in new analytical model as material.
Compared with prior art, the present invention has the following advantages:
The present invention studies a kind of new public information correlating method and develops new data mining engine, retrospective study is carried out to the source feature of information, extraction analysis is carried out to the carrier format of information, and realizes the association confluence analysis system of magnanimity public information on this basis: take time as the confluence analysis module of information sequence and the relevant dimension model based on expert amendment system.These two systems influence each other, mutually supplement the data mining engine forming a set of public information.
Technical scheme of the present invention can help that individual, enterprise and mechanism are convenient, dynamic sensing specifies complete period data in object evolution, thus improve and data supporting accurately for decision analysis, behavior prediction provide, make the value of final data play maximum effectiveness.
Embodiment
Below in conjunction with accompanying drawing, the present invention is further described.
(1) according to the information model of specifying artificial person's object, determine the distribution source of public information on internet, according to the difference of information source character, as: government website, portal website, professional media, specialized agency etc., determine the technological means directly gathered or certification gathers, and collectable Data Elements;
This engine gathers all public information in internet, contains business, proprietary and common data sets, under the prerequisite observing the original access rule of data set, by directly to gather and certification gathers two kinds of modes and maximizes the extension territory and data source thereof that obtain public information.
(2) difference (website, microblogging, micro-letter, Mobile solution) of information source, the corresponding pattern of information is also different, its corresponding data source model is also different, corresponding information style sheet difference is very large, even the renewal of website structure needs again to develop new style sheet, multi-source pattern matching system should Auto-matching style sheet, also wants the coupling of Timeliness coverage style sheet abnormal.
(3) Platform integration multi-source data, are placed in a unified qualitative and quantitative analysis environment by information sets different for information pattern.By building Multiple Velocity Model, simple extraction model becomes the element of complex model, thus composition streamlined, a modular information extraction streaming system.Multi-format information extraction system, according to the different-format of information carrier, can identify and comprise the multiple file layout such as Word, Excel, WPS, PDF and on the basis of Chinese word segmentation, accurately extract the data and key element of specifying
Form modeling is the basis that data pick-up carries out.Form model is responsible for identification to key message and conversion, wherein further comprises the descriptor to source data.The representative of these objects be the social property information of artificial person's object, a model can represent a mechanism, a company, an Enterprise Human, but does not comprise any nature person's object.
(4) multidimensional associates whole analysis system, according to the coupling index of the information model of autonomous research, develops many sets of data association analysis instrument, to meet the needs that multi dimensional analysis associates with complexity.System not only will complete duplicate removal, denoising, go the task such as puppet, cluster, also will carry out compound to data, gather, change, compare and cluster even depth learning manipulation, comprise categorical variables and relative variable, time series and Various types of data dimension.
System requirements is, by numerous isolated tidal data recovering to specific environment, then reasons out valuable result via time series and other deep analyses, has the characteristic of real-time analysis simultaneously.
(5) in order to adapt to complicacy and the polygons of information, expert amendment system becomes more important, and at the indices obtained and data quality model, considers industry singularity, the related algorithm of continuous Corrected Depth study, iteratively faster constantly promotes data value in conjunction with local analysis.Object is that therefore whole system becomes more intelligent, and this is also a process constantly circulated.
(6) multi-source data unity is the unified various dimensions time series models shown by system, become intuitively by the abundant visual form that represents by abstract, histogram, pie chart, curve map etc., by user provide one totally close examination the visual angle of associated data of concern object; Meanwhile, visual presentation is along with the renewal of source data, and real-time exhibition is information the most timely.
In addition, externally provide extendability, customizability and application programming interfaces, realize customizing messages stream from bottom data integration, self-definition model to User Interface, be designed to an open platform.This customizing messages can be shared, links, recombinate, be not not modifiable product, but a kind of material that can join flexibly in new workflow, both can be iterated, also can add in new industry analysis model as material, be supplied to the partner that various expert data requires.
Accompanying drawing explanation
Accompanying drawing is public information correlating method and excavates engine figure.

Claims (5)

1., based on public information correlating method and the excavation engine of large data, it is characterized in that described method step is as follows:
(1) public information source, internet is collected: according to the classification directly gathered and certification gathers, collect the data source that magnanimity public information is associated;
(2) multi-source matching system: according to the difference (website, microblogging, micro-letter, Mobile solution) of data source, carry out the coupling of the corresponding pattern of information;
(3) multi-format information extraction system: according to the different-format of information carrier, extracts the data and key element of specifying;
(4) multidimensional associates whole analysis system: according to the association algorithm of public information model, by duplicate removal, denoising, go the operation such as puppet, cluster, carry out confluence analysis to the data after gathering;
(5) expert amendment system: based on the indices obtained and Evaluation Model on Quality, Corrected Depth study related algorithm and above-mentioned steps in each system;
(6) visual presentation system: according to time series principle, the public information of specifying is given visual integration exhibition.
2. a kind of public information correlating method based on large data according to claim 1 and excavation engine, is characterized in that, in described step (1), and the data field that public information source, the data field that can directly gather and palpus certification gather.
3. a kind of public information correlating method based on large data according to claim 1 and excavation engine, it is characterized in that, in described step (3), must identify and comprise the multiple file layout such as Word, Excel, WPS, PDF and on the basis of Chinese word segmentation, accurately extract the data required.
4. a kind of public information correlating method based on large data according to claim 1 and excavation engine, it is characterized in that, in described step (4), according to the association algorithm of public information model, by duplicate removal, denoising, go the operation such as puppet, cluster, integrate the information source that the Data Concurrent after gathering is now new.
5. a kind of public information correlating method based on large data according to claim 1 and excavation engine, is characterized in that, in described step (5), what expert amendment system adopted is the pattern that non-relational database combines with machine learning.
CN201510017418.8A 2015-01-14 2015-01-14 Big data-based public information association method and mining engine Pending CN104679827A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510017418.8A CN104679827A (en) 2015-01-14 2015-01-14 Big data-based public information association method and mining engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510017418.8A CN104679827A (en) 2015-01-14 2015-01-14 Big data-based public information association method and mining engine

Publications (1)

Publication Number Publication Date
CN104679827A true CN104679827A (en) 2015-06-03

Family

ID=53314869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510017418.8A Pending CN104679827A (en) 2015-01-14 2015-01-14 Big data-based public information association method and mining engine

Country Status (1)

Country Link
CN (1) CN104679827A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677768A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Networked classification analysis system based on complex products
CN106203676A (en) * 2016-06-27 2016-12-07 浪潮(北京)电子信息产业有限公司 A kind of Work Flow Optimizing method based on cloud computing framework
CN106227896A (en) * 2016-08-28 2016-12-14 杭州合众数据技术有限公司 A kind of big data visualization fractional analysis method
CN106453554A (en) * 2016-10-11 2017-02-22 上海携程商务有限公司 Monitoring system and monitoring method for application dependency in distributed information system
CN106649298A (en) * 2015-07-22 2017-05-10 中国科学院微电子研究所 Method and system for carrying out interdisciplinary association establishment on the basis of the Internet of Things
WO2017092696A1 (en) * 2015-12-02 2017-06-08 中国银联股份有限公司 Method for safe integration of big data without leaking privacy
CN107093019A (en) * 2017-04-21 2017-08-25 北京恒冠网络数据处理有限公司 A kind of big data analysis system for macro adjustments and controls
CN107391686A (en) * 2017-07-24 2017-11-24 威创软件南京有限公司 A kind of visual configuration data collecting system implementation method
CN108763565A (en) * 2018-06-04 2018-11-06 广东京信软件科技有限公司 A kind of matched construction method of data auto-associating based on deep learning
CN110008251A (en) * 2019-03-07 2019-07-12 平安科技(深圳)有限公司 Data processing method, device and computer equipment based on time series data
CN110874356A (en) * 2020-01-19 2020-03-10 南京创维信息技术研究院有限公司 Cloud big data system and construction method thereof
CN111625537A (en) * 2020-04-24 2020-09-04 山东电子职业技术学院 Multidimensional data analysis system and multidimensional data analysis method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009003281A1 (en) * 2007-07-03 2009-01-08 Tlg Partnership System, method, and data structure for providing access to interrelated sources of information
CN102523246A (en) * 2011-11-23 2012-06-27 陈刚 Cloud computation treating system and method
CN104123317A (en) * 2013-04-28 2014-10-29 成都勤智数码科技股份有限公司 Service organization assessing and analyzing method based on knowledge base
CN104123323A (en) * 2013-04-28 2014-10-29 成都勤智数码科技股份有限公司 Method for collecting and recognizing service activities based on knowledge base

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009003281A1 (en) * 2007-07-03 2009-01-08 Tlg Partnership System, method, and data structure for providing access to interrelated sources of information
CN102523246A (en) * 2011-11-23 2012-06-27 陈刚 Cloud computation treating system and method
CN104123317A (en) * 2013-04-28 2014-10-29 成都勤智数码科技股份有限公司 Service organization assessing and analyzing method based on knowledge base
CN104123323A (en) * 2013-04-28 2014-10-29 成都勤智数码科技股份有限公司 Method for collecting and recognizing service activities based on knowledge base

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649298A (en) * 2015-07-22 2017-05-10 中国科学院微电子研究所 Method and system for carrying out interdisciplinary association establishment on the basis of the Internet of Things
CN106649298B (en) * 2015-07-22 2021-01-22 中国科学院微电子研究所 Cross-domain association establishment method and system based on Internet of things
WO2017092696A1 (en) * 2015-12-02 2017-06-08 中国银联股份有限公司 Method for safe integration of big data without leaking privacy
CN105677768A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Networked classification analysis system based on complex products
CN106203676A (en) * 2016-06-27 2016-12-07 浪潮(北京)电子信息产业有限公司 A kind of Work Flow Optimizing method based on cloud computing framework
CN106227896A (en) * 2016-08-28 2016-12-14 杭州合众数据技术有限公司 A kind of big data visualization fractional analysis method
CN106453554B (en) * 2016-10-11 2019-11-19 上海携程商务有限公司 The monitoring system and monitoring method of dependence are applied in distributed information system
CN106453554A (en) * 2016-10-11 2017-02-22 上海携程商务有限公司 Monitoring system and monitoring method for application dependency in distributed information system
CN107093019A (en) * 2017-04-21 2017-08-25 北京恒冠网络数据处理有限公司 A kind of big data analysis system for macro adjustments and controls
CN107391686A (en) * 2017-07-24 2017-11-24 威创软件南京有限公司 A kind of visual configuration data collecting system implementation method
CN108763565A (en) * 2018-06-04 2018-11-06 广东京信软件科技有限公司 A kind of matched construction method of data auto-associating based on deep learning
CN110008251A (en) * 2019-03-07 2019-07-12 平安科技(深圳)有限公司 Data processing method, device and computer equipment based on time series data
CN110008251B (en) * 2019-03-07 2023-07-04 平安科技(深圳)有限公司 Data processing method and device based on time sequence data and computer equipment
CN110874356A (en) * 2020-01-19 2020-03-10 南京创维信息技术研究院有限公司 Cloud big data system and construction method thereof
CN111625537A (en) * 2020-04-24 2020-09-04 山东电子职业技术学院 Multidimensional data analysis system and multidimensional data analysis method

Similar Documents

Publication Publication Date Title
CN104679827A (en) Big data-based public information association method and mining engine
CN113778967B (en) Yangtze river basin data acquisition processing and resource sharing system
Best et al. Geospatial web services within a scientific workflow: Predicting marine mammal habitats in a dynamic environment
CN106407278A (en) Architecture design system of big data platform
CN104794151A (en) Spatial knowledge service system building method based on collaborative plotting technology
CN111708774B (en) Industry analytic system based on big data
Anichini et al. Developing the ArchAIDE application: a digital workflow for identifying, organising and sharing archaeological pottery using automated image recognition
Zhang et al. Research hotspots and trends in heritage building information modeling: A review based on CiteSpace analysis
Zhang Application of data mining technology in digital library.
CN111831856A (en) Metadata-based automatic holographic digital power grid data storage system and method
CN116129262A (en) Cultivated land suitability evaluation method and system for suitable mechanized transformation
CN111159559A (en) Method for constructing recommendation engine according to user requirements and user behaviors
Chiang Querying historical maps as a unified, structured, and linked spatiotemporal source: vision paper
CN113254517A (en) Service providing method based on internet big data
CN113722564A (en) Visualization method and device for energy and material supply chain based on space map convolution
KR101545998B1 (en) Method for Management Integration of Runoff-Hydraulic Model Data and System thereof
CN116842092A (en) Method and system for database construction and collection management
STAICULESCU Aplication of GIS Tehnologies in Monitoring Biodiversity
CN110222057A (en) A kind of construction method of aerosol document formatted data base
Alhaj Ali et al. Distributed data mining systems: techniques, approaches and algorithms
CIORUŢA et al. ENVIRONMENTAL INFORMATICS-SOLUTIONS AND EMERGING CHALLENGES IN ENVIRONMENTAL PROTECTION.
Hong [Retracted] Application of Data Mining in Network Information Dynamic Push Software
Oliveira et al. Data science for geographic information systems
McNeely et al. Big data concept
Gao Big Geo-Data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150603