CN101141456A - Vertical search based network data excavation method - Google Patents

Vertical search based network data excavation method Download PDF

Info

Publication number
CN101141456A
CN101141456A CNA2007101329463A CN200710132946A CN101141456A CN 101141456 A CN101141456 A CN 101141456A CN A2007101329463 A CNA2007101329463 A CN A2007101329463A CN 200710132946 A CN200710132946 A CN 200710132946A CN 101141456 A CN101141456 A CN 101141456A
Authority
CN
China
Prior art keywords
data
information
vertical search
network
excavation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007101329463A
Other languages
Chinese (zh)
Inventor
曹杰
章舜仲
刘军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Finance and Economics
Original Assignee
Nanjing University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Finance and Economics filed Critical Nanjing University of Finance and Economics
Priority to CNA2007101329463A priority Critical patent/CN101141456A/en
Publication of CN101141456A publication Critical patent/CN101141456A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a network data mining method based on a vertical search. Firstly, the method of the vertical search is adopted to search data from a network; the information gained is preprocessed, the structured data after purging through the data is saved in a data base; an analysis is performed to the data in the data base to find a rule, therefore to construct a model, and a matching is performed to the feature vector of a collection and the feature vector of a target sample, herefrom, the degree of association of the relevant collection information is gained; a prediction is performed to an unknown data, and an evaluation is delivered by being compared with a actual result, therefore to perform a revision to an original model parameter, and the authoritative information is supplied to a user. The present invention adopts a network data mining method of a vertical search to gain the relevant information, the relevant specialized information can be effectively gained, the repeated information and the spam information are little, and the enquiry of a user specialization can be met.

Description

Network data excavation method based on vertical search
Technical field
The present invention relates to a kind of collection method of network data, specifically a kind of network data excavation method based on vertical search.
Background technology
Fast development along with network communications technology, the World Wide Web (WWW) has become the distributed information space of a huge implication potential value knowledge, containing many useful in the network data, potential, but be not easy found knowledge and pattern, people press for and find and grasp the Method and kit for that can obtain these knowledge and pattern.Search engine is collected in the Internet, discovery information, information is understood, extracts, organized and handles, and for the user provides retrieval service, network data excavation then is to utilize data mining technology to excavate useful pattern and implicit information from network data.Therefore search engine provides the data preparation for network data excavation, and Web Mining is the senior application of search engine.
Yet exist with form web page because data are many on the network, most of webpages are a kind of non-structured text datas, and data mining generally requires to carry out on structural data, so the key issue of network data excavation is to obtain structural data from network.
It is horizon scan that present search engine adopts universal search, though the information service coverage rate that provides in the universal search is very wide, is difficult to directly search the information that the user thirsts for, and mainly has following problem:
(1) the information disordering causes effective information few.Because the network data majority is present in the non-structured web page text, causes network data to be difficult to effectively organize, and has a large amount of duplicate messages and junk information, forms powerful noise, causes user inquiring information just as looking for a needle in a haystack.
(2) information search lacks the specialty division, and the differentiation of industry boundary line and user scope and level can not be satisfied user customizableization, specialized inquiry.
Summary of the invention
In order to overcome the problem that prior art exists, the purpose of this invention is to provide a kind of network data excavation method based on vertical search, adopt this method can directly search the information that the user thirsts for.
The objective of the invention is to be achieved through the following technical solutions:
A kind of network data excavation method based on vertical search is characterized in that it may further comprise the steps:
(1) based on the data sampling of vertical search, promptly from the network gather data.Vertical search is to aim at the information of inquiring about certain industry or theme and the professional research tool that produces, and mainly is to the theme differentiation of the online webpage of Internet and the capture program of Web Spider (Web spider).The vertical search engine classification is more careful, data are goed deep into more comprehensively, content is more timely, be that the special information of certain class in the web page library is integrated, handle after the data that directed extraction needs, return to the user with certain form again, and the more important thing is since vertical search with structural data as the search unit, the web page blocks that is different from the general search engine, therefore the data by the vertical search collection are structurized, not only can have more information retrieval service targetedly, can also carry out further data mining work on this basis for the user provides.
(2) information preliminary treatment is carried out preliminary treatment work to network data, for web page text, needs to extract keyword, sets up index file.Convert unstructured data in the webpage to structural data, and carry out the correct data reliably of knowledge expression simplification, data preparation and extraction, become uniform data format, as handle imperfect or work such as noise data, translation data form, cleaning inconsistent data.
(3) storage, to preserve by the structural data after the data scrubbing, generally store in the database, perhaps further generate data warehouse, set up knowledge base, model library and data management system, storage be owing to only need gather the webpage relevant with a certain theme with maintenance, and more deep collection and excavation are carried out in this field, guarantee the popularity and the integrality of Data Source in the data mining, support better that the Analysis of Policy Making of enterprise or tissue handles data acquisition system.
(4) data modeling by the data in the database are analyzed, thereby finds that rule makes up model.Normally from pretreated structural data, find the process of useful characteristic information, and excavate the clarification of objective vector and calculate corresponding weights, the characteristic vector of collection and the characteristic vector of target sample are mated, obtain the degree of association of relevant Information Monitoring thus.
(5) prediction, estimate and revise, unknown data is predicted, thereby and and actual result relatively make an appraisal original model parameter revised, excavate relevant authoritative information and offer the user
Among the present invention, data mining has two key characters: modeling and prediction, modeling refer to find rule by the data in the analytical database, make up forecast model, prediction is meant that the constructed model of utilization predicts unknown things, for enterprise and department provide decision support.
The present invention adopts the network data excavation method of vertical search to obtain relevant information.The vertical search engine classification is more careful, data are goed deep into more comprehensively, content is more timely, is that the special information of certain class in the web page library is integrated, and handles after the data that directed extraction needs, and returns to the user with certain form again.Since vertical search with structural data as the search unit, the web page blocks that is different from the general search engine, therefore the data by the vertical search collection are structurized, can have more information retrieval service targetedly for the user provides, but also can carry out further data mining on this basis.Compared with prior art, the present invention can effectively obtain relevant speciality information, and duplicate message and junk information are few, can satisfy the specialized inquiry of user.
Description of drawings
Accompanying drawing is the network data excavation structural representation that adopts vertical search.
Embodiment
The invention will be further described below in conjunction with drawings and Examples.
The user need look into quickly and easily and get specific specialized information, adopts the network data excavation method based on vertical search of the present invention, may further comprise the steps:
(1) based on the data sampling of vertical search, promptly from the network gather data.Vertical search is to aim at the information of inquiring about certain industry or theme and the professional research tool that produces, and mainly is to the theme differentiation of the online webpage of Internet and the capture program of Web Spider (Web spider).The vertical search engine classification is more careful, data are goed deep into more comprehensively, content is more timely, be that the special information of certain class in the web page library is integrated, handle after the data that directed extraction needs, return to the user with certain form again, and the more important thing is since vertical search with structural data as the search unit, the web page blocks that is different from the general search engine, therefore the data by the vertical search collection are structurized, not only can have more information retrieval service targetedly, can also carry out further data mining work on this basis for the user provides.
(2) information preliminary treatment is carried out preliminary treatment work to network data, for web page text, needs to extract keyword, sets up index file.Convert unstructured data in the webpage to structural data, and carry out the correct data reliably of knowledge expression simplification, data preparation and extraction, become uniform data format, as handle imperfect or work such as noise data, translation data form, cleaning inconsistent data.
(3) storage, to preserve by the structural data after the data scrubbing, generally store in the database, perhaps further generate data warehouse, set up knowledge base, model library and data management system, storage be owing to only need gather the webpage relevant with a certain theme with maintenance, and more deep collection and excavation are carried out in this field, guarantee the popularity and the integrality of Data Source in the data mining, support the data acquisition system of the Analysis of Policy Making processing of enterprise or tissue better.
(4) data modeling, modeling refer to find rule by the data in the analytical database, make up forecast model.Normally from pretreated structural data, find the process of useful characteristic information, and excavate the clarification of objective vector and calculate corresponding weights, the characteristic vector of collection and the characteristic vector of target sample are mated, obtain the degree of association of relevant Information Monitoring thus.
(5) prediction, estimate and revise, unknown data is predicted, thereby and and actual result relatively make an appraisal original model parameter revised, excavate relevant authoritative information and offer the user.
Among the present invention, data mining has two key characters: model and forecast, modeling refer to by the number in the analytical database It is found that rule, make up forecast model, prediction refers to use constructed model to predict unknown things, for enterprise and Department provides decision support.
The present invention adopts the network data excavation method of vertical search to obtain relevant information. The vertical search engine classification more Careful, data are goed deep into more comprehensively, content is more timely, are that the special information of certain class in the web page library is integrated, Process after the data that directed extraction needs, return to the user with certain form again. Because vertical search is with the structuring number According to as the search unit, be different from the web page blocks of general search engine, therefore the data by the vertical search collection are structures Change, can have more targetedly information retrieval service for the user provides, but also can carry out further on this basis Data mining. Compared with prior art, the present invention can effectively obtain relevant speciality information, duplicate message and junk information Few, can satisfy the specialized inquiry of user.

Claims (4)

1. network data excavation method based on vertical search is characterized in that it may further comprise the steps:
(1) adopt the mode of vertical search from the network gather data; Wherein, vertical search is to aim at the information of inquiring about certain industry or theme and the professional research tool that produces, and it, is handled after the data that directed extraction needs as the search unit with structural data, returns to the user with certain form again;
(2) with the information preliminary treatment that obtains; Convert unstructured data in the webpage to structural data, and carry out the correct data reliably of knowledge expression simplification, data preparation and extraction, become uniform data format;
(3) storage; To be saved in by the structural data after the data scrubbing in the database;
(4) data modeling; Data in the database are analyzed, thereby found that rule makes up model, and the characteristic vector of collection and the characteristic vector of target sample are mated, obtain the degree of association of relevant Information Monitoring thus;
(5) prediction, evaluation and correction; Unknown data is predicted, thereby and and actual result relatively make an appraisal original model parameter revised, authoritative information is offered the user.
2. the network data excavation method based on vertical search according to claim 1 is characterized in that: in the step (1), described vertical search is to the theme differentiation of the online webpage of Internet and the capture program of Web Spider.
3. the network data excavation method based on vertical search according to claim 1 is characterized in that: in the step (4), described data modeling refers to find rule by the data in the analytical database, makes up forecast model.
4. the network data excavation method based on vertical search according to claim 1 is characterized in that: in the step (5), described prediction is meant that the constructed model of utilization predicts unknown things, for enterprise and department provide decision support.
CNA2007101329463A 2007-10-09 2007-10-09 Vertical search based network data excavation method Pending CN101141456A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007101329463A CN101141456A (en) 2007-10-09 2007-10-09 Vertical search based network data excavation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007101329463A CN101141456A (en) 2007-10-09 2007-10-09 Vertical search based network data excavation method

Publications (1)

Publication Number Publication Date
CN101141456A true CN101141456A (en) 2008-03-12

Family

ID=39193200

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007101329463A Pending CN101141456A (en) 2007-10-09 2007-10-09 Vertical search based network data excavation method

Country Status (1)

Country Link
CN (1) CN101141456A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777147A (en) * 2008-11-05 2010-07-14 埃森哲环球服务有限公司 Forecast modeling
CN102200979A (en) * 2010-03-26 2011-09-28 上海市浦东科技信息中心 Distributed parallel information retrieval system and distributed parallel information retrieval method
CN102262648A (en) * 2010-05-31 2011-11-30 索尼公司 Evaluation predicting device, evaluation predicting method, and program
CN101876985B (en) * 2009-11-26 2012-08-29 西北工业大学 WEB text sentiment theme recognizing method based on mixed model
CN103617279A (en) * 2013-12-09 2014-03-05 南京邮电大学 Method for achieving microblog information spreading influence assessment model on basis of Pagerank method
CN103678701A (en) * 2013-12-31 2014-03-26 福建四创软件有限公司 Disaster prevention and reduction information processing system and method based on WebService
CN104063411A (en) * 2013-09-12 2014-09-24 江苏金鸽网络科技有限公司 Enterprise intelligence gathering method based on Michael Porter's Five Forces Model
CN104123659A (en) * 2014-07-30 2014-10-29 杭州野工科技有限公司 Commodity networked gene based brand intellectual property protection platform
CN104616180A (en) * 2015-03-09 2015-05-13 浪潮集团有限公司 Method for predicting hot sellers
CN104641314A (en) * 2012-03-22 2015-05-20 帝威克有限公司 Computerized internet search system and method
CN104657515A (en) * 2015-03-24 2015-05-27 深圳中兴网信科技有限公司 Data real-time analytical method and system
CN105893360A (en) * 2014-09-24 2016-08-24 唐锐 A talented person screening method evaluating user generated data based on a search technique
CN106779827A (en) * 2016-12-02 2017-05-31 上海晶樵网络信息技术有限公司 A kind of Internet user's behavior collection and the big data method of analysis detection
CN106776710A (en) * 2016-11-18 2017-05-31 广东技术师范学院 A kind of picture and text construction of knowledge base method based on vertical search engine
CN106802890A (en) * 2015-11-25 2017-06-06 富士通株式会社 Information processor and method and Information locating device
CN107798124A (en) * 2017-11-10 2018-03-13 深圳市华讯方舟软件信息有限公司 Search system and method based on prediction modeling technique
CN113239140A (en) * 2021-04-30 2021-08-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Complex information analysis engine architecture
CN115422909A (en) * 2022-08-25 2022-12-02 杭州有才信息技术有限公司 Background investigation method and device, electronic equipment and storage medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777147A (en) * 2008-11-05 2010-07-14 埃森哲环球服务有限公司 Forecast modeling
CN101777147B (en) * 2008-11-05 2014-07-30 埃森哲环球服务有限公司 Predictive modeling
CN101876985B (en) * 2009-11-26 2012-08-29 西北工业大学 WEB text sentiment theme recognizing method based on mixed model
CN102200979A (en) * 2010-03-26 2011-09-28 上海市浦东科技信息中心 Distributed parallel information retrieval system and distributed parallel information retrieval method
CN102262648A (en) * 2010-05-31 2011-11-30 索尼公司 Evaluation predicting device, evaluation predicting method, and program
CN104641314A (en) * 2012-03-22 2015-05-20 帝威克有限公司 Computerized internet search system and method
CN104063411A (en) * 2013-09-12 2014-09-24 江苏金鸽网络科技有限公司 Enterprise intelligence gathering method based on Michael Porter's Five Forces Model
CN104063411B (en) * 2013-09-12 2016-05-25 江苏金鸽网络科技有限公司 Based on the corporate information collection method of baud five power models
CN103617279A (en) * 2013-12-09 2014-03-05 南京邮电大学 Method for achieving microblog information spreading influence assessment model on basis of Pagerank method
CN103678701A (en) * 2013-12-31 2014-03-26 福建四创软件有限公司 Disaster prevention and reduction information processing system and method based on WebService
CN104123659A (en) * 2014-07-30 2014-10-29 杭州野工科技有限公司 Commodity networked gene based brand intellectual property protection platform
CN105893360A (en) * 2014-09-24 2016-08-24 唐锐 A talented person screening method evaluating user generated data based on a search technique
CN104616180A (en) * 2015-03-09 2015-05-13 浪潮集团有限公司 Method for predicting hot sellers
CN104657515A (en) * 2015-03-24 2015-05-27 深圳中兴网信科技有限公司 Data real-time analytical method and system
CN106802890A (en) * 2015-11-25 2017-06-06 富士通株式会社 Information processor and method and Information locating device
CN106776710A (en) * 2016-11-18 2017-05-31 广东技术师范学院 A kind of picture and text construction of knowledge base method based on vertical search engine
CN106779827A (en) * 2016-12-02 2017-05-31 上海晶樵网络信息技术有限公司 A kind of Internet user's behavior collection and the big data method of analysis detection
CN107798124A (en) * 2017-11-10 2018-03-13 深圳市华讯方舟软件信息有限公司 Search system and method based on prediction modeling technique
CN113239140A (en) * 2021-04-30 2021-08-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Complex information analysis engine architecture
CN115422909A (en) * 2022-08-25 2022-12-02 杭州有才信息技术有限公司 Background investigation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101141456A (en) Vertical search based network data excavation method
CN102663000B (en) The maliciously recognition methods of the method for building up of network address database, maliciously network address and device
Song et al. Industrial symbiosis: Exploring big-data approach for waste stream discovery
CN104063411B (en) Based on the corporate information collection method of baud five power models
CN102355488A (en) Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN108052632A (en) A kind of method for obtaining network information, system and company information search system
CN107104951B (en) Method and device for detecting network attack source
CN102567494A (en) Website classification method and device
CN106649461A (en) Method for automatically cleaning and maintaining elastic search log index file
Dong Exploration on web usage mining and its application
Ahamed et al. An Efficient Mechanism for Deep Web Data Extraction Based on Tree‐Structured Web Pattern Matching
CN111538839A (en) Real-time text clustering method based on Jacobsard distance
Shanthi Survey on web usage mining using association rule mining
Tamilselvi et al. Handling high web access utility mining using intelligent hybrid hill climbing algorithm based tree construction
Alim et al. Sampling-based estimation method for parameter estimation in big data business era
CN101610284B (en) Service parameter relational matching method and system based on calling data
CN109190010B (en) Internet data acquisition system based on user-defined keyword acquisition mode
CN107391695A (en) A kind of information extracting method based on big data
KR20170059758A (en) Method and System for Managing Business Information using Security Gateway
CN113297447A (en) Keyword-based related intellectual property information capturing, mining and visual analysis system and method
KR20180057470A (en) System and Method for Analyzing Social Problem Using Data Mining
Arnoux et al. Automatic clustering for the web usage mining
Marijić et al. Analysis of the Impact of Smartphone on the Environment Using the LCA Method
CN117172322B (en) Method for establishing digital rural knowledge graph
Bayoude et al. A Predictive Approach Based on Feature Selection to Improve Email Marketing Campaign Success Rate

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20080312