CN102207867B - Customizable intelligent vertical search engine system based on.NET - Google Patents

Customizable intelligent vertical search engine system based on.NET Download PDF

Info

Publication number
CN102207867B
CN102207867B CN201110145461.4A CN201110145461A CN102207867B CN 102207867 B CN102207867 B CN 102207867B CN 201110145461 A CN201110145461 A CN 201110145461A CN 102207867 B CN102207867 B CN 102207867B
Authority
CN
China
Prior art keywords
layer
engine system
module
system based
uprightness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110145461.4A
Other languages
Chinese (zh)
Other versions
CN102207867A (en
Inventor
郝矿荣
黄军君
丁永生
郭崇滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201110145461.4A priority Critical patent/CN102207867B/en
Publication of CN102207867A publication Critical patent/CN102207867A/en
Application granted granted Critical
Publication of CN102207867B publication Critical patent/CN102207867B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a customizable intelligent vertical search engine system based on.NET. The customizable intelligent vertical search engine system comprises an application layer, a network layer, a kernel layer and a control layer, wherein the network layer is used for respectively exchanging data with the application layer and the kernel layer through a unified WebService interface; the control layer controls the network layer and the kernel layer; the application layer, the network layer, the kernel layer and the control layer respectively have a corresponding class library which includes ControlDLL, WebFetchDLL, WebParseDLL and WebServiceDLL; and the engine system also includes a webpage analysis tool. Facing to personal clients, search homepages, corporate clients and business websites, the method not only can customize the searching field and range, but also can customize the data return format.

Description

A kind of customizable intelligent uprightness searching automotive engine system based on .NET
Technical field
The invention belongs to system and vertical search technical field, particularly relate to a kind of customizable intelligent uprightness searching automotive engine system based on .NET.
Background technology
Vertical search is the professional search engine for some industries, to the segmentation of general search engine and extension, be that the special information of certain class in web page library is once integrated, can directed minute field extract after the data that need are processed integration and with more intuitive form, return to user again.Its feature is exactly " special, essence, dark ", and has industry color, the magnanimity information disordering of the universal search engine of comparing, and vertical search engine seems more absorbed, concrete and gos deep into.
So-called search engine, the webpage quantity that first its is processed should be huge, and its data structure is also complicated and changeable, therefore a practical search engine should have very strong adaptability.At present at vertical search engine design aspect, relevant paper and patent all only provide a kind of thinking, or provide a kind of specific method for some specific websites, but these methods all have limitation through the check of practice very much, and what its core technology was inreal discloses, do not possess intelligent universal.Some vertical search systems that exist on market, as " where go net ", " China HR " etc., for the consideration of industry competition, all enforce a blockade to technology separately, if from these local correlation techniques that obtains, business cost is probably huge.And existing vertical search system is due to the constraint of its structure, and expansibility is not strong, user cannot customized searches field and scope; Such as " where go net " is a vertical search engine, but " tourism " field can only be searched for, if user will search for other field, other search engine must be used instead.
The website that some are middle-size and small-size, they may need to provide certain specific search service, but are limited to technology or a system of impossible oneself exploitation of cost consideration.Therefore, need to develop a vertical search engine and can provide an interface for them, and can customized searches field and scope, the result of inquiry is turned back to the predetermined page, thereby significantly reduce cost.In addition, the mass data of obtaining a certain field on internet that this vertical search engine interface can be regular, increases substantially the work efficiency of enterprise.2008 disclosed Chinese patent on January 30, " self-help intelligent uprightness searching method " (CN101114294), it lays particular emphasis on carries out user preference study, user cannot customized searches theme and scope, and only towards personal user, interface support is provided cannot to enterprise or business website.Disclosed Chinese patent on February 6th, 2008 " instant network calling embedded vertical search is with the method for support commercial activity " (CN101119328), its feature is to realize real-time communication by speech communication equipment and application program, but cannot meet the different search needs of different user, the scope of search has just locked after system completes, and user cannot change.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of customizable intelligent uprightness searching automotive engine system based on .NET, to solve field and the not customizable problem of scope of search.
The technical solution adopted for the present invention to solve the technical problems is: a kind of customizable intelligent uprightness searching automotive engine system based on .NET is provided, comprise application layer, network layer, core layer and key-course, described network layer is carried out exchanges data with application layer and core layer respectively by unified WebService interface; Described key-course is controlled network layer and core layer; Described application layer, network layer, core layer and key-course have class libraries in contrast should; Described class libraries comprises ControlDLL, WebFetchDLL, WebParseDLL and WebServiceDLL; Described automotive engine system also comprises web page analysis instrument.
Described core layer comprises: receive parameter module, from WebService interface, receive the parameter that application layer is submitted to; Web Spider module module, resolves user's initial URL table according to the parameter of submitting to; Obtain page module, after URL table is resolved, the page is resolved and data extraction; Participle straw line module, processes processing by the structural data from web page extraction according to the professional knowledge of every profession and trade, duplicate removal and classification; Return results module, the Query Result of processing and classify is returned according to the form of customization.
Described key-course comprises: script control module, carry out the inside and outside script of the page, and increase the interaction capabilities of program, in the situation that not changing program code, by outside script, carry out the execution of control program; Spider control module, the mode of operation of the Web Spider module of control program; Thread control module, the thread management problem of solution multithread programs.
Described ControlDLL comprises ScriptControlLib, ThreadControlLib and SpiderControlLib.
Described WebFetchDLL comprises AutoLoginLib, AutoRegisterLib, WebInfoLib and ImageProcessLib.
Described WebParseDLL comprises XmlLib, RegexLIb, TagLib and DomLib.
Described WebServiceDLL comprises ServiceLib and InterfaceLib.
Described web page analysis instrument comprises browser, tri-kinds of views of HTML and XML.
Beneficial effect
The present invention relates to a kind of customizable intelligent uprightness searching automotive engine system based on .NET, towards individual client's end, search homepage, corporate client and business website, by unified WebService interface and class libraries and two development kits of web page analysis software, field and scope that not only can customized searches, can also customization data return to form.Personal user can easily search for by desktop client end the information of every field, and enterprise customer can build own application program by the interface providing, and regularly captures the data in its customization field, and imports database for it.Interface can be passed through in business website, is customized real-time being presented in its website of Query Result in field, can also, as secondary agent, to other websites, provide service.The interface that utilization provides, can build easily as the search homepage of " Baidu ", " Google " and so on, and need not buy a large amount of servers.
Accompanying drawing explanation
Fig. 1 is system construction drawing.
Fig. 2 is the design sketch after system completes.
Fig. 3 is class libraries structural drawing.
Fig. 4 is analysis software figure.
Fig. 5 is engine workflow diagram.
Fig. 6 is Web Spider module process flow diagram.
Fig. 7 is application case figure.
Embodiment
Below in conjunction with specific embodiment, further set forth the present invention.Should be understood that these embodiment are only not used in and limit the scope of the invention for the present invention is described.In addition should be understood that those skilled in the art can make various changes or modifications the present invention after having read the content of the present invention's instruction, these equivalent form of values fall within the application's appended claims limited range equally.
Embodiment 1
As depicted in figs. 1 and 2, the present invention is an intelligent uprightness searching automotive engine system of holding, searching for homepage, corporate client, business website towards individual client, its maximum feature is " user customizable ", field and scope that not only can customized searches, can also customization data return to form, system externally provides service with unified WebService interface.
As Fig. 3, Fig. 4, shown in Fig. 5 and Fig. 6, the invention provides a kind of customizable intelligent uprightness searching automotive engine system based on .NET, comprise application layer, network layer, core layer and key-course, described network layer is carried out exchanges data with application layer and core layer respectively by unified WebService interface; Described key-course is controlled network layer and core layer; Described application layer, network layer, core layer and key-course have class libraries in contrast should; Described class libraries comprises ControlDLL, WebFetchDLL, WebParseDLL and WebServiceDLL; Described automotive engine system also comprises web page analysis instrument.
The unified WebService interface that described application layer provides by network layer submits to data to core layer, in this level, individual client's end, search homepage, corporate client, business website can complete the customization work to intelligent uprightness searching automotive engine system, at initial phase, user not only can customized searches field and scope, can also customization data return to form.System externally provides service with unified WebService interface, by interface user, can in the website of oneself, realize vertical search function very easily, or by client, obtains the data in a certain field on internet.In the process of customization, advanced level user of enterprise can be with its website that will customize of the network analysis tool analysis providing, to can filter out more accurately desired data.
In described network layer, Web service (Web Service) is a kind of service based on XML and HTTPS, and its communication protocol is mainly based on SOAP, and the description of service, by WSDL, is found and obtained the metadata of service by UDDI.WebService interface exposes an API that can call by Web to the external world.The method of the enough programmings of user's energy is called our systemic-function by Web.Obtain corresponding Query Information, client only need send a HTTP GET request, just can from interface, obtain accordingly and return results.The operation that builds WebService interface has been encapsulated in WebServiceDLL class libraries, only need add just to quote and can call relevant method in programming, and rapid build goes out a WebService interface.
Described core layer is the core of whole system, and core layer reads and process the data that network layer is submitted to, and first, according to the parameter of user's input in parameter module and the content of customization, Web Spider module module is started working, and its workflow as shown in Figure 6.Web Spider module is resolved user's initial URL table, and when capturing webpage, Web Spider module has two kinds of strategies: breadth First and depth-first.Breadth First refers to that Web Spider module can first capture all webpages that link in start page, and then selects one of them linked web pages, continues to capture all webpages that link in this webpage.This is the most frequently used mode, because this method can allow the parallel processing of Web Spider module, improves its grasp speed.Depth-first refers to that Web Spider module can be from start page, and a chain ground connection of a link is followed the tracks of down, handles this circuit and proceeds to next start page afterwards again, continues to follow the tracks of link.In ControlDLL.SpiderControlLib, encapsulate the operation to Web Spider module, in programming, can set the pattern of its work.Next will obtain page module, yet could obtain after some informational needs login, after some informational needs registration, could obtain, also some acquisition of information needs input validation code, and identifying code is picture form mostly, need to identify relevant word.The technical matters that the method encapsulating in WebFetchDLL is obtained for solving the page, wherein AutoLoginLib is for automatic login process, AutoRegisterLib processes for auto registration, ImageProcessLib is for the treatment of picture validation code, WebInfoLib is for obtaining the coding of the page, the parameters such as Cookie.After obtaining the page, need the page to resolve and data extraction, in order to extract structurized data from complicated chaotic webpage, must adopt relevant technological means, such as, tag location, location, picture position, regular expression, XML process, linq language.The method encapsulating in class libraries WebParseDLL is for solving some FAQs that the page is resolved and data are extracted.DomLib is for changing into DOM object model by HTML, and XmlLib has encapsulated the conventional operation of XML document, and RegexLIb is mainly used in the operation of regular expression, and TagLib is mainly used in tag location.After obtaining data, participle index module is divided glossarial index to data, structural data from web page extraction is processed to processing according to the professional knowledge of every profession and trade, duplicate removal, classification etc., finally return results module according to the data layout of customization, returns to the result of inquiry.
Described key-course is mainly that the core layer of system and network layer are controlled, and comprises script control module, spider control module and thread control module.Script control module, carries out the inside and outside script of the page, increases the interaction capabilities of program, in the situation that not changing program code, carrys out the execution of control program by outside script; Spider control module, the mode of operation of the Web Spider module of control program; Thread control module, the thread management problem of solution multithread programs.ThreadControlLib in ControlDLL, SpiderControlLib, ScriptControlLib have encapsulated respectively three class ways to solve the problem.
Described class libraries encapsulates for the design of vertical search engine specially, has encapsulated the method that can supply application layer, network layer, core layer and key-course to call, to solve the technical matters in vertical search engine design.Class libraries comprises ControlDLL, WebFetchDLL, WebParseDLL and WebServiceDLL tetra-parts, described ControlDLL comprises ScriptControlLib, ThreadControlLib and SpiderControlLib, described WebFetchDLL comprises AutoLoginLib, AutoRegisterLib, WebInfoLib and ImageProcessLib, described WebParseDLL comprises XmlLib, RegexLIb, TagLib and DomLib, described WebServiceDLL comprises ServiceLib and InterfaceLib.As long as add in program the quoting of class libraries, can realize very easily following functions:
1. search engine reptile: capture the related web page on internet, take the strategy adjustment of depth-first or the breadth First direction of creeping, make system concentrate and creep at the webpage of customization.
2. multiple line distance management: Mutli-thread Programming Technology is the effective way that improves program efficiency, the series of problems such as multiple line distance management can avoid deadlock, overflow.
3. the page obtains technology: adopt login and auto registration technology automatically, to obtain more information; Picture is identified automatically, and some acquisition of information needs input validation code, and identifying code is picture form mostly, need to identify relevant word.
4. information extraction technology: in order to extract structurized data from complicated chaotic webpage, must adopt relevant technological means, such as, tag location, location, picture position, regular expression, XML process, linq language.
5. participle index technology: vertical search engine is processed processing by the structural data from web page extraction according to the professional knowledge of every profession and trade, duplicate removal, classification etc., last participle, according to the data layout of customization, returns to the result of inquiry.
Described web page analysis instrument comprises browser, tri-kinds of views of HTML and XML, and the content that not only can check webpage, can also check html code and xml code.The parameters such as the coded system, POST parameter, Cookie record of current web page can be checked in parameter hurdle.Toolbar comprises XPath instrument, regular expression instrument, coding and decoding instrument, Post testing tool, IP instrument, script testing tool.Enter after analytical model, the navigation feature of browser will be closed, and now can select for a post meaning web page element at browser view mid point, can check each property value and the HTML code of choosing label in this window.If recover browser navigation feature, click " browser button pattern ".
Below with concrete application case explanation concrete using method of the present invention.
Certain business service company, for the bid ranking information of hotel, air ticket can be provided to its client, needs the pricing information of real-time each airline official website of search, hotel group official website, middle-agent company; User is input inquiry condition on the page of its website, after search, according to price sequence, then result is turned back in its page according to unified form.The common search engine of this task is incompetent, can only complete by vertical search engine.
In order to make engine meet this commercial company's requirement, first to complete customization work.The first step, the scope of customized searches, submits the URL address of its each website that will search for to engine.Second step, form is returned in customization, and the data structure of the return data of its requirement is described to engine.The 3rd, interface docking, the WebService interface that engine is provided is added to its website.After completing this three step, engine just can normally be worked.
As shown in Figure 7, what this figure showed is air ticket query interface to its actual effect, and user inserts departure place, destination and date and just can obtain relevant Flight Information, and according to predetermined form, returns to the information of different web sites.

Claims (5)

1. the customizable intelligent uprightness searching automotive engine system based on .NET, comprises application layer, network layer, core layer and key-course, it is characterized in that, described network layer is carried out exchanges data with application layer and core layer respectively by unified WebService interface; Described key-course is controlled network layer and core layer; Described application layer, network layer, core layer and key-course have class libraries in contrast should; Described class libraries comprises ControlDLL, WebFetchDLL, WebParseDLL and WebServiceDLL; Described automotive engine system also comprises web page analysis instrument; Described core layer comprises: receive parameter module, from WebService interface, receive the parameter that application layer is submitted to; Web Spider module, resolves user's initial URL table according to the parameter of submitting to; Obtain page module, after URL table is resolved, the page is resolved and data extraction; Participle straw line module, processes processing by the structural data from web page extraction according to the professional knowledge of every profession and trade, duplicate removal and classification; Return results module, the Query Result of processing and classify is returned according to the form of customization; Described key-course comprises: script control module, carry out the inside and outside script of the page, and increase the interaction capabilities of program, in the situation that not changing program code, by outside script, carry out the execution of control program; Spider control module, the mode of operation of the Web Spider module of control program; Thread control module, the thread management problem of solution multithread programs.
2. a kind of customizable intelligent uprightness searching automotive engine system based on .NET according to claim 1, is characterized in that, described ControlDLL comprises ScriptControlLib, ThreadControlLib and SpiderControlLib.
3. a kind of customizable intelligent uprightness searching automotive engine system based on .NET according to claim 1, is characterized in that, described WebFetchDLL comprises AutoLoginLib, AutoRegisterLib, WebInfoLib and ImageProcessLib.
4. a kind of customizable intelligent uprightness searching automotive engine system based on .NET according to claim 1, is characterized in that, described WebParseDLL comprises XmlLib, RegexLIb, TagLib and DomLib.
5. a kind of customizable intelligent uprightness searching automotive engine system based on .NET according to claim 1, is characterized in that, described WebServiceDLL comprises ServiceLib and InterfaceLib.
CN201110145461.4A 2011-06-01 2011-06-01 Customizable intelligent vertical search engine system based on.NET Expired - Fee Related CN102207867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110145461.4A CN102207867B (en) 2011-06-01 2011-06-01 Customizable intelligent vertical search engine system based on.NET

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110145461.4A CN102207867B (en) 2011-06-01 2011-06-01 Customizable intelligent vertical search engine system based on.NET

Publications (2)

Publication Number Publication Date
CN102207867A CN102207867A (en) 2011-10-05
CN102207867B true CN102207867B (en) 2014-08-13

Family

ID=44696715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110145461.4A Expired - Fee Related CN102207867B (en) 2011-06-01 2011-06-01 Customizable intelligent vertical search engine system based on.NET

Country Status (1)

Country Link
CN (1) CN102207867B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102393858A (en) * 2011-11-17 2012-03-28 陈洪 Meta search engine system based on client side real time aggregation
CN104125306A (en) * 2014-08-14 2014-10-29 浪潮电子信息产业股份有限公司 HTTPS (Hypertext Transfer Protocol Secure)-based acquiring method of webpage content of encryption protocol

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8312014B2 (en) * 2003-12-29 2012-11-13 Yahoo! Inc. Lateral search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system

Also Published As

Publication number Publication date
CN102207867A (en) 2011-10-05

Similar Documents

Publication Publication Date Title
US8538949B2 (en) Interactive web crawler
CN101452453B (en) A kind of method of input method Web side navigation and a kind of input method system
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
CN101971172B (en) Mobile sitemaps
CN101211364B (en) Method and system for social bookmarking of resources exposed in web pages
CN102073726B (en) Structured data import method and device for search engine system
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN108052632B (en) Network information acquisition method and system and enterprise information search system
CN103443786A (en) Machine learning method to identify independent tasks for parallel layout in web browsers
CN102880607A (en) Dynamic network content grabbing method and dynamic network content crawler system
CN101908071A (en) Method and device thereof for improving search efficiency of search engine
CN101288075A (en) Simultaneously spawning multiple searches across multiple providers
CN101404666A (en) Infinite layer collection method based on Web page
CN101984429A (en) Method and device for acquiring destination page, search engine and browser
CN101329687A (en) Method for positioning news web page
CN102065114A (en) Method and device for mobile terminal to access webpage
CN104317948A (en) Page data capturing method and system
CN106662986A (en) Optimized browser rendering process
CN107016102A (en) A kind of big data web crawlers paging collocation method
CN104199893B (en) A kind of system and method for quickly issuing full media content
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
CN106547749B (en) Webpage data acquisition method and device
CN103778156A (en) Method and device for searching for data and server for data search
CN102207867B (en) Customizable intelligent vertical search engine system based on.NET
CN103488675A (en) Automatic precise extraction device for multi-webpage news comment contents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140813

Termination date: 20170601