CN103838732A - Vertical search engine in life service field - Google Patents

Vertical search engine in life service field Download PDF

Info

Publication number
CN103838732A
CN103838732A CN201210475513.9A CN201210475513A CN103838732A CN 103838732 A CN103838732 A CN 103838732A CN 201210475513 A CN201210475513 A CN 201210475513A CN 103838732 A CN103838732 A CN 103838732A
Authority
CN
China
Prior art keywords
word
webpage
stu
information
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210475513.9A
Other languages
Chinese (zh)
Inventor
梅昱婷
刘博�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Original Assignee
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd filed Critical DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN201210475513.9A priority Critical patent/CN103838732A/en
Publication of CN103838732A publication Critical patent/CN103838732A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a vertical search engine in the life service field. The vertical search engine comprises the following steps of collecting information by using a professional web spider, extracting information, building an index and searching information. The web spider technology is applied by the vertical search engine, and the engine traverses web portals of the life service field, collects and stores the webpage related to subjects, and conducts link analysis and extraction on the webpage. The mode of combining DOM subject block extraction and the regular expression extraction is used by the vertical search engine for extracting structural information. Full-text searching service of information is provided for a user by building the index for the structural information of a database, weights are set and the searching results are reasonably arranged according to different fields. Finally, according to the characteristics of the life service, the data showing is based on the Internet more than, and the search service can be provided at any time and any where through mobile phone WAP.

Description

One way of life service field vertical search engine
Technical field
The present invention relates to search engine technique, particularly a kind of vertical search engine for service for life field.
Background technology
Along with the fast development of internet, the network information increases sharply, how from the network data of magnanimity, to retrieve information needed is fast and accurately problem demanding prompt solution, search engine is a kind of instrument of obtaining information from network that we the most often use, but mostly universal search engine is what the mode of employing key word was inquired about, return results and be more prone to knowledge production, his information content is large, it is not accurate enough to inquire about, the degree of depth is inadequate.Therefore, vertical search engine arises at the historic moment.Its range of search is no longer even up to ten million related web pages up to a hundred, but searches for for the domain knowledge of certain specific industry specially, is segmentation and the extension of search engine.Although vertical search engine also provides key word search, these key words are placed in the context of domain knowledge conventionally, and in the result of returning, message and entry are in the majority.Different from universal search engine, vertical search engine only gathers info web according to particular topic, and non-structured info web is transformed and be extracted as structural data, take structural data as the minimum unit of searching for.Then store these data into database, last participle, index meet user's request again in the mode of search.
Vertical search engine, according to the difference of professional domain, offers the corresponding information of user.One of most important application in vertical search engine towards the search engine in service for life field.Food and drink, amusement, shopping and house property etc. that people often touch in daily life can retrieve rapidly and accurately by this search engine.People needn't screen own Useful Information in a large amount of information, and the clothing, food, lodging and transportion--basic necessities of life that the vertical search engine in service for life field is people provide a large amount of valuable information, the daily life that can greatly be convenient for people to.
Compared with other search engine, the vertical search engine in service for life field needs professional Web Spider, the portal website in traversal service for life field, and the webpage of collection and preservation and Topic relative, and these webpages are carried out to link analysis and extraction; Can accurately webpage be carried out to structural data extraction and then be saved in database; Define weight, participle and set up index according to structured field territory; Also need to meet user's searching request anywhere or anytime by WEB and two kinds of data exhibiting modes of WAP.
Summary of the invention
For better meeting user's requirement, the present invention will design and Implement a kind of vertical search engine for service for life field, and this search engine is mainly towards food and drink, amusement and three fields of Yellow Page at present.
To achieve these goals, technical scheme of the present invention is as follows: the vertical search engine in service for life field, comprises the following steps:
A, the Information Monitoring of use specialized network spider
Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature.For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library.
Topic relativity analysis adopts the method for vector space model.Its basic thought is the dictionary (T this field theme 1, T 2... .., T n) regard a n dimension coordinate system as, for any word T iif, in webpage, comprise this word, give certain weights W according to significance level i, otherwise W ibe 0.Therefore, each webpage can be converted into one group of entry vector (W 1, W 2... .., W n).W in system iassignment method uses TF-IDF model, entry T jat webpage D iin TF-IDF value defined by following formula:
W i , j = TF i · log ( N DF i )
Wherein, TF ientry T jthe number of times occurring in this webpage; DF irepresent to comprise entry T in whole webpage collection D jwebpage number; N represents the sum of webpage.
Because webpage comprises various marks, link and text etc., the Feature Words significance level difference therefore occurring on diverse location, should weighted calculation:
TF i=a·TF M+b·TF T+c·TF K+d·TF D+e·TF A
Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient.
Utilize the cosine value of vectorial angle to represent the degree of correlation between webpage and theme, result of calculation is between 0 to 1, and this webpage of the larger explanation of value more meets subject information, and specific formula for calculation is as follows:
Sim ( D ) = cos θ = Σ i = 1 n D i × T i ( Σ i = 1 n D i 2 ) × ( Σ i = 1 n T i 2 )
B, structured message extract
System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database.
B1, extract based on DOM theme piece
Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree (Local Correlativity) and context dependent degree (Contextual Correlativity).Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:
LinkCount ( STU i ) = &Sigma; j = 1 N LinkCount ( STUC ij )
CountentLenth ( STU i ) = &Sigma; j = 1 N ContentLength ( STUC ij )
LocalCorrelativity ( STU i ) = LinkCount ( STU i ) CountentLenth ( STU i )
Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC ijrepresent STU ij sub-block.
Context dependent degree determines by piece internal chaining and father's piece content, and its computing formula can be expressed as:
ContextualCorrelativity ( STU i ) = LinkCount ( STU i ) CountentLenth ( STU Pi )
Wherein, STU pirepresent STU ifather node.
B2, regular expression extract
For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching.First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild.
The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard.When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work.
C, index are set up
C1, index module
System is used Lucene technology to the structural data generating indexes file in database.Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory (Field) of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.
C2, Chinese word segmentation
System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism.First load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure.Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate.On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table.If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky.
D, information retrieval
First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set.System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:
Score = &Sigma; t in q tf ( d . t ) &CenterDot; idf ( t ) &CenterDot; boost ( d . t . field ) &CenterDot; lengthNorm ( d . t . field )
Wherein, the tf scoring factor refers to the frequency that certain index entry occurs in a document; What the idf factor reflected is the number of files that comprises this index entry, and the more polyfactorial value of quantity is less; The boost factor can be used for controlling certain territory in document importance for the document, and the importance of certain document in all documents; The lengthNorm factor is the size of document, if the larger value of document is just lower.
Compared with prior art, the present invention has following beneficial effect:
1, because using professional Web Spider, the present invention gathers the info web in a certain field, the result wide coverage of information acquisition, with a high credibility;
2, the present invention extracts with DOM theme piece and regular expression extracts the mode combining and carries out the extraction of structured message.DOM theme piece Extraction parts can filter out the webpage irrelevant with theme, the efficiency of raising information extraction that can be very large; Regular expression extracts and can mate accurately, improves the accuracy rate of information extraction.
Accompanying drawing explanation
4, the total accompanying drawing of the present invention, wherein:
Fig. 1 is basic function figure of the present invention;
Fig. 2 is the workflow of professional spider in the present invention.
Fig. 3 is participle workflow in the present invention.
Fig. 4 is frame diagram of the present invention
Embodiment
The basic function of the service for life field vertical search engine system of the present invention's exploitation as shown in Figure 1.As shown in Figure 4, this system can be divided into two of front-end and back-end part.Rear end runs on server, for the retrieval service of front end provides Data support, comprises that information acquisition, extraction and index set up three parts.Front end adopts B/S pattern.System has been opened the vertical search engine in food and drink, amusement and three fields of Yellow Page at present, covers 37 main cities, the whole nation.System is supported the searching request of WAP simultaneously by www.zhaocha.mobi, make user just can access easily relevant service for life information with mobile phone whenever and wherever possible.The specific implementation of information acquisition, information extraction, index foundation and information retrieval is as follows:
A, the Information Monitoring of use specialized network spider
Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature.For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library, and its detailed workflow as shown in Figure 2.
Topic relativity analysis adopts the method for vector space model.Its basic thought is the dictionary (T this field theme 1, T 2... .., T n) regard a n dimension coordinate system as, for any word T iif, in webpage, comprise this word, give certain weights W according to significance level i, otherwise W ibe 0.Therefore, each webpage can be converted into one group of entry vector (W 1, W 2... .., W n).W in system iassignment method uses TF-IDF model, entry T jat webpage D iin TF-IDF value defined by following formula:
W i , j = TF i &CenterDot; log ( N DF i )
Wherein, TF ientry T jthe number of times occurring in this webpage; DF irepresent to comprise entry T in whole webpage collection D jwebpage number; N represents the sum of webpage.
Because webpage comprises various marks, link and text etc., the Feature Words significance level difference therefore occurring on diverse location, should weighted calculation:
TF i=a·TF M+b·TF T+c·TF K+d·TF D+e·TF A
Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient.
Utilize the cosine value of vectorial angle to represent the degree of correlation between webpage and theme, result of calculation is between 0 to 1, and this webpage of the larger explanation of value more meets subject information, and specific formula for calculation is as follows:
Sim ( D ) = cos &theta; = &Sigma; i = 1 n D i &times; T i ( &Sigma; i = 1 n D i 2 ) &times; ( &Sigma; i = 1 n T i 2 )
B, structured message extract
System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database.
B1, extract based on DOM theme piece
Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree (Local Correlativity) and context dependent degree (Contextual Correlativity).Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:
LinkCount ( STU i ) = &Sigma; j = 1 N LinkCount ( STUC ij )
CountentLenth ( STU i ) = &Sigma; j = 1 N ContentLength ( STUC ij )
LocalCorrelativity ( STU i ) = LinkCount ( STU i ) CountentLenth ( STU i )
Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC ijrepresent STU ij sub-block.
Context dependent degree determines by piece internal chaining and father's piece content, and its computing formula can be expressed as:
ContextualCorrelativity ( STU i ) = LinkCount ( STU i ) CountentLenth ( STU Pi )
Wherein, STU pirepresent STU ifather node.
B2, regular expression extract
For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching.First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild.
The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard.When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work.
C, index are set up
C1, index module
System is used Lucene technology to the structural data generating indexes file in database.Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory (Field) of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.
C2, Chinese word segmentation
System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism.As shown in Figure 3, first load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure.Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate.On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table.If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky.
D, information retrieval
First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set.System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:
Score = &Sigma; t in q tf ( d . t ) &CenterDot; idf ( t ) &CenterDot; boost ( d . t . field ) &CenterDot; lengthNorm ( d . t . field )
Wherein, the tf scoring factor refers to the frequency that certain index entry occurs in a document; What the idf factor reflected is the number of files that comprises this index entry, and the more polyfactorial value of quantity is less; The boost factor can be used for controlling certain territory in document importance for the document, and the importance of certain document in all documents; The lengthNorm factor is the size of document, if the larger value of document is just lower.

Claims (1)

1. the vertical search engine of one way of life service field, is characterized in that: comprise the following steps:
A, the Information Monitoring of use specialized network spider
Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature; For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library;
Topic relativity analysis adopts the method for vector space model; Its basic thought is the dictionary (T this field theme 1, T 2... .., T n) regard a n dimension coordinate system as, for any word T iif, in webpage, comprise this word, give certain weights W according to significance level i, otherwise W ibe 0; Therefore, each webpage can be converted into one group of entry vector (W 1, W 2... .., W n); W in system iassignment method uses TF-IDF model, entry T jat webpage D iin TF-IDF value defined by following formula:
W i , j = TF i &CenterDot; log ( N DF i )
Wherein, TF ientry T jthe number of times occurring in this webpage; DF irepresent to comprise entry T in whole webpage collection D jwebpage number; N represents the sum of webpage;
Because webpage comprises various marks, link and text etc., the Feature Words significance level difference therefore occurring on diverse location, should weighted calculation:
TF i=a·TF M+b·TF T+c·TF K+d·TF D+e·TF A
Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient;
Utilize the cosine value of vectorial angle to represent the degree of correlation between webpage and theme, result of calculation is between 0 to 1, and this webpage of the larger explanation of value more meets subject information, and specific formula for calculation is as follows:
Sim ( D ) = cos &theta; = &Sigma; i = 1 n D i &times; T i ( &Sigma; i = 1 n D i 2 ) &times; ( &Sigma; i = 1 n T i 2 )
B, structured message extract
System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database;
B1, extract based on DOM theme piece
Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree and context dependent degree; Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:
LinkCount ( STU i ) = &Sigma; j = 1 N LinkCount ( STUC ij )
CountentLenth ( STU i ) = &Sigma; j = 1 N ContentLength ( STUC ij )
LocalCorrelativity ( STU i ) = LinkCount ( STU i ) CountentLenth ( STU i )
Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC ijrepresent STU ij sub-block;
Context dependent degree determines by piece internal chaining and father's piece content, and its computing formula can be expressed as:
ContextualCorrelativity ( STU i ) = LinkCount ( STU i ) CountentLenth ( STU Pi )
Wherein, STU pirepresent STU ifather node;
B2, regular expression extract
For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching; First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild;
The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard; When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work;
C, index are set up
C1, index module
System is used Lucene technology to the structural data generating indexes file in database; Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory Field of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.;
C2, Chinese word segmentation
System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism; First load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure; Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate; On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table; If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky;
D, information retrieval
First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set; System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:
Score = &Sigma; t in q tf ( d . t ) &CenterDot; idf ( t ) &CenterDot; boost ( d . t . field ) &CenterDot; lengthNorm ( d . t . field )
Wherein, the tf scoring factor refers to the frequency that certain index entry occurs in a document; What the idf factor reflected is the number of files that comprises this index entry, and the more polyfactorial value of quantity is less; The boost factor can be used for controlling certain territory in document importance for the document, and the importance of certain document in all documents; The lengthNorm factor is the size of document, if the larger value of document is just lower.
CN201210475513.9A 2012-11-21 2012-11-21 Vertical search engine in life service field Pending CN103838732A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210475513.9A CN103838732A (en) 2012-11-21 2012-11-21 Vertical search engine in life service field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210475513.9A CN103838732A (en) 2012-11-21 2012-11-21 Vertical search engine in life service field

Publications (1)

Publication Number Publication Date
CN103838732A true CN103838732A (en) 2014-06-04

Family

ID=50802246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210475513.9A Pending CN103838732A (en) 2012-11-21 2012-11-21 Vertical search engine in life service field

Country Status (1)

Country Link
CN (1) CN103838732A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537053A (en) * 2014-12-26 2015-04-22 北京奇虎科技有限公司 Classified site mining method and device and searching method and system
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN105912662A (en) * 2016-04-11 2016-08-31 天津大学 Coreseek-based vertical search engine research and optimization method
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
WO2017185277A1 (en) * 2016-04-28 2017-11-02 华为技术有限公司 File storage method and electronic device
CN108491438A (en) * 2018-02-12 2018-09-04 陆夏根 A kind of technology policy retrieval analysis method
CN109299411A (en) * 2018-09-26 2019-02-01 湖北函数科技有限公司 A kind of network information cognitive method
CN110134851A (en) * 2019-05-05 2019-08-16 北京科技大学 A kind of search engine system and construction method based on field Intranet
WO2019174132A1 (en) * 2018-03-12 2019-09-19 平安科技(深圳)有限公司 Data processing method, server and computer storage medium
CN112597370A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage information autonomous collecting and screening system with specified demand range
CN112989163A (en) * 2021-03-15 2021-06-18 中国美术学院 Vertical search method and system
CN113191123A (en) * 2021-04-08 2021-07-30 中广核工程有限公司 Indexing method and device for engineering design archive information and computer equipment
CN113704589A (en) * 2021-09-03 2021-11-26 海粟智链(青岛)科技有限公司 Internet system for collecting industrial chain data
CN113761426A (en) * 2021-09-24 2021-12-07 南方电网数字电网研究院有限公司 System, method, device, equipment and medium for page service authentication access to middleboxes
CN116627973A (en) * 2023-05-25 2023-08-22 成都融见软件科技有限公司 Data positioning system
CN117349295A (en) * 2023-12-04 2024-01-05 江苏瑞宁信创科技有限公司 Word frequency statistics method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
US20110161307A1 (en) * 2008-09-08 2011-06-30 Huawei Technologies Co., Ltd. Method, system, and device for searching for information and method for registering vertical search engine
CN102200975A (en) * 2010-03-25 2011-09-28 北京师范大学 Vertical search engine system and method using semantic analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
US20110161307A1 (en) * 2008-09-08 2011-06-30 Huawei Technologies Co., Ltd. Method, system, and device for searching for information and method for registering vertical search engine
CN102200975A (en) * 2010-03-25 2011-09-28 北京师范大学 Vertical search engine system and method using semantic analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
汲业等: "生活服务领域垂直搜索引擎的设计与实现", 《计算机工程》 *
王治江: "面向领域的垂直搜索系统研究与实现", 《中国优秀硕士学位论文全文数据库·信息科技辑》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537053A (en) * 2014-12-26 2015-04-22 北京奇虎科技有限公司 Classified site mining method and device and searching method and system
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN105912662A (en) * 2016-04-11 2016-08-31 天津大学 Coreseek-based vertical search engine research and optimization method
US11308029B2 (en) 2016-04-28 2022-04-19 Huawei Technologies Co., Ltd. File saving method and electronic device
WO2017185277A1 (en) * 2016-04-28 2017-11-02 华为技术有限公司 File storage method and electronic device
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
CN108491438A (en) * 2018-02-12 2018-09-04 陆夏根 A kind of technology policy retrieval analysis method
WO2019174132A1 (en) * 2018-03-12 2019-09-19 平安科技(深圳)有限公司 Data processing method, server and computer storage medium
CN109299411A (en) * 2018-09-26 2019-02-01 湖北函数科技有限公司 A kind of network information cognitive method
CN110134851B (en) * 2019-05-05 2021-10-15 北京科技大学 Search engine system based on domain intranet and construction method
CN110134851A (en) * 2019-05-05 2019-08-16 北京科技大学 A kind of search engine system and construction method based on field Intranet
CN112597370A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage information autonomous collecting and screening system with specified demand range
CN112989163A (en) * 2021-03-15 2021-06-18 中国美术学院 Vertical search method and system
CN113191123A (en) * 2021-04-08 2021-07-30 中广核工程有限公司 Indexing method and device for engineering design archive information and computer equipment
CN113704589A (en) * 2021-09-03 2021-11-26 海粟智链(青岛)科技有限公司 Internet system for collecting industrial chain data
CN113704589B (en) * 2021-09-03 2023-10-13 海粟智链(青岛)科技有限公司 Internet system for collecting industrial chain data
CN113761426A (en) * 2021-09-24 2021-12-07 南方电网数字电网研究院有限公司 System, method, device, equipment and medium for page service authentication access to middleboxes
CN113761426B (en) * 2021-09-24 2024-02-13 南方电网数字平台科技(广东)有限公司 System, method, device, equipment and medium for page service authentication access center
CN116627973A (en) * 2023-05-25 2023-08-22 成都融见软件科技有限公司 Data positioning system
CN116627973B (en) * 2023-05-25 2024-02-09 成都融见软件科技有限公司 Data positioning system
CN117349295A (en) * 2023-12-04 2024-01-05 江苏瑞宁信创科技有限公司 Word frequency statistics method and device
CN117349295B (en) * 2023-12-04 2024-02-13 江苏瑞宁信创科技有限公司 Word frequency statistics method and device

Similar Documents

Publication Publication Date Title
CN103838732A (en) Vertical search engine in life service field
CN103049575B (en) A kind of academic conference search system of topic adaptation
CN100416570C (en) FAQ based Chinese natural language ask and answer method
CN105488196B (en) A kind of hot topic automatic mining system based on interconnection corpus
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
US8140579B2 (en) Method and system for subject relevant web page filtering based on navigation paths information
JP5084858B2 (en) Summary creation device, summary creation method and program
US20140006408A1 (en) Identifying points of interest via social media
CN103064956A (en) Method, computing system and computer-readable storage media for searching electric contents
CN104035972B (en) A kind of knowledge recommendation method and system based on microblogging
CN103838785A (en) Vertical search engine in patent field
Asadi et al. Pseudo test collections for learning web search ranking functions
CN104331449A (en) Method and device for determining similarity between inquiry sentence and webpage, terminal and server
CN103678412A (en) Document retrieval method and device
CN103399862B (en) Determine the method and apparatus of search index information corresponding to target query sequence
Chuang et al. Enabling maps/location searches on mobile devices: Constructing a POI database via focused crawling and information extraction
KR101011726B1 (en) Apparatus and method for providing snippet
CN103257975A (en) Search method, search device and search system
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages
CN106202312B (en) A kind of interest point search method and system for mobile Internet
US11586824B2 (en) System and method for link prediction with semantic analysis
Jeong et al. Determining the titles of Web pages using anchor text and link analysis
CN104281693A (en) Semantic search method and semantic search system
CN101661480B (en) Method and system for ensuring name of organization in different languages
Qiu et al. Detection and optimized disposal of near-duplicate pages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140604