CN103838732A - Vertical search engine in life service field - Google Patents
Vertical search engine in life service field Download PDFInfo
- Publication number
- CN103838732A CN103838732A CN201210475513.9A CN201210475513A CN103838732A CN 103838732 A CN103838732 A CN 103838732A CN 201210475513 A CN201210475513 A CN 201210475513A CN 103838732 A CN103838732 A CN 103838732A
- Authority
- CN
- China
- Prior art keywords
- word
- webpage
- stu
- information
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a vertical search engine in the life service field. The vertical search engine comprises the following steps of collecting information by using a professional web spider, extracting information, building an index and searching information. The web spider technology is applied by the vertical search engine, and the engine traverses web portals of the life service field, collects and stores the webpage related to subjects, and conducts link analysis and extraction on the webpage. The mode of combining DOM subject block extraction and the regular expression extraction is used by the vertical search engine for extracting structural information. Full-text searching service of information is provided for a user by building the index for the structural information of a database, weights are set and the searching results are reasonably arranged according to different fields. Finally, according to the characteristics of the life service, the data showing is based on the Internet more than, and the search service can be provided at any time and any where through mobile phone WAP.
Description
Technical field
The present invention relates to search engine technique, particularly a kind of vertical search engine for service for life field.
Background technology
Along with the fast development of internet, the network information increases sharply, how from the network data of magnanimity, to retrieve information needed is fast and accurately problem demanding prompt solution, search engine is a kind of instrument of obtaining information from network that we the most often use, but mostly universal search engine is what the mode of employing key word was inquired about, return results and be more prone to knowledge production, his information content is large, it is not accurate enough to inquire about, the degree of depth is inadequate.Therefore, vertical search engine arises at the historic moment.Its range of search is no longer even up to ten million related web pages up to a hundred, but searches for for the domain knowledge of certain specific industry specially, is segmentation and the extension of search engine.Although vertical search engine also provides key word search, these key words are placed in the context of domain knowledge conventionally, and in the result of returning, message and entry are in the majority.Different from universal search engine, vertical search engine only gathers info web according to particular topic, and non-structured info web is transformed and be extracted as structural data, take structural data as the minimum unit of searching for.Then store these data into database, last participle, index meet user's request again in the mode of search.
Vertical search engine, according to the difference of professional domain, offers the corresponding information of user.One of most important application in vertical search engine towards the search engine in service for life field.Food and drink, amusement, shopping and house property etc. that people often touch in daily life can retrieve rapidly and accurately by this search engine.People needn't screen own Useful Information in a large amount of information, and the clothing, food, lodging and transportion--basic necessities of life that the vertical search engine in service for life field is people provide a large amount of valuable information, the daily life that can greatly be convenient for people to.
Compared with other search engine, the vertical search engine in service for life field needs professional Web Spider, the portal website in traversal service for life field, and the webpage of collection and preservation and Topic relative, and these webpages are carried out to link analysis and extraction; Can accurately webpage be carried out to structural data extraction and then be saved in database; Define weight, participle and set up index according to structured field territory; Also need to meet user's searching request anywhere or anytime by WEB and two kinds of data exhibiting modes of WAP.
Summary of the invention
For better meeting user's requirement, the present invention will design and Implement a kind of vertical search engine for service for life field, and this search engine is mainly towards food and drink, amusement and three fields of Yellow Page at present.
To achieve these goals, technical scheme of the present invention is as follows: the vertical search engine in service for life field, comprises the following steps:
A, the Information Monitoring of use specialized network spider
Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature.For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library.
Topic relativity analysis adopts the method for vector space model.Its basic thought is the dictionary (T this field theme
1, T
2... .., T
n) regard a n dimension coordinate system as, for any word T
iif, in webpage, comprise this word, give certain weights W according to significance level
i, otherwise W
ibe 0.Therefore, each webpage can be converted into one group of entry vector (W
1, W
2... .., W
n).W in system
iassignment method uses TF-IDF model, entry T
jat webpage D
iin TF-IDF value defined by following formula:
Wherein, TF
ientry T
jthe number of times occurring in this webpage; DF
irepresent to comprise entry T in whole webpage collection D
jwebpage number; N represents the sum of webpage.
Because webpage comprises various marks, link and text etc., the Feature Words significance level difference therefore occurring on diverse location, should weighted calculation:
TF
i=a·TF
M+b·TF
T+c·TF
K+d·TF
D+e·TF
A
Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient.
Utilize the cosine value of vectorial angle to represent the degree of correlation between webpage and theme, result of calculation is between 0 to 1, and this webpage of the larger explanation of value more meets subject information, and specific formula for calculation is as follows:
B, structured message extract
System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database.
B1, extract based on DOM theme piece
Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree (Local Correlativity) and context dependent degree (Contextual Correlativity).Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:
Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC
ijrepresent STU
ij sub-block.
Context dependent degree determines by piece internal chaining and father's piece content, and its computing formula can be expressed as:
Wherein, STU
pirepresent STU
ifather node.
B2, regular expression extract
For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching.First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild.
The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard.When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work.
C, index are set up
C1, index module
System is used Lucene technology to the structural data generating indexes file in database.Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory (Field) of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.
C2, Chinese word segmentation
System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism.First load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure.Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate.On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table.If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky.
D, information retrieval
First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set.System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:
Wherein, the tf scoring factor refers to the frequency that certain index entry occurs in a document; What the idf factor reflected is the number of files that comprises this index entry, and the more polyfactorial value of quantity is less; The boost factor can be used for controlling certain territory in document importance for the document, and the importance of certain document in all documents; The lengthNorm factor is the size of document, if the larger value of document is just lower.
Compared with prior art, the present invention has following beneficial effect:
1, because using professional Web Spider, the present invention gathers the info web in a certain field, the result wide coverage of information acquisition, with a high credibility;
2, the present invention extracts with DOM theme piece and regular expression extracts the mode combining and carries out the extraction of structured message.DOM theme piece Extraction parts can filter out the webpage irrelevant with theme, the efficiency of raising information extraction that can be very large; Regular expression extracts and can mate accurately, improves the accuracy rate of information extraction.
Accompanying drawing explanation
4, the total accompanying drawing of the present invention, wherein:
Fig. 1 is basic function figure of the present invention;
Fig. 2 is the workflow of professional spider in the present invention.
Fig. 3 is participle workflow in the present invention.
Fig. 4 is frame diagram of the present invention
Embodiment
The basic function of the service for life field vertical search engine system of the present invention's exploitation as shown in Figure 1.As shown in Figure 4, this system can be divided into two of front-end and back-end part.Rear end runs on server, for the retrieval service of front end provides Data support, comprises that information acquisition, extraction and index set up three parts.Front end adopts B/S pattern.System has been opened the vertical search engine in food and drink, amusement and three fields of Yellow Page at present, covers 37 main cities, the whole nation.System is supported the searching request of WAP simultaneously by www.zhaocha.mobi, make user just can access easily relevant service for life information with mobile phone whenever and wherever possible.The specific implementation of information acquisition, information extraction, index foundation and information retrieval is as follows:
A, the Information Monitoring of use specialized network spider
Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature.For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library, and its detailed workflow as shown in Figure 2.
Topic relativity analysis adopts the method for vector space model.Its basic thought is the dictionary (T this field theme
1, T
2... .., T
n) regard a n dimension coordinate system as, for any word T
iif, in webpage, comprise this word, give certain weights W according to significance level
i, otherwise W
ibe 0.Therefore, each webpage can be converted into one group of entry vector (W
1, W
2... .., W
n).W in system
iassignment method uses TF-IDF model, entry T
jat webpage D
iin TF-IDF value defined by following formula:
Wherein, TF
ientry T
jthe number of times occurring in this webpage; DF
irepresent to comprise entry T in whole webpage collection D
jwebpage number; N represents the sum of webpage.
Because webpage comprises various marks, link and text etc., the Feature Words significance level difference therefore occurring on diverse location, should weighted calculation:
TF
i=a·TF
M+b·TF
T+c·TF
K+d·TF
D+e·TF
A
Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient.
Utilize the cosine value of vectorial angle to represent the degree of correlation between webpage and theme, result of calculation is between 0 to 1, and this webpage of the larger explanation of value more meets subject information, and specific formula for calculation is as follows:
B, structured message extract
System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database.
B1, extract based on DOM theme piece
Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree (Local Correlativity) and context dependent degree (Contextual Correlativity).Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:
Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC
ijrepresent STU
ij sub-block.
Context dependent degree determines by piece internal chaining and father's piece content, and its computing formula can be expressed as:
Wherein, STU
pirepresent STU
ifather node.
B2, regular expression extract
For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching.First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild.
The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard.When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work.
C, index are set up
C1, index module
System is used Lucene technology to the structural data generating indexes file in database.Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory (Field) of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.
C2, Chinese word segmentation
System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism.As shown in Figure 3, first load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure.Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate.On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table.If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky.
D, information retrieval
First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set.System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:
Wherein, the tf scoring factor refers to the frequency that certain index entry occurs in a document; What the idf factor reflected is the number of files that comprises this index entry, and the more polyfactorial value of quantity is less; The boost factor can be used for controlling certain territory in document importance for the document, and the importance of certain document in all documents; The lengthNorm factor is the size of document, if the larger value of document is just lower.
Claims (1)
1. the vertical search engine of one way of life service field, is characterized in that: comprise the following steps:
A, the Information Monitoring of use specialized network spider
Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature; For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library;
Topic relativity analysis adopts the method for vector space model; Its basic thought is the dictionary (T this field theme
1, T
2... .., T
n) regard a n dimension coordinate system as, for any word T
iif, in webpage, comprise this word, give certain weights W according to significance level
i, otherwise W
ibe 0; Therefore, each webpage can be converted into one group of entry vector (W
1, W
2... .., W
n); W in system
iassignment method uses TF-IDF model, entry T
jat webpage D
iin TF-IDF value defined by following formula:
Wherein, TF
ientry T
jthe number of times occurring in this webpage; DF
irepresent to comprise entry T in whole webpage collection D
jwebpage number; N represents the sum of webpage;
Because webpage comprises various marks, link and text etc., the Feature Words significance level difference therefore occurring on diverse location, should weighted calculation:
TF
i=a·TF
M+b·TF
T+c·TF
K+d·TF
D+e·TF
A
Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient;
Utilize the cosine value of vectorial angle to represent the degree of correlation between webpage and theme, result of calculation is between 0 to 1, and this webpage of the larger explanation of value more meets subject information, and specific formula for calculation is as follows:
B, structured message extract
System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database;
B1, extract based on DOM theme piece
Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree and context dependent degree; Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:
Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC
ijrepresent STU
ij sub-block;
Context dependent degree determines by piece internal chaining and father's piece content, and its computing formula can be expressed as:
Wherein, STU
pirepresent STU
ifather node;
B2, regular expression extract
For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching; First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild;
The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard; When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work;
C, index are set up
C1, index module
System is used Lucene technology to the structural data generating indexes file in database; Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory Field of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.;
C2, Chinese word segmentation
System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism; First load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure; Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate; On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table; If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky;
D, information retrieval
First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set; System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:
Wherein, the tf scoring factor refers to the frequency that certain index entry occurs in a document; What the idf factor reflected is the number of files that comprises this index entry, and the more polyfactorial value of quantity is less; The boost factor can be used for controlling certain territory in document importance for the document, and the importance of certain document in all documents; The lengthNorm factor is the size of document, if the larger value of document is just lower.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210475513.9A CN103838732A (en) | 2012-11-21 | 2012-11-21 | Vertical search engine in life service field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210475513.9A CN103838732A (en) | 2012-11-21 | 2012-11-21 | Vertical search engine in life service field |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103838732A true CN103838732A (en) | 2014-06-04 |
Family
ID=50802246
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210475513.9A Pending CN103838732A (en) | 2012-11-21 | 2012-11-21 | Vertical search engine in life service field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103838732A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537053A (en) * | 2014-12-26 | 2015-04-22 | 北京奇虎科技有限公司 | Classified site mining method and device and searching method and system |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
CN105912662A (en) * | 2016-04-11 | 2016-08-31 | 天津大学 | Coreseek-based vertical search engine research and optimization method |
CN106294885A (en) * | 2016-10-09 | 2017-01-04 | 华东师范大学 | A kind of data collection towards isomery webpage and mask method |
WO2017185277A1 (en) * | 2016-04-28 | 2017-11-02 | 华为技术有限公司 | File storage method and electronic device |
CN108491438A (en) * | 2018-02-12 | 2018-09-04 | 陆夏根 | A kind of technology policy retrieval analysis method |
CN109299411A (en) * | 2018-09-26 | 2019-02-01 | 湖北函数科技有限公司 | A kind of network information cognitive method |
CN110134851A (en) * | 2019-05-05 | 2019-08-16 | 北京科技大学 | A kind of search engine system and construction method based on field Intranet |
WO2019174132A1 (en) * | 2018-03-12 | 2019-09-19 | 平安科技(深圳)有限公司 | Data processing method, server and computer storage medium |
CN112597370A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage information autonomous collecting and screening system with specified demand range |
CN112989163A (en) * | 2021-03-15 | 2021-06-18 | 中国美术学院 | Vertical search method and system |
CN113191123A (en) * | 2021-04-08 | 2021-07-30 | 中广核工程有限公司 | Indexing method and device for engineering design archive information and computer equipment |
CN113704589A (en) * | 2021-09-03 | 2021-11-26 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
CN113761426A (en) * | 2021-09-24 | 2021-12-07 | 南方电网数字电网研究院有限公司 | System, method, device, equipment and medium for page service authentication access to middleboxes |
CN116627973A (en) * | 2023-05-25 | 2023-08-22 | 成都融见软件科技有限公司 | Data positioning system |
CN117349295A (en) * | 2023-12-04 | 2024-01-05 | 江苏瑞宁信创科技有限公司 | Word frequency statistics method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
US20110161307A1 (en) * | 2008-09-08 | 2011-06-30 | Huawei Technologies Co., Ltd. | Method, system, and device for searching for information and method for registering vertical search engine |
CN102200975A (en) * | 2010-03-25 | 2011-09-28 | 北京师范大学 | Vertical search engine system and method using semantic analysis |
-
2012
- 2012-11-21 CN CN201210475513.9A patent/CN103838732A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
US20110161307A1 (en) * | 2008-09-08 | 2011-06-30 | Huawei Technologies Co., Ltd. | Method, system, and device for searching for information and method for registering vertical search engine |
CN102200975A (en) * | 2010-03-25 | 2011-09-28 | 北京师范大学 | Vertical search engine system and method using semantic analysis |
Non-Patent Citations (2)
Title |
---|
汲业等: "生活服务领域垂直搜索引擎的设计与实现", 《计算机工程》 * |
王治江: "面向领域的垂直搜索系统研究与实现", 《中国优秀硕士学位论文全文数据库·信息科技辑》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537053A (en) * | 2014-12-26 | 2015-04-22 | 北京奇虎科技有限公司 | Classified site mining method and device and searching method and system |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
CN105912662A (en) * | 2016-04-11 | 2016-08-31 | 天津大学 | Coreseek-based vertical search engine research and optimization method |
US11308029B2 (en) | 2016-04-28 | 2022-04-19 | Huawei Technologies Co., Ltd. | File saving method and electronic device |
WO2017185277A1 (en) * | 2016-04-28 | 2017-11-02 | 华为技术有限公司 | File storage method and electronic device |
CN106294885A (en) * | 2016-10-09 | 2017-01-04 | 华东师范大学 | A kind of data collection towards isomery webpage and mask method |
CN108491438A (en) * | 2018-02-12 | 2018-09-04 | 陆夏根 | A kind of technology policy retrieval analysis method |
WO2019174132A1 (en) * | 2018-03-12 | 2019-09-19 | 平安科技(深圳)有限公司 | Data processing method, server and computer storage medium |
CN109299411A (en) * | 2018-09-26 | 2019-02-01 | 湖北函数科技有限公司 | A kind of network information cognitive method |
CN110134851B (en) * | 2019-05-05 | 2021-10-15 | 北京科技大学 | Search engine system based on domain intranet and construction method |
CN110134851A (en) * | 2019-05-05 | 2019-08-16 | 北京科技大学 | A kind of search engine system and construction method based on field Intranet |
CN112597370A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage information autonomous collecting and screening system with specified demand range |
CN112989163A (en) * | 2021-03-15 | 2021-06-18 | 中国美术学院 | Vertical search method and system |
CN113191123A (en) * | 2021-04-08 | 2021-07-30 | 中广核工程有限公司 | Indexing method and device for engineering design archive information and computer equipment |
CN113704589A (en) * | 2021-09-03 | 2021-11-26 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
CN113704589B (en) * | 2021-09-03 | 2023-10-13 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
CN113761426A (en) * | 2021-09-24 | 2021-12-07 | 南方电网数字电网研究院有限公司 | System, method, device, equipment and medium for page service authentication access to middleboxes |
CN113761426B (en) * | 2021-09-24 | 2024-02-13 | 南方电网数字平台科技(广东)有限公司 | System, method, device, equipment and medium for page service authentication access center |
CN116627973A (en) * | 2023-05-25 | 2023-08-22 | 成都融见软件科技有限公司 | Data positioning system |
CN116627973B (en) * | 2023-05-25 | 2024-02-09 | 成都融见软件科技有限公司 | Data positioning system |
CN117349295A (en) * | 2023-12-04 | 2024-01-05 | 江苏瑞宁信创科技有限公司 | Word frequency statistics method and device |
CN117349295B (en) * | 2023-12-04 | 2024-02-13 | 江苏瑞宁信创科技有限公司 | Word frequency statistics method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103838732A (en) | Vertical search engine in life service field | |
CN103049575B (en) | A kind of academic conference search system of topic adaptation | |
CN100416570C (en) | FAQ based Chinese natural language ask and answer method | |
CN105488196B (en) | A kind of hot topic automatic mining system based on interconnection corpus | |
CN102902806B (en) | A kind of method and system utilizing search engine to carry out query expansion | |
US8140579B2 (en) | Method and system for subject relevant web page filtering based on navigation paths information | |
JP5084858B2 (en) | Summary creation device, summary creation method and program | |
US20140006408A1 (en) | Identifying points of interest via social media | |
CN103064956A (en) | Method, computing system and computer-readable storage media for searching electric contents | |
CN104035972B (en) | A kind of knowledge recommendation method and system based on microblogging | |
CN103838785A (en) | Vertical search engine in patent field | |
Asadi et al. | Pseudo test collections for learning web search ranking functions | |
CN104331449A (en) | Method and device for determining similarity between inquiry sentence and webpage, terminal and server | |
CN103678412A (en) | Document retrieval method and device | |
CN103399862B (en) | Determine the method and apparatus of search index information corresponding to target query sequence | |
Chuang et al. | Enabling maps/location searches on mobile devices: Constructing a POI database via focused crawling and information extraction | |
KR101011726B1 (en) | Apparatus and method for providing snippet | |
CN103257975A (en) | Search method, search device and search system | |
KR100671077B1 (en) | Server, Method and System for Providing Information Search Service by Using Sheaf of Pages | |
CN106202312B (en) | A kind of interest point search method and system for mobile Internet | |
US11586824B2 (en) | System and method for link prediction with semantic analysis | |
Jeong et al. | Determining the titles of Web pages using anchor text and link analysis | |
CN104281693A (en) | Semantic search method and semantic search system | |
CN101661480B (en) | Method and system for ensuring name of organization in different languages | |
Qiu et al. | Detection and optimized disposal of near-duplicate pages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140604 |