CN103425735B

CN103425735B - A kind of method for building up and system based on website subject term inquiry

Info

Publication number: CN103425735B
Application number: CN201310223294.XA
Authority: CN
Inventors: 车天文; 雷大伟; 石志伟; 周步恋; 杨振东; 王喜民
Original assignee: Shenzhen Yisou Science & Technology Development Co Ltd
Current assignee: Shenzhen easou world Polytron Technologies Inc
Priority date: 2013-06-06
Filing date: 2013-06-06
Publication date: 2017-08-11
Anticipated expiration: 2033-06-06
Also published as: CN103425735A

Abstract

The present invention relates to information retrieval field there is provided a kind of method for building up based on website subject term inquiry, including, obtain web data；Importance between the website of vocabulary is counted according to web data；Subject of Web site word is extracted according to web data；According to the subject of Web site word information of extraction, storage resource dictionary is set up；Set up a web site theme query interface.System set up based on website subject term inquiry present invention also offers a kind of.Using technical scheme, flow is simply easily realized, can quickly realize renolation, for being experienced under line, on line using to lift Professional search.

Description

A kind of method for building up and system based on website subject term inquiry

Technical field

The present invention relates to information retrieval field, particularly a kind of method for building up based on website subject term inquiry and it is System.

Background technology

With the development of information technology, the information of internet is increasingly enriched, and has penetrated into side's aspect of people's life Face.The especially appearance of search engine, allows user quickly to search the information of oneself needs from mass data.Traditional Search engine is for the purpose of the demand for meeting user, and everybody shares a search engine, and meeting most of Man's Demands just can be with. As a popular network tool, most search engine is also difficult to meet specific industry, specific user to spy Fixed information or the demand of service.Specialized search engine thus is occurred in that, is absorbed in and collects related to a certain theme important The page, and ensure including and upgrading in time to a certain realm information.

Search engine should not be only to meet people's life-stylize, the instrument of entertainment orientation information, but also should meet People's more extensive, benefited instrument of more demand of specialty.Search engine how is allowed to play bigger effect, more professional practicality, Butcher,baker,and the candlestick maker can use and the required of oneself is obtained, is search engine problems faced.

Vertical search is one kind of search engine, and most search engine all possesses vertical search function.Vertically search Rope can be regarded as the search of certain class professional domain, cover the fields such as novel, music, video, picture.Such as when user searches for certain song Qu Shi, can directly obtain the information of the song, audition, download etc., so directly meet the Search Requirement of user, allow user Pleasure., can in this kind of field but vertical search covering is a kind of field of resource-type, is wilfully activated, entertainment orientation field Directly to meet user's real needs.

But the distribution field of user, professional domain interested are not limited only to vertical resources domain.Especially internet is provided Source is more and more rich and varied, increasing professionals, and encountering problems can tend to, by search engine, retrieve a few thing On, professional problem.But for increasing professional demand, search engine seems unable to do what one wishes, it is impossible to provide specially The result of industry.This is not because have specialized data on internet, but search engine is to point on internet data Analysis and understand still not enough, simply capture and retrieve, lack more analyses, understand, it is necessary to mass data is done into data mining, it is whole Manage into more high-quality data.

Make search engine search results more specialized, more allow people to convince, most important one is exactly to make the data of oneself special Industry.Understand internet data under line, have clearly to the data type in the searching database of oneself, distribution, theme etc. Understand and arrange.After user search keyword, computer can provide the more specialized, authoritativeization of keyword correlation Site result, user can really be benefited.Therefore analysis and understanding is done under needing online to internet data, extract numerous websites letters Breath, understands the theme of website.Once being aware of the theme of website, the conception of the website has been known that.When user provides keyword Information, can be supplied to the data of the corresponding website of user's key word information.Effective retrieval can not be provided in the prior art Scheme.

The content of the invention

Present invention solves the technical problem that it is the provision of a kind of method for building up based on website subject term inquiry and system, More preferably to improve the validity of search engine.

To solve the above problems, the invention provides a kind of method for building up based on website subject term inquiry, including,

Obtain web data；

Importance between the website of vocabulary is counted according to web data；

Subject of Web site word is extracted according to web data；

According to the subject of Web site word information of extraction, storage resource dictionary is set up；

Set up a web site theme query interface.

Above-mentioned method, wherein, the acquisition web data includes,

The data for the webpage that website is included are obtained, mainly include the title title and URL of webpage url。

Above-mentioned method, wherein, importance includes between the website that vocabulary is counted according to web data,

After web page title participle, a series of vocabulary is obtained, after being filtered out by part of speech, each vocabulary is exported and calculates Reverse website frequency ISF values as importance measurement.

Above-mentioned method, wherein, the formula of the reverse website frequency ISF values is defined as,

Wherein, n represents T containing vocabulary_jWebsite number；N is the number of all websites.

Above-mentioned method, wherein, the extraction subject of Web site word includes,

After the title participle of webpage, a series of vocabulary Term is obtained；

The vocabulary is subjected to part of speech filtering；

Vocabulary marking to completing above-mentioned steps, according to descriptor of the selected vocabulary of marking as website.

Above-mentioned method, wherein, it is described that vocabulary progress marking is included, the title of webpage is given a mark, the marking Formula is

p_score(T_j)=index_score (T_j)*pos_score(T_j)

Wherein, index_score (T_j) it is vocabulary T_jPosition score, pos_score (T_j) it is vocabulary T_jPart of speech obtain Point.

Wherein, N is the vocabulary number that title contains, dpos (T_j) it is vocabulary T_jPart of speech grade；

Also include giving a mark to vocabulary in website,

Wherein, s_score (T_j) it is vocabulary T in certain website_jFraction, page_num (T_j) included for title in certain website Vocabulary T_jWebpage number, page_num is that the number of webpage is contained in the website.

Above-mentioned method, wherein, the storage resource dictionary of setting up includes,

By the subject of Web site word information of extraction, the resource dictionary of structuring is set up, including website is to the positive inquiry of descriptor Module, and/or descriptor is to the inverse enquiry module of website.

Above-mentioned method, wherein, the positive enquiry module and inverse enquiry module include data field and structural area, structural area Storage is the object directly inquired about, and what data field was stored is the data that structural area is shared.

Above-mentioned method, wherein, the subject of Web site query interface includes positive inquiry and reverse inquiry, and the forward direction is looked into Ask, by station address, to inquire about the descriptor and its weights of the website；The reverse inquiry is, by descriptor, to inquire about the master Write inscription the website covered and its weights.

System set up based on website subject term inquiry present invention also offers a kind of, including,

Acquisition module, for obtaining web data；

Importance between statistical module, the website for counting vocabulary according to web data；

Extraction module, for extracting subject of Web site word according to web data；

Module is set up, for the subject of Web site word information according to extraction, storage resource dictionary is set up；

Interface module, for the theme query interface that sets up a web site.

Using technical scheme, based on the web data of search engine itself, count important between the website of vocabulary Property, calculate ISF（inverse site frequency）；Based on the crucial numeric field data of webpage, extract important vocabulary and merger goes out it The descriptor of place website；Most result is stored as resource dictionary at last, and provides descriptor related forward and reverse query interface, convenient Inquiry between website and descriptor is used.The scheme that the present invention is provided is summarized, flow is simply easily realized, can quickly be realized Renolation, for being experienced under line, on line using to lift Professional search.

Brief description of the drawings

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the present invention, this hair Bright schematic description and description is used to explain the present invention, does not constitute inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is first embodiment of the invention flow chart；

Fig. 2 is second embodiment of the invention flow chart.

Embodiment

In order that technical problems, technical solutions and advantages to be solved are clearer, clear, tie below Drawings and examples are closed, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only To explain the present invention, it is not intended to limit the present invention.

In view of at present search engine to the understanding of internet data, analysis work is not enough, the present invention is based on internet web page Data, to extract the descriptor of website, and set up a kind of resource dictionary there is provided the just reverse interface between website and descriptor, The specialized result for lifting itself for search engine provides a kind of basis.

As shown in figure 1, being that there is provided a kind of foundation based on website subject term inquiry for first embodiment of the invention flow chart Method, including,

Step S101, obtains web data；

As one embodiment, the data for the webpage that website is included are obtained, mainly include the title (title) of webpage, And url（URL）.

Search engine is that the term of user and mass data set up into hinge, therefore two indispensable big datas of search engine Source：Retrieve daily record and web data.Retrieval daily record can be used for analyzing, understand user view and core demand；Web data can For analyzing, understanding itself back-end data, including do data mining, do used in knowledge base etc..

The present invention extracts the descriptor of website, it is contemplated that website homepage content is less, is not suitable for doing the excavation of content property, And typically capture is web data for search, thus data be not website homepage data, but website is included Webpage data.And using the actual web data for indexing storehouse of search engine, because the establishing data of search engine is week What phase property updated, therefore the present invention can also update in the cycle therewith.Most important key is the title (marks of webpage in webpage Topic), and url（URL）.Therefore for carrying the two attributes for also mainly using webpage of descriptor.Url is used In extracting corresponding web-site, title is then used to extract key vocabularies.

Step S102, importance between the website of vocabulary is counted according to web data；

As one embodiment, first to web page title participle after, obtain a series of vocabulary, secondly filtered out by part of speech Some stop words, punctuation mark etc.；Each vocabulary and the reverse website frequency ISF values calculated are finally exported as importance Measurement.

The descriptor of website is calculated, first it is to be understood that a probably distribution of the vocabulary between website.If some word T is one The frequency TF occurred in individual website is high, and seldom occurs in other websites, then it is assumed that this word has area between good website The ability of dividing.An ISF is calculated in this step（Inverse site frequency, reverse website frequency）, similar to text point IDF in class（Inverse document frequency, reverse document-frequency）.

As the measurement of a word general importance, the ISF of a certain particular words can be by total website number divided by bag The number of website containing the word, then obtained business is taken the logarithm obtained.

ISF formula are defined as

Step S103, extracts subject of Web site word；

The ISF of vocabulary illustrates a kind of importance of this vocabulary between website, is a kind of measurement of overall importance.ISF is got over It is high, it can be understood as more to show the theme of website.And this step is to combine specific each web data, Term points are locally Analysis, extracts the important vocabulary in each web page title.

One webpage includes many contents, url, title, content, link etc..The title of webpage is the key of a webpage Domain, be it is most important be also the part for best embodying web page contents theme, only cheating, low quality webpage, title and content are not Consistent such case.But the web data that we take before has been done under the web data that search engine indexes storehouse, line Cross quality analysis, it is believed that be high-quality webpage, dead chain, the webpage situation of the type such as cheating is few.

First, by after the title participle of webpage, a series of vocabulary Term is obtained.

Secondly, not all word be may serve to do what is analyzed, in order to improve efficiency, and we are filtered by part of speech, Part is filtered out to look at a glance with regard to unessential word, such as auxiliary word, punctuation mark, conjunction, preposition etc..

Again, the vocabulary for completing above-mentioned steps is given a mark.Marking is with reference to two factors herein：Position, part of speech.What position referred to It is positions of the vocabulary T in web page title, point front portion, middle part, three kinds of rear portion.Front portion is defined as Term positions less than Term numbers 20%, rear portion is defined as 80% that Term positions are more than Term numbers, and other is middle part.Position where vocabulary is different, what it rose Significance level is also different, general a word or an article, important vocabulary, the vocabulary related to theme, can be located at Front and rear.The importance of part of speech is self-evident, web page title, more meets natural language custom and logic of language, is A kind of language for comparing specification, it is possible to judge which word is important from the angle of part of speech, which is unessential.I From the theme angle of title, part of speech is fallen into three classes：One-level part of speech mainly has noun, verb, name, place name, mechanism Name, proper noun, Chinese idiom, initialism；Two grades of parts of speech mainly have adjective, adverbial word, the noun of locality, measure word；Other parts of speech are classified as Three grades, are some morpheme words, only serve certain modification.

Marking formula be

p_score(T_j)=index_score (T_j)*pos_score(T_j)

Wherein, N is the vocabulary number that title contains, dpos (T_j) it is vocabulary T_jPart of speech grade.

The fraction of the vocabulary contained in the title that a webpage has been obtained to this.But this is only webpage rank, vocabulary Fraction in certain website, it is also contemplated that the quantity for the webpage that the website is contained, and the webpage containing the vocabulary number.

This step completes the process of the vocabulary marking to website, both with reference to importance of the vocabulary between website, and had joined again The specifying information of vocabulary in the webpage and webpage for contained in website vocabulary is examined.The number of the vocabulary contained in view of a website Amount is huge, it would be desirable to be not whole vocabulary in website, therefore only regard before fraction ranking 1% word as the website Descriptor, is preserved.

Step S104, sets up storage resource dictionary；

By the subject of Web site word information of extraction, a kind of resource dictionary of structuring is built up, carrys out fast and easy inquiry.The present invention The dictionary being related to is divided into two modules：Website is to the positive enquiry module of descriptor, and descriptor is to the inverse enquiry module of website.

Each module is containing two parts：Data field and structural area, structural area storage is the object directly inquired about, number What is stored according to area is data that structural area is shared.

Positive enquiry module is to inquire about its corresponding descriptor by station address.Structural area stores site information, such as all kinds of Station address, website includes the number of descriptor, the weights of descriptor；Data field is the theme word information, and storage is descriptor Character string.Such as sina.com and two websites of sohu.com, have " door " this descriptor, and what structural area was stored is The information of the two websites, including address, descriptor number, descriptor is in the position of data field, the weight of descriptor, and data Area's storage is " door " this kind of word, and is not repeated.When inquiry sina.com websites, descriptor " door " can be found, is inquired about Sohu.com can also inquire descriptor " door ".

Inverse enquiry module is then the station address inquired about under descriptor covering according to descriptor.Write inscription based on structural area Information, containing all kinds of descriptor, corresponding website is in the position of data field, the weights in website；Data field is site information, is deposited What is stored up is the address character string of website.For example by inquiring about " door ", the website " sina.com " being the theme with " door " is exported, " sohu.com " etc..

The such storage organization of design, can both accomplish resource-sharing, again can reasonable utilization space, when reduction is used Loading consumption.

Visualization structure such as Fig. 2 of two modules of dictionary.

Step S105, set up a web site theme query interface.

The information of subject of Web site word is preserved to resource dictionary, in order that inquiry is convenient and swift.Inquiry is externally provided herein Interface, when inputting keyword, different query interfaces can carry out the inquiry of disparate modules.

With the design of above-mentioned resource dictionary, the present invention provides two kinds of query interfaces, forward direction inquiry and reverse inquiry.Forward direction inquiry By station address, to inquire about the descriptor and its weights of the website；Reverse inquiry is, by descriptor, to inquire about descriptor culvert The website of lid and its weights.

Why the query interface of both of which is provided, and being able to do Flexible Query for different terms makes With.In practice, the inquiry of different modes can play different effects：Web page quality analysis, webpage Term marking are done under line When, positive inquiry can be used, a little specially treateds are done with reference to the descriptor of output；It can be used when being analyzed on line user search word Reverse inquiry, if the term of user includes some descriptor, the website that descriptor covers, which can contemplate, preferentially to be showed, this energy It is enough that certain castering action is played to the professional, authoritative of search result.

The simple explanation that gives an actual example, forward direction inquiry, query pattern is 1, such as input " sina.com ", and Query Result is " door 0.8；Sina website 0.8；Sina 0.8 " etc., in web page analysis under making line, for the webpage of sina.com websites, this kind of theme Word is, it is necessary to pay attention to.

Reverse inquiry, query pattern is 2, such as input " literature ", and Query Result is " rongshuxia.com0.4； tianyibook.com0.4；D5wx.com0.3 " etc., when doing user search analysis on line, has " literature " intention for user Demand, it may be considered that give the result of this kind of website or by before the result of this kind of website row, knot that more can be to user authority Really.

As described in Figure 2, it is that there is provided a kind of foundation based on website subject term inquiry for second embodiment of the invention structure chart System, including,

Acquisition module 201, for obtaining web data；

Importance between statistical module 202, the website for counting vocabulary according to web data；

Extraction module 203, for extracting subject of Web site word according to web data；

Module 204 is set up, for the subject of Web site word information according to extraction, storage resource dictionary is set up；

Interface module 205, for the theme query interface that sets up a web site.

The acquisition module, for obtaining web data, is specifically included, the number for obtaining the webpage that website is included According to mainly including the title title and URL url of webpage.

Importance between the statistical module, the website for counting vocabulary according to web data, is specifically included, for net After page head participle, a series of vocabulary is obtained, after being filtered out by part of speech, each vocabulary and the reverse website calculated frequency is exported Rate ISF values as importance measurement.

The extraction module, for extracting subject of Web site word according to web data, after the title participle by webpage, is obtained To a series of vocabulary Term；The vocabulary is subjected to part of speech filtering；Vocabulary marking to completing above-mentioned steps, is selected according to marking Vocabulary is determined as the descriptor of website.

It is described to set up module, for the subject of Web site word information according to extraction, storage resource dictionary is set up, including for inciting somebody to action The subject of Web site word information of extraction, sets up the resource dictionary of structuring, including website to the positive enquiry module of descriptor, and/or Inverse enquiry module of the descriptor to website.

Relative to existing search engine under the not enough situation of understanding, the analysis of its data so that search result is at certain A little professional domains can not meet higher demand.The technical scheme that the present invention is provided, with reference to importance of the vocabulary between website, It with reference to contain the specifying information of vocabulary in the webpage and webpage of vocabulary again in website, provide the score of vocabulary in website.For Prominent important vocabulary, can take before website score ranking 1% word as the descriptor of the website, save as the money of clear in structure Source dictionary, for being used under line, on line.It is available on the important vocabulary of web page analysis reference process, line that user search can be directed under line Word, provides the search result of more authoritative, more professional website.This method is simple and easy to apply, can quickly update, to meeting particular row Industry, specific user play positive impetus to the demand of customizing messages or service.

The preferred embodiments of the present invention have shown and described in described above, but as previously described, it should be understood that the present invention is not Form disclosed herein is confined to, the exclusion to other embodiment is not to be taken as, and available for various other combinations, modification And environment, and can be carried out in invention contemplated scope described herein by the technology or knowledge of above-mentioned teaching or association area Change., then all should be in institute of the present invention and the change and change that those skilled in the art are carried out do not depart from the spirit and scope of the present invention In attached scope of the claims.

Claims

1. a kind of method for building up based on website subject term inquiry, it is characterised in that including,

Obtain web data；

Importance between the website of vocabulary is counted according to web data, specifically includes, after web page title participle, obtains a series of Vocabulary, after being filtered out by part of speech, exports each vocabulary and the reverse website frequency ISF values calculated as the measurement of importance；

Subject of Web site word is extracted according to web data, specifically includes, after the title participle of webpage, obtains a series of vocabulary Term；The vocabulary is subjected to part of speech filtering；Vocabulary marking to completing above-mentioned steps, website is used as according to the selected vocabulary of marking Descriptor；It is described that vocabulary progress marking is included, the title of webpage is given a mark, the marking formula is

p_score(T_j)=index_score (T_j)*pos_score(T_j)

Wherein, index_score (T_j) it is vocabulary T_jPosition score, pos_score (T_j) it is vocabulary T_jPart of speech score；

Also include giving a mark to vocabulary in website,

Wherein, s_score (T_j) it is vocabulary T in certain website_jFraction, page_num (T_j) include vocabulary T for title in certain website_j Webpage number, page_num is that the number of webpage is contained in the website；

Set up a web site theme query interface.

2. according to the method described in claim 1, it is characterised in that the acquisition web data includes,

The data for the webpage that website is included are obtained, mainly include the title title and URL url of webpage.

3. method according to claim 2, it is characterised in that the formula of the reverse website frequency ISF values is defined as,

Wherein, n_jRepresent T containing vocabulary_jWebsite number；N is the number of all websites.

4. method according to claim 3, it is characterised in that the storage resource dictionary of setting up includes,

By the subject of Web site word information of extraction, the resource dictionary of structuring, including website are set up to the positive enquiry module of descriptor, And/or descriptor is to the inverse enquiry module of website.

5. method according to claim 4, it is characterised in that the positive enquiry module and inverse enquiry module include data Area and structural area, structural area storage is the object directly inquired about, and what data field was stored is the data that structural area is shared.

6. method according to claim 5, it is characterised in that the subject of Web site query interface includes positive inquiry and inverse To inquiry, the positive inquiry is, by station address, to inquire about the descriptor and its weights of the website；The reverse inquiry is logical Descriptor is crossed, website and its weights that the descriptor covers are inquired about.

7. a kind of set up system based on website subject term inquiry, it is characterised in that including,

Acquisition module, for obtaining web data；

Importance between statistical module, the website for counting vocabulary according to web data；Specifically include, to web page title participle Afterwards, a series of vocabulary is obtained, after being filtered out by part of speech, each vocabulary is exported and the reverse website frequency ISF values calculated is made For the measurement of importance；

Extraction module, for extracting subject of Web site word according to web data；Specifically include, after the title participle of webpage, obtain A series of vocabulary Term；The vocabulary is subjected to part of speech filtering；Vocabulary marking to completing above-mentioned steps, it is selected according to marking Vocabulary as website descriptor；It is described that vocabulary progress marking is included, the title of webpage is given a mark, the marking formula For

p_score(T_j)=index_score (T_j)*pos_score(T_j)

Also include giving a mark to vocabulary in website,

Interface module, for the theme query interface that sets up a web site.