CN105095175A - Method and device for obtaining truncated web title - Google Patents

Method and device for obtaining truncated web title Download PDF

Info

Publication number
CN105095175A
CN105095175A CN201410158987.XA CN201410158987A CN105095175A CN 105095175 A CN105095175 A CN 105095175A CN 201410158987 A CN201410158987 A CN 201410158987A CN 105095175 A CN105095175 A CN 105095175A
Authority
CN
China
Prior art keywords
web page
page title
brachymemma
webpage
url information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410158987.XA
Other languages
Chinese (zh)
Other versions
CN105095175B (en
Inventor
商胜
徐俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201410158987.XA priority Critical patent/CN105095175B/en
Publication of CN105095175A publication Critical patent/CN105095175A/en
Application granted granted Critical
Publication of CN105095175B publication Critical patent/CN105095175B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for obtaining a truncated web title. The method comprises: obtaining webpage uniform resource locator information and a to-be-truncated web title mapped by the information; processing the to-be-truncated web title, just maintaining a part which can reflect webpage contents; the method for processing the to-be-truncated web title comprising one or arbitrary combination of a plurality of the following methods: performing word segmentation on the title and removing non-sensed words; querying a preset web title matching base to obtain matched rule corresponding to the to-be-truncated webpage uniform resource locator information, according to the obtained matched rule to process the to-be-truncated web title, to obtain a truncated web title; using a general rule to perform truncating process on the title, the preset web title matching base comprising a webpage white list base and/or a web title template base, and/or a web title prefix-suffix recognition base. Using the method and the device can effectively improve redundancy elimination effect of web titles.

Description

Obtain method and the device of the web page title of brachymemma
Technical field
The present invention relates to browser display treatment technology, be specifically related to a kind of method and the device that obtain the web page title of brachymemma.
Background technology
At present, based on the needs of browser display interface layout, the browser display area being stored in the web page title collected in browser collection hurdle, collection due to display user is relatively limited, and by the web page title that this browser display area shows, user can be made to get the relevant information of this webpage (website).Thus, how in limited browser display area, make the web page title stored can provide information as much as possible to user, obtain more useful informations about webpage to make user, thus promote the business experience of user, become the technical matters that is needed badly solution.Wherein, web page title is a word for summarizing web page contents, is the high enrichment to web page contents, can provide related web page refining and useful information to user.
In existing browser, for the web page title that user collects in collection, generally automatically extract the title (Title) at webpage top as web page title by browser, such as, webpage URL(uniform resource locator) (URL, UniformResourceLocator) information for needing collection: www.sohu.com, the title " upper Sohu, sees the Olympic Games " that webpage www.sohu.com top arranges by browser is automatically as this webpage www.sohu.com title, and be stored in collection, certainly, user also according to the actual needs of self, can carry out manual modification to the web page title in collection.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or the method for the web page title of acquisition brachymemma solved the problem at least in part and device.
According to one aspect of the present invention, provide the method for the web page title obtaining brachymemma, the method comprises:
What acquisition webpage URL information and this webpage URL information mapped treats brachymemma web page title;
Treat brachymemma web page title to process, only retain the part that can reflect web page contents;
Describedly treat method that brachymemma web page title carries out processing and comprise one or more combination in any in following method: word segmentation processing is done to title and removes meaningless word; Inquire about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, obtain the web page title of brachymemma; General rule is utilized to do brachymemma process to title;
Described web page title coupling storehouse comprises: webpage white list storehouse and/or, web page title template base and/or, sew before and after web page title and identify storehouse.
According to a further aspect in the invention, provide the device of the web page title obtaining brachymemma, comprising: the web page title acquisition module of brachymemma request processing module and brachymemma, wherein,
Brachymemma request processing module, for from receive carry out in the request of web page title brachymemma obtain treat brachymemma webpage URL information and this treat the web page title that brachymemma webpage URL information maps;
The web page title acquisition module of brachymemma, for inquiring about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, obtain the web page title of brachymemma; Described web page title coupling storehouse comprises: webpage white list storehouse and/or, web page title template base and/or, sew before and after web page title and identify storehouse.
According to method and the device of the web page title of acquisition brachymemma of the present invention, according to webpage URL information and the web page title of input, utilize the webpage white list storehouse set up in advance and/or, Page template storehouse and/or, sew before and after web page title identify storehouse and/or, brachymemma general rule, carries out brachymemma to web page title.Solve after existing method extracts web page title thus, the web page title obtaining brachymemma comprises the technical matters that descriptive expression and front and back are sewed, effectively can remove the front and back comprised in web page title to sew and descriptive expression, obtain good de-redundancy object, the web page title reaching the brachymemma of acquisition meets browser display area requirement, and more how useful information can be provided to user, thus promote the beneficial effect of customer service experience.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the method flow signal that the embodiment of the present invention obtains the web page title of brachymemma;
Fig. 2 shows the apparatus structure signal that the embodiment of the present invention obtains the web page title of brachymemma; And
Fig. 3 shows the method idiographic flow signal that the embodiment of the present invention obtains the web page title of brachymemma.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Along with the development of network technology, in order to provide more useful information to user and adapt to browser display area, some nonessential information to comprising in the web page title stored in collection are also needed to carry out filtration treatment, namely crucial words extraction is carried out with brachymemma web page title, to provide information useful as far as possible to user in limited browser display area to web page title.
As embodiment, can be split the web page title obtained by participle cutting method, first words cutting is carried out to web page title, then, meaningless word removal is carried out to the words of cutting, finally, words combination is carried out to the web page title after Transformatin, obtain the web page title of brachymemma.
In practical application, owing to adopting participle cutting method to carry out words cutting to web page title, and meaningless word removal is carried out to the words of cutting, can not effectively remove in web page title the information that user has nothing to do.Such as, web page title " upper Sohu, see the Olympic Games " is after words cutting, the removal of meaningless word and words combination, the web page title obtaining extracting is still " upper Sohu, see the Olympic Games ", and for user, " on " and " seeing the Olympic Games " may be the information useless to user, the useful information amount provided to user in limited browser display area is reduced, reduces the business experience of user, again such as, " access Sohu is welcome " for web page title, after existing method is extracted web page title, the web page title obtaining brachymemma is still " welcoming access Sohu ", and wherein, " welcoming access " is descriptive expression, the information useful to user can not be provided, like this, owing to containing some descriptive expressions in the web page title of brachymemma, on the one hand, make the web page title of brachymemma can not meet browser display area requirement, on the other hand, also the web page title of brachymemma is made to be supplied to the useful information of user less, web page title de-redundancy effect is poor.Preferably, a kind of a kind of web page title brachymemma technology to each web page title retain header useful information is as far as possible proposed in the embodiment of the present invention, namely the method for the web page title of brachymemma is obtained, by set up webpage white list storehouse and/or, Page template storehouse and/or, sew before and after web page title identify storehouse and/or, brachymemma general rule, useful brachymemma is carried out to web page title, make it to comprise the keyword of more refining or crucial phrase, and remove the information had nothing to do with user, thus meet browser display area requirement, and provide more how useful information to user.
Fig. 1 shows the method flow signal that the embodiment of the present invention obtains the web page title of brachymemma.See Fig. 1, this flow process comprises:
Step 101, obtain treat brachymemma webpage URL information and this treat the web page title that brachymemma webpage URL information maps;
In this step, relative to existing technology of only carrying out brachymemma for web page title, in the embodiment of the present invention, in order to realize the webpage white list storehouse of more efficiently web page title brachymemma and the proposition of the coupling embodiment of the present invention, and/or, Page template storehouse, and/or, sew before and after web page title and identify storehouse technology, when obtaining web page title, also need to obtain and utilize this webpage URL information, and as embodiment, unlike the prior art, in the embodiment of the present invention, treat that brachymemma web page title can be the invalid title of the not representation page subject information such as empty or url.
This step specifically comprises:
Receive the request carrying out web page title brachymemma;
In this step, user is in the process browsing webpage, if determine to need this webpage to collect, then at the display interface of this webpage, collection submenu is added in collection drop-down menu by clicking, web page title brachymemma is carried out in triggering, this web browser extracts the web page title that this user browses, namely brachymemma web page title is treated, the web page title extracted and this webpage URL information (treating brachymemma webpage URL information) are encapsulated in the request carrying out web page title brachymemma, send to server, or, user needs to be optimized the web page title stored in collection (treating brachymemma web page title), then by clicking the rename submenu arranged in collection drop-down menu, web page title brachymemma is carried out in triggering, user can choose the web page title needing to carry out brachymemma, the web page title that this user chooses by web browser and this webpage URL information are encapsulated in the request carrying out web page title brachymemma, send to server, wherein, if user has chosen multiple web page title, then in the request carrying out web page title brachymemma, each web page title and this webpage URL information form mapping relations.
Resolve and carry out the request of web page title brachymemma, obtain treating brachymemma web page title and this treat brachymemma webpage URL information.
In this step, server carries out the request of web page title brachymemma receiving, and by decapsulation and resolve this request, can obtain the web page title that carries in asking and this webpage URL information.
Step 102, inquires about the web page title coupling storehouse pre-set, obtains the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, obtain the web page title of brachymemma; Described web page title coupling storehouse comprises: webpage white list storehouse and/or, web page title template base and/or, sew before and after web page title identify storehouse and/or, brachymemma general rule, wherein,
The web page title of brachymemma corresponding to webpage URL information is stored in webpage white list storehouse;
The canonical brachymemma rule that webpage URL information is corresponding is stored in web page title template base;
Sew suffix list and/or front and back before storing web page title in identification storehouse before and after web page title and sew recognition rule.Wherein, sewing recognition rule before and after web page title is the term frequency-inverse document word frequency calculative strategy for sewing identification before and after carrying out web page title arranged, and is follow-uply described in detail again.
In this step, as preferred embodiment, also in advance web page title can be mated storehouse and be loaded in buffer memory.
In the embodiment of the present invention, if store in web page title coupling storehouse before and after webpage white list storehouse, web page title template base and web page title and sew identification storehouse, because webpage white list storehouse coupling required time is short, effectively can filter the web page title be not included in webpage white list storehouse, reduce subsequent treatment; And with web page title template base, to carry out mating required time longer, and, and before and after web page title, sew that to identify that storehouse carries out mating required time the longest.Thus, if need to adopt triplicity to carry out slug brachymemma, preferably, the matched rule of employing is sequentially: sew before and after webpage white list storehouse, web page title template base, web page title and identify storehouse.
Generating web page white list storehouse comprises:
A11, extracts the web page title of each webpage URL information and the webpage URL information mapping comprised in user's collection;
In this step, from the user's collection comprising web page title (web page storage title), extract web page title and the webpage URL information of user's setting.
In the embodiment of the present invention, because user is when using search dog browser to carry out web page access, search dog browser can store the web page access record of user, such as, the web page title that user is arranged for webpage and this webpage URL information, server, by extracting the web page storage folder data in search dog browser, can obtain the web page title that a large amount of webpage URL information and each user set webpage URL information.Wherein, webpage URL information and web page title form mapping relations (title to), and different users is for same webpage URL information, and the web page title of setting can be different.Thus, same webpage URL information may be mapped with a large amount of different web pages titles that each user is arranged this webpage URL information.
A12, for each web page resources locator information, obtains all web page titles that this web page resources locator information maps, and, add up the number of users that each web page title of this web page resources locator information mapping is corresponding;
In this step, due to the difference of user, thus, for each webpage URL information, different web page titles is mapped with.In the embodiment of the present invention, for each webpage URL information, add up the number of users that each web page title of this webpage URL information mapping is corresponding respectively.Such as, for webpage URL information: www.sohu.com, the web page title of mapping comprises: " upper Sohu, sees the Olympic Games ", " welcoming access Sohu ", " Sohu " and " Sohu official website ", wherein, through statistics, number of users corresponding to " upper Sohu; see the Olympic Games " is 10,000, namely has 10,000 users by webpage URL information: www.sohu.comthe web page title of mapping is set to " above Sohu, sees the Olympic Games ", and number of users corresponding to " welcoming access Sohu " is 1.5 ten thousand, and the number of users of " Sohu " correspondence is 50,000, and the number of users of " Sohu official website " correspondence is 2.5 ten thousand.
A13, is applied to the webpage white list calculative strategy pre-set, obtains this web page title weighted value by number of users corresponding for web page title and web page title;
In this step, as embodiment, webpage white list calculative strategy can be the calculative strategy according to number of users, then web page title weighted value is user's numerical value.As described in steps A 12, according to the calculative strategy of number of users, web page title weighted value corresponding to " upper Sohu; see the Olympic Games " web page title is 10,000, corresponding web page title weighted value is 1.5 ten thousand " to welcome access Sohu ", the web page title weighted value of " Sohu " correspondence is 50,000, and the web page title weighted value of " Sohu official website " correspondence is 2.5 ten thousand.
Certainly, as another embodiment, it is also conceivable to the field weight in field belonging to user in practical application, for the user in a certain field, the accuracy of user in this field to the web page title name that webpage URL information maps should be greater than the accuracy that the user in other this field non-names the web page title that same webpage URL information maps, and the user namely in this field can obtain the web page title name that webpage URL information maps and apply more widely and popularize.Such as, for the user of a certain mechanical field, it should be greater than the accuracy of other on-mechanical field user to the web page title name that this webpage URL information maps to the accuracy of the web page title name that mechanical field webpage URL information maps.Thus, webpage white list calculative strategy can be the calculative strategy according to field weight belonging to the user pre-set, like this, by arranging each field weight respectively for each user in advance, such as, for a certain field user, can arrange its mechanical field weight is 0.5, electricity field weight is 0.3, and communications field weight is 0.2 etc.About the field determined belonging to user, the characteristic matching by user tag obtains, and is known technology, omits detailed description at this.Like this, number of users corresponding for web page title and web page title are applied to the webpage white list calculative strategy pre-set, obtain this web page title weighted value and comprise:
B11, extracts the Feature Words comprised in the web page title that webpage URL information maps, mates with each domain features dictionary pre-set, and determines this field belonging to webpage URL information;
In this step, for example, for webpage URL information, random selecting one web page title " welcomes access Sohu ", and the Feature Words of extraction is Sohu, if in the feature dictionary of the communications field, include Feature Words Sohu, then the field belonging to this web page title is the communications field.
B12, according to each field weight arranged respectively for each user in advance, the field weight in user field belonging to this webpage URL information determined that each web page title obtaining the mapping of webpage URL information respectively comprises;
In this step, for " upper Sohu; see the Olympic Games ", wherein, in 10,000 users, there is 0.2 general-purpose family to be 0.2 in the field weight of the communications field, have 0.3 general-purpose family to be 0.3 in the field weight of the communications field, there is 0.1 general-purpose family to be 0.6 in the field weight of the communications field, have 0.4 general-purpose family to be 0.9 in the field weight of the communications field.For webpage URL information: www.sohu.comother web page title mapped, adds up according to the method identical with this.
B13, the field weight in the number of users comprised by web page title and user field belonging to this webpage URL information determined is applied to the weight calculation formula pre-set, and obtains web page title weighted value.
In this step, weight calculation formula can be total weight calculation formula, also can be relative weighting computing formula.Wherein, total weight calculation formula is as follows:
X i = Σ j = 1 K U i , j ξ i , j
In formula,
X ibe i-th web page title weighted value, wherein, i is natural number;
U i,jbe the jth user that i-th web page title is corresponding;
ξ i,jfor the field weight in jth user field belonging to this web page title corresponding to i-th web page title;
K is the total number of users that i-th web page title is corresponding, and K is natural number.
Relative weighting computing formula is as follows:
X i = Σ j = 1 K U i , j ξ i , j / K
As other embodiment, can also in advance for each web page title arranges web page title priority coefficient, and be combined into the field weight calculation web page title weighted value of user's setting that web page title maps, namely webpage white list calculative strategy can be according to the calculative strategy of field weight belonging to the user pre-set in conjunction with web page title priority coefficient.The method comprises further:
The web page title weighted value obtained is multiplied with web page title priority coefficient, as the final web page title weighted value exported.
In this step, for total weight calculation formula, calculate the final web page title weighted value exported as follows:
X i = ψ i ( Σ j = 1 K U i , j ξ i , j )
In formula,
ψ ibe i-th web page title priority coefficient.
In the embodiment of the present invention, web page title priority coefficient is arranged by manual type.Such as, by obtaining each web page title in search dog browser, being respectively each web page title and corresponding web page title priority coefficient is set.
A14, in same webpage URL information, chooses the web page title that maximum web page head weighted value is corresponding, using the web page title that webpage URL information maps as webpage URL information with the web page title chosen, is placed in the webpage white list storehouse of setting.
In this step, for same webpage URL information, after calculating each web page title weighted value of this webpage URL information mapping, choose the web page title that maximum web page head weighted value is corresponding, as the web page title that webpage URL information in webpage white list storehouse maps.Wherein, web page title weighted value comprises the maximum total weighted value of web page title and the maximum relative weight value of web page title, can choose the web page title that web page title corresponding to the maximum total weighted value of web page title maps as this webpage URL information; Or, choose the web page title that web page title corresponding to the maximum relative weight value of web page title maps as this webpage URL information.
As embodiment, can also in same webpage URL information, web page title weighted value is sorted by size, choose the web page title that the web page title weighted value of sequence top N is corresponding, each webpage URL information maps N number of web page title, and N number of web page title that webpage URL information maps is placed in the white list storehouse of setting, wherein, N is natural number.Namely in webpage white list storehouse, each webpage URL information is mapped with N number of web page title, and wherein, N can determine according to actual needs.
In practical application, the web page title mapped due to the webpage URL information obtained by said method selects according to user behavior, and the web page title that the webpage URL information in user's collection maps may accurately can not reflect web page title, and in the web page navigation data that each navigation website provides, owing to being through technical professional, high level overview is carried out to webpage, thus, the web page title provided comparatively refining, and the useful information comprised is more.Thus, in the embodiment of the present invention, behind generating web page white list storehouse, further, the method can also comprise:
C11, obtains web page navigation data, extracts the web page title of webpage URL information and this webpage URL information mapping comprised in web page navigation data;
In this step, by the mode of web crawlers, web page navigation data can be captured from each navigation website, and web page navigation data are resolved, therefrom extract the web page title of webpage URL information and the mapping of webpage URL information.About crawl web page navigation data, extraction web page title and webpage URL information are known technology, omit detailed description at this.
C12, each webpage URL information that traversal is extracted, this webpage URL information whether is there is in query webpage white list storehouse, if there is no, the web page title write white list storehouse this webpage URL information and this webpage URL information mapped, if existed, from the web page title extracted and webpage white list storehouse, obtain the web page title that this webpage URL information maps respectively, after comparing, determine whether the web page title that in more new web page white list storehouse, this webpage URL information maps.
In the embodiment of the present invention, because the webpage URL information quantity provided in web page navigation data is relatively limited, namely can not cover all webpage URL information on a large scale, thus, using the useful supplement of the method as webpage white list storehouse.By capturing the web page navigation data of each navigation website, extract the web page title of webpage URL information and the mapping of this webpage URL information, and according to webpage URL information, web page title in the Web side navigation data of the web page title stored from webpage white list storehouse and crawl is compared, thus choose more accurate web page title of expressing the meaning, if the web page title namely stored in webpage white list storehouse is expressed the meaning more accurate, then do not deal with, if the web page title mapped from this webpage URL information of web page navigation data extraction is expressed the meaning more accurate, then the web page title stored in webpage white list storehouse is upgraded.
So far, the flow process in generating web page white list storehouse terminates.
Generating web page title template base comprises:
In advance for webpage URL information map web page title arrange sort out strategy, and for each sort out web page title arrange correspondence regularity.
In this step, although the web page title quantity of each website is various, but from a large amount of web page title data, web page title can be sorted out according to the classification strategy pre-set, wherein, classification strategy can be the classification strategy according to social class, Tech blog class etc., that is, web page title is classified as social class web page title, Tech blog class web page title etc.And corresponding regularity is set for each web page title sorted out, form web page title template.
In follow-up, after web page title is sorted out, in the web page title sorted out, use the regularity of this classification correspondence, the web page title sorted out is intercepted, the web page title of brachymemma can be obtained.Such as, in web page title template base, social class web page title and regularity corresponding to Tech blog class web page title are set in advance respectively, like this, after web page title being classified as social class web page title or Tech blog class web page title, to the web page title that each is sorted out, all can intercept according to the regularity corresponding to corresponding classification pre-set, thus obtain the web page title of corresponding brachymemma.
About arranging corresponding regularity for each web page title sorted out, obtaining by carrying out data mining to the web page title sorted out, omitting detailed description at this.
In the embodiment of the present invention, due to the ending place at each web page title, often containing " homepage ", " it is reported ", " western medium ", " focus " etc. for suffix information before making web page title eye-catching, or represent web page title structure and irrelevant with the theme of web page title before suffix information.In order to remove suffix information before web page title, use aforementioned regularity or white list storehouse carry out before and after to sew the flow process of information filtering comparatively loaded down with trivial details.Thus, in the embodiment of the present invention, title storehouse (web page title of the webpage URL information mapping of storage) can be utilized to carry out mass data analysis, utilize TFIDF method to carry out periodic data excavation, thus capture out front suffix information.
Sew before and after generating web page title and identify that storehouse comprises:
Obtain the web page title of webpage URL information mapping in user's collection and store;
Arrange for carrying out term frequency-inverse document word frequency (TF-IDF, the TermFrequency-InverseDocumentFrequency) calculative strategy sewing identification in front and back to web page title.
In the embodiment of the present invention, TF-IDF is a kind of conventional weighted statistical method for information retrieval.Wherein, word frequency is in order to assess the weight of a words for a copy of it document in a document library (file set or corpus), the weight of words to be directly proportional increase along with the number of times that this words occurs in document library, and the frequency simultaneously occurred in document library along with this words is inversely proportional to decline; Inverse document word frequency is the tolerance of a words general importance.
The weight calculation formula of TF is:
TF = P w P
In formula,
TF is word frequency weight;
P wthe number of times in document library is appeared at for word (words) w;
P is document library length, the words total quantity namely comprised.
The weight calculation formula of IDF is:
IDF = log D D w
In formula,
IDF is inverse document word frequency weight;
D wfor individuality (document) sum containing words w in sample (document library, file set or corpus);
D is total sample number, i.e. total number of files.
If IDF value is less, represent that in sample, more document package contain this words, the quantity of information that this words comprises is fewer; If IDF value is larger, represent in sample and only have fewer document package to contain this words, the quantity of information that this words comprises is larger.
In conjunction with word frequency and inverse document word frequency, term frequency-inverse document word frequency can be obtained:
Weight w = TF × IDF = P w P × log D D w
In formula, Weight wfor the TF-IDF weight of words w.
If TF-IDF weighted value is larger, represent the indicative better of this words.
Below the web page title obtaining brachymemma is described in detail again.
In the embodiment of the present invention, if web page title coupling storehouse comprises webpage white list storehouse, then inquire about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, the web page title obtaining brachymemma comprises:
Query webpage white list storehouse, obtains the web page title treating that brachymemma webpage URL information maps, and using the web page title that the obtains web page title as brachymemma.
In this step, storing for not having in webpage white list storehouse the situation treating brachymemma webpage URL information, conventionally can carry out brachymemma process to web page title, not repeating them here.
If web page title coupling storehouse comprises web page title template base, then inquire about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, the web page title obtaining brachymemma comprises:
D11, extracts the naming rule treating the web page title that brachymemma webpage URL information maps, the naming rule of extraction is mated the classification strategy pre-set, treat the classification belonging to web page title that brachymemma webpage URL information maps described in obtaining;
In this step, by the naming rule of analyzing web page title, the classification belonging to this web page title can be distinguished.Classify as known technology about to web page title, omit detailed description at this.
As embodiment, if the web page title that webpage URL information maps is invalid, namely web page title can not react web page contents completely, such as, be empty, namely do not comprise any in perhaps only comprise symbol, then can return the web page title of domain name as brachymemma of this webpage URL information.
D12, query webpage title template base, treats the regularity of the classification correspondence belonging to web page title that brachymemma webpage URL information maps described in acquisition;
In this step, if treat brachymemma webpage URL information map web page title sort out after, affiliated classifies as social class web page title, then from web page title template base, be read as social class web page title arrange regularity.
D13, the web page title utilizing the regularity obtained to treat the mapping of brachymemma webpage URL information carries out canonical process, obtains the web page title of brachymemma.
If web page title coupling storehouse comprises before and after web page title and sews identification storehouse, then inquire about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, the web page title obtaining brachymemma comprises:
E11, obtains and treats the web page title that brachymemma webpage URL information maps, and splits, obtain one or more webpage subtitle according to the fractionation strategy pre-set to the web page title obtained;
In this step, due to when collecting web page title, each ingredient of web page title has certain feature, such as, generally include prefix (or descriptive expression), title text, one or more suffix, and by the estimation analysis to each ingredient of web page title, can be distinguished by some specific punctuation marks; Moreover, for title text, be the information useful to user, can provide to user as a whole.
Thus, in the embodiment of the present invention, splitting strategy can be split according to the punctuation mark pre-set comprised in web page title.Such as, the punctuation mark pre-set can be _ ,-,-,+, &, # ...:.,, |:, ┊, ‖; ,.,, s ,-,-, etc.If include above-mentioned arbitrary symbol pre-set in web page title, then this web page title is split from this symbol.
E12, in conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, for each webpage subtitle, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculate the term frequency-inverse document word frequency value of this each webpage subtitle;
In this step, as embodiment, can also before webpage subtitle being carried out to the calculating of term frequency-inverse document word frequency value, the method comprises further:
The multiple webpage subtitles obtained are combined, and for the webpage subtitle that each combines, in conjunction with sewing the web page title and term frequency-inverse document word frequency calculative strategy that identify that the webpage URL information stored in storehouse maps before and after web page title, calculate the TFIDF value of the webpage subtitle of this each combination, and when all non-front and back of webpage subtitle of each combination are sewed, perform the calculating of the term frequency-inverse document word frequency value to each webpage subtitle described.
In this step, the mode of combination webpage subtitle can be that such as, web page title is after splitting, sequentially obtain three webpage subtitles, be respectively A, B, C, then, after combining, obtain the webpage subtitle of two combinations, be respectively AB, BC, first front and back are carried out to AB and sew judgement, sew if AB is front and back, then using the web page title of C as brachymemma; If AB does not sew for front and back, then front and back are carried out to BC and sew judgement, sew if BC is front and back, then using the web page title of A as brachymemma; If BC does not sew for front and back, then again front and back are carried out respectively to A, B, C and sew judgement.
In the embodiment of the present invention, the formula calculating the term frequency-inverse document word frequency value of webpage subtitle can be as follows:
TFIDF = TF × IDF = n ′ n × log ( D D ′ + 1 )
In formula,
TF is the word frequency of webpage subtitle;
IDF is the inverse document word frequency of webpage subtitle;
N' is the number of times that webpage subtitle occurs in sample set;
N is the total quantity of each webpage subtitle in sample set;
D is the total number of files comprising webpage subtitle in sample set;
D' is the total number of files comprised in sample set;
+ 1 is smoothing processing.
It should be noted that the term frequency-inverse document word frequency value of the webpage subtitle of calculation combination method similar with the method for the term frequency-inverse document word frequency value calculating webpage subtitle, omit detailed description at this.
E13, judge whether the term frequency-inverse document word frequency value calculated is greater than the front and back pre-set and sews threshold value, if so, determines that this each webpage subtitle is that front and back are sewed, filtering from web page title will be sewed before and after this, and using the web page title sewed before and after the filtering web page title as brachymemma.
In this step, if the term frequency-inverse document word frequency value of webpage subtitle that step e 12 calculates is greater than the front and back pre-set sew threshold value, then shows that this webpage subtitle (entirety) is for sewing front and back, and this webpage subtitle is deleted.
Further, the web page title write due to each web editor all can have oneself style or template, thus, in practical application, sew judgement carrying out above-mentioned front and back, after namely performing step e 13, again front and back are carried out to each webpage subtitle comprised in the web page title of brachymemma and sew filtering, can improve the validity of the web page title of the brachymemma of output further, thus, the method can further include:
E14, according to treating brachymemma webpage URL information, sewing the web page title identifying that the webpage URL information stored storehouse maps, extracting this and treat the web page title that brachymemma webpage URL information maps before and after web page title;
E15, in conjunction with the web page title extracted, each webpage subtitle that the web page title for brachymemma is corresponding, utilizes the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculates the term frequency-inverse document word frequency value of this webpage subtitle;
In this step, for each the webpage subtitle sewed before and after filtering, in conjunction with each web page title of the website of sewing before and after web page title belonging to the web page title that identifies in storehouse and extract, calculate the TFIDF value of each the webpage subtitle sewed before and after this filtering.
E16, judges whether the term frequency-inverse document word frequency value calculated is greater than the front and back pre-set and sews threshold value, if so, determines that this each webpage subtitle is that front and back are sewed, will sew filtering from the web page title of brachymemma, and upgrade the web page title of brachymemma before and after this.
In the invention process example, step e 14 to step e 16 be aforesaid utilize the web page title sewed before and after web page title and identify and store in storehouse before suffix list and/or front and back sew recognition rule, sew the idiographic flow of identification before and after web page title is carried out.
In the embodiment of the present invention, in step e 14 to step e 16, using the web page title of all site informations as Sample Storehouse, then, in Sample Storehouse, front and back are carried out to each web page title and sew judgement.
As another embodiment, also can separately according to site information, first web page title is classified, such as, be categorized as Sohu, Sina, 163, Netease etc., then, the web page title identifying that the webpage URL information extracting this classification correspondence storehouse maps is sewed again before and after web page title, utilize to sew before and after web page title and identify that the term frequency-inverse document word frequency calculative strategy arranged in storehouse carries out term frequency-inverse document word frequency value and calculates, and the judgement of sewing before and after carrying out, thus reach remove before and after the effect of sewing.Like this, relative to aforementioned using the situation of the web page title of all site informations as Sample Storehouse, the present embodiment using the web page title of site information of classification as Sample Storehouse, then, determine to treat the classification belonging to brachymemma webpage URL information, and in the Sample Storehouse of classification, this is treated that the corresponding web page title of brachymemma webpage URL information carries out front and back and sews judgement.
As another embodiment, can also sew being excavated by TFIDF method the front and back obtained to be stored in before and after web page title and sew in storehouse, and in follow-up flow process, first after web page title being split, carry out front and back sew preliminary matches by sewing storehouse before and after web page title, filter out in web page title with before and after web page title and sew the front and back that storehouse matches and sew, then, for filtering the web page title obtained, carry out front and back by TFIDF method again and sew judgement, and after sewing before and after judging, with the form of increment the front and back judged are sewed before and after the web page title adding to and prestore and sew in storehouse.
If web page title coupling storehouse comprises before and after webpage white list storehouse, web page title template base and web page title and sews identification storehouse, then inquire about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, the web page title obtaining brachymemma comprises:
F11, query webpage white list storehouse, if obtain the web page title treating that brachymemma webpage URL information maps, and using the web page title that the obtains web page title as brachymemma, otherwise, perform step F 12;
F12, extracts the naming rule treating the web page title that brachymemma webpage URL information maps, the naming rule of extraction is mated the classification strategy pre-set, treat the classification belonging to web page title that brachymemma webpage URL information maps described in obtaining;
F13, query webpage title template base, if treat the regularity of the classification correspondence belonging to web page title that brachymemma webpage URL information maps described in getting, the web page title utilizing the regularity obtained to treat the mapping of brachymemma webpage URL information carries out canonical process, obtain the web page title of brachymemma, otherwise, perform step F 14;
F14, obtains and treats the web page title that brachymemma webpage URL information maps, and splits, obtain one or more webpage subtitle according to the fractionation strategy pre-set to the web page title obtained;
F15, in conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, for each webpage subtitle, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculate the term frequency-inverse document word frequency value of this each webpage subtitle;
F16, judge whether the term frequency-inverse document word frequency value calculated is greater than the front and back pre-set and sews threshold value, if so, determines that this each webpage subtitle is that front and back are sewed, filtering from web page title will be sewed before and after this, and using the web page title sewed before and after the filtering web page title as brachymemma.
As embodiment, the method can further include:
Step 103, is issued to the web page title of the brachymemma of acquisition in user's collection and stores.
In this step, also can be that the web page title of the brachymemma of acquisition is issued to user and shows by server, after being selected whether to revise by user, select to store in collection according to user.
As another embodiment, the method can further include:
The web page title adopting the brachymemma general rule pre-set to treat brachymemma carries out brachymemma process.Carry out brachymemma process about employing brachymemma general rule, to be follow-uply described in detail again.
Fig. 2 shows the apparatus structure signal that the embodiment of the present invention obtains the web page title of brachymemma.See Fig. 2, this device comprises: the web page title acquisition module of brachymemma request processing module and brachymemma, wherein,
Brachymemma request processing module, for from receive carry out in the request of web page title brachymemma obtain treat brachymemma webpage URL information and this treat the web page title that brachymemma webpage URL information maps;
The web page title acquisition module of brachymemma, for inquiring about the web page title coupling storehouse pre-set, obtaining the matched rule treating that brachymemma webpage URL information is corresponding, treating that brachymemma web page title processes according to the matched rule obtained to described, obtaining the web page title of brachymemma; Described web page title coupling storehouse comprises: webpage white list storehouse and/or, web page title template base and/or, sew before and after web page title and identify storehouse, wherein,
The web page title of brachymemma corresponding to webpage URL information is stored in webpage white list storehouse;
The canonical brachymemma rule that webpage URL information is corresponding is stored in web page title template base;
Sew suffix list and/or front and back before storing web page title in identification storehouse before and after web page title and sew recognition rule.
Wherein,
Brachymemma request processing module comprises: receiving element and resolution unit (not shown), wherein,
Receiving element, for receiving the request carrying out web page title brachymemma;
Resolution unit, for resolving the request carrying out web page title brachymemma, obtain treating brachymemma web page title and this treat brachymemma webpage URL information.
As embodiment, the web page title acquisition module of brachymemma comprises: the web page title query unit (not shown) of webpage white list storehouse generation unit and brachymemma, wherein,
Webpage white list storehouse generation unit, for extracting the web page title of each webpage URL information and the webpage URL information mapping comprised in user's collection; For each web page resources locator information, obtain all web page titles that this web page resources locator information maps, and, add up the number of users that each web page title of this web page resources locator information mapping is corresponding; Number of users corresponding for web page title and web page title are applied to the webpage white list calculative strategy pre-set, obtain this web page title weighted value; In same webpage URL information, choose the web page title that maximum web page head weighted value is corresponding, using the web page title that webpage URL information maps as webpage URL information with the web page title chosen, be placed in the webpage white list storehouse of setting;
The web page title query unit of brachymemma, for query webpage white list storehouse generation unit, obtains the web page title treating that brachymemma webpage URL information maps, and using the web page title that the obtains web page title as brachymemma.
In the embodiment of the present invention, preferably, the web page title acquisition module of brachymemma can also comprise:
Web page title updating block, for obtaining web page navigation data, extracts the web page title of webpage URL information and this webpage URL information mapping comprised in web page navigation data; Each webpage URL information that traversal is extracted, this webpage URL information whether is there is in the generation unit of query webpage white list storehouse, if there is no, the web page title write webpage white list storehouse generation unit that this webpage URL information and this webpage URL information are mapped, if existed, from the web page title extracted and webpage white list storehouse generation unit, obtain the web page title that this webpage URL information maps respectively, after comparing, determine whether the web page title that in more new web page white list storehouse generation unit, this webpage URL information maps.
As another embodiment, the web page title acquisition module of brachymemma comprises: the web page title acquiring unit of web page title template base generation unit and brachymemma, wherein,
Web page title template base generation unit, in advance for webpage URL information map web page title arrange sort out strategy, and for each sort out web page title arrange correspondence regularity;
The web page title acquiring unit of brachymemma, for extracting the naming rule treating the web page title that brachymemma webpage URL information maps, the naming rule of extraction is mated the classification strategy pre-set, described in obtaining, treat the classification belonging to web page title that brachymemma webpage URL information maps; Query webpage title template base generation unit, treats the regularity of the classification correspondence belonging to web page title that brachymemma webpage URL information maps described in acquisition; The web page title utilizing the regularity obtained to treat the mapping of brachymemma webpage URL information carries out canonical process, obtains the web page title of brachymemma.
As an embodiment again, the web page title acquisition module of brachymemma comprises: sew the web page title processing unit identifying storehouse generation unit and brachymemma before and after web page title, wherein,
Sew before and after web page title and identify storehouse generation unit, for obtaining the web page title of webpage URL information mapping in user's collection and storing; Arrange for carrying out the term frequency-inverse document word frequency calculative strategy sewing identification in front and back to web page title.
The web page title processing unit of brachymemma, for obtaining the web page title treating that brachymemma webpage URL information maps, splitting the web page title obtained according to the fractionation strategy pre-set, obtaining one or more webpage subtitle; In conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, for each webpage subtitle, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculate the term frequency-inverse document word frequency value of this each webpage subtitle; Judge whether the term frequency-inverse document word frequency value calculated is greater than the front and back pre-set and sews threshold value, if, determine that this each webpage subtitle is that front and back are sewed, filtering from web page title will be sewed before and after this, and using the web page title sewed before and after the filtering web page title as brachymemma.
As an embodiment again, the web page title acquisition module of brachymemma comprises: sew the web page title processing unit identifying storehouse generation unit, the web page title query unit of brachymemma, the web page title acquiring unit of brachymemma and brachymemma before and after webpage white list storehouse generation unit, web page title template base generation unit, web page title, wherein
Webpage white list storehouse generation unit, for extracting the web page title of each webpage URL information and the webpage URL information mapping comprised in user's collection; For each web page resources locator information, obtain all web page titles that this web page resources locator information maps, and, add up the number of users that each web page title of this web page resources locator information mapping is corresponding; Number of users corresponding for web page title and web page title are applied to the webpage white list calculative strategy pre-set, obtain this web page title weighted value; In same webpage URL information, choose the web page title that maximum web page head weighted value is corresponding, using the web page title that webpage URL information maps as webpage URL information with the web page title chosen, be placed in the webpage white list storehouse of setting;
Web page title template base generation unit, in advance for webpage URL information map web page title arrange sort out strategy, and for each sort out web page title arrange correspondence regularity;
Sew before and after web page title and identify storehouse generation unit, for obtaining the web page title of webpage URL information mapping in user's collection and storing; Arrange for carrying out the term frequency-inverse document word frequency calculative strategy sewing identification in front and back to web page title;
The web page title query unit of brachymemma, brachymemma webpage URL information query webpage white list storehouse generation unit is treated for basis, if obtain the web page title treating that brachymemma webpage URL information maps, and using the web page title that the obtains web page title as brachymemma, otherwise, the web page title acquiring unit of notice brachymemma;
The web page title acquiring unit of brachymemma, for extracting the naming rule treating the web page title that brachymemma webpage URL information maps, the naming rule of extraction is mated the classification strategy pre-set, described in obtaining, treat the classification belonging to web page title that brachymemma webpage URL information maps; Query webpage title template base generation unit, if treat the regularity of the classification correspondence belonging to web page title that brachymemma webpage URL information maps described in getting, the web page title utilizing the regularity obtained to treat the mapping of brachymemma webpage URL information carries out canonical process, obtain the web page title of brachymemma, otherwise, the web page title processing unit of notice brachymemma;
The web page title processing unit of brachymemma, for obtaining the web page title treating that brachymemma webpage URL information maps, splitting the web page title obtained according to the fractionation strategy pre-set, obtaining one or more webpage subtitle; In conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, for each webpage subtitle, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculate the term frequency-inverse document word frequency value of this each webpage subtitle; Judge that the term frequency-inverse document word frequency value calculated is not more than the front and back pre-set and sews threshold value, determine that this each webpage subtitle is that front and back are sewed, filtering from web page title will be sewed before and after this, and using the web page title sewed before and after the filtering web page title as brachymemma.
Lift a specific embodiment more below, the method for the web page title obtaining brachymemma is described.
Fig. 3 shows the method idiographic flow signal that the embodiment of the present invention obtains the web page title of brachymemma.See Fig. 3, this flow process comprises:
Step 301, input treat the web page title of brachymemma and this treat the webpage URL information that the web page title of brachymemma maps;
In this step, user carries out web page title collection in the process browsing webpage, also can be that the web page title stored in web page storage folder is optimized, namely brachymemma is carried out to web page title, such as, after user clicks web page title to be optimized, trigger web pages browser to server input treat the web page title of brachymemma and this treat the webpage URL information that the web page title of brachymemma maps.
Step 302, according to webpage URL information query webpage white list storehouse, if store described webpage URL information in webpage white list storehouse, performs step 303, otherwise, perform step 304;
Step 303, reads the web page title that described in webpage white list storehouse, webpage URL information maps, and the web page title as brachymemma exports and process ends;
Step 304, judges that whether the web page title treating brachymemma inputted is effective, if invalid, performs step 305, otherwise, perform step 306;
In step 303 to step 304, the webpage URL information of input is retrieved from webpage white list storehouse, if store the webpage URL information of input in webpage white list storehouse, then hit webpage white list, directly return the web page title that this webpage URL information maps, the web page title as brachymemma exports and process ends; Otherwise, need to treat that the validity of the web page title of brachymemma judges to what input.Such as, the web page title treating brachymemma of input is " using Baidu.com, you just know ", and the webpage URL information of mapping is http:// www.baidu.com/, then through webpage white list library inquiry and coupling, the web page title " Baidu " that stores in the webpage white list storehouse web page title as brachymemma is returned.
As embodiment, also can in advance webpage white list storehouse be loaded in buffer memory, carry out webpage URL information coupling in the buffer, like this, the efficiency of the web page title obtaining brachymemma can be improved, shorten the processing time.
In this step, web page title is invalid refers to that the web page title of input can not react web page contents completely, such as, for empty or do not include any word (such as, only comprising symbol etc.).
Step 305, returns the domain name that described webpage URL information is corresponding, as the web page title of brachymemma and process ends;
Step 306, according to the webpage URL information query webpage title template base of input, if there is the webpage URL information of described input in web page title template base, performs step 307, otherwise, perform step 308;
Step 307, read the regularity that the webpage URL information that inputs described in web page title template base is corresponding, the web page title utilizing the regularity read to treat brachymemma carries out canonical process, obtains the web page title of brachymemma and process ends;
In this step, whether the webpage URL information of inquiry input hits web page title template base.Such as, the web page title treating brachymemma of input is " Russian girl outdoor bathing place get sun very sexy _ Liu Xingyun _ sina blog ", and webpage URL information is http:// blog.sina.com.cn/s/blog_49b0d2b50102eyxt.html t j=1if stored in web page title template base http:// blog.sina.com.cnand the regularity of correspondence, then hit web page title template base, according to the web page title template base of hit, utilize the regularity stored, that extracts input treats that the web page title of brachymemma is " Russian girl outdoor bathing place get sun very sexy _ Liu Xingyun " web page title as brachymemma.
To what input, step 308, according to the fractionation strategy pre-set, treats that the web page title of brachymemma splits, obtains one or more webpage subtitle;
Step 309, in conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, for each webpage subtitle, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculate the term frequency-inverse document word frequency value of this each webpage subtitle;
Step 310, judges whether the term frequency-inverse document word frequency value calculated is greater than the front and back pre-set and sews threshold value, if so, performs step 311, otherwise, return the term frequency-inverse document word frequency value calculating that step 309 performs next webpage subtitle;
Step 311, determining to be greater than the webpage subtitle sewing threshold value in the front and back pre-set is that front and back are sewed, and will sew filtering from the web page title treating brachymemma before and after this;
Step 312, judges whether the web page title length of sewing before and after filtering is greater than the web page title length threshold pre-set, and if not, performs step 313, otherwise, perform step 314;
Step 313, using the web page title of filtering result as brachymemma, process ends;
In step 308 to step 313, by treating that the web page title of brachymemma splits to what input, fractionation mode adopts the method for carrying out with the punctuation mark pre-set mating, in conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, use the principle of maximum coupling to carry out front and back and sew identification.Such as, if only containing a prefix or suffix (webpage subtitle) in web page title, as long as then remove the prefix identified or suffix, but sew due to multiple front and back can be included in web page title, such as, web sites hierarchical relationship is also comprised while comprising info web, thus, through splitting, multiple front and back to be identified can be generated sew, sew before and after can removing exactly, in the embodiment of the present invention, use the mode of permutation and combination, such as, employing maximum forward is mated, maximum reverse coupling or the mode of simultaneously mating, all front and back to be identified are sewed and includes data statistics in, the extraction of sewing before and after carrying out, thus the front and back in filtering web page title are sewed, if the web page title length after filtering meets the web page title length threshold pre-set, then the web page title of the web page title after filtration as brachymemma is returned.
In the embodiment of the present invention, for example, if the web page title treating brachymemma of input be " high definition: Wuhan reporter investigates first quarter moon secretly and takes off dealer kidney shady deal _ news _ www.qq.com ", webpage URL information is http:// news.qq.com/a/20130820/003196.htm#p=3based on sewing the web page title, fractionation strategy, maximum match principle and the term frequency-inverse document word frequency calculative strategy that identify the webpage URL information stored in storehouse and map before and after aforesaid web page title, obtain the result after sewing before and after filtering for " Wuhan reporter investigates first quarter moon secretly and takes off dealer's kidney shady deal ", and using the web page title of this result as brachymemma.
Whether step 314, judge to treat to comprise in the web page title of brachymemma to draw together content to some extent, if so, performs step 315, otherwise, perform step 316;
In this step, draw together content and refer to the content be included in the symbol such as punctuation marks used to enclose the title, bracket.
Step 315, using the title content of drawn together content as brachymemma, and process ends;
Step 316, the web page title utilizing the first group of punctuation mark pre-set to treat brachymemma carries out cutting;
Step 317, judges whether that the fragment length of cutting is not more than the fragment threshold value pre-set, and if so, performs step 318, otherwise, perform step 321;
Step 318, is not more than the fragment of the cutting of the fragment threshold value pre-set, removes common phrases in this fragment for each, judge whether the fragment length removing common phrases is not more than the web page title length threshold pre-set, if so, perform step 319, otherwise, perform step 320;
Step 319, returns to the fragment after except common phrases as the web page title of brachymemma and process ends;
Step 320, utilizes the second group of punctuation mark pre-set to carry out cutting to the fragment removing common phrases, returns and perform step 317;
Step 321, from described treat the web page title reference position of brachymemma, the character string of intercepting page length for heading threshold value is as the web page title of brachymemma.
In the embodiment of the present invention, step 314 to step 321 is that the web page title adopting the brachymemma general rule pre-set to treat brachymemma carries out brachymemma process.Such as, for treating that the web page title of brachymemma is " trivial games; 4399 trivial games; trivial games is complete works of; the game of double trivial games complete works-www.4399.com Largest In China ", webpage URL information is " http://www.4399.com/ sogou ", calculate according to above-mentioned brachymemma general rule, the web page title of the brachymemma of acquisition is " 4399 trivial games ".
From above-mentioned, the embodiment of the present invention is long for the web page title in collection first, affect bandwagon effect and the technical matters making the useful information of displaying less, propose the multiple strategy web page title treating brachymemma that combines and carry out brachymemma process, specifically, use favorites data, statistical study user names the web page title of webpage, carry out extracting rear generating web page white list storehouse, the canonical brachymemma rule that webpage URL information is corresponding is prestored in web page title template base, by the web page title of a large amount of webpage URL information and mapping thereof, before and after web page title, sew suffix list and/or front and back before web page title is set in identification storehouse sew recognition rule, effectively removes the front and back comprised in web page title to sew and descriptive expression, obtain good de-redundancy effect, make the web page title of brachymemma can meet browser display area requirement, improve web page title de-redundancy effect, further, the method for truncating combined by the multiple strategy of the embodiment of the present invention, accuracy rate is higher, makes the web page title of brachymemma be supplied to the useful information of user many, thus improves the business experience of user.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the junk short message identification equipment of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (23)

1. obtain a method for the web page title of brachymemma, comprising:
What acquisition webpage URL information and this webpage URL information mapped treats brachymemma web page title;
Treat brachymemma web page title to process, only retain the part that can reflect web page contents;
Describedly treat method that brachymemma web page title carries out processing and comprise one or more combination in any in following method: word segmentation processing is done to title and removes meaningless word; Inquire about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, obtain the web page title of brachymemma; General rule is utilized to do brachymemma process to title;
Described web page title coupling storehouse comprises: webpage white list storehouse and/or, web page title template base and/or, sew before and after web page title and identify storehouse.
2. the method for claim 1, stores the web page title of brachymemma corresponding to webpage URL information in described webpage white list storehouse;
The canonical brachymemma rule that webpage URL information is corresponding is stored in web page title template base;
Sew suffix list and/or front and back before storing web page title in identification storehouse before and after web page title and sew recognition rule.
3. method as claimed in claim 2, described matched rule is sequentially: sew before and after webpage white list storehouse, web page title template base, web page title and identify storehouse.
4. the method for claim 1, described acquisition treat brachymemma webpage URL information and this treat that the web page title that brachymemma webpage URL information maps comprises:
Receive the request carrying out web page title brachymemma;
Resolve and carry out the request of web page title brachymemma, obtain treating brachymemma web page title and this treat brachymemma webpage URL information.
5. the method for claim 1, generates described webpage white list storehouse and comprises:
Extract the web page title of each webpage URL information and the webpage URL information mapping comprised in multiple user's disposal data;
For each web page resources locator information, obtain all web page titles that this web page resources locator information maps, and, add up the number of users that each web page title of this web page resources locator information mapping is corresponding;
Number of users corresponding for web page title and web page title are applied to the webpage white list calculative strategy pre-set, obtain this web page title weighted value;
In same webpage URL information, choose the web page title that maximum web page head weighted value is corresponding, using the web page title that webpage URL information maps as webpage URL information with the web page title chosen, be placed in the webpage white list storehouse of setting.
6. method as claimed in claim 5, described webpage white list calculative strategy is the calculative strategy according to number of users, and described web page title weighted value is user's numerical value.
7. method as claimed in claim 5, described webpage white list calculative strategy is the calculative strategy according to field weight belonging to the user pre-set, described in obtain this web page title weighted value and comprise:
Extract the Feature Words comprised in the web page title that webpage URL information maps, mate with each domain features dictionary pre-set, determine this field belonging to webpage URL information;
According to each field weight arranged respectively for each user in advance, the field weight in user field belonging to this webpage URL information determined that each web page title obtaining the mapping of webpage URL information respectively comprises;
The field weight in the number of users comprised by web page title and user field belonging to this webpage URL information determined is applied to the weight calculation formula pre-set, and obtains web page title weighted value.
8. method as claimed in claims 6 or 7, described method comprises further:
Obtain web page navigation data, extract the web page title of webpage URL information and this webpage URL information mapping comprised in web page navigation data;
Each webpage URL information that traversal is extracted, this webpage URL information whether is there is in query webpage white list storehouse, if there is no, the web page title write white list storehouse that this webpage URL information and this webpage URL information are mapped, if existed, from the web page title extracted and webpage white list storehouse, obtain the web page title that this webpage URL information maps respectively, the web page title that in more new web page white list storehouse, this webpage URL information maps is determined whether after comparing.
9. method as claimed in claim 8, described web page title coupling storehouse comprises webpage white list storehouse, the web page title coupling storehouse that described inquiry pre-sets, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, the web page title obtaining brachymemma comprises:
Query webpage white list storehouse, obtains the web page title treating that brachymemma webpage URL information maps, and using the web page title that the obtains web page title as brachymemma.
10. the method for claim 1, generates described web page title template base and comprises:
In advance for webpage URL information map web page title arrange sort out strategy, and for each sort out web page title arrange correspondence regularity.
11. methods as claimed in claim 10, described web page title coupling storehouse comprises web page title template base, the web page title coupling storehouse that described inquiry pre-sets, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, the web page title obtaining brachymemma comprises:
Extract the naming rule treating the web page title that brachymemma webpage URL information maps, the naming rule of extraction is mated the classification strategy pre-set, described in obtaining, treat the classification belonging to web page title that brachymemma webpage URL information maps;
Query webpage title template base, treats the regularity of the classification correspondence belonging to web page title that brachymemma webpage URL information maps described in acquisition;
The web page title utilizing the regularity obtained to treat the mapping of brachymemma webpage URL information carries out canonical process, obtains the web page title of brachymemma.
12. the method for claim 1, generate to sew before and after described web page title and identify that storehouse comprises:
Obtain the web page title of the webpage URL information mapping treating brachymemma and store;
Arrange for carrying out the term frequency-inverse document word frequency calculative strategy sewing identification in front and back to web page title, before forming web page title, recognition rule is sewed in suffix list and/or front and back.
13. methods as claimed in claim 12, described web page title coupling storehouse comprises before and after web page title sews identification storehouse, the web page title coupling storehouse that described inquiry pre-sets, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, the web page title obtaining brachymemma comprises:
Obtain and treat the web page title that brachymemma webpage URL information maps, according to the fractionation strategy pre-set, the web page title obtained is split, obtain one or more webpage subtitle;
In conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, for each webpage subtitle, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculate the term frequency-inverse document word frequency value of this each webpage subtitle;
Judge whether the term frequency-inverse document word frequency value calculated is greater than the front and back pre-set and sews threshold value, if, determine that this webpage subtitle is that front and back are sewed, filtering from web page title will be sewed before and after this, and using the web page title sewed before and after the filtering web page title as brachymemma, and, the front and back determined are sewed to be stored in before and after web page title and sews in storehouse.
14. methods as claimed in claim 13, before the term frequency-inverse document word frequency value of this each webpage subtitle of described calculating, described method comprises further:
The multiple webpage subtitles obtained are combined, and for the webpage subtitle that each combines, in conjunction with sewing the web page title and term frequency-inverse document word frequency calculative strategy that identify that the webpage URL information stored in storehouse maps before and after web page title, calculate the TFIDF value of the webpage subtitle of this each combination, and when all non-front and back of webpage subtitle of each combination are sewed, perform the term frequency-inverse document word frequency value of this each webpage subtitle of described calculating.
15. methods as claimed in claim 13, described by this before and after sew from web page title after filtering, and using the web page title sewed before and after filtering as brachymemma web page title before, described method comprises further:
Judge whether the web page title length of sewing before and after filtering is greater than the web page title length threshold pre-set, and will the web page title sewed before and after the filtering of the web page title length threshold the pre-set web page title as described brachymemma be not more than.
16. methods as claimed in claim 13, described fractionation strategy is for split according to the punctuation mark pre-set comprised in web page title, the described punctuation mark pre-set comprises: _ ,-,-,+, &, # ...:.,, |:, ┊, ‖; ,.,, s ,-,-and.
17. the method for claim 1, the web page title that the described brachymemma general rule according to pre-setting treats brachymemma carries out brachymemma process and comprises:
Whether G1, judge to treat to comprise in the web page title of brachymemma to draw together content to some extent, and wherein, to draw together content be the content be included in symbol in institute, if so, performs step G2, otherwise, execution step G3;
G2, using the title content of drawn together content as brachymemma, and process ends;
G3, the web page title utilizing the first group of punctuation mark pre-set to treat brachymemma carries out cutting;
G4, judges whether that the fragment length of cutting is not more than the fragment threshold value pre-set, and if so, performs step G5, otherwise, perform step G8;
G5, is not more than the fragment of the cutting of the fragment threshold value pre-set, removes common phrases in this fragment for each, judge whether the fragment length removing common phrases is not more than the web page title length threshold pre-set, if so, perform step G6, otherwise, perform step G7;
G6, returns to the fragment after except common phrases as the web page title of brachymemma and process ends;
G7, utilizes the second group of punctuation mark pre-set to carry out cutting to the fragment removing common phrases, returns and perform step G4;
G8, from described treat the web page title reference position of brachymemma, the character string of intercepting page length for heading threshold value is as the web page title of brachymemma.
18. 1 kinds of devices obtaining the web page title of brachymemma, this device comprises: the web page title acquisition module of brachymemma request processing module and brachymemma, wherein,
Brachymemma request processing module, for from receive carry out in the request of web page title brachymemma obtain treat brachymemma webpage URL information and this treat the web page title that brachymemma webpage URL information maps;
The web page title acquisition module of brachymemma, for inquiring about the web page title coupling storehouse pre-set, obtain the matched rule treating that brachymemma webpage URL information is corresponding, treat that brachymemma web page title processes according to the matched rule obtained to described, obtain the web page title of brachymemma; Described web page title coupling storehouse comprises: webpage white list storehouse and/or, web page title template base and/or, sew before and after web page title and identify storehouse.
19. devices as claimed in claim 18, described brachymemma request processing module comprises: receiving element and resolution unit are wherein
Receiving element, for receiving the request carrying out web page title brachymemma;
Resolution unit, for resolving the request carrying out web page title brachymemma, obtain treating brachymemma web page title and this treat brachymemma webpage URL information.
20. devices as claimed in claim 18, the web page title acquisition module of described brachymemma comprises: the web page title query unit of webpage white list storehouse generation unit and brachymemma, wherein,
Webpage white list storehouse generation unit, for extracting the web page title of each webpage URL information and the webpage URL information mapping comprised in user's collection; For each web page resources locator information, obtain all web page titles that this web page resources locator information maps, and, add up the number of users that each web page title of this web page resources locator information mapping is corresponding; Number of users corresponding for web page title and web page title are applied to the webpage white list calculative strategy pre-set, obtain this web page title weighted value; In same webpage URL information, choose the web page title that maximum web page head weighted value is corresponding, using the web page title that webpage URL information maps as webpage URL information with the web page title chosen, be placed in the webpage white list storehouse of setting;
The web page title query unit of brachymemma, for query webpage white list storehouse generation unit, obtains the web page title treating that brachymemma webpage URL information maps, and using the web page title that the obtains web page title as brachymemma.
21. devices as claimed in claim 20, the web page title acquisition module of described brachymemma comprises further:
Web page title updating block, for obtaining web page navigation data, extracts the web page title of webpage URL information and this webpage URL information mapping comprised in web page navigation data, each webpage URL information that traversal is extracted, this webpage URL information whether is there is in the generation unit of query webpage white list storehouse, if there is no, the web page title write webpage white list storehouse generation unit that this webpage URL information and this webpage URL information are mapped, if existed, from the web page title extracted and webpage white list storehouse generation unit, obtain the web page title that this webpage URL information maps respectively, the web page title that in more new web page white list storehouse generation unit, this webpage URL information maps is determined whether after comparing.
22. devices as claimed in claim 18, the web page title acquisition module of described brachymemma comprises: the web page title acquiring unit of web page title template base generation unit and brachymemma, wherein,
Web page title template base generation unit, in advance for webpage URL information map web page title arrange sort out strategy, and for each sort out web page title arrange correspondence regularity;
The web page title acquiring unit of brachymemma, for extracting the naming rule treating the web page title that brachymemma webpage URL information maps, the naming rule of extraction is mated the classification strategy pre-set, described in obtaining, treat the classification belonging to web page title that brachymemma webpage URL information maps; Query webpage title template base generation unit, treats the regularity of the classification correspondence belonging to web page title that brachymemma webpage URL information maps described in acquisition; The web page title utilizing the regularity obtained to treat the mapping of brachymemma webpage URL information carries out canonical process, obtains the web page title of brachymemma.
23. devices as claimed in claim 18, the web page title acquisition module of described brachymemma comprises: sew the web page title processing unit identifying storehouse generation unit and brachymemma before and after web page title, wherein,
Sew before and after web page title and identify storehouse generation unit, for obtaining the web page title of webpage URL information mapping in user's collection and storing; Arrange for carrying out the term frequency-inverse document word frequency calculative strategy sewing identification in front and back to web page title;
The web page title processing unit of brachymemma, for obtaining the web page title treating that brachymemma webpage URL information maps, splitting the web page title obtained according to the fractionation strategy pre-set, obtaining one or more webpage subtitle; In conjunction with sewing the web page title identifying that the webpage URL information stored in storehouse maps before and after web page title, for each webpage subtitle, utilize the term frequency-inverse document word frequency calculative strategy sewed before and after web page title and identify and arrange in storehouse, calculate the term frequency-inverse document word frequency value of this each webpage subtitle; Judge whether the term frequency-inverse document word frequency value calculated is greater than the front and back pre-set and sews threshold value, if, determine that this each webpage subtitle is that front and back are sewed, filtering from web page title will be sewed before and after this, and using the web page title sewed before and after the filtering web page title as brachymemma.
CN201410158987.XA 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title Active CN105095175B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410158987.XA CN105095175B (en) 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410158987.XA CN105095175B (en) 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title

Publications (2)

Publication Number Publication Date
CN105095175A true CN105095175A (en) 2015-11-25
CN105095175B CN105095175B (en) 2019-04-30

Family

ID=54575649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410158987.XA Active CN105095175B (en) 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title

Country Status (1)

Country Link
CN (1) CN105095175B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574175A (en) * 2015-12-21 2016-05-11 北京奇虎科技有限公司 Processing method and device for optimizing search result title
CN105630909A (en) * 2015-12-21 2016-06-01 北京奇虎科技有限公司 Method and device for displaying normalized header information
CN106959945A (en) * 2017-03-23 2017-07-18 北京百度网讯科技有限公司 The method and apparatus that slug is generated for news based on artificial intelligence
CN107045529A (en) * 2017-01-16 2017-08-15 广州爱九游信息技术有限公司 Network-content acquisition method, device and service terminal
CN111460307A (en) * 2020-04-03 2020-07-28 渭南双盈未来科技有限公司 Mobile terminal accurate searching method and device
CN111680482A (en) * 2020-05-07 2020-09-18 车智互联(北京)科技有限公司 Title image-text generation method and computing device
CN112437356A (en) * 2020-11-13 2021-03-02 珠海大横琴科技发展有限公司 Streaming media data processing method and device
WO2021072850A1 (en) * 2019-10-15 2021-04-22 平安科技(深圳)有限公司 Feature word extraction method and apparatus, text similarity calculation method and apparatus, and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831199A (en) * 2012-08-07 2012-12-19 北京奇虎科技有限公司 Method and device for establishing interest model
CN102831248A (en) * 2012-09-18 2012-12-19 北京奇虎科技有限公司 Network hotspot mining method and network hotspot mining device
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog
US20130262430A1 (en) * 2012-03-29 2013-10-03 Microsoft Corporation Dominant image determination for search results
US20140095521A1 (en) * 2012-10-01 2014-04-03 DISCERN, Inc. Data augmentation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130262430A1 (en) * 2012-03-29 2013-10-03 Microsoft Corporation Dominant image determination for search results
CN102831199A (en) * 2012-08-07 2012-12-19 北京奇虎科技有限公司 Method and device for establishing interest model
CN102831248A (en) * 2012-09-18 2012-12-19 北京奇虎科技有限公司 Network hotspot mining method and network hotspot mining device
US20140095521A1 (en) * 2012-10-01 2014-04-03 DISCERN, Inc. Data augmentation
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢创丰: "基于兴趣模型的个性化信息推荐系统研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574175A (en) * 2015-12-21 2016-05-11 北京奇虎科技有限公司 Processing method and device for optimizing search result title
CN105630909A (en) * 2015-12-21 2016-06-01 北京奇虎科技有限公司 Method and device for displaying normalized header information
CN107045529A (en) * 2017-01-16 2017-08-15 广州爱九游信息技术有限公司 Network-content acquisition method, device and service terminal
CN107045529B (en) * 2017-01-16 2021-01-22 阿里巴巴(中国)有限公司 Network content acquisition method and device and service terminal
CN106959945A (en) * 2017-03-23 2017-07-18 北京百度网讯科技有限公司 The method and apparatus that slug is generated for news based on artificial intelligence
WO2021072850A1 (en) * 2019-10-15 2021-04-22 平安科技(深圳)有限公司 Feature word extraction method and apparatus, text similarity calculation method and apparatus, and device
CN111460307A (en) * 2020-04-03 2020-07-28 渭南双盈未来科技有限公司 Mobile terminal accurate searching method and device
CN111680482A (en) * 2020-05-07 2020-09-18 车智互联(北京)科技有限公司 Title image-text generation method and computing device
CN111680482B (en) * 2020-05-07 2024-04-12 车智互联(北京)科技有限公司 Title image-text generation method and computing device
CN112437356A (en) * 2020-11-13 2021-03-02 珠海大横琴科技发展有限公司 Streaming media data processing method and device
CN112437356B (en) * 2020-11-13 2021-09-28 珠海大横琴科技发展有限公司 Streaming media data processing method and device

Also Published As

Publication number Publication date
CN105095175B (en) 2019-04-30

Similar Documents

Publication Publication Date Title
CN105095175A (en) Method and device for obtaining truncated web title
US8312034B2 (en) Concept bridge and method of operating the same
US8190601B2 (en) Identifying task groups for organizing search results
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN100440224C (en) Automatization processing method of rating of merit of search engine
CN106095979B (en) URL merging processing method and device
CN104715064B (en) It is a kind of to realize the method and server that keyword is marked on webpage
CN103870461B (en) Subject recommending method, device and server
KR100930455B1 (en) Method and system for generating search collection by query
CN106021418B (en) The clustering method and device of media event
CN102930059A (en) Method for designing focused crawler
CN103116635B (en) Field-oriented method and system for collecting invisible web resources
CN104699841A (en) Method and device for providing list summary information of search results
CN102200975A (en) Vertical search engine system and method using semantic analysis
CN105404688A (en) Searching method and searching device
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN105653701A (en) Model generating method and device as well as word weighting method and device
US20150302090A1 (en) Method and System for the Structural Analysis of Websites
CN104376115A (en) Fuzzy word determining method and device based on global search
CN106874502A (en) A kind of method of video search, device and terminal
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
US20100082594A1 (en) Building a topic based webpage based on algorithmic and community interactions
CN104778232B (en) Searching result optimizing method and device based on long query
KR20090120843A (en) A system and method generating multi-concept networks based on user's web usage data
KR100557874B1 (en) Method of scientific information analysis and media that can record computer program thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant