CN105095175B - Obtain the method and device of truncated web page title - Google Patents

Obtain the method and device of truncated web page title Download PDF

Info

Publication number
CN105095175B
CN105095175B CN201410158987.XA CN201410158987A CN105095175B CN 105095175 B CN105095175 B CN 105095175B CN 201410158987 A CN201410158987 A CN 201410158987A CN 105095175 B CN105095175 B CN 105095175B
Authority
CN
China
Prior art keywords
web page
page title
webpage
truncated
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410158987.XA
Other languages
Chinese (zh)
Other versions
CN105095175A (en
Inventor
商胜
徐俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201410158987.XA priority Critical patent/CN105095175B/en
Publication of CN105095175A publication Critical patent/CN105095175A/en
Application granted granted Critical
Publication of CN105095175B publication Critical patent/CN105095175B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of method and devices for obtaining truncated web page title.Method includes: the web page title to be truncated for obtaining webpage URL information and the information MAP;It treats truncation web page title to be handled, only retains the part for being able to reflect web page contents;It is described to treat that truncate the method that is handled of web page title include one of following methods or a variety of any combination: word segmentation processing being done to title and removes meaningless word;Pre-set web page title matching library is inquired, the corresponding matching rule of webpage URL information to be truncated is obtained, the web page title to be truncated is handled according to obtained matching rule, obtains truncated web page title;Truncation processing is done to title using general rule;The web page title matching library include: webpage white list library, and/or, web page title template library, and/or, identification library is sewed before and after web page title.With the application of the invention, the de-redundancy effect of web page title can be promoted effectively.

Description

Obtain the method and device of truncated web page title
Technical field
The present invention relates to browser Display Processing Technologies, and in particular to a kind of method for obtaining truncated web page title and dress It sets.
Background technique
Currently, the needs based on browser display interface layout, since display user is stored in browser gallery, collection The browser display area for the web page title collected in folder webpage mark that is relatively limited, and being shown by the browser display area Topic, is able to use the relevant information that family gets the webpage (website).Thus, how in limited browser display area, The web page title of storage is enabled to provide a user information as much as possible, so that user obtains about the more useful of webpage Information, thus the technical issues of promoting the business experience of user, becoming a urgent need to resolve.Wherein, web page title is for general The a word for including web page contents, be to the highly concentrated of web page contents, can provide a user related web page refining and it is useful Information.
In existing browser, for the web page title that user collects in collection, generally mentioned automatically by browser Take the title (Title) at the top of webpage as web page title, for example, the webpage uniform resource locator collected for needs (URL, Uniform Resource Locator) information:www.sohu.com, browser is automatically by webpage www.sohu.com The title " upper Sohu, see the Olympic Games " of top setting is stored in collection as webpage www.sohu.com title, when So, user can also carry out manual modification to the web page title in collection according to the actual needs of itself.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind State the method and device of the truncated web page title of acquisition of problem.
According to one aspect of the present invention, the method for obtaining truncated web page title is provided, this method comprises:
Obtain the net to be truncated of webpage URL information and webpage URL information mapping Page head;
It treats truncation web page title to be handled, only retains the part for being able to reflect web page contents;
It is described to treat that truncate the method that is handled of web page title include one of following methods or a variety of any group It closes: word segmentation processing being done to title and removes meaningless word;Pre-set web page title matching library is inquired, webpage to be truncated is obtained The corresponding matching rule of URL information, according to obtained matching rule to the web page title to be truncated at Reason, obtains truncated web page title;Truncation processing is done to title using general rule;
The web page title matching library include: webpage white list library, and/or, web page title template library, and/or, webpage mark Sew identification library in topic front and back.
According to another aspect of the present invention, the device for obtaining truncated web page title is provided, comprising: truncate request processing Module and truncated web page title obtain module, wherein
Request processing module is truncated, for obtaining webpage system to be truncated from the truncated request of received progress web page title One Resource Locator information and the web page title for being somebody's turn to do webpage URL information mapping to be truncated;
Truncated web page title obtains module and obtains net to be truncated for inquiring pre-set web page title matching library The corresponding matching rule of page URL information carries out the web page title to be truncated according to obtained matching rule Processing, obtains truncated web page title;The web page title matching library include: webpage white list library, and/or, web page title mould Plate library, and/or, identification library is sewed before and after web page title.
The method and device according to the present invention for obtaining truncated web page title, positions according to the webpage unified resource of input Accord with information and web page title, using the webpage white list library pre-established, and/or, Page template library, and/or, web page title Front and back sew identification library, and/or, truncate general rule, web page title is truncated.Thus existing method is solved to webpage mark After topic extracts, obtains truncated web page title and include descriptive expression and the technical issues of front and back is sewed, it can be effectively The front and back for including in removal web page title is sewed and descriptive expression, obtains good de-redundancy purpose, reaches cutting for acquisition Short web page title meets browser display area requirement, and can provide a user more useful information, to promote user The beneficial effect of business experience.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows the method flow signal that the embodiment of the present invention obtains truncated web page title;
Fig. 2 shows the apparatus structure signals that the embodiment of the present invention obtains truncated web page title;And
Fig. 3 shows the method detailed process signal that the embodiment of the present invention obtains truncated web page title.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
With the development of network technology, in order to provide a user more useful information and adapt to browser viewing area Domain, it is also necessary to processing are filtered to some nonessential informations for including in the web page title stored in collection, i.e., to webpage Title carries out crucial words and extracts to truncate web page title, to provide a user as far as possible in limited browser display area Useful information.
As alternative embodiment, the web page title of acquisition can be split by segmenting cutting method, first to webpage Title carries out words cutting, then, carries out meaningless word removal to the words of cutting, finally, to by removal treated net Page head carries out words combination, obtains truncated web page title.
In practical application, due to carrying out words cutting to web page title using participle cutting method, and to the words of cutting Meaningless word removal is carried out, cannot be removed effectively information unrelated to user in web page title.For example, web page title " upper Sohu, See the Olympic Games " after words cutting, the removal of meaningless word and words combination, the web page title extracted, which remains as, " above to be searched Fox sees the Olympic Games ", and for a user, "upper" and " seeing the Olympic Games " they may be the information useless to user, so that limited clear The useful information amount provided a user of looking in device display area is reduced, and reduces the business experience of user;For another example for webpage Title " welcomes access Sohu ", after existing method extracts web page title, obtains truncated web page title and remains as " welcoming access Sohu ", and wherein, " welcoming access " is descriptive expression, cannot provide the information useful to user, in this way, by Some descriptive expressions are contained in truncated web page title, on the one hand, so that truncated web page title is not able to satisfy browsing Device display area requires, on the other hand, but also the useful information that truncated web page title is supplied to user is less, and web page title De-redundancy effect is poor.Preferably, propose that one kind retains as far as possible each web page title the useful letter of title in the embodiment of the present invention Breath a kind of web page title truncate technology, that is, the method for obtaining truncated web page title, by establish webpage white list library and/ Or, Page template library, and/or, sew before and after web page title identification library, and/or, truncate general rule, have to web page title With truncation, it is allowed to keyword or crucial phrase comprising more refining, and removes the information unrelated with user, to meet browsing Device display area requires, and provides a user more useful information.
Fig. 1 shows the method flow signal that the embodiment of the present invention obtains truncated web page title.Referring to Fig. 1, the process Include:
Step 101, it obtains webpage URL information to be truncated and is somebody's turn to do webpage unified resource positioning to be truncated Accord with the web page title of information MAP;
In this step, truncated technology is carried out only for web page title relative to existing, in the embodiment of the present invention, for reality Existing more efficiently web page title truncates and the matching embodiment of the present invention proposes webpage white list library, and/or, Page template Library, and/or, identification library technology is sewed before and after web page title, when obtaining to web page title, it is also necessary to obtain and simultaneously utilize the net Page URL information, and as alternative embodiment, unlike the prior art, in the embodiment of the present invention, wait cut Short web page title can not be the invalid title of representation page subject information for empty or url etc..
This step specifically includes:
It receives and carries out the truncated request of web page title;
In this step, user is during browsing webpage, if it is determined that needs to collect the webpage, then in the net The display interface of page is added to collection submenu by clicking in collection drop-down menu, and triggering carries out web page title and cuts Short, which extracts the web page title of user browsing, that is, web page title to be truncated, by the web page title of extraction and The webpage URL information (webpage URL information to be truncated), which is encapsulated in, carries out web page title truncation Request in, to server send;Alternatively, user need to the web page title (web page title to be truncated) stored in collection into Row optimization, then by clicking the renaming submenu arranged in collection drop-down menu, triggering carries out web page title truncation, user It can choose and need to carry out truncated web page title, the web page title and the webpage that web browser chooses the user are unified Resource Locator Information encapsulation is sent in carrying out the truncated request of web page title to server, wherein if user, which chooses, to be had Multiple web page titles, then in carrying out the truncated request of web page title, each web page title and the webpage uniform resource locator Information forms mapping relations.
Parsing carries out the truncated request of web page title, obtains web page title to be truncated and is somebody's turn to do webpage unified resource to be truncated Locator information.
In this step, server is receiving the progress truncated request of web page title, is asked by decapsulating and parsing this It asks, the web page title and the webpage URL information carried in available request.
Step 102, pre-set web page title matching library is inquired, webpage URL information to be truncated is obtained Corresponding matching rule handles the web page title to be truncated according to obtained matching rule, obtains truncated webpage Title;The web page title matching library include: webpage white list library, and/or, web page title template library, and/or, web page title Front and back sew identification library, and/or, truncate general rule, wherein
The corresponding truncated web page title of webpage URL information is stored in webpage white list library;
It is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule;
Sew suffix list and/or front and back before being stored with web page title in identification library before and after web page title and sews recognition rule.Its In, it is setting for carrying out the term frequency-inverse document word that identification is sewed in front and back to web page title that recognition rule is sewed before and after web page title Frequency calculative strategy, it is subsequent to be described in detail again.
In this step, as preferred embodiment, web page title matching library can also be loaded into caching in advance.
In the embodiment of the present invention, if being stored with webpage white list library, web page title template library in web page title matching library And sew identification library before and after web page title, since webpage white list storehouse matching required time is short, it can effectively filter and be not included in Web page title in webpage white list library reduces subsequent processing;And it is longer the time required to being matched with web page title template library, And sew longest the time required to identification library is matched with web page title front and back.Thus, if necessary to be carried out using triplicity Slug truncates, preferably, the matching rule used is sequentially are as follows: before webpage white list library, web page title template library, web page title Suffix identifies library.
Generating webpage white list library includes:
A11 extracts each webpage URL information for including in user's collection and the positioning of webpage unified resource Accord with the web page title of information MAP;
In this step, from user's collection comprising web page title (web page storage title), the net of user setting is extracted Page head and webpage URL information.
In the embodiment of the present invention, since user is when carrying out web page access using sogou browser, sogou browser can be deposited The web page access record of user is stored up, for example, user is the web page title and the webpage URL information of webpage setting, server is logical The web page storage folder data extracted in sogou browser are crossed, available a large amount of webpage URL information and each user are to webpage The web page title of URL information setting.Wherein, webpage URL information and web page title constitute mapping relations (title to), different use For same webpage URL information, the web page title of setting can be different at family.Thus, same webpage URL information may be mapped with respectively A large amount of different web pages titles that the webpage URL information is arranged in user.
A12 obtains all webpages of web page resources locator information mapping for each web page resources locator information Title, and, count the corresponding number of users of each web page title of web page resources locator information mapping;
In this step, since the difference of user thus for each webpage URL information, is mapped with different webpage marks Topic.In the embodiment of the present invention, for each webpage URL information, each web page title of webpage URL information mapping is counted respectively Corresponding number of users.For example, for webpage URL information:www.sohu.com, the web page title of mapping includes: that " upper Sohu, sees The Olympic Games ", " welcoming access Sohu ", " Sohu " and " Sohu official website ", wherein by statistics, " upper Sohu, see the Olympic Games " is corresponding Number of users is 10,000, that is, has 10,000 users by webpage URL information:www.sohu.comThe web page title of mapping is set as " above searching Fox sees the Olympic Games ", " welcoming access Sohu " corresponding number of users is 1.5 ten thousand, and " Sohu " corresponding number of users is 50,000, " Sohu official The corresponding number of users of net " is 2.5 ten thousand.
A13 calculates the corresponding number of users of web page title and web page title applied to pre-set webpage white list Strategy obtains the web page title weighted value;
In this step, as alternative embodiment, webpage white list calculative strategy can be the calculative strategy according to number of users, Then web page title weighted value is user's numerical value.As described in step A12, according to the calculative strategy of number of users, web page title " is above searched Fox sees the Olympic Games " corresponding web page title weighted value is 10,000, " welcoming access Sohu " corresponding web page title weighted value is 1.5 Ten thousand, " Sohu " corresponding web page title weighted value is 50,000, and " Sohu official website " corresponding web page title weighted value is 2.5 ten thousand.
Certainly, as another alternative embodiment, it is also contemplated that the field weight of user's fields into practical application, For the user in a certain field, it is accurate that the user in the field names the web page title that webpage URL information maps Property should be greater than the accuracy of the web page title name that user in other non-fields maps same webpage URL information, i.e., should The web page title name that user in field maps webpage URL information is available to be more widely applied and popularizes.For example, It, should be big to the accuracy of the web page title name of machinery field webpage URL information mapping for the user of a certain machinery field In the accuracy for the web page title name that other on-mechanical field users map the webpage URL information.Thus, webpage white list Calculative strategy can be the calculative strategy according to pre-set user's fields weight, in this way, by being in advance each use Each field weight is respectively set in family, for example, it is 0.5 that its machinery field weight, which can be set, for a certain field user, electricity neck Domain weight is 0.3, and communications field weight is 0.2 etc..About field belonging to user is determined, the feature of user tag can be passed through Match to obtain, is well-known technique, detailed description is omitted here.In this way, by the corresponding number of users of web page title and web page title application In pre-set webpage white list calculative strategy, obtaining the web page title weighted value includes:
B11 extracts the Feature Words for including in the web page title of webpage URL information mapping, special with pre-set each field Sign dictionary is matched, and determines field belonging to the webpage URL information;
In this step, for example, for webpage URL information, randomly select a web page title " welcoming access Sohu ", mention The Feature Words taken are Sohu, if including Feature Words Sohu in the feature dictionary of the communications field, then belonging to the web page title Field is the communications field.
B12 obtains the mapping of webpage URL information according to being in advance the respectively arranged each field weight of each user respectively Field weight of the user that each web page title includes in the determining webpage URL information fields;
In this step, for " upper Sohu, see the Olympic Games ", wherein in 10,000 users, have 0.2 general-purpose family in the communications field Field weight be 0.2, have 0.3 general-purpose family the communications field field weight be 0.3, have 0.1 general-purpose family in the communications field Field weight is 0.6, and having field weight of the 0.4 general-purpose family in the communications field is 0.9.For webpage URL information:www.sohu.comOther web page titles of mapping are counted according to method identical with this.
B13, the number of users for including by web page title and user are in the field of the determining webpage URL information fields Weight is applied to pre-set weight calculation formula, obtains web page title weighted value.
In this step, weight calculation formula can be total weight calculation formula, be also possible to relative weighting calculation formula.Its In, total weight calculation formula is as follows:
In formula,
XiFor i-th of web page title weighted value, wherein i is natural number;
Ui,jFor corresponding j-th of the user of i-th of web page title;
ξi,jFor corresponding j-th of the user of i-th of web page title the web page title fields field weight;
K is the corresponding total number of users of i-th of web page title, and K is natural number.
Relative weighting calculation formula is as follows:
As other alternative embodiments, web page title priority coefficient can also be set for each web page title in advance, and It is combined into the field weight calculation web page title weighted value of the user setting of web page title mapping, i.e. webpage white list calculative strategy It can be the calculative strategy according to pre-set user's fields weight combination web page title priority coefficient.This method into One step includes:
Obtained web page title weighted value is multiplied with web page title priority coefficient, the web page title as final output Weighted value.
In this step, for total weight calculation formula, the web page title weighted value for calculating final output is as follows:
In formula,
ψiFor i-th of web page title priority coefficient.
In the embodiment of the present invention, web page title priority coefficient can be arranged by manual type.For example, by obtaining search dog Corresponding web page title priority coefficient is arranged in each web page title in browser, respectively each web page title.
A14 chooses the corresponding web page title of maximum web page title weighted value, by webpage in same webpage URL information The web page title that URL information and the web page title of selection are mapped as webpage URL information is placed in the webpage white list library of setting In.
In this step, for same webpage URL information, each web page title weight of webpage URL information mapping is calculated After value, the corresponding web page title of maximum web page title weighted value is chosen, is mapped as webpage URL information in webpage white list library Web page title.Wherein, web page title weighted value includes web page title maximum always weighted value and web page title maximum relative weight value, The web page title that the maximum total corresponding web page title of weighted value of web page title is mapped as the webpage URL information can be chosen;Or Person chooses the web page title that the corresponding web page title of web page title maximum relative weight value is mapped as the webpage URL information.
As alternative embodiment, web page title weighted value can also be arranged by size in same webpage URL information Sequence, chooses the corresponding web page title of web page title weighted value of sequence top N, and each webpage URL information maps N number of webpage mark Topic, and N number of web page title that webpage URL information is mapped is placed in the white list library of setting, wherein N is natural number.I.e. in net In page white list library, each webpage URL information is mapped with N number of web page title, wherein N can be determine according to actual needs.
In practical application, since the web page title mapped by webpage URL information obtained by the above method is according to user Behavior is selected, and the web page title of the webpage URL information mapping in user's collection may not can accurately reflect webpage Title, and in the web page navigation data that each navigation website provides, due to being to have carried out height to webpage by technical professional Degree is summarized, thus, the web page title provided comparatively refines, and the useful information for including is more.Thus, the embodiment of the present invention In, after generating webpage white list library, further, this method can also include:
C11 obtains web page navigation data, extracts the webpage URL information for including in web page navigation data and webpage URL The web page title of information MAP;
In this step, web page navigation data can be grabbed from each navigation website, and to webpage by way of web crawlers Navigation data is parsed, and the web page title of webpage URL information and the mapping of webpage URL information is therefrom extracted.About crawl net Page navigation data, extracts web page title and webpage URL information is well-known technique, detailed description is omitted here.
C12 traverses each webpage URL information of extraction, believes in query webpage white list library with the presence or absence of webpage URL Breath, if it does not, white list library is written into the web page title that the webpage URL information and the webpage URL information map, if In the presence of, from the web page title of extraction and webpage white list library, the web page title of webpage URL information mapping is obtained respectively, The web page title that the webpage URL information maps in more new web page white list library is determined whether after being compared.
In the embodiment of the present invention, since the webpage URL information quantity provided in web page navigation data is relatively limited, i.e., cannot All webpage URL informations are covered on a large scale, thus, using this method as a useful supplement in webpage white list library.Pass through The web page navigation data of each navigation website are grabbed, the web page title of webpage URL information and webpage URL information mapping is extracted, and It, will be from the webpage in webpage white list library in the web page title stored and the Web side navigation data of crawl according to webpage URL information Title is compared, to choose more accurate web page title of expressing the meaning, i.e., if the webpage mark stored in webpage white list library Topic is expressed the meaning more accurate, then is not dealt with, if the web page title for the webpage URL information mapping extracted from web page navigation data It expresses the meaning more accurate, then the web page title stored in webpage white list library is updated.
So far, the process for generating webpage white list library terminates.
Generating web page title template library includes:
Sort out strategy in advance for the web page title setting of webpage URL information mapping, and is set for the web page title of each classification Set corresponding regularity.
It,, can be with from a large amount of web page title data although the web page title quantity of each website is various in this step Web page title is sorted out according to pre-set classification strategy, wherein classification strategy can be rich according to social category, technology The classification strategy of objective class etc., that is to say, that web page title is classified as social category web page title, Tech blog class web page title Deng.And corresponding regularity is set for the web page title of each classification, form web page title template.
In subsequent, after sorting out to web page title, in the web page title of classification, the corresponding canonical of the classification is used Rule intercepts the web page title of classification, and truncated web page title can be obtained.For example, in web page title template library, Social category web page title and the corresponding regularity of Tech blog class web page title is respectively set in advance, in this way, by webpage It, can be by the web page title of each classification after title is classified as social category web page title or Tech blog class web page title It is intercepted according to pre-set corresponding to the regularity accordingly sorted out, to obtain corresponding truncated web page title.
About for the web page title of each classification, corresponding regularity is set, it can be by being carried out to the web page title of classification Data mining obtains, detailed description is omitted here.
In the embodiment of the present invention, due at the ending of each web page title, often containing " homepage ", " it is reported that ", it is " outer Matchmaker ", " hot spot " etc. be used for keep web page title eye-catching preceding suffix information, or indicate web page title structure and with web page title The unrelated preceding suffix information of theme.In order to remove the preceding suffix information of web page title, aforementioned regularity or white list are used Information filtering is sewed in library process before and after carrying out is relatively complicated.Thus, in the embodiment of the present invention, can use title library (storage The web page title of webpage URL information mapping) mass data analysis is carried out, periodic data excavation is carried out using TFIDF method, thus Suffix information before grabbing out.
Sewing identification library before and after generation web page title includes:
Obtain the web page title of webpage URL information mapping and storage in user's collection;
It is arranged for carrying out term frequency-inverse document word frequency (TF-IDF, the Term that identification is sewed in front and back to web page title Frequency-Inverse Document Frequency) calculative strategy.
In the embodiment of the present invention, TF-IDF is a kind of common weighted statistical method for information retrieval.Wherein, word frequency is used To assess a words for the weight of a copy of it document in a document library (file set or corpus), the weight of words with The directly proportional increase of number that occurs in document library of the words, while the frequency occurred in document library with the words is at anti- Than decline;Inverse document word frequency is the measurement of a words general importance.
The weight calculation formula of TF are as follows:
In formula,
TF is word frequency weight;
PwThe number in document library is appeared in for word (words) w;
P is document library length, that is, the words total quantity for including.
The weight calculation formula of IDF are as follows:
In formula,
IDF is inverse document word frequency weight;
DwFor individual (document) sum in sample (document library, file set or corpus) containing words w;
D is total sample number, i.e., total number of files.
If IDF value is smaller, indicate that document more in sample includes the words, the information content which includes is got over It is few;If IDF value is bigger, indicate that document only fewer in sample includes the words, the information content which includes is bigger.
In conjunction with word frequency and inverse document word frequency, available term frequency-inverse document word frequency:
In formula, WeightwFor the TF-IDF weight of words w.
If TF-IDF weight value is bigger, the indicative better of the words is indicated.
The truncated web page title of acquisition is described in detail again below.
In the embodiment of the present invention, if web page title matching library includes webpage white list library, pre-set net is inquired Page head matching library obtains the corresponding matching rule of webpage URL information to be truncated, and is advised according to obtained matching Then the web page title to be truncated is handled, obtaining truncated web page title includes:
Query webpage white list library obtains the web page title of webpage URL information mapping to be truncated, and will Obtained web page title is as truncated web page title.
In this step, for not storing the feelings of webpage URL information to be truncated in webpage white list library Shape can carry out truncation processing to web page title according to the prior art, and details are not described herein.
If web page title matching library includes web page title template library, pre-set web page title matching library is inquired, The corresponding matching rule of webpage URL information to be truncated is obtained, according to obtained matching rule to described wait truncate Web page title is handled, and is obtained truncated web page title and is included:
D11 extracts the naming rule of the web page title of webpage URL information mapping to be truncated, by the naming rule of extraction With pre-set classification strategy, classification belonging to the web page title of the webpage URL information mapping to be truncated is obtained;
In this step, classification belonging to the web page title can be distinguished by the naming rule of analysis web page title.About Web page title is carried out to be classified as well-known technique, detailed description is omitted here.
As alternative embodiment, if the web page title of webpage URL information mapping is invalid, i.e., web page title completely cannot be anti- Web page contents are answered, for example, empty, that is, not including in any only includes perhaps symbol, then can return to the domain of the webpage URL information Name is used as truncated web page title.
D12, query webpage title template library obtain belonging to the web page title that the webpage URL information to be truncated maps Sort out corresponding regularity;
In this step, if affiliated is classified as after the web page title for treating truncation webpage URL information mapping is sorted out Social category web page title is read as the regularity of social category web page title setting then from web page title template library.
D13 treats the web page title progress canonical processing for truncating the mapping of webpage URL information using the regularity of acquisition, Obtain truncated web page title.
If web page title matching library includes that identification library is sewed in web page title front and back, pre-set web page title is inquired With library, the corresponding matching rule of webpage URL information to be truncated is obtained, according to obtained matching rule to described Web page title to be truncated is handled, and is obtained truncated web page title and is included:
E11 obtains the web page title of webpage URL information mapping to be truncated, according to pre-set fractionation strategy to acquisition Web page title split, obtain one or more webpage subtitles;
In this step, since when collecting to web page title, each component part of web page title has certain spy Point, for example, prefix (or descriptive expression), title text, one or more suffix have been generally comprised, and by web page title The estimation of each component part is analyzed, and can be distinguished by some specific punctuation marks;Furthermore for title text, it is The information useful to user can be used as whole provide a user.
As a result, in the embodiment of the present invention, splitting strategy be can be according to the pre-set punctuate for including in web page title Symbol is split.For example, pre-set punctuation mark can be _ ,-,-,+, &, # ...:,, |:, ┊,‖,;,,.,, s ,-,-, etc..If in web page title including any of the above-described pre-set symbol, by the webpage Title is split from the symbol.
E12, in conjunction with the web page title for sewing the webpage URL information mapping stored in identification library before and after web page title, for every It is every to calculate this using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title for one webpage subtitle The term frequency-inverse document word frequency value of one webpage subtitle;
In this step, as alternative embodiment, it can also be calculated carrying out term frequency-inverse document word frequency value to webpage subtitle Before, this method further comprises:
Obtained multiple webpage subtitles are combined, and are directed to each combined webpage subtitle, in conjunction with webpage mark Sew the web page title and term frequency-inverse document word frequency calculative strategy of the webpage URL information mapping stored in identification library, meter in topic front and back The TFIDF value of each combined webpage subtitle is calculated, and in the case where each all non-front and back of combined webpage subtitle is sewed, Execute the calculating to the term frequency-inverse document word frequency value of each webpage subtitle.
In this step, the mode of combination webpage subtitle can be, for example, web page title sequentially obtains after splitting Three webpage subtitles, respectively A, B, C after being then combined, obtain the webpage subtitle of two combinations, respectively AB, BC, Front and back first is carried out to AB and sews judgement, if AB sews for front and back, using C as truncated web page title;If AB is not that front and back is sewed, Front and back then is carried out to BC and sews judgement, if BC sews for front and back, using A as truncated web page title;If BC is not that front and back is sewed, It then carries out front and back respectively to A, B, C again and sews judgement.
In the embodiment of the present invention, the formula for calculating the term frequency-inverse document word frequency value of webpage subtitle be can be such that
In formula,
TF is the word frequency of webpage subtitle;
IDF is the inverse document word frequency of webpage subtitle;
N' is the number that webpage subtitle occurs in sample set;
N is the total quantity of each webpage subtitle in sample set;
D is total number of files in sample set comprising webpage subtitle;
D' is the total number of files for including in sample set;
+ 1 is smoothing processing.
It should be noted that calculating the method for the term frequency-inverse document word frequency value of the webpage subtitle of combination and calculating webpage The method of the term frequency-inverse document word frequency value of subtitle is similar, detailed description is omitted here.
E13, judges whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, really Fixed each webpage subtitle is sewed for front and back, the webpage mark which is sewed and is filtered out from web page title, and is sewed before and after filtering out Topic is used as truncated web page title.
In this step, set in advance if the term frequency-inverse document word frequency value for the webpage subtitle that step E12 is calculated is greater than Threshold value is sewed in the front and back set, then shows that the webpage subtitle (entirety) is sewed for front and back, and the webpage subtitle is deleted.
Further, since the web page title that each web editor is write can all have the style or template of oneself, thus, it is real In the application of border, sew judgement carrying out above-mentioned front and back, i.e. after execution step E13, then to each net for including in truncated web page title Page subtitle is sewed before and after carrying out to be filtered out, and can be further improved the validity of the truncated web page title of output, thus, this method It can further include:
E14 sews the webpage URL information stored in identification library from web page title front and back and reflects according to webpage URL information to be truncated The web page title penetrated extracts the web page title of the webpage URL information mapping to be truncated;
E15 is utilized in conjunction with the web page title extracted for the corresponding each webpage subtitle of truncated web page title Sew the term frequency-inverse document word frequency calculative strategy being arranged in identification library before and after web page title, the word frequency-for calculating the webpage subtitle is inverse Document word frequency value;
In this step, for each webpage subtitle sewed before and after filtering out, identified in library in conjunction with sewing before and after web page title Each web page title of website belonging to the web page title of extraction calculates the TFIDF for filtering out each webpage subtitle that front and back is sewed Value.
E16, judges whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, really Fixed each webpage subtitle is sewed for front and back, which is sewed and is filtered out from truncated web page title, and updates truncated webpage Title.
In embodiment of the present invention, step E14 to step E16 deposits to sew in identification library before and after utilization web page title above-mentioned Recognition rule is sewed in suffix list and/or front and back before the web page title of storage, carries out the detailed process that identification is sewed in front and back to web page title.
In the embodiment of the present invention, step E14 is into step E16, using the web page title of all site informations as sample database, Then, sew judgement before and after carrying out in sample database to each web page title.
As another alternative embodiment, can also first classify to web page title individually according to site information, for example, It is classified as Sohu, Sina, 163, Netease etc., then, then sews in identification library before and after web page title and extracts the corresponding net of the classification The web page title of page URL information mapping calculates plan using the term frequency-inverse document word frequency being arranged in identification library is sewed before and after web page title It slightly carries out term frequency-inverse document word frequency value to calculate, and carries out the judgement that front and back is sewed, to achieve the effect that removal front and back is sewed.In this way, Relative to aforementioned using the web page title of all site informations as the situation of sample database, the present embodiment is by the site information of classification Web page title is as sample database, then, it is determined that classification belonging to webpage URL information to be truncated, and in the sample database of classification, it is right It is somebody's turn to do webpage URL information to be truncated and corresponds to before and after web page title carries out and sew judgement.
As another alternative embodiment, the front and back excavated by TFIDF method can also be sewed and be stored in webpage mark Topic front and back is sewed in library, and in subsequent process, after being split first to web page title, by sew before and after web page title library into Preliminary matches are sewed in row front and back, sew the front and back that library matches before and after filtering out in web page title with web page title and sew, then, for mistake Obtained web page title is filtered, then front and back is carried out by TFIDF method and sews judgement, and after judging that front and back is sewed, with the shape of increment Formula sews the front and back judged be added to pre-stored web page title before and after sew in library.
If web page title matching library includes that identification is sewed in webpage white list library, web page title template library and web page title front and back Pre-set web page title matching library is then inquired in library, obtains the corresponding matching of webpage URL information to be truncated Rule is handled the web page title to be truncated according to obtained matching rule, and obtaining truncated web page title includes:
F11, query webpage white list library, if obtaining the webpage mark of webpage URL information mapping to be truncated Topic, and using obtained web page title as truncated web page title, otherwise, execute step F12;
F12 extracts the naming rule of the web page title of webpage URL information mapping to be truncated, by the naming rule of extraction With pre-set classification strategy, classification belonging to the web page title of the webpage URL information mapping to be truncated is obtained;
F13, query webpage title template library, if getting the web page title of the webpage URL information mapping to be truncated The corresponding regularity of affiliated classification treats the web page title for truncating the mapping of webpage URL information using the regularity of acquisition Canonical processing is carried out, truncated web page title is obtained, otherwise, executes step F14;
F14 obtains the web page title of webpage URL information mapping to be truncated, according to pre-set fractionation strategy to acquisition Web page title split, obtain one or more webpage subtitles;
F15, in conjunction with the web page title for sewing the webpage URL information mapping stored in identification library before and after web page title, for every It is every to calculate this using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title for one webpage subtitle The term frequency-inverse document word frequency value of one webpage subtitle;
F16, judges whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, really Fixed each webpage subtitle is sewed for front and back, the webpage mark which is sewed and is filtered out from web page title, and is sewed before and after filtering out Topic is used as truncated web page title.
As alternative embodiment, this method be can further include:
Step 103, the truncated web page title that will acquire, which is issued in user's collection, to be stored.
In this step, the truncated web page title that being also possible to server will acquire is issued to user and shows, by user After choosing whether modification, stored in collection according to user's selection.
As another alternative embodiment, this method be can further include:
Truncated web page title is treated using pre-set truncation general rule and carries out truncation processing.About using truncation General rule carries out truncation processing, subsequent to be described in detail again.
Fig. 2 shows the apparatus structure signals that the embodiment of the present invention obtains truncated web page title.Referring to fig. 2, the device It include: to truncate request processing module and truncated web page title acquisition module, wherein
Request processing module is truncated, for obtaining webpage to be truncated from the truncated request of received progress web page title URL information and the web page title for being somebody's turn to do webpage URL information mapping to be truncated;
Truncated web page title obtains module and obtains net to be truncated for inquiring pre-set web page title matching library The corresponding matching rule of page URL information, is handled the web page title to be truncated according to obtained matching rule, is obtained and is cut Short web page title;The web page title matching library include: webpage white list library, and/or, web page title template library, and/or, Sew identification library before and after web page title, wherein
The corresponding truncated web page title of webpage URL information is stored in webpage white list library;
It is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule;
Sew suffix list and/or front and back before being stored with web page title in identification library before and after web page title and sews recognition rule.
Wherein,
Truncating request processing module includes: receiving unit and resolution unit (not shown), wherein
Receiving unit carries out the truncated request of web page title for receiving;
Resolution unit carries out the truncated request of web page title for parsing, and obtains web page title to be truncated and is somebody's turn to do wait cut Short webpage URL information.
As alternative embodiment, truncated web page title obtains module and includes: webpage white list library generation unit and cut Short web page title query unit (not shown), wherein
Webpage white list library generation unit, for extracting each webpage URL information for including in user's collection and webpage URL The web page title of information MAP;For each web page resources locator information, web page resources locator information mapping is obtained All web page titles, and, count the corresponding number of users of each web page title of web page resources locator information mapping;By net The corresponding number of users of page head and web page title are applied to pre-set webpage white list calculative strategy, obtain the webpage mark Inscribe weighted value;In same webpage URL information, the corresponding web page title of maximum web page title weighted value is chosen, webpage URL is believed The web page title that breath is mapped with the web page title chosen as webpage URL information, is placed in the webpage white list library of setting;
Truncated web page title query unit is used for query webpage white list library generation unit, obtains webpage URL to be truncated The web page title of information MAP, and using obtained web page title as truncated web page title.
In the embodiment of the present invention, preferably, truncated web page title acquisition module can also include:
Web page title updating unit extracts the webpage URL for including in web page navigation data for obtaining web page navigation data Information and the web page title of webpage URL information mapping;Traverse each webpage URL information extracted, query webpage white list It whether there is the webpage URL information in the generation unit of library, if it does not, by the webpage URL information and the webpage URL information Webpage white list library generation unit is written in the web page title of mapping, if it does, from the white name of web page title and webpage of extraction In single library generation unit, the web page title of webpage URL information mapping is obtained respectively, more new web page is determined whether after being compared The web page title that the webpage URL information maps in the generation unit of white list library.
As another alternative embodiment, it includes: web page title template library generation unit that truncated web page title, which obtains module, And truncated web page title acquiring unit, wherein
Web page title template library generation unit sorts out plan for the web page title setting in advance for the mapping of webpage URL information Slightly, and for the web page title of each classification corresponding regularity is set;
Truncated web page title acquiring unit, the name of the web page title for extracting webpage URL information mapping to be truncated The naming rule of extraction is matched pre-set classification strategy by rule, obtains the net of the webpage URL information mapping to be truncated Classification belonging to page head;Query webpage title template library generation unit obtains the net of the webpage URL information mapping to be truncated The corresponding regularity of classification belonging to page head;The net for truncating the mapping of webpage URL information is treated using the regularity of acquisition Page head carries out canonical processing, obtains truncated web page title.
As yet another alternative embodiment, it includes: that the life of identification library is sewed in web page title front and back that truncated web page title, which obtains module, At unit and truncated web page title processing unit, wherein
Sew identification library generation unit before and after web page title, for obtaining the net that webpage URL information maps in user's collection Page head simultaneously stores;It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title.
Truncated web page title processing unit, for obtaining the web page title of webpage URL information mapping to be truncated, according to pre- The fractionation strategy being first arranged splits the web page title of acquisition, obtains one or more webpage subtitles;In conjunction with webpage mark The web page title that the webpage URL information mapping stored in identification library is sewed in topic front and back utilizes webpage mark for each webpage subtitle The term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed in topic front and back, calculates the inverse text of word frequency-of each webpage subtitle Shelves word frequency value;Judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, determining Each webpage subtitle is sewed for front and back, the web page title which is sewed and is filtered out from web page title, and is sewed before and after filtering out As truncated web page title.
As yet another alternative embodiment, it includes: webpage white list library generation unit, net that truncated web page title, which obtains module, Sew identification library generation unit before and after page head template library generation unit, web page title, truncated web page title query unit, truncate Web page title acquiring unit and truncated web page title processing unit, wherein
Webpage white list library generation unit, for extracting each webpage URL information for including in user's collection and webpage URL The web page title of information MAP;For each web page resources locator information, web page resources locator information mapping is obtained All web page titles, and, count the corresponding number of users of each web page title of web page resources locator information mapping;By net The corresponding number of users of page head and web page title are applied to pre-set webpage white list calculative strategy, obtain the webpage mark Inscribe weighted value;In same webpage URL information, the corresponding web page title of maximum web page title weighted value is chosen, webpage URL is believed The web page title that breath is mapped with the web page title chosen as webpage URL information, is placed in the webpage white list library of setting;
Web page title template library generation unit sorts out plan for the web page title setting in advance for the mapping of webpage URL information Slightly, and for the web page title of each classification corresponding regularity is set;
Sew identification library generation unit before and after web page title, for obtaining the net that webpage URL information maps in user's collection Page head simultaneously stores;It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title;
Truncated web page title query unit, for being generated according to webpage URL information query webpage white list library to be truncated Unit, if obtaining the web page title of webpage URL information mapping to be truncated, and using obtained web page title as truncated webpage Otherwise title notifies truncated web page title acquiring unit;
Truncated web page title acquiring unit, the name of the web page title for extracting webpage URL information mapping to be truncated The naming rule of extraction is matched pre-set classification strategy by rule, obtains the net of the webpage URL information mapping to be truncated Classification belonging to page head;Query webpage title template library generation unit is reflected if getting the webpage URL information to be truncated The corresponding regularity of classification belonging to the web page title penetrated is treated truncation webpage URL information using the regularity of acquisition and is reflected The web page title penetrated carries out canonical processing, obtains truncated web page title, otherwise, notifies truncated web page title processing unit;
Truncated web page title processing unit, for obtaining the web page title of webpage URL information mapping to be truncated, according to pre- The fractionation strategy being first arranged splits the web page title of acquisition, obtains one or more webpage subtitles;In conjunction with webpage mark The web page title that the webpage URL information mapping stored in identification library is sewed in topic front and back utilizes webpage mark for each webpage subtitle The term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed in topic front and back, calculates the inverse text of word frequency-of each webpage subtitle Shelves word frequency value;The term frequency-inverse document word frequency value that judgement calculates sews threshold value no more than pre-set front and back, determines each webpage Subtitle is sewed for front and back, which is sewed and is filtered out from web page title, and the web page title sewed before and after filtering out is as truncated Web page title.
It lifts a specific embodiment again below, the method for obtaining truncated web page title is illustrated.
Fig. 3 shows the method detailed process signal that the embodiment of the present invention obtains truncated web page title.It, should referring to Fig. 3 Process includes:
Step 301, it inputs to truncated web page title and the webpage URL information that should be mapped to truncated web page title;
In this step, user can be progress web page title collection during browsing webpage, be also possible to webpage The web page title stored in collection optimizes, i.e., truncates to web page title, for example, net to be optimized when the user clicks After page head, triggering web browser is inputted to truncated web page title and should be mapped to truncated web page title to server Webpage URL information.
Step 302, according to webpage URL information query webpage white list library, if be stored in webpage white list library described Webpage URL information executes step 303, otherwise, executes step 304;
Step 303, the web page title for reading the mapping of webpage URL information described in webpage white list library, as truncated net Page head exports and terminates process;
Step 304, the whether effective to truncated web page title of input judged, if in vain, executing step 305, otherwise, Execute step 306;
In step 303 to step 304, the webpage URL information of input is retrieved from webpage white list library, if the white name of webpage It is stored with the webpage URL information of input in single library, then hits webpage white list, directly returns to the net of webpage URL information mapping Page head exports as truncated web page title and terminates process;Otherwise, the having to truncated web page title to input is needed Effect property is judged.For example, input is " using Baidu.com, you are known that " to truncated web page title, the webpage URL of mapping believes Breath ishttp://www.baidu.com/, then pass through webpage white list library inquiry and matching, return and stored in webpage white list library Web page title " Baidu " be used as truncated web page title.
As alternative embodiment, webpage white list library can also be loaded into caching in advance, carry out webpage in the buffer URL information matching shortens the processing time in this way, the efficiency for obtaining truncated web page title can be improved.
In this step, web page title refers to that the web page title of input cannot react web page contents completely in vain, for example, empty Or not comprising there is any text (for example, only comprising symbol etc.).
Step 305, the corresponding domain name of the webpage URL information is returned, as truncated web page title and terminates process;
Step 306, according to the webpage URL information query webpage title template library of input, if in web page title template library There are the webpage URL informations of the input, execute step 307, otherwise, execute step 308;
Step 307, the corresponding regularity of webpage URL information of input described in web page title template library is read, is utilized The regularity of reading treats truncated web page title and carries out canonical processing, obtains truncated web page title and terminates process;
In this step, whether the webpage URL information for inquiring input hits web page title template library.For example, input wait cut Short web page title is " Russian girl outdoor bathing place get sun very sexy _ Liu Xingyun _ sina blog ", and webpage URL information ishttp://blog.sina.com.cn/s/blog_49b0d2b50102eyxt.html?t j=1If web page title template library In be stored withhttp://blog.sina.com.cnAnd its corresponding regularity, then web page title template library is hit, according to life In web page title template library, using the regularity of storage, extract input is " Russian girl to truncated web page title Very sexy _ Liu Xingyun " get sun as truncated web page title in outdoor bathing place.
Step 308, it according to pre-set fractionation strategy, to being split to truncated web page title for input, obtains One or more webpage subtitles;
Step 309, in conjunction with the web page title for sewing the webpage URL information mapping stored in identification library before and after web page title, needle Each webpage subtitle is calculated using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title The term frequency-inverse document word frequency value of each webpage subtitle;
Step 310, judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if It is to execute step 311, otherwise, the term frequency-inverse document word frequency value that return step 309 executes next webpage subtitle calculates;
Step 311, determine to be greater than pre-set front and back and sew the webpage subtitle of threshold value and sew for front and back, by the front and back sew from To be filtered out in truncated web page title;
Step 312, whether the web page title length that judgement filters out that front and back is sewed is greater than pre-set web page title length threshold Otherwise value, executes step 314 if not, executing step 313;
Step 313, result will be filtered out as truncated web page title, terminate process;
In step 308 to step 313, by splitting to truncated web page title to input, fractionation mode is used Matched method is carried out with pre-set punctuation mark, is believed in conjunction with the webpage URL stored in identification library is sewed before and after web page title The web page title for ceasing mapping is used using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title Maximum matched principle carries out front and back and sews identification.For example, if containing only a prefix or suffix (webpage in web page title Title), if prefix or suffix that then removal identifies, but due to that can include that multiple front and backs are sewed in web page title, For example, also include web sites hierarchical relationship while comprising webpage information, thus, by splitting, can generate multiple to be identified Front and back sew, in order to accurately remove front and back sew, in the embodiment of the present invention, using the mode of permutation and combination, for example, using Maximum forward matching, maximum reverse matching or matched mode simultaneously, all front and backs to be identified are sewed and are included in data statistics, into The extraction that row front and back is sewed, so that the front and back in filtering web page title is sewed, if filtered web page title length satisfaction is set in advance The web page title length threshold set, then return filtered web page title as truncated web page title.
In the embodiment of the present invention, for example, if input is that " high definition: Wuhan reporter is dark to truncated web page title Visit half a month take off dealer kidney shady deal _ news _ www.qq.com ", webpage URL information behttp://news.qq.com/a/20130820/ 003196.htm#p=3, based on the webpage mark for sewing the webpage URL information mapping stored in identification library before and after web page title above-mentioned Topic splits strategy, maximum match principle and term frequency-inverse document word frequency calculative strategy, obtains the result after filtering front and back is sewed and is " Wuhan reporter investigates half a month secretly and takes off dealer's kidney shady deal ", and using the result as truncated web page title.
Step 314, judge in truncated web page title whether comprising being included content, if so, execute step 315, Otherwise, step 316 is executed;
In this step, included content refers to include content in the symbols such as punctuation marks used to enclose the title, bracket.
Step 315, using included content as truncated title content, and terminate process;
Step 316, truncated web page title is treated using pre-set first group of punctuation mark carry out cutting;
Step 317, the fragment length of cutting is judged whether there is no more than pre-set segment threshold value, if so, executing Step 318, otherwise, step 321 is executed;
Step 318, it for the segment of each cutting no more than pre-set segment threshold value, removes in the segment and commonly uses Phrase, judges whether the fragment length for removing common phrases is not more than pre-set web page title length threshold, if so, holding Otherwise row step 319 executes step 320;
Step 319, the segment after returning to removal common phrases as truncated web page title and terminates process;
Step 320, cutting is carried out using segment of the pre-set second group of punctuation mark to removal common phrases, returned Execute step 317;
Step 321, since described to truncated web page title initial position, the character of intercepting page length for heading threshold value String is used as truncated web page title.
In the embodiment of the present invention, step 314 to step 321 is truncated to be treated using pre-set truncation general rule Web page title carries out truncation processing.For example, for being that " trivial games, 4399 trivial games, trivial games are big to truncated web page title Entirely, the game of double trivial games complete works-www.4399.com Largest In China ", webpage URL information be " http: // Www.4399.com/ sogou " is calculated according to above-mentioned truncation general rule, and the truncated web page title of acquisition is " 4399 Trivial games ".
From the foregoing, it can be seen that the web page title that the embodiment of the present invention is directed to for the first time in collection is too long, bandwagon effect is influenced simultaneously Make the less technical problem of useful information shown, proposes that a variety of strategies combine and treat truncated web page title and carry out at truncation Reason specifically using favorites data, statisticallys analyze the web page title that user names webpage, generates webpage after being extracted The corresponding canonical of webpage URL information is stored in advance in web page title template library and truncates rule for white list library, By a large amount of webpage URL information and its web page title of mapping, sews before and after web page title and set in identification library Set suffix list and/or front and back before web page title and sew recognition rule, effectively remove the front and back for including in web page title sew with And descriptive expression, good de-redundancy effect is obtained, so that truncated web page title can satisfy browser display area It is required that improving web page title de-redundancy effect;Further, the truncation that a variety of strategies through the embodiment of the present invention combine Method, accuracy rate is higher, so that the useful information that truncated web page title is supplied to user is more, to improve the business of user Experience.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) are realized in junk short message identification equipment according to an embodiment of the present invention The some or all functions of some or all components.The present invention is also implemented as executing method as described herein Some or all device or device programs (for example, computer program and computer program product).Such reality Existing program of the invention can store on a computer-readable medium, or may be in the form of one or more signals. Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or in any other forms It provides.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

Claims (23)

1. a kind of method for obtaining truncated web page title, comprising:
Obtain the webpage mark to be truncated of webpage URL information and webpage URL information mapping Topic;
It treats truncation web page title to be handled, only retains the part for being able to reflect web page contents;
It is described to treat that truncate the method that is handled of web page title include one of following methods or a variety of any combination: right Title does word segmentation processing and removes meaningless word;Pre-set web page title matching library is inquired, it is unified to obtain webpage to be truncated The corresponding matching rule of Resource Locator information is handled the web page title to be truncated according to obtained matching rule, Obtain truncated web page title;Truncation processing is done to title using general rule;
The web page title matching library includes: webpage white list library, and/or web page title template library, and/or web page title front and back Sew identification library;Wherein, the corresponding truncated webpage of webpage URL information is stored in webpage white list library Title is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule.
2. the method as described in claim 1, the webpage white list is stored with webpage URL information pair in library The truncated web page title answered;
It is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule;
Sew suffix list and/or front and back before being stored with web page title in identification library before and after web page title and sews recognition rule.
3. method according to claim 2, the matching rule that the basis obtains to the web page title to be truncated at Reason, comprising:
Sequentially according to identification library is sewed before and after webpage white list library, web page title template library, web page title, to the webpage to be truncated Title is handled.
4. the method as described in claim 1, the acquisition webpage URL information and the webpage unified resource Locator information mapping web page title to be truncated include:
It receives and carries out the truncated request of web page title;
Parsing carries out the truncated request of web page title, obtains web page title to be truncated and is somebody's turn to do webpage unified resource positioning to be truncated Accord with information.
5. the method as described in claim 1, generating webpage white list library includes:
It extracts multiple users and arranges each webpage URL information for including in data and webpage uniform resource locator The web page title of information MAP;
For each web page resources locator information, all web page titles of web page resources locator information mapping are obtained, with And count the corresponding number of users of each web page title of web page resources locator information mapping;
The corresponding number of users of web page title and web page title are applied to pre-set webpage white list calculative strategy, obtained The web page title weighted value;
In same webpage URL information, the corresponding web page title of maximum web page title weighted value is chosen, by net The web page title that page URL information and the web page title of selection are mapped as webpage URL information, It is placed in the webpage white list library of setting.
6. method as claimed in claim 5, the webpage white list calculative strategy is the calculative strategy according to number of users, described Web page title weighted value is number of users.
7. method as claimed in claim 5, the webpage white list calculative strategy is to lead according to belonging to pre-set user The calculative strategy of domain weight, the described web page title weighted value that obtains include:
The Feature Words for including in the web page title of webpage URL information mapping are extracted, with pre-set each field Feature dictionary is matched, and determines field belonging to the webpage URL information;
According to being in advance the respectively arranged each field weight of each user, the mapping of webpage URL information is obtained respectively Each web page title user for including the determining webpage URL information fields field weight;
The number of users for including by web page title and user are in the determining webpage URL information fields Field weight is applied to pre-set weight calculation formula, obtains web page title weighted value;
Wherein, the weight calculation formula, specifically:
In formula,
XiFor i-th of web page title weighted value, wherein i is natural number;
Ui,jFor corresponding j-th of the user of i-th of web page title;
ξi,jFor corresponding j-th of the user of i-th of web page title the web page title fields field weight;
K is the corresponding total number of users of i-th of web page title, and K is natural number.
8. method according to claim 6 or 7, the method further includes:
Web page navigation data are obtained, the webpage URL information and the webpage for including in web page navigation data are extracted The web page title of URL information mapping;
The each webpage URL information extracted is traversed, it is unified to whether there is the webpage in query webpage white list library Resource Locator information, if it does not, by the webpage URL information and the webpage uniform resource locator White list library is written in the web page title of information MAP, if it does, from the web page title of extraction and webpage white list library, point The web page title for not obtaining webpage URL information mapping, determines whether more new web page white list after being compared The web page title that the webpage URL information maps in library.
9. method according to claim 8, the web page title matching library includes webpage white list library, and the inquiry is set in advance The web page title matching library set obtains the corresponding matching rule of webpage URL information to be truncated, according to what is obtained Matching rule handles the web page title to be truncated, and obtains truncated web page title and includes:
Query webpage white list library obtains the web page title of webpage URL information mapping to be truncated, and will obtain Web page title as truncated web page title.
10. the method as described in claim 1, generating the web page title template library includes:
Sort out strategy in advance for the web page title setting of webpage URL information mapping, and is the webpage of each classification Corresponding regularity is arranged in title.
11. method as claimed in claim 10, the web page title matching library includes web page title template library, and the inquiry is pre- The web page title matching library being first arranged obtains the corresponding matching rule of webpage URL information to be truncated, according to To matching rule the web page title to be truncated is handled, obtaining truncated web page title includes:
The naming rule for extracting the web page title of webpage URL information mapping to be truncated, by the naming rule of extraction Pre-set classification strategy is matched, is obtained belonging to the web page title of the webpage URL information mapping to be truncated Classification;
Query webpage title template library obtains belonging to the web page title that the webpage URL information to be truncated maps The corresponding regularity of classification;
It is treated and is truncated at the web page title progress canonical of webpage URL information mapping using the regularity of acquisition Reason, obtains truncated web page title.
12. the method as described in claim 1, generates and sew identification library before and after the web page title and include:
Obtain the web page title mapped to truncated webpage URL information and storage;
It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title, before forming web page title Recognition rule is sewed in suffix list and/or front and back.
13. method as claimed in claim 12, the web page title matching library includes that identification library is sewed in web page title front and back, described Pre-set web page title matching library is inquired, the corresponding matching rule of webpage URL information to be truncated is obtained, The web page title to be truncated is handled according to obtained matching rule, obtaining truncated web page title includes:
The web page title for obtaining webpage URL information to be truncated mapping, it is tactful to obtaining according to pre-set fractionations The web page title taken is split, and one or more webpage subtitles are obtained;
Sew the web page title that the webpage URL information stored in identification library maps in conjunction with web page title front and back, for Each webpage subtitle, using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title, calculating should The term frequency-inverse document word frequency value of each webpage subtitle;
Judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, determining the webpage Subtitle is sewed for front and back, which is sewed and is filtered out from web page title, and the web page title sewed before and after filtering out is as truncated Web page title, and, determining front and back is sewed and is sewed in library before and after being stored in web page title.
14. method as claimed in claim 13, the term frequency-inverse document word frequency value for calculating each webpage subtitle it Before, the method further includes:
Obtained multiple webpage subtitles are combined, and are directed to each combined webpage subtitle, before web page title The web page title and term frequency-inverse document word frequency of the webpage URL information mapping stored in suffix identification library calculate Strategy calculates the TFIDF value of each combined webpage subtitle, and sew in each all non-front and back of combined webpage subtitle In the case of, execute the term frequency-inverse document word frequency value for calculating each webpage subtitle.
15. method as claimed in claim 13, it is described by this before and after sew after being filtered out in web page title, and before filtering out Before the web page title of suffix is as truncated web page title, the method further includes:
Judgement filters out whether the web page title length that front and back is sewed is greater than pre-set web page title length threshold, and will be not more than Pre-set web page title length threshold filters out the web page title sewed of front and back as the truncated web page title.
16. method as claimed in claim 13, the fractionation strategy is according to the pre-set mark for including in web page title Point symbol is split, the pre-set punctuation mark includes: _ ,-,-,+, &, # ...:,, |:, ┊,‖,;,,.,, s ,-,-and?.
17. the method as described in claim 1 is treated truncated web page title according to pre-set truncation general rule and is carried out Truncation is handled
G1, judge in truncated web page title whether comprising being included content, wherein included content is includes in symbol Otherwise content, executes step G3 if so, executing step G2;
G2 using included content as truncated title content, and terminates process;
G3 treats truncated web page title using pre-set first group of punctuation mark and carries out cutting;
G4 judges whether there is the fragment length of cutting no more than pre-set segment threshold value, if so, step G5 is executed, it is no Then, step G8 is executed;
G5 removes common phrases in the segment for the segment of each cutting no more than pre-set segment threshold value, judgement Whether the fragment length of removal common phrases is not more than pre-set web page title length threshold, if so, step G6 is executed, Otherwise, step G7 is executed;
G6, the segment after returning to removal common phrases as truncated web page title and terminate process;
G7 carries out cutting using segment of the pre-set second group of punctuation mark to removal common phrases, returns to step G4;
G8, since it is described to truncated web page title initial position, the character string of intercepting page length for heading threshold value is as cutting Short web page title.
18. a kind of device for obtaining truncated web page title, which includes: to truncate request processing module and truncated webpage Title obtains module, wherein
Request processing module is truncated, webpage to be truncated is unified to be provided for obtaining from the truncated request of received progress web page title Source locator information and the web page title for being somebody's turn to do webpage URL information mapping to be truncated;
Truncated web page title obtains module, for inquiring pre-set web page title matching library, obtains webpage system to be truncated The corresponding matching rule of one Resource Locator information, according to obtained matching rule to the web page title to be truncated at Reason, obtains truncated web page title;The web page title matching library include: webpage white list library, and/or, web page title template Library, and/or, identification library is sewed before and after web page title;Wherein, webpage uniform resource locator is stored in webpage white list library It is corresponding just to be stored with webpage URL information in web page title template library for information corresponding truncated web page title Then truncate rule.
19. device as claimed in claim 18, the truncation request processing module includes: receiving unit and resolution unit, Wherein,
Receiving unit carries out the truncated request of web page title for receiving;
Resolution unit carries out the truncated request of web page title for parsing, and obtains web page title to be truncated and is somebody's turn to do net to be truncated Page URL information.
20. device as claimed in claim 19, it includes: that webpage white list library generates that the truncated web page title, which obtains module, Unit and truncated web page title query unit, wherein
Webpage white list library generation unit, for extract each webpage URL information for including in user's collection and The web page title of webpage URL information mapping;For each web page resources locator information, webpage money is obtained All web page titles of source locator information mapping, and, count each webpage mark of web page resources locator information mapping Inscribe corresponding number of users;The corresponding number of users of web page title and web page title are applied to pre-set webpage white list meter Strategy is calculated, the web page title weighted value is obtained;In same webpage URL information, maximum web page title power is chosen The corresponding web page title of weight values is determined using webpage URL information and the web page title of selection as webpage unified resource The web page title of position symbol information MAP, is placed in the webpage white list library of setting;
Truncated web page title query unit is used for query webpage white list library generation unit, obtains the unified money of webpage to be truncated The web page title of source locator information mapping, and using obtained web page title as truncated web page title.
21. device as claimed in claim 20, the truncated web page title obtains module and further comprises:
Web page title updating unit extracts the unified money of the webpage for including in web page navigation data for obtaining web page navigation data Source locator information and the web page title of webpage URL information mapping;Each webpage that traversal is extracted is unified Resource Locator information whether there is the webpage URL information in the generation unit of query webpage white list library, such as Fruit is not present, the web page title that the webpage URL information and the webpage URL information are mapped Webpage white list library generation unit is written, if it does, from the web page title of extraction and webpage white list library generation unit, The web page title for obtaining webpage URL information mapping respectively, determines whether the white name of more new web page after being compared The web page title that the webpage URL information maps in single library generation unit.
22. device as claimed in claim 18, it includes: that web page title template library is raw that the truncated web page title, which obtains module, At unit and truncated web page title acquiring unit, wherein
Web page title template library generation unit, for being in advance the web page title setting of webpage URL information mapping Sort out strategy, and corresponding regularity is set for the web page title of each classification;
Truncated web page title acquiring unit, for extracting the web page title of webpage URL information mapping to be truncated Naming rule, the naming rule of extraction is matched into pre-set classifications strategy, obtain described in webpage unified resource to be truncated Classification belonging to the web page title of locator information mapping;Query webpage title template library generation unit obtains described wait truncate The corresponding regularity of classification belonging to the web page title of webpage URL information mapping;It is advised using the canonical of acquisition The web page title progress canonical processing for truncating the mapping of webpage URL information is then treated, truncated webpage mark is obtained Topic.
23. device as claimed in claim 18, it includes: that knowledge is sewed in web page title front and back that the truncated web page title, which obtains module, Other library generation unit and truncated web page title processing unit, wherein
Sew identification library generation unit before and after web page title, is reflected for obtaining webpage URL information in user's collection The web page title penetrated and storage;It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title;
Truncated web page title processing unit, for obtaining the webpage mark of webpage URL information mapping to be truncated Topic splits the web page title of acquisition according to pre-set fractionation strategy, obtains one or more webpage subtitles;Knot The web page title that the webpage URL information mapping stored in identification library is sewed in web page title front and back is closed, for each net Page subtitle calculates each net using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title The term frequency-inverse document word frequency value of page subtitle;Judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back Sew threshold value, if so, determining that each webpage subtitle is sewed for front and back, which is sewed and is filtered out from web page title, and will filter The web page title sewed except front and back is as truncated web page title.
CN201410158987.XA 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title Active CN105095175B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410158987.XA CN105095175B (en) 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410158987.XA CN105095175B (en) 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title

Publications (2)

Publication Number Publication Date
CN105095175A CN105095175A (en) 2015-11-25
CN105095175B true CN105095175B (en) 2019-04-30

Family

ID=54575649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410158987.XA Active CN105095175B (en) 2014-04-18 2014-04-18 Obtain the method and device of truncated web page title

Country Status (1)

Country Link
CN (1) CN105095175B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574175A (en) * 2015-12-21 2016-05-11 北京奇虎科技有限公司 Processing method and device for optimizing search result title
CN105630909A (en) * 2015-12-21 2016-06-01 北京奇虎科技有限公司 Method and device for displaying normalized header information
CN107045529B (en) * 2017-01-16 2021-01-22 阿里巴巴(中国)有限公司 Network content acquisition method and device and service terminal
CN106959945B (en) * 2017-03-23 2021-01-05 北京百度网讯科技有限公司 Method and device for generating short titles for news based on artificial intelligence
CN110852097B (en) * 2019-10-15 2022-02-01 平安科技(深圳)有限公司 Feature word extraction method, text similarity calculation method, device and equipment
CN111460307B (en) * 2020-04-03 2020-11-06 渭南双盈未来科技有限公司 Mobile terminal accurate searching method and device
CN111680482B (en) * 2020-05-07 2024-04-12 车智互联(北京)科技有限公司 Title image-text generation method and computing device
CN112437356B (en) * 2020-11-13 2021-09-28 珠海大横琴科技发展有限公司 Streaming media data processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831199A (en) * 2012-08-07 2012-12-19 北京奇虎科技有限公司 Method and device for establishing interest model
CN102831248A (en) * 2012-09-18 2012-12-19 北京奇虎科技有限公司 Network hotspot mining method and network hotspot mining device
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130262430A1 (en) * 2012-03-29 2013-10-03 Microsoft Corporation Dominant image determination for search results
US8799278B2 (en) * 2012-10-01 2014-08-05 DISCERN, Inc. Data augmentation based on second-phase metadata

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831199A (en) * 2012-08-07 2012-12-19 北京奇虎科技有限公司 Method and device for establishing interest model
CN102831248A (en) * 2012-09-18 2012-12-19 北京奇虎科技有限公司 Network hotspot mining method and network hotspot mining device
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于兴趣模型的个性化信息推荐系统研究与设计;谢创丰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20101015(第10期);全文

Also Published As

Publication number Publication date
CN105095175A (en) 2015-11-25

Similar Documents

Publication Publication Date Title
CN105095175B (en) Obtain the method and device of truncated web page title
Adar et al. The web changes everything: understanding the dynamics of web content
CN102831199B (en) Method and device for establishing interest model
CN105843965B (en) A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification
Peters et al. Content extraction using diverse feature sets
CN108920434A (en) A kind of general Web page subject method for extracting content and system
JP2009151749A (en) Method and system for filtering subject related web page based on navigation path information
CN104391978B (en) Web page storage processing method and processing device for browser
CN104978408A (en) Berkeley DB database based topic crawler system
US20160103913A1 (en) Method and system for calculating a degree of linkage for webpages
TW202001620A (en) Automatic website data collection method using a complex semantic computing model to form a seed vocabulary data set
JP4875911B2 (en) Content identification method and apparatus
US20150302093A1 (en) Method and system for filtering of a website
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
Mehta et al. DOM tree based approach for web content extraction
CN106776640A (en) A kind of stock information information displaying method and device
KR20090120843A (en) A system and method generating multi-concept networks based on user's web usage data
Sluban et al. URL Tree: Efficient unsupervised content extraction from streams of web documents
CN108470046B (en) News event sequencing method and system based on news event search sentence
CN104462613B (en) Hot spot polymerization and device
CN105787032B (en) The generation method and device of snapshots of web pages
Peng et al. Tunneling enhanced by web page content block partition for focused crawling
Saberi¹ et al. What does the future of search engine optimization hold?
Blanco et al. Efficiently Locating Collections of Web Pages to Wrap.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant