CN105095175B - Obtain the method and device of truncated web page title - Google Patents
Obtain the method and device of truncated web page title Download PDFInfo
- Publication number
- CN105095175B CN105095175B CN201410158987.XA CN201410158987A CN105095175B CN 105095175 B CN105095175 B CN 105095175B CN 201410158987 A CN201410158987 A CN 201410158987A CN 105095175 B CN105095175 B CN 105095175B
- Authority
- CN
- China
- Prior art keywords
- web page
- page title
- webpage
- truncated
- title
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of method and devices for obtaining truncated web page title.Method includes: the web page title to be truncated for obtaining webpage URL information and the information MAP;It treats truncation web page title to be handled, only retains the part for being able to reflect web page contents;It is described to treat that truncate the method that is handled of web page title include one of following methods or a variety of any combination: word segmentation processing being done to title and removes meaningless word;Pre-set web page title matching library is inquired, the corresponding matching rule of webpage URL information to be truncated is obtained, the web page title to be truncated is handled according to obtained matching rule, obtains truncated web page title;Truncation processing is done to title using general rule;The web page title matching library include: webpage white list library, and/or, web page title template library, and/or, identification library is sewed before and after web page title.With the application of the invention, the de-redundancy effect of web page title can be promoted effectively.
Description
Technical field
The present invention relates to browser Display Processing Technologies, and in particular to a kind of method for obtaining truncated web page title and dress
It sets.
Background technique
Currently, the needs based on browser display interface layout, since display user is stored in browser gallery, collection
The browser display area for the web page title collected in folder webpage mark that is relatively limited, and being shown by the browser display area
Topic, is able to use the relevant information that family gets the webpage (website).Thus, how in limited browser display area,
The web page title of storage is enabled to provide a user information as much as possible, so that user obtains about the more useful of webpage
Information, thus the technical issues of promoting the business experience of user, becoming a urgent need to resolve.Wherein, web page title is for general
The a word for including web page contents, be to the highly concentrated of web page contents, can provide a user related web page refining and it is useful
Information.
In existing browser, for the web page title that user collects in collection, generally mentioned automatically by browser
Take the title (Title) at the top of webpage as web page title, for example, the webpage uniform resource locator collected for needs
(URL, Uniform Resource Locator) information:www.sohu.com, browser is automatically by webpage www.sohu.com
The title " upper Sohu, see the Olympic Games " of top setting is stored in collection as webpage www.sohu.com title, when
So, user can also carry out manual modification to the web page title in collection according to the actual needs of itself.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind
State the method and device of the truncated web page title of acquisition of problem.
According to one aspect of the present invention, the method for obtaining truncated web page title is provided, this method comprises:
Obtain the net to be truncated of webpage URL information and webpage URL information mapping
Page head;
It treats truncation web page title to be handled, only retains the part for being able to reflect web page contents;
It is described to treat that truncate the method that is handled of web page title include one of following methods or a variety of any group
It closes: word segmentation processing being done to title and removes meaningless word;Pre-set web page title matching library is inquired, webpage to be truncated is obtained
The corresponding matching rule of URL information, according to obtained matching rule to the web page title to be truncated at
Reason, obtains truncated web page title;Truncation processing is done to title using general rule;
The web page title matching library include: webpage white list library, and/or, web page title template library, and/or, webpage mark
Sew identification library in topic front and back.
According to another aspect of the present invention, the device for obtaining truncated web page title is provided, comprising: truncate request processing
Module and truncated web page title obtain module, wherein
Request processing module is truncated, for obtaining webpage system to be truncated from the truncated request of received progress web page title
One Resource Locator information and the web page title for being somebody's turn to do webpage URL information mapping to be truncated;
Truncated web page title obtains module and obtains net to be truncated for inquiring pre-set web page title matching library
The corresponding matching rule of page URL information carries out the web page title to be truncated according to obtained matching rule
Processing, obtains truncated web page title;The web page title matching library include: webpage white list library, and/or, web page title mould
Plate library, and/or, identification library is sewed before and after web page title.
The method and device according to the present invention for obtaining truncated web page title, positions according to the webpage unified resource of input
Accord with information and web page title, using the webpage white list library pre-established, and/or, Page template library, and/or, web page title
Front and back sew identification library, and/or, truncate general rule, web page title is truncated.Thus existing method is solved to webpage mark
After topic extracts, obtains truncated web page title and include descriptive expression and the technical issues of front and back is sewed, it can be effectively
The front and back for including in removal web page title is sewed and descriptive expression, obtains good de-redundancy purpose, reaches cutting for acquisition
Short web page title meets browser display area requirement, and can provide a user more useful information, to promote user
The beneficial effect of business experience.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows the method flow signal that the embodiment of the present invention obtains truncated web page title;
Fig. 2 shows the apparatus structure signals that the embodiment of the present invention obtains truncated web page title;And
Fig. 3 shows the method detailed process signal that the embodiment of the present invention obtains truncated web page title.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
With the development of network technology, in order to provide a user more useful information and adapt to browser viewing area
Domain, it is also necessary to processing are filtered to some nonessential informations for including in the web page title stored in collection, i.e., to webpage
Title carries out crucial words and extracts to truncate web page title, to provide a user as far as possible in limited browser display area
Useful information.
As alternative embodiment, the web page title of acquisition can be split by segmenting cutting method, first to webpage
Title carries out words cutting, then, carries out meaningless word removal to the words of cutting, finally, to by removal treated net
Page head carries out words combination, obtains truncated web page title.
In practical application, due to carrying out words cutting to web page title using participle cutting method, and to the words of cutting
Meaningless word removal is carried out, cannot be removed effectively information unrelated to user in web page title.For example, web page title " upper Sohu,
See the Olympic Games " after words cutting, the removal of meaningless word and words combination, the web page title extracted, which remains as, " above to be searched
Fox sees the Olympic Games ", and for a user, "upper" and " seeing the Olympic Games " they may be the information useless to user, so that limited clear
The useful information amount provided a user of looking in device display area is reduced, and reduces the business experience of user;For another example for webpage
Title " welcomes access Sohu ", after existing method extracts web page title, obtains truncated web page title and remains as
" welcoming access Sohu ", and wherein, " welcoming access " is descriptive expression, cannot provide the information useful to user, in this way, by
Some descriptive expressions are contained in truncated web page title, on the one hand, so that truncated web page title is not able to satisfy browsing
Device display area requires, on the other hand, but also the useful information that truncated web page title is supplied to user is less, and web page title
De-redundancy effect is poor.Preferably, propose that one kind retains as far as possible each web page title the useful letter of title in the embodiment of the present invention
Breath a kind of web page title truncate technology, that is, the method for obtaining truncated web page title, by establish webpage white list library and/
Or, Page template library, and/or, sew before and after web page title identification library, and/or, truncate general rule, have to web page title
With truncation, it is allowed to keyword or crucial phrase comprising more refining, and removes the information unrelated with user, to meet browsing
Device display area requires, and provides a user more useful information.
Fig. 1 shows the method flow signal that the embodiment of the present invention obtains truncated web page title.Referring to Fig. 1, the process
Include:
Step 101, it obtains webpage URL information to be truncated and is somebody's turn to do webpage unified resource positioning to be truncated
Accord with the web page title of information MAP;
In this step, truncated technology is carried out only for web page title relative to existing, in the embodiment of the present invention, for reality
Existing more efficiently web page title truncates and the matching embodiment of the present invention proposes webpage white list library, and/or, Page template
Library, and/or, identification library technology is sewed before and after web page title, when obtaining to web page title, it is also necessary to obtain and simultaneously utilize the net
Page URL information, and as alternative embodiment, unlike the prior art, in the embodiment of the present invention, wait cut
Short web page title can not be the invalid title of representation page subject information for empty or url etc..
This step specifically includes:
It receives and carries out the truncated request of web page title;
In this step, user is during browsing webpage, if it is determined that needs to collect the webpage, then in the net
The display interface of page is added to collection submenu by clicking in collection drop-down menu, and triggering carries out web page title and cuts
Short, which extracts the web page title of user browsing, that is, web page title to be truncated, by the web page title of extraction and
The webpage URL information (webpage URL information to be truncated), which is encapsulated in, carries out web page title truncation
Request in, to server send;Alternatively, user need to the web page title (web page title to be truncated) stored in collection into
Row optimization, then by clicking the renaming submenu arranged in collection drop-down menu, triggering carries out web page title truncation, user
It can choose and need to carry out truncated web page title, the web page title and the webpage that web browser chooses the user are unified
Resource Locator Information encapsulation is sent in carrying out the truncated request of web page title to server, wherein if user, which chooses, to be had
Multiple web page titles, then in carrying out the truncated request of web page title, each web page title and the webpage uniform resource locator
Information forms mapping relations.
Parsing carries out the truncated request of web page title, obtains web page title to be truncated and is somebody's turn to do webpage unified resource to be truncated
Locator information.
In this step, server is receiving the progress truncated request of web page title, is asked by decapsulating and parsing this
It asks, the web page title and the webpage URL information carried in available request.
Step 102, pre-set web page title matching library is inquired, webpage URL information to be truncated is obtained
Corresponding matching rule handles the web page title to be truncated according to obtained matching rule, obtains truncated webpage
Title;The web page title matching library include: webpage white list library, and/or, web page title template library, and/or, web page title
Front and back sew identification library, and/or, truncate general rule, wherein
The corresponding truncated web page title of webpage URL information is stored in webpage white list library;
It is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule;
Sew suffix list and/or front and back before being stored with web page title in identification library before and after web page title and sews recognition rule.Its
In, it is setting for carrying out the term frequency-inverse document word that identification is sewed in front and back to web page title that recognition rule is sewed before and after web page title
Frequency calculative strategy, it is subsequent to be described in detail again.
In this step, as preferred embodiment, web page title matching library can also be loaded into caching in advance.
In the embodiment of the present invention, if being stored with webpage white list library, web page title template library in web page title matching library
And sew identification library before and after web page title, since webpage white list storehouse matching required time is short, it can effectively filter and be not included in
Web page title in webpage white list library reduces subsequent processing;And it is longer the time required to being matched with web page title template library,
And sew longest the time required to identification library is matched with web page title front and back.Thus, if necessary to be carried out using triplicity
Slug truncates, preferably, the matching rule used is sequentially are as follows: before webpage white list library, web page title template library, web page title
Suffix identifies library.
Generating webpage white list library includes:
A11 extracts each webpage URL information for including in user's collection and the positioning of webpage unified resource
Accord with the web page title of information MAP;
In this step, from user's collection comprising web page title (web page storage title), the net of user setting is extracted
Page head and webpage URL information.
In the embodiment of the present invention, since user is when carrying out web page access using sogou browser, sogou browser can be deposited
The web page access record of user is stored up, for example, user is the web page title and the webpage URL information of webpage setting, server is logical
The web page storage folder data extracted in sogou browser are crossed, available a large amount of webpage URL information and each user are to webpage
The web page title of URL information setting.Wherein, webpage URL information and web page title constitute mapping relations (title to), different use
For same webpage URL information, the web page title of setting can be different at family.Thus, same webpage URL information may be mapped with respectively
A large amount of different web pages titles that the webpage URL information is arranged in user.
A12 obtains all webpages of web page resources locator information mapping for each web page resources locator information
Title, and, count the corresponding number of users of each web page title of web page resources locator information mapping;
In this step, since the difference of user thus for each webpage URL information, is mapped with different webpage marks
Topic.In the embodiment of the present invention, for each webpage URL information, each web page title of webpage URL information mapping is counted respectively
Corresponding number of users.For example, for webpage URL information:www.sohu.com, the web page title of mapping includes: that " upper Sohu, sees
The Olympic Games ", " welcoming access Sohu ", " Sohu " and " Sohu official website ", wherein by statistics, " upper Sohu, see the Olympic Games " is corresponding
Number of users is 10,000, that is, has 10,000 users by webpage URL information:www.sohu.comThe web page title of mapping is set as " above searching
Fox sees the Olympic Games ", " welcoming access Sohu " corresponding number of users is 1.5 ten thousand, and " Sohu " corresponding number of users is 50,000, " Sohu official
The corresponding number of users of net " is 2.5 ten thousand.
A13 calculates the corresponding number of users of web page title and web page title applied to pre-set webpage white list
Strategy obtains the web page title weighted value;
In this step, as alternative embodiment, webpage white list calculative strategy can be the calculative strategy according to number of users,
Then web page title weighted value is user's numerical value.As described in step A12, according to the calculative strategy of number of users, web page title " is above searched
Fox sees the Olympic Games " corresponding web page title weighted value is 10,000, " welcoming access Sohu " corresponding web page title weighted value is 1.5
Ten thousand, " Sohu " corresponding web page title weighted value is 50,000, and " Sohu official website " corresponding web page title weighted value is 2.5 ten thousand.
Certainly, as another alternative embodiment, it is also contemplated that the field weight of user's fields into practical application,
For the user in a certain field, it is accurate that the user in the field names the web page title that webpage URL information maps
Property should be greater than the accuracy of the web page title name that user in other non-fields maps same webpage URL information, i.e., should
The web page title name that user in field maps webpage URL information is available to be more widely applied and popularizes.For example,
It, should be big to the accuracy of the web page title name of machinery field webpage URL information mapping for the user of a certain machinery field
In the accuracy for the web page title name that other on-mechanical field users map the webpage URL information.Thus, webpage white list
Calculative strategy can be the calculative strategy according to pre-set user's fields weight, in this way, by being in advance each use
Each field weight is respectively set in family, for example, it is 0.5 that its machinery field weight, which can be set, for a certain field user, electricity neck
Domain weight is 0.3, and communications field weight is 0.2 etc..About field belonging to user is determined, the feature of user tag can be passed through
Match to obtain, is well-known technique, detailed description is omitted here.In this way, by the corresponding number of users of web page title and web page title application
In pre-set webpage white list calculative strategy, obtaining the web page title weighted value includes:
B11 extracts the Feature Words for including in the web page title of webpage URL information mapping, special with pre-set each field
Sign dictionary is matched, and determines field belonging to the webpage URL information;
In this step, for example, for webpage URL information, randomly select a web page title " welcoming access Sohu ", mention
The Feature Words taken are Sohu, if including Feature Words Sohu in the feature dictionary of the communications field, then belonging to the web page title
Field is the communications field.
B12 obtains the mapping of webpage URL information according to being in advance the respectively arranged each field weight of each user respectively
Field weight of the user that each web page title includes in the determining webpage URL information fields;
In this step, for " upper Sohu, see the Olympic Games ", wherein in 10,000 users, have 0.2 general-purpose family in the communications field
Field weight be 0.2, have 0.3 general-purpose family the communications field field weight be 0.3, have 0.1 general-purpose family in the communications field
Field weight is 0.6, and having field weight of the 0.4 general-purpose family in the communications field is 0.9.For webpage URL information:www.sohu.comOther web page titles of mapping are counted according to method identical with this.
B13, the number of users for including by web page title and user are in the field of the determining webpage URL information fields
Weight is applied to pre-set weight calculation formula, obtains web page title weighted value.
In this step, weight calculation formula can be total weight calculation formula, be also possible to relative weighting calculation formula.Its
In, total weight calculation formula is as follows:
In formula,
XiFor i-th of web page title weighted value, wherein i is natural number;
Ui,jFor corresponding j-th of the user of i-th of web page title;
ξi,jFor corresponding j-th of the user of i-th of web page title the web page title fields field weight;
K is the corresponding total number of users of i-th of web page title, and K is natural number.
Relative weighting calculation formula is as follows:
As other alternative embodiments, web page title priority coefficient can also be set for each web page title in advance, and
It is combined into the field weight calculation web page title weighted value of the user setting of web page title mapping, i.e. webpage white list calculative strategy
It can be the calculative strategy according to pre-set user's fields weight combination web page title priority coefficient.This method into
One step includes:
Obtained web page title weighted value is multiplied with web page title priority coefficient, the web page title as final output
Weighted value.
In this step, for total weight calculation formula, the web page title weighted value for calculating final output is as follows:
In formula,
ψiFor i-th of web page title priority coefficient.
In the embodiment of the present invention, web page title priority coefficient can be arranged by manual type.For example, by obtaining search dog
Corresponding web page title priority coefficient is arranged in each web page title in browser, respectively each web page title.
A14 chooses the corresponding web page title of maximum web page title weighted value, by webpage in same webpage URL information
The web page title that URL information and the web page title of selection are mapped as webpage URL information is placed in the webpage white list library of setting
In.
In this step, for same webpage URL information, each web page title weight of webpage URL information mapping is calculated
After value, the corresponding web page title of maximum web page title weighted value is chosen, is mapped as webpage URL information in webpage white list library
Web page title.Wherein, web page title weighted value includes web page title maximum always weighted value and web page title maximum relative weight value,
The web page title that the maximum total corresponding web page title of weighted value of web page title is mapped as the webpage URL information can be chosen;Or
Person chooses the web page title that the corresponding web page title of web page title maximum relative weight value is mapped as the webpage URL information.
As alternative embodiment, web page title weighted value can also be arranged by size in same webpage URL information
Sequence, chooses the corresponding web page title of web page title weighted value of sequence top N, and each webpage URL information maps N number of webpage mark
Topic, and N number of web page title that webpage URL information is mapped is placed in the white list library of setting, wherein N is natural number.I.e. in net
In page white list library, each webpage URL information is mapped with N number of web page title, wherein N can be determine according to actual needs.
In practical application, since the web page title mapped by webpage URL information obtained by the above method is according to user
Behavior is selected, and the web page title of the webpage URL information mapping in user's collection may not can accurately reflect webpage
Title, and in the web page navigation data that each navigation website provides, due to being to have carried out height to webpage by technical professional
Degree is summarized, thus, the web page title provided comparatively refines, and the useful information for including is more.Thus, the embodiment of the present invention
In, after generating webpage white list library, further, this method can also include:
C11 obtains web page navigation data, extracts the webpage URL information for including in web page navigation data and webpage URL
The web page title of information MAP;
In this step, web page navigation data can be grabbed from each navigation website, and to webpage by way of web crawlers
Navigation data is parsed, and the web page title of webpage URL information and the mapping of webpage URL information is therefrom extracted.About crawl net
Page navigation data, extracts web page title and webpage URL information is well-known technique, detailed description is omitted here.
C12 traverses each webpage URL information of extraction, believes in query webpage white list library with the presence or absence of webpage URL
Breath, if it does not, white list library is written into the web page title that the webpage URL information and the webpage URL information map, if
In the presence of, from the web page title of extraction and webpage white list library, the web page title of webpage URL information mapping is obtained respectively,
The web page title that the webpage URL information maps in more new web page white list library is determined whether after being compared.
In the embodiment of the present invention, since the webpage URL information quantity provided in web page navigation data is relatively limited, i.e., cannot
All webpage URL informations are covered on a large scale, thus, using this method as a useful supplement in webpage white list library.Pass through
The web page navigation data of each navigation website are grabbed, the web page title of webpage URL information and webpage URL information mapping is extracted, and
It, will be from the webpage in webpage white list library in the web page title stored and the Web side navigation data of crawl according to webpage URL information
Title is compared, to choose more accurate web page title of expressing the meaning, i.e., if the webpage mark stored in webpage white list library
Topic is expressed the meaning more accurate, then is not dealt with, if the web page title for the webpage URL information mapping extracted from web page navigation data
It expresses the meaning more accurate, then the web page title stored in webpage white list library is updated.
So far, the process for generating webpage white list library terminates.
Generating web page title template library includes:
Sort out strategy in advance for the web page title setting of webpage URL information mapping, and is set for the web page title of each classification
Set corresponding regularity.
It,, can be with from a large amount of web page title data although the web page title quantity of each website is various in this step
Web page title is sorted out according to pre-set classification strategy, wherein classification strategy can be rich according to social category, technology
The classification strategy of objective class etc., that is to say, that web page title is classified as social category web page title, Tech blog class web page title
Deng.And corresponding regularity is set for the web page title of each classification, form web page title template.
In subsequent, after sorting out to web page title, in the web page title of classification, the corresponding canonical of the classification is used
Rule intercepts the web page title of classification, and truncated web page title can be obtained.For example, in web page title template library,
Social category web page title and the corresponding regularity of Tech blog class web page title is respectively set in advance, in this way, by webpage
It, can be by the web page title of each classification after title is classified as social category web page title or Tech blog class web page title
It is intercepted according to pre-set corresponding to the regularity accordingly sorted out, to obtain corresponding truncated web page title.
About for the web page title of each classification, corresponding regularity is set, it can be by being carried out to the web page title of classification
Data mining obtains, detailed description is omitted here.
In the embodiment of the present invention, due at the ending of each web page title, often containing " homepage ", " it is reported that ", it is " outer
Matchmaker ", " hot spot " etc. be used for keep web page title eye-catching preceding suffix information, or indicate web page title structure and with web page title
The unrelated preceding suffix information of theme.In order to remove the preceding suffix information of web page title, aforementioned regularity or white list are used
Information filtering is sewed in library process before and after carrying out is relatively complicated.Thus, in the embodiment of the present invention, can use title library (storage
The web page title of webpage URL information mapping) mass data analysis is carried out, periodic data excavation is carried out using TFIDF method, thus
Suffix information before grabbing out.
Sewing identification library before and after generation web page title includes:
Obtain the web page title of webpage URL information mapping and storage in user's collection;
It is arranged for carrying out term frequency-inverse document word frequency (TF-IDF, the Term that identification is sewed in front and back to web page title
Frequency-Inverse Document Frequency) calculative strategy.
In the embodiment of the present invention, TF-IDF is a kind of common weighted statistical method for information retrieval.Wherein, word frequency is used
To assess a words for the weight of a copy of it document in a document library (file set or corpus), the weight of words with
The directly proportional increase of number that occurs in document library of the words, while the frequency occurred in document library with the words is at anti-
Than decline;Inverse document word frequency is the measurement of a words general importance.
The weight calculation formula of TF are as follows:
In formula,
TF is word frequency weight;
PwThe number in document library is appeared in for word (words) w;
P is document library length, that is, the words total quantity for including.
The weight calculation formula of IDF are as follows:
In formula,
IDF is inverse document word frequency weight;
DwFor individual (document) sum in sample (document library, file set or corpus) containing words w;
D is total sample number, i.e., total number of files.
If IDF value is smaller, indicate that document more in sample includes the words, the information content which includes is got over
It is few;If IDF value is bigger, indicate that document only fewer in sample includes the words, the information content which includes is bigger.
In conjunction with word frequency and inverse document word frequency, available term frequency-inverse document word frequency:
In formula, WeightwFor the TF-IDF weight of words w.
If TF-IDF weight value is bigger, the indicative better of the words is indicated.
The truncated web page title of acquisition is described in detail again below.
In the embodiment of the present invention, if web page title matching library includes webpage white list library, pre-set net is inquired
Page head matching library obtains the corresponding matching rule of webpage URL information to be truncated, and is advised according to obtained matching
Then the web page title to be truncated is handled, obtaining truncated web page title includes:
Query webpage white list library obtains the web page title of webpage URL information mapping to be truncated, and will
Obtained web page title is as truncated web page title.
In this step, for not storing the feelings of webpage URL information to be truncated in webpage white list library
Shape can carry out truncation processing to web page title according to the prior art, and details are not described herein.
If web page title matching library includes web page title template library, pre-set web page title matching library is inquired,
The corresponding matching rule of webpage URL information to be truncated is obtained, according to obtained matching rule to described wait truncate
Web page title is handled, and is obtained truncated web page title and is included:
D11 extracts the naming rule of the web page title of webpage URL information mapping to be truncated, by the naming rule of extraction
With pre-set classification strategy, classification belonging to the web page title of the webpage URL information mapping to be truncated is obtained;
In this step, classification belonging to the web page title can be distinguished by the naming rule of analysis web page title.About
Web page title is carried out to be classified as well-known technique, detailed description is omitted here.
As alternative embodiment, if the web page title of webpage URL information mapping is invalid, i.e., web page title completely cannot be anti-
Web page contents are answered, for example, empty, that is, not including in any only includes perhaps symbol, then can return to the domain of the webpage URL information
Name is used as truncated web page title.
D12, query webpage title template library obtain belonging to the web page title that the webpage URL information to be truncated maps
Sort out corresponding regularity;
In this step, if affiliated is classified as after the web page title for treating truncation webpage URL information mapping is sorted out
Social category web page title is read as the regularity of social category web page title setting then from web page title template library.
D13 treats the web page title progress canonical processing for truncating the mapping of webpage URL information using the regularity of acquisition,
Obtain truncated web page title.
If web page title matching library includes that identification library is sewed in web page title front and back, pre-set web page title is inquired
With library, the corresponding matching rule of webpage URL information to be truncated is obtained, according to obtained matching rule to described
Web page title to be truncated is handled, and is obtained truncated web page title and is included:
E11 obtains the web page title of webpage URL information mapping to be truncated, according to pre-set fractionation strategy to acquisition
Web page title split, obtain one or more webpage subtitles;
In this step, since when collecting to web page title, each component part of web page title has certain spy
Point, for example, prefix (or descriptive expression), title text, one or more suffix have been generally comprised, and by web page title
The estimation of each component part is analyzed, and can be distinguished by some specific punctuation marks;Furthermore for title text, it is
The information useful to user can be used as whole provide a user.
As a result, in the embodiment of the present invention, splitting strategy be can be according to the pre-set punctuate for including in web page title
Symbol is split.For example, pre-set punctuation mark can be _ ,-,-,+, &, # ...:,, |:,
┊,‖,;,,.,, s ,-,-, etc..If in web page title including any of the above-described pre-set symbol, by the webpage
Title is split from the symbol.
E12, in conjunction with the web page title for sewing the webpage URL information mapping stored in identification library before and after web page title, for every
It is every to calculate this using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title for one webpage subtitle
The term frequency-inverse document word frequency value of one webpage subtitle;
In this step, as alternative embodiment, it can also be calculated carrying out term frequency-inverse document word frequency value to webpage subtitle
Before, this method further comprises:
Obtained multiple webpage subtitles are combined, and are directed to each combined webpage subtitle, in conjunction with webpage mark
Sew the web page title and term frequency-inverse document word frequency calculative strategy of the webpage URL information mapping stored in identification library, meter in topic front and back
The TFIDF value of each combined webpage subtitle is calculated, and in the case where each all non-front and back of combined webpage subtitle is sewed,
Execute the calculating to the term frequency-inverse document word frequency value of each webpage subtitle.
In this step, the mode of combination webpage subtitle can be, for example, web page title sequentially obtains after splitting
Three webpage subtitles, respectively A, B, C after being then combined, obtain the webpage subtitle of two combinations, respectively AB, BC,
Front and back first is carried out to AB and sews judgement, if AB sews for front and back, using C as truncated web page title;If AB is not that front and back is sewed,
Front and back then is carried out to BC and sews judgement, if BC sews for front and back, using A as truncated web page title;If BC is not that front and back is sewed,
It then carries out front and back respectively to A, B, C again and sews judgement.
In the embodiment of the present invention, the formula for calculating the term frequency-inverse document word frequency value of webpage subtitle be can be such that
In formula,
TF is the word frequency of webpage subtitle;
IDF is the inverse document word frequency of webpage subtitle;
N' is the number that webpage subtitle occurs in sample set;
N is the total quantity of each webpage subtitle in sample set;
D is total number of files in sample set comprising webpage subtitle;
D' is the total number of files for including in sample set;
+ 1 is smoothing processing.
It should be noted that calculating the method for the term frequency-inverse document word frequency value of the webpage subtitle of combination and calculating webpage
The method of the term frequency-inverse document word frequency value of subtitle is similar, detailed description is omitted here.
E13, judges whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, really
Fixed each webpage subtitle is sewed for front and back, the webpage mark which is sewed and is filtered out from web page title, and is sewed before and after filtering out
Topic is used as truncated web page title.
In this step, set in advance if the term frequency-inverse document word frequency value for the webpage subtitle that step E12 is calculated is greater than
Threshold value is sewed in the front and back set, then shows that the webpage subtitle (entirety) is sewed for front and back, and the webpage subtitle is deleted.
Further, since the web page title that each web editor is write can all have the style or template of oneself, thus, it is real
In the application of border, sew judgement carrying out above-mentioned front and back, i.e. after execution step E13, then to each net for including in truncated web page title
Page subtitle is sewed before and after carrying out to be filtered out, and can be further improved the validity of the truncated web page title of output, thus, this method
It can further include:
E14 sews the webpage URL information stored in identification library from web page title front and back and reflects according to webpage URL information to be truncated
The web page title penetrated extracts the web page title of the webpage URL information mapping to be truncated;
E15 is utilized in conjunction with the web page title extracted for the corresponding each webpage subtitle of truncated web page title
Sew the term frequency-inverse document word frequency calculative strategy being arranged in identification library before and after web page title, the word frequency-for calculating the webpage subtitle is inverse
Document word frequency value;
In this step, for each webpage subtitle sewed before and after filtering out, identified in library in conjunction with sewing before and after web page title
Each web page title of website belonging to the web page title of extraction calculates the TFIDF for filtering out each webpage subtitle that front and back is sewed
Value.
E16, judges whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, really
Fixed each webpage subtitle is sewed for front and back, which is sewed and is filtered out from truncated web page title, and updates truncated webpage
Title.
In embodiment of the present invention, step E14 to step E16 deposits to sew in identification library before and after utilization web page title above-mentioned
Recognition rule is sewed in suffix list and/or front and back before the web page title of storage, carries out the detailed process that identification is sewed in front and back to web page title.
In the embodiment of the present invention, step E14 is into step E16, using the web page title of all site informations as sample database,
Then, sew judgement before and after carrying out in sample database to each web page title.
As another alternative embodiment, can also first classify to web page title individually according to site information, for example,
It is classified as Sohu, Sina, 163, Netease etc., then, then sews in identification library before and after web page title and extracts the corresponding net of the classification
The web page title of page URL information mapping calculates plan using the term frequency-inverse document word frequency being arranged in identification library is sewed before and after web page title
It slightly carries out term frequency-inverse document word frequency value to calculate, and carries out the judgement that front and back is sewed, to achieve the effect that removal front and back is sewed.In this way,
Relative to aforementioned using the web page title of all site informations as the situation of sample database, the present embodiment is by the site information of classification
Web page title is as sample database, then, it is determined that classification belonging to webpage URL information to be truncated, and in the sample database of classification, it is right
It is somebody's turn to do webpage URL information to be truncated and corresponds to before and after web page title carries out and sew judgement.
As another alternative embodiment, the front and back excavated by TFIDF method can also be sewed and be stored in webpage mark
Topic front and back is sewed in library, and in subsequent process, after being split first to web page title, by sew before and after web page title library into
Preliminary matches are sewed in row front and back, sew the front and back that library matches before and after filtering out in web page title with web page title and sew, then, for mistake
Obtained web page title is filtered, then front and back is carried out by TFIDF method and sews judgement, and after judging that front and back is sewed, with the shape of increment
Formula sews the front and back judged be added to pre-stored web page title before and after sew in library.
If web page title matching library includes that identification is sewed in webpage white list library, web page title template library and web page title front and back
Pre-set web page title matching library is then inquired in library, obtains the corresponding matching of webpage URL information to be truncated
Rule is handled the web page title to be truncated according to obtained matching rule, and obtaining truncated web page title includes:
F11, query webpage white list library, if obtaining the webpage mark of webpage URL information mapping to be truncated
Topic, and using obtained web page title as truncated web page title, otherwise, execute step F12;
F12 extracts the naming rule of the web page title of webpage URL information mapping to be truncated, by the naming rule of extraction
With pre-set classification strategy, classification belonging to the web page title of the webpage URL information mapping to be truncated is obtained;
F13, query webpage title template library, if getting the web page title of the webpage URL information mapping to be truncated
The corresponding regularity of affiliated classification treats the web page title for truncating the mapping of webpage URL information using the regularity of acquisition
Canonical processing is carried out, truncated web page title is obtained, otherwise, executes step F14;
F14 obtains the web page title of webpage URL information mapping to be truncated, according to pre-set fractionation strategy to acquisition
Web page title split, obtain one or more webpage subtitles;
F15, in conjunction with the web page title for sewing the webpage URL information mapping stored in identification library before and after web page title, for every
It is every to calculate this using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title for one webpage subtitle
The term frequency-inverse document word frequency value of one webpage subtitle;
F16, judges whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, really
Fixed each webpage subtitle is sewed for front and back, the webpage mark which is sewed and is filtered out from web page title, and is sewed before and after filtering out
Topic is used as truncated web page title.
As alternative embodiment, this method be can further include:
Step 103, the truncated web page title that will acquire, which is issued in user's collection, to be stored.
In this step, the truncated web page title that being also possible to server will acquire is issued to user and shows, by user
After choosing whether modification, stored in collection according to user's selection.
As another alternative embodiment, this method be can further include:
Truncated web page title is treated using pre-set truncation general rule and carries out truncation processing.About using truncation
General rule carries out truncation processing, subsequent to be described in detail again.
Fig. 2 shows the apparatus structure signals that the embodiment of the present invention obtains truncated web page title.Referring to fig. 2, the device
It include: to truncate request processing module and truncated web page title acquisition module, wherein
Request processing module is truncated, for obtaining webpage to be truncated from the truncated request of received progress web page title
URL information and the web page title for being somebody's turn to do webpage URL information mapping to be truncated;
Truncated web page title obtains module and obtains net to be truncated for inquiring pre-set web page title matching library
The corresponding matching rule of page URL information, is handled the web page title to be truncated according to obtained matching rule, is obtained and is cut
Short web page title;The web page title matching library include: webpage white list library, and/or, web page title template library, and/or,
Sew identification library before and after web page title, wherein
The corresponding truncated web page title of webpage URL information is stored in webpage white list library;
It is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule;
Sew suffix list and/or front and back before being stored with web page title in identification library before and after web page title and sews recognition rule.
Wherein,
Truncating request processing module includes: receiving unit and resolution unit (not shown), wherein
Receiving unit carries out the truncated request of web page title for receiving;
Resolution unit carries out the truncated request of web page title for parsing, and obtains web page title to be truncated and is somebody's turn to do wait cut
Short webpage URL information.
As alternative embodiment, truncated web page title obtains module and includes: webpage white list library generation unit and cut
Short web page title query unit (not shown), wherein
Webpage white list library generation unit, for extracting each webpage URL information for including in user's collection and webpage URL
The web page title of information MAP;For each web page resources locator information, web page resources locator information mapping is obtained
All web page titles, and, count the corresponding number of users of each web page title of web page resources locator information mapping;By net
The corresponding number of users of page head and web page title are applied to pre-set webpage white list calculative strategy, obtain the webpage mark
Inscribe weighted value;In same webpage URL information, the corresponding web page title of maximum web page title weighted value is chosen, webpage URL is believed
The web page title that breath is mapped with the web page title chosen as webpage URL information, is placed in the webpage white list library of setting;
Truncated web page title query unit is used for query webpage white list library generation unit, obtains webpage URL to be truncated
The web page title of information MAP, and using obtained web page title as truncated web page title.
In the embodiment of the present invention, preferably, truncated web page title acquisition module can also include:
Web page title updating unit extracts the webpage URL for including in web page navigation data for obtaining web page navigation data
Information and the web page title of webpage URL information mapping;Traverse each webpage URL information extracted, query webpage white list
It whether there is the webpage URL information in the generation unit of library, if it does not, by the webpage URL information and the webpage URL information
Webpage white list library generation unit is written in the web page title of mapping, if it does, from the white name of web page title and webpage of extraction
In single library generation unit, the web page title of webpage URL information mapping is obtained respectively, more new web page is determined whether after being compared
The web page title that the webpage URL information maps in the generation unit of white list library.
As another alternative embodiment, it includes: web page title template library generation unit that truncated web page title, which obtains module,
And truncated web page title acquiring unit, wherein
Web page title template library generation unit sorts out plan for the web page title setting in advance for the mapping of webpage URL information
Slightly, and for the web page title of each classification corresponding regularity is set;
Truncated web page title acquiring unit, the name of the web page title for extracting webpage URL information mapping to be truncated
The naming rule of extraction is matched pre-set classification strategy by rule, obtains the net of the webpage URL information mapping to be truncated
Classification belonging to page head;Query webpage title template library generation unit obtains the net of the webpage URL information mapping to be truncated
The corresponding regularity of classification belonging to page head;The net for truncating the mapping of webpage URL information is treated using the regularity of acquisition
Page head carries out canonical processing, obtains truncated web page title.
As yet another alternative embodiment, it includes: that the life of identification library is sewed in web page title front and back that truncated web page title, which obtains module,
At unit and truncated web page title processing unit, wherein
Sew identification library generation unit before and after web page title, for obtaining the net that webpage URL information maps in user's collection
Page head simultaneously stores;It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title.
Truncated web page title processing unit, for obtaining the web page title of webpage URL information mapping to be truncated, according to pre-
The fractionation strategy being first arranged splits the web page title of acquisition, obtains one or more webpage subtitles;In conjunction with webpage mark
The web page title that the webpage URL information mapping stored in identification library is sewed in topic front and back utilizes webpage mark for each webpage subtitle
The term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed in topic front and back, calculates the inverse text of word frequency-of each webpage subtitle
Shelves word frequency value;Judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, determining
Each webpage subtitle is sewed for front and back, the web page title which is sewed and is filtered out from web page title, and is sewed before and after filtering out
As truncated web page title.
As yet another alternative embodiment, it includes: webpage white list library generation unit, net that truncated web page title, which obtains module,
Sew identification library generation unit before and after page head template library generation unit, web page title, truncated web page title query unit, truncate
Web page title acquiring unit and truncated web page title processing unit, wherein
Webpage white list library generation unit, for extracting each webpage URL information for including in user's collection and webpage URL
The web page title of information MAP;For each web page resources locator information, web page resources locator information mapping is obtained
All web page titles, and, count the corresponding number of users of each web page title of web page resources locator information mapping;By net
The corresponding number of users of page head and web page title are applied to pre-set webpage white list calculative strategy, obtain the webpage mark
Inscribe weighted value;In same webpage URL information, the corresponding web page title of maximum web page title weighted value is chosen, webpage URL is believed
The web page title that breath is mapped with the web page title chosen as webpage URL information, is placed in the webpage white list library of setting;
Web page title template library generation unit sorts out plan for the web page title setting in advance for the mapping of webpage URL information
Slightly, and for the web page title of each classification corresponding regularity is set;
Sew identification library generation unit before and after web page title, for obtaining the net that webpage URL information maps in user's collection
Page head simultaneously stores;It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title;
Truncated web page title query unit, for being generated according to webpage URL information query webpage white list library to be truncated
Unit, if obtaining the web page title of webpage URL information mapping to be truncated, and using obtained web page title as truncated webpage
Otherwise title notifies truncated web page title acquiring unit;
Truncated web page title acquiring unit, the name of the web page title for extracting webpage URL information mapping to be truncated
The naming rule of extraction is matched pre-set classification strategy by rule, obtains the net of the webpage URL information mapping to be truncated
Classification belonging to page head;Query webpage title template library generation unit is reflected if getting the webpage URL information to be truncated
The corresponding regularity of classification belonging to the web page title penetrated is treated truncation webpage URL information using the regularity of acquisition and is reflected
The web page title penetrated carries out canonical processing, obtains truncated web page title, otherwise, notifies truncated web page title processing unit;
Truncated web page title processing unit, for obtaining the web page title of webpage URL information mapping to be truncated, according to pre-
The fractionation strategy being first arranged splits the web page title of acquisition, obtains one or more webpage subtitles;In conjunction with webpage mark
The web page title that the webpage URL information mapping stored in identification library is sewed in topic front and back utilizes webpage mark for each webpage subtitle
The term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed in topic front and back, calculates the inverse text of word frequency-of each webpage subtitle
Shelves word frequency value;The term frequency-inverse document word frequency value that judgement calculates sews threshold value no more than pre-set front and back, determines each webpage
Subtitle is sewed for front and back, which is sewed and is filtered out from web page title, and the web page title sewed before and after filtering out is as truncated
Web page title.
It lifts a specific embodiment again below, the method for obtaining truncated web page title is illustrated.
Fig. 3 shows the method detailed process signal that the embodiment of the present invention obtains truncated web page title.It, should referring to Fig. 3
Process includes:
Step 301, it inputs to truncated web page title and the webpage URL information that should be mapped to truncated web page title;
In this step, user can be progress web page title collection during browsing webpage, be also possible to webpage
The web page title stored in collection optimizes, i.e., truncates to web page title, for example, net to be optimized when the user clicks
After page head, triggering web browser is inputted to truncated web page title and should be mapped to truncated web page title to server
Webpage URL information.
Step 302, according to webpage URL information query webpage white list library, if be stored in webpage white list library described
Webpage URL information executes step 303, otherwise, executes step 304;
Step 303, the web page title for reading the mapping of webpage URL information described in webpage white list library, as truncated net
Page head exports and terminates process;
Step 304, the whether effective to truncated web page title of input judged, if in vain, executing step 305, otherwise,
Execute step 306;
In step 303 to step 304, the webpage URL information of input is retrieved from webpage white list library, if the white name of webpage
It is stored with the webpage URL information of input in single library, then hits webpage white list, directly returns to the net of webpage URL information mapping
Page head exports as truncated web page title and terminates process;Otherwise, the having to truncated web page title to input is needed
Effect property is judged.For example, input is " using Baidu.com, you are known that " to truncated web page title, the webpage URL of mapping believes
Breath ishttp://www.baidu.com/, then pass through webpage white list library inquiry and matching, return and stored in webpage white list library
Web page title " Baidu " be used as truncated web page title.
As alternative embodiment, webpage white list library can also be loaded into caching in advance, carry out webpage in the buffer
URL information matching shortens the processing time in this way, the efficiency for obtaining truncated web page title can be improved.
In this step, web page title refers to that the web page title of input cannot react web page contents completely in vain, for example, empty
Or not comprising there is any text (for example, only comprising symbol etc.).
Step 305, the corresponding domain name of the webpage URL information is returned, as truncated web page title and terminates process;
Step 306, according to the webpage URL information query webpage title template library of input, if in web page title template library
There are the webpage URL informations of the input, execute step 307, otherwise, execute step 308;
Step 307, the corresponding regularity of webpage URL information of input described in web page title template library is read, is utilized
The regularity of reading treats truncated web page title and carries out canonical processing, obtains truncated web page title and terminates process;
In this step, whether the webpage URL information for inquiring input hits web page title template library.For example, input wait cut
Short web page title is " Russian girl outdoor bathing place get sun very sexy _ Liu Xingyun _ sina blog ", and webpage URL information ishttp://blog.sina.com.cn/s/blog_49b0d2b50102eyxt.html?t j=1If web page title template library
In be stored withhttp://blog.sina.com.cnAnd its corresponding regularity, then web page title template library is hit, according to life
In web page title template library, using the regularity of storage, extract input is " Russian girl to truncated web page title
Very sexy _ Liu Xingyun " get sun as truncated web page title in outdoor bathing place.
Step 308, it according to pre-set fractionation strategy, to being split to truncated web page title for input, obtains
One or more webpage subtitles;
Step 309, in conjunction with the web page title for sewing the webpage URL information mapping stored in identification library before and after web page title, needle
Each webpage subtitle is calculated using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title
The term frequency-inverse document word frequency value of each webpage subtitle;
Step 310, judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if
It is to execute step 311, otherwise, the term frequency-inverse document word frequency value that return step 309 executes next webpage subtitle calculates;
Step 311, determine to be greater than pre-set front and back and sew the webpage subtitle of threshold value and sew for front and back, by the front and back sew from
To be filtered out in truncated web page title;
Step 312, whether the web page title length that judgement filters out that front and back is sewed is greater than pre-set web page title length threshold
Otherwise value, executes step 314 if not, executing step 313;
Step 313, result will be filtered out as truncated web page title, terminate process;
In step 308 to step 313, by splitting to truncated web page title to input, fractionation mode is used
Matched method is carried out with pre-set punctuation mark, is believed in conjunction with the webpage URL stored in identification library is sewed before and after web page title
The web page title for ceasing mapping is used using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title
Maximum matched principle carries out front and back and sews identification.For example, if containing only a prefix or suffix (webpage in web page title
Title), if prefix or suffix that then removal identifies, but due to that can include that multiple front and backs are sewed in web page title,
For example, also include web sites hierarchical relationship while comprising webpage information, thus, by splitting, can generate multiple to be identified
Front and back sew, in order to accurately remove front and back sew, in the embodiment of the present invention, using the mode of permutation and combination, for example, using
Maximum forward matching, maximum reverse matching or matched mode simultaneously, all front and backs to be identified are sewed and are included in data statistics, into
The extraction that row front and back is sewed, so that the front and back in filtering web page title is sewed, if filtered web page title length satisfaction is set in advance
The web page title length threshold set, then return filtered web page title as truncated web page title.
In the embodiment of the present invention, for example, if input is that " high definition: Wuhan reporter is dark to truncated web page title
Visit half a month take off dealer kidney shady deal _ news _ www.qq.com ", webpage URL information behttp://news.qq.com/a/20130820/ 003196.htm#p=3, based on the webpage mark for sewing the webpage URL information mapping stored in identification library before and after web page title above-mentioned
Topic splits strategy, maximum match principle and term frequency-inverse document word frequency calculative strategy, obtains the result after filtering front and back is sewed and is
" Wuhan reporter investigates half a month secretly and takes off dealer's kidney shady deal ", and using the result as truncated web page title.
Step 314, judge in truncated web page title whether comprising being included content, if so, execute step 315,
Otherwise, step 316 is executed;
In this step, included content refers to include content in the symbols such as punctuation marks used to enclose the title, bracket.
Step 315, using included content as truncated title content, and terminate process;
Step 316, truncated web page title is treated using pre-set first group of punctuation mark carry out cutting;
Step 317, the fragment length of cutting is judged whether there is no more than pre-set segment threshold value, if so, executing
Step 318, otherwise, step 321 is executed;
Step 318, it for the segment of each cutting no more than pre-set segment threshold value, removes in the segment and commonly uses
Phrase, judges whether the fragment length for removing common phrases is not more than pre-set web page title length threshold, if so, holding
Otherwise row step 319 executes step 320;
Step 319, the segment after returning to removal common phrases as truncated web page title and terminates process;
Step 320, cutting is carried out using segment of the pre-set second group of punctuation mark to removal common phrases, returned
Execute step 317;
Step 321, since described to truncated web page title initial position, the character of intercepting page length for heading threshold value
String is used as truncated web page title.
In the embodiment of the present invention, step 314 to step 321 is truncated to be treated using pre-set truncation general rule
Web page title carries out truncation processing.For example, for being that " trivial games, 4399 trivial games, trivial games are big to truncated web page title
Entirely, the game of double trivial games complete works-www.4399.com Largest In China ", webpage URL information be " http: //
Www.4399.com/ sogou " is calculated according to above-mentioned truncation general rule, and the truncated web page title of acquisition is " 4399
Trivial games ".
From the foregoing, it can be seen that the web page title that the embodiment of the present invention is directed to for the first time in collection is too long, bandwagon effect is influenced simultaneously
Make the less technical problem of useful information shown, proposes that a variety of strategies combine and treat truncated web page title and carry out at truncation
Reason specifically using favorites data, statisticallys analyze the web page title that user names webpage, generates webpage after being extracted
The corresponding canonical of webpage URL information is stored in advance in web page title template library and truncates rule for white list library,
By a large amount of webpage URL information and its web page title of mapping, sews before and after web page title and set in identification library
Set suffix list and/or front and back before web page title and sew recognition rule, effectively remove the front and back for including in web page title sew with
And descriptive expression, good de-redundancy effect is obtained, so that truncated web page title can satisfy browser display area
It is required that improving web page title de-redundancy effect;Further, the truncation that a variety of strategies through the embodiment of the present invention combine
Method, accuracy rate is higher, so that the useful information that truncated web page title is supplied to user is more, to improve the business of user
Experience.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein.
Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system
Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various
Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair
Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects,
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect
Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself
All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors
Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice
Microprocessor or digital signal processor (DSP) are realized in junk short message identification equipment according to an embodiment of the present invention
The some or all functions of some or all components.The present invention is also implemented as executing method as described herein
Some or all device or device programs (for example, computer program and computer program product).Such reality
Existing program of the invention can store on a computer-readable medium, or may be in the form of one or more signals.
Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or in any other forms
It provides.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability
Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch
To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame
Claim.
Claims (23)
1. a kind of method for obtaining truncated web page title, comprising:
Obtain the webpage mark to be truncated of webpage URL information and webpage URL information mapping
Topic;
It treats truncation web page title to be handled, only retains the part for being able to reflect web page contents;
It is described to treat that truncate the method that is handled of web page title include one of following methods or a variety of any combination: right
Title does word segmentation processing and removes meaningless word;Pre-set web page title matching library is inquired, it is unified to obtain webpage to be truncated
The corresponding matching rule of Resource Locator information is handled the web page title to be truncated according to obtained matching rule,
Obtain truncated web page title;Truncation processing is done to title using general rule;
The web page title matching library includes: webpage white list library, and/or web page title template library, and/or web page title front and back
Sew identification library;Wherein, the corresponding truncated webpage of webpage URL information is stored in webpage white list library
Title is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule.
2. the method as described in claim 1, the webpage white list is stored with webpage URL information pair in library
The truncated web page title answered;
It is stored with the corresponding canonical of webpage URL information in web page title template library and truncates rule;
Sew suffix list and/or front and back before being stored with web page title in identification library before and after web page title and sews recognition rule.
3. method according to claim 2, the matching rule that the basis obtains to the web page title to be truncated at
Reason, comprising:
Sequentially according to identification library is sewed before and after webpage white list library, web page title template library, web page title, to the webpage to be truncated
Title is handled.
4. the method as described in claim 1, the acquisition webpage URL information and the webpage unified resource
Locator information mapping web page title to be truncated include:
It receives and carries out the truncated request of web page title;
Parsing carries out the truncated request of web page title, obtains web page title to be truncated and is somebody's turn to do webpage unified resource positioning to be truncated
Accord with information.
5. the method as described in claim 1, generating webpage white list library includes:
It extracts multiple users and arranges each webpage URL information for including in data and webpage uniform resource locator
The web page title of information MAP;
For each web page resources locator information, all web page titles of web page resources locator information mapping are obtained, with
And count the corresponding number of users of each web page title of web page resources locator information mapping;
The corresponding number of users of web page title and web page title are applied to pre-set webpage white list calculative strategy, obtained
The web page title weighted value;
In same webpage URL information, the corresponding web page title of maximum web page title weighted value is chosen, by net
The web page title that page URL information and the web page title of selection are mapped as webpage URL information,
It is placed in the webpage white list library of setting.
6. method as claimed in claim 5, the webpage white list calculative strategy is the calculative strategy according to number of users, described
Web page title weighted value is number of users.
7. method as claimed in claim 5, the webpage white list calculative strategy is to lead according to belonging to pre-set user
The calculative strategy of domain weight, the described web page title weighted value that obtains include:
The Feature Words for including in the web page title of webpage URL information mapping are extracted, with pre-set each field
Feature dictionary is matched, and determines field belonging to the webpage URL information;
According to being in advance the respectively arranged each field weight of each user, the mapping of webpage URL information is obtained respectively
Each web page title user for including the determining webpage URL information fields field weight;
The number of users for including by web page title and user are in the determining webpage URL information fields
Field weight is applied to pre-set weight calculation formula, obtains web page title weighted value;
Wherein, the weight calculation formula, specifically:
In formula,
XiFor i-th of web page title weighted value, wherein i is natural number;
Ui,jFor corresponding j-th of the user of i-th of web page title;
ξi,jFor corresponding j-th of the user of i-th of web page title the web page title fields field weight;
K is the corresponding total number of users of i-th of web page title, and K is natural number.
8. method according to claim 6 or 7, the method further includes:
Web page navigation data are obtained, the webpage URL information and the webpage for including in web page navigation data are extracted
The web page title of URL information mapping;
The each webpage URL information extracted is traversed, it is unified to whether there is the webpage in query webpage white list library
Resource Locator information, if it does not, by the webpage URL information and the webpage uniform resource locator
White list library is written in the web page title of information MAP, if it does, from the web page title of extraction and webpage white list library, point
The web page title for not obtaining webpage URL information mapping, determines whether more new web page white list after being compared
The web page title that the webpage URL information maps in library.
9. method according to claim 8, the web page title matching library includes webpage white list library, and the inquiry is set in advance
The web page title matching library set obtains the corresponding matching rule of webpage URL information to be truncated, according to what is obtained
Matching rule handles the web page title to be truncated, and obtains truncated web page title and includes:
Query webpage white list library obtains the web page title of webpage URL information mapping to be truncated, and will obtain
Web page title as truncated web page title.
10. the method as described in claim 1, generating the web page title template library includes:
Sort out strategy in advance for the web page title setting of webpage URL information mapping, and is the webpage of each classification
Corresponding regularity is arranged in title.
11. method as claimed in claim 10, the web page title matching library includes web page title template library, and the inquiry is pre-
The web page title matching library being first arranged obtains the corresponding matching rule of webpage URL information to be truncated, according to
To matching rule the web page title to be truncated is handled, obtaining truncated web page title includes:
The naming rule for extracting the web page title of webpage URL information mapping to be truncated, by the naming rule of extraction
Pre-set classification strategy is matched, is obtained belonging to the web page title of the webpage URL information mapping to be truncated
Classification;
Query webpage title template library obtains belonging to the web page title that the webpage URL information to be truncated maps
The corresponding regularity of classification;
It is treated and is truncated at the web page title progress canonical of webpage URL information mapping using the regularity of acquisition
Reason, obtains truncated web page title.
12. the method as described in claim 1, generates and sew identification library before and after the web page title and include:
Obtain the web page title mapped to truncated webpage URL information and storage;
It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title, before forming web page title
Recognition rule is sewed in suffix list and/or front and back.
13. method as claimed in claim 12, the web page title matching library includes that identification library is sewed in web page title front and back, described
Pre-set web page title matching library is inquired, the corresponding matching rule of webpage URL information to be truncated is obtained,
The web page title to be truncated is handled according to obtained matching rule, obtaining truncated web page title includes:
The web page title for obtaining webpage URL information to be truncated mapping, it is tactful to obtaining according to pre-set fractionations
The web page title taken is split, and one or more webpage subtitles are obtained;
Sew the web page title that the webpage URL information stored in identification library maps in conjunction with web page title front and back, for
Each webpage subtitle, using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title, calculating should
The term frequency-inverse document word frequency value of each webpage subtitle;
Judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back and sews threshold value, if so, determining the webpage
Subtitle is sewed for front and back, which is sewed and is filtered out from web page title, and the web page title sewed before and after filtering out is as truncated
Web page title, and, determining front and back is sewed and is sewed in library before and after being stored in web page title.
14. method as claimed in claim 13, the term frequency-inverse document word frequency value for calculating each webpage subtitle it
Before, the method further includes:
Obtained multiple webpage subtitles are combined, and are directed to each combined webpage subtitle, before web page title
The web page title and term frequency-inverse document word frequency of the webpage URL information mapping stored in suffix identification library calculate
Strategy calculates the TFIDF value of each combined webpage subtitle, and sew in each all non-front and back of combined webpage subtitle
In the case of, execute the term frequency-inverse document word frequency value for calculating each webpage subtitle.
15. method as claimed in claim 13, it is described by this before and after sew after being filtered out in web page title, and before filtering out
Before the web page title of suffix is as truncated web page title, the method further includes:
Judgement filters out whether the web page title length that front and back is sewed is greater than pre-set web page title length threshold, and will be not more than
Pre-set web page title length threshold filters out the web page title sewed of front and back as the truncated web page title.
16. method as claimed in claim 13, the fractionation strategy is according to the pre-set mark for including in web page title
Point symbol is split, the pre-set punctuation mark includes: _ ,-,-,+, &, # ...:,, |:,
┊,‖,;,,.,, s ,-,-and?.
17. the method as described in claim 1 is treated truncated web page title according to pre-set truncation general rule and is carried out
Truncation is handled
G1, judge in truncated web page title whether comprising being included content, wherein included content is includes in symbol
Otherwise content, executes step G3 if so, executing step G2;
G2 using included content as truncated title content, and terminates process;
G3 treats truncated web page title using pre-set first group of punctuation mark and carries out cutting;
G4 judges whether there is the fragment length of cutting no more than pre-set segment threshold value, if so, step G5 is executed, it is no
Then, step G8 is executed;
G5 removes common phrases in the segment for the segment of each cutting no more than pre-set segment threshold value, judgement
Whether the fragment length of removal common phrases is not more than pre-set web page title length threshold, if so, step G6 is executed,
Otherwise, step G7 is executed;
G6, the segment after returning to removal common phrases as truncated web page title and terminate process;
G7 carries out cutting using segment of the pre-set second group of punctuation mark to removal common phrases, returns to step
G4;
G8, since it is described to truncated web page title initial position, the character string of intercepting page length for heading threshold value is as cutting
Short web page title.
18. a kind of device for obtaining truncated web page title, which includes: to truncate request processing module and truncated webpage
Title obtains module, wherein
Request processing module is truncated, webpage to be truncated is unified to be provided for obtaining from the truncated request of received progress web page title
Source locator information and the web page title for being somebody's turn to do webpage URL information mapping to be truncated;
Truncated web page title obtains module, for inquiring pre-set web page title matching library, obtains webpage system to be truncated
The corresponding matching rule of one Resource Locator information, according to obtained matching rule to the web page title to be truncated at
Reason, obtains truncated web page title;The web page title matching library include: webpage white list library, and/or, web page title template
Library, and/or, identification library is sewed before and after web page title;Wherein, webpage uniform resource locator is stored in webpage white list library
It is corresponding just to be stored with webpage URL information in web page title template library for information corresponding truncated web page title
Then truncate rule.
19. device as claimed in claim 18, the truncation request processing module includes: receiving unit and resolution unit,
Wherein,
Receiving unit carries out the truncated request of web page title for receiving;
Resolution unit carries out the truncated request of web page title for parsing, and obtains web page title to be truncated and is somebody's turn to do net to be truncated
Page URL information.
20. device as claimed in claim 19, it includes: that webpage white list library generates that the truncated web page title, which obtains module,
Unit and truncated web page title query unit, wherein
Webpage white list library generation unit, for extract each webpage URL information for including in user's collection and
The web page title of webpage URL information mapping;For each web page resources locator information, webpage money is obtained
All web page titles of source locator information mapping, and, count each webpage mark of web page resources locator information mapping
Inscribe corresponding number of users;The corresponding number of users of web page title and web page title are applied to pre-set webpage white list meter
Strategy is calculated, the web page title weighted value is obtained;In same webpage URL information, maximum web page title power is chosen
The corresponding web page title of weight values is determined using webpage URL information and the web page title of selection as webpage unified resource
The web page title of position symbol information MAP, is placed in the webpage white list library of setting;
Truncated web page title query unit is used for query webpage white list library generation unit, obtains the unified money of webpage to be truncated
The web page title of source locator information mapping, and using obtained web page title as truncated web page title.
21. device as claimed in claim 20, the truncated web page title obtains module and further comprises:
Web page title updating unit extracts the unified money of the webpage for including in web page navigation data for obtaining web page navigation data
Source locator information and the web page title of webpage URL information mapping;Each webpage that traversal is extracted is unified
Resource Locator information whether there is the webpage URL information in the generation unit of query webpage white list library, such as
Fruit is not present, the web page title that the webpage URL information and the webpage URL information are mapped
Webpage white list library generation unit is written, if it does, from the web page title of extraction and webpage white list library generation unit,
The web page title for obtaining webpage URL information mapping respectively, determines whether the white name of more new web page after being compared
The web page title that the webpage URL information maps in single library generation unit.
22. device as claimed in claim 18, it includes: that web page title template library is raw that the truncated web page title, which obtains module,
At unit and truncated web page title acquiring unit, wherein
Web page title template library generation unit, for being in advance the web page title setting of webpage URL information mapping
Sort out strategy, and corresponding regularity is set for the web page title of each classification;
Truncated web page title acquiring unit, for extracting the web page title of webpage URL information mapping to be truncated
Naming rule, the naming rule of extraction is matched into pre-set classifications strategy, obtain described in webpage unified resource to be truncated
Classification belonging to the web page title of locator information mapping;Query webpage title template library generation unit obtains described wait truncate
The corresponding regularity of classification belonging to the web page title of webpage URL information mapping;It is advised using the canonical of acquisition
The web page title progress canonical processing for truncating the mapping of webpage URL information is then treated, truncated webpage mark is obtained
Topic.
23. device as claimed in claim 18, it includes: that knowledge is sewed in web page title front and back that the truncated web page title, which obtains module,
Other library generation unit and truncated web page title processing unit, wherein
Sew identification library generation unit before and after web page title, is reflected for obtaining webpage URL information in user's collection
The web page title penetrated and storage;It is arranged for carrying out the term frequency-inverse document word frequency calculative strategy that identification is sewed in front and back to web page title;
Truncated web page title processing unit, for obtaining the webpage mark of webpage URL information mapping to be truncated
Topic splits the web page title of acquisition according to pre-set fractionation strategy, obtains one or more webpage subtitles;Knot
The web page title that the webpage URL information mapping stored in identification library is sewed in web page title front and back is closed, for each net
Page subtitle calculates each net using the term frequency-inverse document word frequency calculative strategy being arranged in identification library is sewed before and after web page title
The term frequency-inverse document word frequency value of page subtitle;Judge whether the term frequency-inverse document word frequency value calculated is greater than pre-set front and back
Sew threshold value, if so, determining that each webpage subtitle is sewed for front and back, which is sewed and is filtered out from web page title, and will filter
The web page title sewed except front and back is as truncated web page title.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410158987.XA CN105095175B (en) | 2014-04-18 | 2014-04-18 | Obtain the method and device of truncated web page title |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410158987.XA CN105095175B (en) | 2014-04-18 | 2014-04-18 | Obtain the method and device of truncated web page title |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105095175A CN105095175A (en) | 2015-11-25 |
CN105095175B true CN105095175B (en) | 2019-04-30 |
Family
ID=54575649
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410158987.XA Active CN105095175B (en) | 2014-04-18 | 2014-04-18 | Obtain the method and device of truncated web page title |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105095175B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105574175A (en) * | 2015-12-21 | 2016-05-11 | 北京奇虎科技有限公司 | Processing method and device for optimizing search result title |
CN105630909A (en) * | 2015-12-21 | 2016-06-01 | 北京奇虎科技有限公司 | Method and device for displaying normalized header information |
CN107045529B (en) * | 2017-01-16 | 2021-01-22 | 阿里巴巴(中国)有限公司 | Network content acquisition method and device and service terminal |
CN106959945B (en) * | 2017-03-23 | 2021-01-05 | 北京百度网讯科技有限公司 | Method and device for generating short titles for news based on artificial intelligence |
CN110852097B (en) * | 2019-10-15 | 2022-02-01 | 平安科技(深圳)有限公司 | Feature word extraction method, text similarity calculation method, device and equipment |
CN111460307B (en) * | 2020-04-03 | 2020-11-06 | 渭南双盈未来科技有限公司 | Mobile terminal accurate searching method and device |
CN111680482B (en) * | 2020-05-07 | 2024-04-12 | 车智互联(北京)科技有限公司 | Title image-text generation method and computing device |
CN112437356B (en) * | 2020-11-13 | 2021-09-28 | 珠海大横琴科技发展有限公司 | Streaming media data processing method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831199A (en) * | 2012-08-07 | 2012-12-19 | 北京奇虎科技有限公司 | Method and device for establishing interest model |
CN102831248A (en) * | 2012-09-18 | 2012-12-19 | 北京奇虎科技有限公司 | Network hotspot mining method and network hotspot mining device |
CN103324665A (en) * | 2013-05-14 | 2013-09-25 | 亿赞普(北京)科技有限公司 | Hot spot information extraction method and device based on micro-blog |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130262430A1 (en) * | 2012-03-29 | 2013-10-03 | Microsoft Corporation | Dominant image determination for search results |
US8799278B2 (en) * | 2012-10-01 | 2014-08-05 | DISCERN, Inc. | Data augmentation based on second-phase metadata |
-
2014
- 2014-04-18 CN CN201410158987.XA patent/CN105095175B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831199A (en) * | 2012-08-07 | 2012-12-19 | 北京奇虎科技有限公司 | Method and device for establishing interest model |
CN102831248A (en) * | 2012-09-18 | 2012-12-19 | 北京奇虎科技有限公司 | Network hotspot mining method and network hotspot mining device |
CN103324665A (en) * | 2013-05-14 | 2013-09-25 | 亿赞普(北京)科技有限公司 | Hot spot information extraction method and device based on micro-blog |
Non-Patent Citations (1)
Title |
---|
基于兴趣模型的个性化信息推荐系统研究与设计;谢创丰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20101015(第10期);全文 |
Also Published As
Publication number | Publication date |
---|---|
CN105095175A (en) | 2015-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105095175B (en) | Obtain the method and device of truncated web page title | |
Adar et al. | The web changes everything: understanding the dynamics of web content | |
CN102831199B (en) | Method and device for establishing interest model | |
CN105843965B (en) | A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification | |
Peters et al. | Content extraction using diverse feature sets | |
CN108920434A (en) | A kind of general Web page subject method for extracting content and system | |
JP2009151749A (en) | Method and system for filtering subject related web page based on navigation path information | |
CN104391978B (en) | Web page storage processing method and processing device for browser | |
CN104978408A (en) | Berkeley DB database based topic crawler system | |
US20160103913A1 (en) | Method and system for calculating a degree of linkage for webpages | |
TW202001620A (en) | Automatic website data collection method using a complex semantic computing model to form a seed vocabulary data set | |
JP4875911B2 (en) | Content identification method and apparatus | |
US20150302093A1 (en) | Method and system for filtering of a website | |
CN103116635A (en) | Field-oriented method and system for collecting invisible web resources | |
CN104899215A (en) | Data processing method, recommendation source information organization, information recommendation method and information recommendation device | |
Mehta et al. | DOM tree based approach for web content extraction | |
CN106776640A (en) | A kind of stock information information displaying method and device | |
KR20090120843A (en) | A system and method generating multi-concept networks based on user's web usage data | |
Sluban et al. | URL Tree: Efficient unsupervised content extraction from streams of web documents | |
CN108470046B (en) | News event sequencing method and system based on news event search sentence | |
CN104462613B (en) | Hot spot polymerization and device | |
CN105787032B (en) | The generation method and device of snapshots of web pages | |
Peng et al. | Tunneling enhanced by web page content block partition for focused crawling | |
Saberi¹ et al. | What does the future of search engine optimization hold? | |
Blanco et al. | Efficiently Locating Collections of Web Pages to Wrap. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |