CN103970755A - Novel catalog entry identification method, device and system - Google Patents

Novel catalog entry identification method, device and system Download PDF

Info

Publication number
CN103970755A
CN103970755A CN201310031915.4A CN201310031915A CN103970755A CN 103970755 A CN103970755 A CN 103970755A CN 201310031915 A CN201310031915 A CN 201310031915A CN 103970755 A CN103970755 A CN 103970755A
Authority
CN
China
Prior art keywords
text
novel
doubtful
directory block
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310031915.4A
Other languages
Chinese (zh)
Other versions
CN103970755B (en
Inventor
黄钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310031915.4A priority Critical patent/CN103970755B/en
Publication of CN103970755A publication Critical patent/CN103970755A/en
Application granted granted Critical
Publication of CN103970755B publication Critical patent/CN103970755B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Abstract

The embodiment of the invention discloses a novel catalog entry identification method, device and system. According to the embodiment, through the determination that whether a novel catalog entry feature exists in a webpage of the world wide web, a novel catalog page can be preliminarily recognized from the webpage of the world wide web, and for the webpage, of the world wide web, containing no novel catalog entry, whether the novel catalog page exists or not can be determined by further adopting visual partitions, establishing a first DOM (Document Object Model) tree, utilizing the first DOM tree to acquire the features of a suspected catalog block, and then being based on the feature of the suspected catalog block, so that the recognition on the novel catalog entry is realized, follow-up display on a mobile terminal is facilitated, the display effect is improved, and the browsing quality of a user is improved.

Description

A kind of recognition methods of listing of novel item, device and system
Technical field
The present invention relates to communication technical field, be specifically related to a kind of recognition methods, device and system of listing of novel item.
Background technology
Development along with development of Mobile Internet technology and mobile terminal, people more and more tend to read the information on internet by mobile terminal, wherein, comprise and read various novel works on internet, but, it is all with WWW (www that novel works on current internet have greatly, world wideweb) form of webpage exists, so-called web presence, generally refer to based on personal computer (PC, Personal Computer) webpage, it is different from WAP (wireless application protocol) (WAP, Wireless ApplicationProtocol) webpage, WAP page, generally refer to the webpage based on mobile terminal.
In the research and practice process to prior art, the present inventor finds, due to the structure of the novel on web presence and content more complicated all, on mobile terminal, shows and has limitation, so often display effect is not good, even can affects user and browse quality.
Summary of the invention
The embodiment of the present invention provides a kind of recognition methods, device and system of listing of novel item, can identify listing of novel item, thereby be convenient to follow-uply on mobile terminal, show, to improve display effect, improves user's the quality of browsing.
A recognition methods for listing of novel item, comprising:
Determine whether web presence exists listing of novel item feature;
If so, determine that described web presence is listing of novel page;
If not, described web presence is carried out to vision piecemeal, obtain webpage after piecemeal, according to webpage after piecemeal, set up the first DOM Document Object Model (DOM, Document Object Model) tree, according to described the first dom tree, obtain the feature of doubtful directory block, according to the feature of described doubtful directory block, determine while there is listing of novel page, determine after described piecemeal that webpage is listing of novel page.
Optionally, wherein, whether described definite web presence exists listing of novel item feature to comprise:
The text link according to the link of web presence, title and full text with text feature determines whether to exist listing of novel item feature; For example, specifically can be as follows:
According to the link of web presence, determine whether described web presence is homepage or secondary homepage;
If so, determine and do not have listing of novel item feature;
If not, according to described web presence, set up the second dom tree, utilize described the second dom tree to obtain the title of described web presence and the text link that full text has text feature, determine that described title exists preset novel title characteristic keyword, and determining that described full text has in the text link of text feature exists the quantity of the text link of preset novel text characteristic keyword to be more than or equal to preset first threshold, and in determining in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, determine and have listing of novel item feature, otherwise, determine and do not have listing of novel item feature.
A recognition device for listing of novel item, comprising:
The first determining unit, for determining whether web presence exists listing of novel item feature, if so, determines that described web presence is listing of novel page;
Minute module unit, when determining that in the first determining unit web presence does not exist listing of novel item feature, carries out vision piecemeal to described web presence, obtains webpage after piecemeal;
Model is set up unit, for setting up the first dom tree according to webpage after piecemeal;
Acquiring unit, for obtaining the feature of doubtful directory block according to described the first dom tree;
The second determining unit, while there is listing of novel page for determining according to the feature of described doubtful directory block, determines after described piecemeal that webpage is listing of novel page.
Optionally, described the first determining unit, specifically can determine whether to exist listing of novel item feature for have the text link of text feature according to the link of web presence, title and full text.For example, specifically can be as follows:
Described the first determining unit, specifically for determining according to the link of web presence whether described web presence is homepage or secondary homepage, if so, determine and do not have listing of novel item feature, if not, according to described web presence, set up the second dom tree, utilize described the second dom tree to obtain the title of described web presence and the text link that full text has text feature, determine that described title exists preset novel title characteristic keyword, and determining that described full text has in the text link of text feature exists the quantity of the text link of preset novel text characteristic keyword to be more than or equal to preset first threshold, and in determining in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, determine and have listing of novel item feature, otherwise, determine and do not have listing of novel item feature.
A communication system, comprises the recognition device of the arbitrary middle novel directory entry that the embodiment of the present invention provides.
The embodiment of the present invention is by determining in web presence whether have listing of novel item feature, from web presence, tentatively identify listing of novel page, and for the web presence that does not have listing of novel item feature, further by vision piecemeal and set up the first dom tree, and utilize the first dom tree to obtain the feature of doubtful directory block, then according to the feature of doubtful directory block, determine whether it is listing of novel page, thereby realized the identification to listing of novel item, thereby be convenient to follow-uply on mobile terminal, show, to improve display effect, improve user's the quality of browsing.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing of required use during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those skilled in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the schematic flow sheet of the recognition methods of the listing of novel item that provides of the embodiment of the present invention;
Fig. 2 is another schematic flow sheet of the recognition methods of the listing of novel item that provides of the embodiment of the present invention;
Fig. 3 is the structural representation of the recognition device of the listing of novel item that provides of the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those skilled in the art, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
The embodiment of the present invention provides a kind of recognition methods, device and system of listing of novel item.Below be elaborated respectively.
Embodiment mono-,
The embodiment of the present invention is described the angle of the recognition device from listing of novel item, and the recognition device of this listing of novel item specifically can be integrated in mobile terminal, such as mobile phone or panel computer etc.
A recognition methods for listing of novel item, comprising: determine whether web presence exists listing of novel item feature, if there is listing of novel item feature, determine that this web presence is listing of novel page; If there is not listing of novel item feature, this web presence is carried out to vision piecemeal, obtain webpage after piecemeal, according to webpage after piecemeal, set up the first dom tree, according to this first dom tree, obtain the feature of doubtful directory block, according to the feature of this doubtful directory block, determine while there is listing of novel page, determine after this piecemeal that webpage is listing of novel page.
As shown in Figure 1, idiographic flow can be as follows:
101, determine whether web presence exists listing of novel item feature, if so, performs step 102, if not, performs step 103;
For example, specifically can determine whether to exist listing of novel item feature according to the link of web presence, title and the text link in full with text feature; Such as, specifically can comprise:
According to the link of web presence, determine whether this web presence is homepage or secondary homepage, if homepage or secondary homepage, determine and do not have listing of novel item feature, if be not that homepage neither secondary homepage, according to this web presence, set up dom tree, for convenience, this dom tree is called to the second dom tree, utilize this second dom tree to obtain the title of this web presence and the text link that full text has text feature, determine that this title exists preset novel title characteristic keyword, and determining in this text link in full with text feature exists the quantity of the text link of preset novel text characteristic keyword to be more than or equal to preset first threshold, and in determining in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, determine and have listing of novel item feature, otherwise, determine and do not have listing of novel item feature.
Wherein, the Similar Text link of the embodiment of the present invention, refer to the different text link that points to same text page, the different text link with the link of identical text page, such as the link of the difference " chapter " under same " the ", or, the text link of the difference " joint " under same " chapter ", or, the text link that the difference under same " joint " " is returned ", etc.
Wherein, the link of web presence is specifically as follows URL(uniform resource locator) (URL, UniversalResource Locator), specifically can detect the path of the URL of web presence, and detect whether contain similar " index "+" .html/jsp/asp/php/shtml " or " default "+keywords such as " .html/jsp/asp/php/shtml ", thereby judge whether this web presence belongs to homepage or secondary homepage.
In addition, novel title characteristic keyword can comprise the words such as catalogue and/or title; Novel text characteristic keyword comprises: the, chapter, joint, return and/or the word such as volume, first threshold and Second Threshold can arrange according to the demand of practical application, do not repeat them here.
102, determine when web presence exists listing of novel item feature, determine that this web presence is listing of novel page.
103, determine when web presence does not exist listing of novel item feature, web presence is carried out to vision piecemeal, obtain webpage after piecemeal.
104, according to webpage after piecemeal, set up dom tree, for convenience, in embodiments of the present invention, this dom tree is called to the first dom tree.
105, according to this first dom tree, obtain the feature of doubtful directory block, according to the feature of this doubtful directory block, determine while there is listing of novel page, determine after this piecemeal that webpage is listing of novel page.
Wherein, according to this first dom tree, obtain the feature of doubtful directory block, specifically can comprise:
According to this first dom tree, obtain the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature;
Now, step " is determined while there is listing of novel page according to the feature of this doubtful directory block; determine after this piecemeal that webpage is listing of novel page " and is specifically as follows: according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, determine while there is listing of novel page, determine after described piecemeal that webpage is listing of novel page, specifically can be as follows:
While meeting first condition and second condition according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature is definite, determine and have listing of novel page;
(1) first condition comprises:
In doubtful directory block, have in the text link of text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 3rd preset threshold value, and the ratio that in doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to the 4th preset threshold value;
(2) second condition comprises:
According to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, determining that after this piecemeal, webpage exists under the prerequisite of directory block, meeting following any one situation:
(1) represent in the text link in directory block with text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 5th preset threshold value;
(2) quantity that represents the link of Similar Text in directory block is more than or equal to the 6th preset threshold value, and represents that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to the 7th preset threshold value;
(3) after piecemeal, the quantity of the Similar Text in all doubtful directory block in webpage link is more than or equal to the 8th preset threshold value, and after this piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to the 9th preset threshold value.
Wherein, can adopt and obtain with the following method representing directory block, as follows:
Add up the quantity of the chain feature of the novel text page occurring in doubtful directory block, and add up the quantity that has the text link of preset novel text characteristic keyword in the text link in doubtful directory block with text feature; Determine that doubtful directory block that the quantity of the quantity of this chain feature and the text link of novel text characteristic keyword is maximum is for representing directory block.
Wherein, step " after determining this piecemeal according to the text link in doubtful directory block link and doubtful directory block with text feature, webpage exists directory block " specifically can comprise:
Determine that doubtful directory block position meets prerequisite (can arrange according to the demand of practical application), and determine that doubtful directory block link exists the chain feature of preset novel text page, and determine that doubtful directory block has while there is preset novel text characteristic keyword in the text link of text feature, determine this piecemeal after webpage there is directory block.
Wherein, the 3rd threshold value, the 4th threshold value, the 5th threshold value, the 6th threshold value, the 7th threshold value, the 8th threshold value and the 9th threshold value can arrange according to the demand of practical application.
It should be noted that, in embodiments of the present invention, by meeting the piece in webpage after the piecemeal of prerequisite, be called doubtful directory block, this prerequisite can arrange according to the demand of practical application.
In addition,, if determine and not have listing of novel page according to the feature of this doubtful directory block, can determine after this piecemeal that webpage is not listing of novel page.
As from the foregoing, the present embodiment is by determining in web presence whether have listing of novel item feature, from web presence, tentatively identify listing of novel page, and for the web presence that does not have listing of novel item feature, further by vision piecemeal and set up the first dom tree, and utilize the first dom tree to obtain the feature of doubtful directory block, then according to the feature of doubtful directory block, determine whether it is listing of novel page, thereby realized the identification to listing of novel item, thereby be convenient to follow-uply on mobile terminal, show, to improve display effect, improve user's the quality of browsing.
Embodiment bis-,
According to the described method of embodiment mono-, below will be described in further detail for example.
In embodiments of the present invention, the recognition device with this listing of novel item is specifically integrated in to mobile terminal, and the link of web presence to be specially URL be that example describes.
A recognition methods for listing of novel item, as shown in Figure 2, idiographic flow can be as follows:
201, acquisition for mobile terminal web presence.
202, mobile terminal determines according to the URL of web presence whether this web presence is homepage or secondary homepage, if homepage or secondary homepage can directly determine and not have listing of novel item feature, so execution step 204; If be not that homepage neither secondary homepage, perform step 203.
For example, specifically can detect the path of the URL of web presence, and detect whether contain similar " index "+" .html/jsp/asp/php/shtml " or " default "+keywords such as " .html/jsp/asp/php/shtml ", thereby judge whether this web presence belongs to homepage or secondary homepage.
203, according to this web presence, set up the second dom tree, utilize this second dom tree to obtain the title of this web presence and the text link that full text has text feature, and determine whether to exist listing of novel item feature according to the link of this web presence, title and the text link in full with text feature, specific as follows:
Determine and in this title, whether have preset novel title characteristic keyword, and determine in this text link in full with text feature whether have preset novel text characteristic keyword, and add up the quantity that these exist the text link of novel text characteristic keyword, in addition, the accounting of Similar Text link in can also determining in full, in full text, Similar Text link accounts for the ratio of all text links in full.
If determine there is preset novel title characteristic keyword in this title, and this has in full in the text link of text feature and exists the quantity of the text link of preset novel text characteristic keyword to be more than or equal to preset first threshold, and in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, can determine and have listing of novel item feature, so determine that this web presence is listing of novel page, flow process finishes;
Otherwise, if determine there is not preset novel title characteristic keyword in this title, or determine that the quantity that exists preset novel text characteristic key words text to link in this text link in full with text feature is less than preset first threshold, or in determining in full, Similar Text link accounts for the ratio of all text links in full and is less than preset Second Threshold, determine and do not have listing of novel item feature, so execution step 204.
Wherein, novel title characteristic keyword can comprise the words such as catalogue and/or title; Novel text characteristic keyword comprises: the, chapter, joint, return and/or the word such as volume, first threshold and Second Threshold can arrange according to the demand of practical application, for example, specifically can be as follows:
According to this web presence, set up the second dom tree, utilize this second dom tree to obtain the title under this web presence <title> label, determine in this title, whether there is keywords such as " catalogues " and/or " title ";
Travel through the second dom tree, definite full text has in the text link of text feature whether contain the catalogue associative keys such as " ", " chapter ", " joint ", " returning " and/or " volume ", if contain, add up the quantity of these text links that contain the catalogue associative keys such as " ", " chapter ", " joint ", " returning " and/or " volume ", in addition in, can also calculating in full, Similar Text link accounts for the ratio of all text links in full;
If there is keywords such as " catalogues " and/or " title " in title, and the quantity of the text link that these contain the catalogue associative keys such as " ", " chapter ", " joint ", " returning " and/or " volume " is more than or equal to first threshold, and these Similar Text links account for the ratio of all text links in full and are more than or equal to preset Second Threshold, can determine that this web presence exists listing of novel item feature, so determine that this web presence is listing of novel page, flow process finishes.
If there is not keywords such as " catalogues " and/or " title " in title, or the quantity of these text links that contain the catalogue associative keys such as " ", " chapter ", " joint ", " returning " and/or " volume " is less than first threshold, or the ratio that these Similar Text links account for all text links of full text is less than preset Second Threshold, can determine that this web presence does not exist listing of novel item feature, so can perform step 204.
204, mobile terminal is determined when web presence does not exist listing of novel item feature, and web presence is carried out to vision piecemeal, obtains webpage after piecemeal.
205, mobile terminal is set up the first dom tree according to webpage after piecemeal, and obtains the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature according to this first dom tree.
Wherein, doubtful directory block position can be embodied from the position coordinates of piece, width, height etc., for example, if the page with web presence from left to right represents x axle, represent from top to bottom the coordinate system of y axle, can represent with x the x coordinate of piece, y represents the y coordinate of piece, width represents the width of piecemeal, and height represents the height of piecemeal, etc.
206, mobile terminal determines whether to exist listing of novel page according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, if, after definite this piecemeal, webpage is listing of novel page, if not, can determine after this piecemeal that webpage is not listing of novel page (being non-listing of novel page).
For example, mobile terminal specifically can determine whether to meet first condition and second condition according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, if can meet first condition and second condition simultaneously, after determining this piecemeal there is listing of novel page in webpage, otherwise, if can not meet first condition and second condition simultaneously, after definite this piecemeal there is not listing of novel page in webpage.
Wherein, first condition and second condition specifically can be as follows:
(1) first condition comprises:
In doubtful directory block, have in the text link of text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 3rd preset threshold value, and the ratio that in doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to the 4th preset threshold value.
For example, novel text characteristic keyword is specifically as follows the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and the 3rd threshold value specifically can be set to " 15 ", the 4th threshold value specifically can be set to " 0.8 ", specifically can be as follows:
In doubtful directory block, have in the link text of text feature and contain the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and this class link text quantity is more than or equal to 15, and the ratio that in this doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to 0.8.
(2) second condition comprises:
According to the text link in doubtful directory block link and doubtful directory block with text feature, determining that after this piecemeal, webpage exists under the prerequisite of directory block, meeting following any one situation:
(1) represent in the text link in directory block with text feature and exist the quantity of preset novel text characteristic keyword to be more than or equal to the 5th preset threshold value;
For example, novel text characteristic keyword is specifically as follows the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and the 5th threshold value can be set to " 10 ", specifically can be as follows:
Represent in the link text in directory block with text feature and contain the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and this class link text quantity is more than or equal to 10.
(2) quantity that represents the link of Similar Text in directory block is more than or equal to the 6th preset threshold value, and represents that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to the 7th preset threshold value;
For example, the 6th threshold value specifically can be set to " 20 ", and the 7th threshold value specifically can be set to " 0.9 ", specifically can be as follows:
The quantity that represents the link of Similar Text in directory block is more than or equal to 20, and represents that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to 0.9.
(3) after piecemeal, the quantity of the Similar Text in all doubtful directory block in webpage link is more than or equal to the 8th preset threshold value, and after this piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to the 9th preset threshold value.
For example, the 8th threshold value specifically can be set to " 100 ", and the 9th threshold value specifically can be set to " 0.8 ", specifically can be as follows:
The quantity of the Similar Text link of all doubtful directory blocks after piecemeal in webpage is more than or equal to 100, and after this piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to 0.8.
Wherein, can adopt and obtain with the following method representing directory block, as follows:
Add up the quantity of the chain feature of the novel text page occurring in doubtful directory block, and add up the quantity that has the text link of preset novel text characteristic keyword in the text link in doubtful directory block with text feature, determine that doubtful directory block that the quantity of the quantity of this chain feature and the text link of novel text characteristic keyword is maximum is for representing directory block.
Wherein, step " after determining this piecemeal according to the text link in doubtful directory block link and doubtful directory block with text feature, webpage exists directory block " specifically can comprise:
Determine that doubtful directory block position meets prerequisite, such as, meet " y>=100; Width>300; Height>100 "; and determine that doubtful directory block link exists the chain feature of preset novel text page; and determine that doubtful directory block has while there is preset novel text characteristic keyword in the text link of text feature, can determine this piecemeal after webpage there is directory block.
It should be noted that, the value of each threshold value is only example above, should be understood that, each threshold value above, the concrete value of first threshold, Second Threshold, the 3rd threshold value, the 4th threshold value, the 5th threshold value, the 6th threshold value, the 7th threshold value, the 8th threshold value and the 9th threshold value can arrange according to the demand of practical application.
As from the foregoing, the present embodiment is by determining in web presence whether have listing of novel item feature, from web presence, tentatively identify listing of novel page, and for the web presence that does not have listing of novel item feature, further by vision piecemeal and set up the first dom tree, and utilize the first dom tree to obtain the feature of doubtful directory block, such as obtaining doubtful directory block position, in doubtful directory block link and doubtful directory block, there is the text link of text feature etc., then according to the feature of these doubtful directory blocks, determine whether it is listing of novel page, thereby realized the identification to listing of novel item, can be more targeted when the relevant extraction of carrying out listing of novel page, obtain better extraction effect, thereby be convenient to follow-uply on mobile terminal, show, to improve display effect, improve user's the quality of browsing.
Embodiment tri-,
In order to implement better above method, the embodiment of the present invention also provides a kind of recognition device of listing of novel item, as shown in Figure 3, the recognition device of this listing of novel item comprises the first determining unit 301, divides module unit 302, model to set up unit 303, acquiring unit 304 and the second determining unit 305;
The first determining unit 301, for determining whether web presence exists listing of novel item feature, if so, determines that this web presence is listing of novel page;
Minute module unit 302, for when the first determining unit 301 determines that web presence does not exist listing of novel item feature, carries out vision piecemeal to this web presence, obtains webpage after piecemeal;
Model is set up unit 303, for webpage after the piecemeal obtaining according to minute module unit 302, sets up the first dom tree;
Acquiring unit 304, obtains the feature of doubtful directory block for set up the first dom tree of setting up unit 303 according to model;
The second determining unit 305, while there is listing of novel page for determining according to the feature of doubtful directory block, determines after described piecemeal that webpage is listing of novel page.
Wherein, the first determining unit 301, specifically can determine whether to exist listing of novel item feature for have the text link of text feature according to the link of web presence, title and full text.For example, specifically can be as follows:
The first determining unit 301, specifically can be for determining according to the link of web presence whether this web presence is homepage or secondary homepage, if so, determine and do not have listing of novel item feature, if not, according to this web presence, set up the second dom tree, utilize the second dom tree to obtain the title of this web presence and the text link that full text has text feature, determine that this title exists preset novel title characteristic keyword, and determine and should in " text link in full with text feature ", exist the quantity of the text link of " preset novel text characteristic keyword " to be more than or equal to preset first threshold, and in determining in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, determine and have listing of novel item feature, otherwise, determine and do not have listing of novel item feature.
Wherein, the Similar Text link of the embodiment of the present invention, refer to the different text link that points to same text page, the different text link with the link of identical text page, such as the link of the difference " chapter " under same " the ", or, the text link of the difference " joint " under same " chapter ", or, the text link that the difference under same " joint " " is returned ", etc.
Wherein, the link of web presence is specifically as follows URL, specifically can detect the path of the URL of web presence, and detect whether contain similar " index "+" .html/jsp/asp/php/shtml " or " default "+keywords such as " .html/jsp/asp/php/shtml ", thereby judge whether this web presence belongs to homepage or secondary homepage.
In addition, novel title characteristic keyword can comprise the words such as catalogue and/or title; Novel text characteristic keyword comprises: the, chapter, joint, return and/or the word such as volume, first threshold and Second Threshold can arrange according to the demand of practical application, do not repeat them here.
Wherein, the feature of doubtful directory block can comprise the text link etc. in the link of doubtful directory block position, doubtful directory block and doubtful directory block with text feature, that is:
Acquiring unit 304, specifically can be for obtaining the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature according to the first dom tree;
Now, the second determining unit 305, specifically can be for determining according to the text link in the link of doubtful directory block position, doubtful directory block and doubtful directory block with text feature while there is listing of novel page, determines after this piecemeal that webpage is listing of novel page.For example, specifically can be as follows:
The second determining unit 305, specifically can, for determining according to the text link in the link of doubtful directory block position, doubtful directory block and doubtful directory block with text feature while meeting first condition and second condition, determine and have listing of novel page;
Wherein, first condition and second condition specifically can be as follows:
(1) first condition comprises:
In doubtful directory block, have in the text link of text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 3rd preset threshold value, and the ratio that in doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to the 4th preset threshold value.
For example, novel text characteristic keyword is specifically as follows the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and the 3rd threshold value specifically can be set to " 15 ", the 4th threshold value specifically can be set to " 0.8 ", specifically can be as follows:
In doubtful directory block, have in the link text of text feature and contain the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and this class link text quantity is more than or equal to 15, and the ratio that in this doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to 0.8.
(2) second condition comprises:
According to the text link in doubtful directory block link and doubtful directory block with text feature, determining that after this piecemeal, webpage exists under the prerequisite of directory block, meeting following any one situation:
(1) represent in the text link in directory block with text feature and exist the quantity of preset novel text characteristic keyword to be more than or equal to the 5th preset threshold value;
For example, novel text characteristic keyword is specifically as follows the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and the 5th threshold value can be set to " 10 ", specifically can be as follows:
Represent in the link text in directory block with text feature and contain the relevant keywords of catalogue such as " ", " chapter ", " joint ", " returning " and/or " volume ", and this class link text quantity is more than or equal to 10.
(2) quantity that represents the link of Similar Text in directory block is more than or equal to the 6th preset threshold value, and represents that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to the 7th preset threshold value;
For example, the 6th threshold value specifically can be set to " 20 ", and the 7th threshold value specifically can be set to " 0.9 ", specifically can be as follows:
The quantity that represents the link of Similar Text in directory block is more than or equal to 20, and represents that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to 0.9.
(3) after piecemeal, the quantity of the Similar Text in all doubtful directory block in webpage link is more than or equal to the 8th preset threshold value, and after this piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to the 9th preset threshold value.
For example, the 8th threshold value specifically can be set to " 100 ", and the 9th threshold value specifically can be set to " 0.8 ", specifically can be as follows:
The quantity of the Similar Text link of all doubtful directory blocks after piecemeal in webpage is more than or equal to 100, and after this piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to 0.8.
Wherein, can adopt and obtain with the following method representing directory block, as follows:
Add up the quantity of the chain feature of the novel text page occurring in doubtful directory block, and add up the quantity that has the text link of preset novel text characteristic keyword in the text link in doubtful directory block with text feature, determine that doubtful directory block that the quantity of the quantity of this chain feature and the text link of novel text characteristic keyword is maximum is for representing directory block.That is:
The second determining unit 305, specifically can be for adding up the quantity of the chain feature of the novel text page occurring in doubtful directory block, and add up the quantity that has the text link of preset novel text characteristic keyword in the text link in doubtful directory block with text feature, determine that doubtful directory block that the quantity of the quantity of this chain feature and the text link of novel text characteristic keyword is maximum is for representing directory block.
In addition, specifically can determine by the following methods piecemeal after webpage whether there is directory block, as follows:
Determine that doubtful directory block position meets prerequisite, such as, meet " y>=100; Width>300; Height>100 "; and determine that doubtful directory block link exists the chain feature of preset novel text page; and determine that doubtful directory block has while there is preset novel text characteristic keyword in the text link of text feature; after can determining this piecemeal, webpage exists directory block; otherwise, determine this piecemeal after webpage there is directory block.That is:
The second determining unit 305, specifically can be for determining that doubtful directory block position meets prerequisite, and determine that doubtful directory block link exists the chain feature of preset novel text page, and determine that doubtful directory block has while there is preset novel text characteristic keyword in the text link of text feature, determine piecemeal after webpage there is directory block.
It should be noted that, the value of each threshold value is only example above, should be understood that, each threshold value above, the concrete value of first threshold, Second Threshold, the 3rd threshold value, the 4th threshold value, the 5th threshold value, the 6th threshold value, the 7th threshold value, the 8th threshold value and the 9th threshold value can arrange according to the demand of practical application.
In addition, if webpage does not exist listing of novel page after mobile terminal is determined this piecemeal according to the text link in the link of doubtful directory block position, doubtful directory block and doubtful directory block with text feature, the second determining unit 305 can determine after this piecemeal that webpage is not listing of novel page.
The recognition device of this listing of novel item specifically can be integrated in mobile terminal, such as mobile phone or panel computer etc.
During concrete enforcement, above unit can be used as independently entity and realizes, and also can carry out combination in any, as same or several entities, realizes, and the concrete enforcement of above unit can, referring to embodiment of the method above, not repeat them here.
As from the foregoing, the recognition device of the listing of novel item of the present embodiment can determine in web presence whether have listing of novel item feature by the first determining unit 301, from web presence, tentatively identify listing of novel page, and for the web presence that does not have listing of novel item feature, by a minute module unit 302, make further vision piecemeal and by model, set up unit 303 and set up the first dom tree, and utilize the first dom tree to obtain the feature of doubtful directory block by acquiring unit 304, and then according to the feature of these doubtful directory blocks, determine whether it is listing of novel page by the second determining unit 305, thereby realized the identification to listing of novel item, can be more targeted when the relevant extraction of carrying out listing of novel page, obtain better extraction effect, thereby be convenient to follow-uply on mobile terminal, show, to improve display effect, improve user's the quality of browsing.
Embodiment tetra-,
Accordingly, the embodiment of the present invention also provides a kind of communication system, comprises the recognition device of any listing of novel item that the embodiment of the present invention provides.For example, specifically can be as follows:
The recognition device of listing of novel item, for determining whether web presence exists listing of novel item feature, if there is listing of novel item feature, determines that this web presence is listing of novel page; If there is not listing of novel item feature, this web presence is carried out to vision piecemeal, obtain webpage after piecemeal, according to webpage after piecemeal, set up the first dom tree, according to this first dom tree, obtain the feature of doubtful directory block, according to the feature of this doubtful directory block, determine while there is listing of novel page, determine after this piecemeal that webpage is listing of novel page.
Optionally, wherein, the recognition device of listing of novel item, specifically can determine whether to exist listing of novel item feature for have the text link of text feature according to the link of web presence, title and full text.
For example, the recognition device of listing of novel item, specifically can be for determining according to the link of web presence whether this web presence is homepage or secondary homepage, if homepage or secondary homepage, determine and do not have listing of novel item feature, if be not that homepage neither secondary homepage, according to this web presence, set up the second dom tree, utilize this second dom tree to obtain the title of this web presence and the text link that full text has text feature, determine that this title exists preset novel title characteristic keyword, and determining in this text link in full with text feature exists the quantity of the text link of preset novel text characteristic keyword to be more than or equal to preset first threshold, and in determining in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, determine and have listing of novel item feature, otherwise, determine and do not have listing of novel item feature.
Wherein, Similar Text link, refer to the different text link that points to same text page, the different text link with the link of identical text page, such as the link of the difference " chapter " under same " the ", or, the text link of the difference " joint " under same " chapter ", or, the text link that the difference under same " joint " " is returned ", etc.
Wherein, the link of web presence is specifically as follows URL, specifically can detect the path of the URL of web presence, and detect whether contain similar " index "+" .html/jsp/asp/php/shtml " or " default "+keywords such as " .html/jsp/asp/php/shtml ", thereby judge whether this web presence belongs to homepage or secondary homepage.
In addition, novel title characteristic keyword can comprise the words such as catalogue and/or title; Novel text characteristic keyword comprises: the, chapter, joint, return and/or the word such as volume, first threshold and Second Threshold can arrange according to the demand of practical application, do not repeat them here.
Wherein, the recognition device of listing of novel item, specifically can be for obtaining the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature according to this first dom tree; Then according to the text link in the link of doubtful directory block position, doubtful directory block and doubtful directory block with text feature, determine while there is listing of novel page, determine after this piecemeal that webpage is listing of novel page, such as, specifically can be as follows:
While meeting first condition and second condition according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature is definite, determine and have listing of novel page; Wherein, first condition and second condition specifically can be as follows:
(1) first condition comprises:
In doubtful directory block, have in the text link of text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 3rd preset threshold value, and the ratio that in doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to the 4th preset threshold value;
(2) second condition comprises:
According to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, determining that after this piecemeal, webpage exists under the prerequisite of directory block, meeting following any one situation:
(1) represent in the text link in directory block with text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 5th preset threshold value;
(2) quantity that represents the link of Similar Text in directory block is more than or equal to the 6th preset threshold value, and represents that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to the 7th preset threshold value;
(3) after piecemeal, the quantity of the Similar Text in all doubtful directory block in webpage link is more than or equal to the 8th preset threshold value, and after this piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to the 9th preset threshold value.
Wherein, can adopt and obtain with the following method representing directory block, as follows:
Add up the quantity of the chain feature of the novel text page occurring in doubtful directory block, and add up the quantity that has the text link of preset novel text characteristic keyword in the text link in doubtful directory block with text feature; Determine that doubtful directory block that the quantity of the quantity of this chain feature and the text link of novel text characteristic keyword is maximum is for representing directory block.
Wherein, step " after determining this piecemeal according to the text link in doubtful directory block link and doubtful directory block with text feature, webpage exists directory block " specifically can comprise:
Determine that doubtful directory block position meets prerequisite (can arrange according to the demand of practical application), and determine that doubtful directory block link exists the chain feature of preset novel text page, and determine that doubtful directory block has while there is preset novel text characteristic keyword in the text link of text feature, determine this piecemeal after webpage there is directory block.
Wherein, the 3rd threshold value, the 4th threshold value, the 5th threshold value, the 6th threshold value, the 7th threshold value, the 8th threshold value and the 9th threshold value can arrange according to the demand of practical application.
Wherein, the recognition device of this listing of novel item specifically can be integrated in mobile terminal, such as mobile phone or panel computer etc.
In addition, this communication system can also comprise network equipment, for the recognition device of web presence to this listing of novel item is provided.
This network equipment is specifically as follows the equipment such as server, does not repeat them here.
As from the foregoing, the recognition device of the listing of novel item in the communication system of the present embodiment can by determining in web presence whether have listing of novel item feature, from web presence, tentatively identify listing of novel page, and for the web presence that does not have listing of novel item feature, further by vision piecemeal and set up the first dom tree, and utilize the first dom tree to obtain the feature of doubtful directory block, then according to the feature of these doubtful directory blocks, determine whether it is listing of novel page, thereby realized the identification to listing of novel item, can be more targeted when the relevant extraction of carrying out listing of novel page, obtain better extraction effect, thereby be convenient to follow-uply on mobile terminal, show, to improve display effect, improve user's the quality of browsing.
One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of above-described embodiment is to come the hardware that instruction is relevant to complete by program, this program can be stored in a computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc.
Recognition methods, device and the system of a kind of listing of novel the item above embodiment of the present invention being provided are described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for those skilled in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims (15)

1. a recognition methods for listing of novel item, is characterized in that, comprising:
Determine whether web presence exists listing of novel item feature;
If so, determine that described web presence is listing of novel page;
If not, described web presence is carried out to vision piecemeal, obtain webpage after piecemeal, according to webpage after piecemeal, set up the first document object model tree, according to described the first document object model tree, obtain the feature of doubtful directory block, according to the feature of described doubtful directory block, determine while there is listing of novel page, determine after described piecemeal that webpage is listing of novel page.
2. method according to claim 1, is characterized in that, whether described definite web presence exists listing of novel item feature, comprising:
The text link according to the link of web presence, title and full text with text feature determines whether to exist listing of novel item feature.
3. method according to claim 2, is characterized in that, describedly according to the link of web presence, title and the text link in full with text feature, determines whether to exist listing of novel item feature, comprising:
According to the link of web presence, determine whether described web presence is homepage or secondary homepage;
If so, determine and do not have listing of novel item feature;
If not, according to described web presence, set up the second document object model tree, utilize described the second document object model tree to obtain the title of described web presence and the text link that full text has text feature, determine that described title exists preset novel title characteristic keyword, and determining that described full text has in the text link of text feature exists the quantity of the text link of preset novel text characteristic keyword to be more than or equal to preset first threshold, and in determining in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, determine and have listing of novel item feature, otherwise, determine and do not have listing of novel item feature.
4. method according to claim 3, is characterized in that, the described feature of obtaining doubtful directory block according to described the first document object model tree, comprising:
According to described the first document object model tree, obtain the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature;
While there is listing of novel page according to the feature of described doubtful directory block is definite, determine after described piecemeal that webpage is that listing of novel page is specially: according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, determine while there is listing of novel page, determine after described piecemeal that webpage is listing of novel page.
5. method according to claim 4, is characterized in that, describedly according to the text link in the link of doubtful directory block position, doubtful directory block and doubtful directory block with text feature, determines and has listing of novel page, comprising:
While meeting first condition and second condition according to the text link in doubtful directory block link and doubtful directory block with text feature is definite, determine and have listing of novel page;
Described first condition comprises: in doubtful directory block, have in the text link of text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 3rd preset threshold value, and the ratio that in doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to the 4th preset threshold value;
Described second condition comprises: determining that according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature after described piecemeal, webpage exists under the prerequisite of directory block, meet following any one situation: represent in the text link in directory block with text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 5th preset threshold value; Or, represent that the quantity of Similar Text in directory block link is more than or equal to the 6th preset threshold value, and represent that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to the 7th preset threshold value; Or, the quantity of Similar Text in all doubtful directory block after described piecemeal in webpage link is more than or equal to the 8th preset threshold value, and after described piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to the 9th preset threshold value;
Describedly represent that directory block is: add up the quantity of the chain feature of the novel text page occurring in doubtful directory block, and add up the quantity that has the text link of preset novel text characteristic keyword in the text link in doubtful directory block with text feature; Determine that doubtful directory block that the quantity of the quantity of described chain feature and the text link of novel text characteristic keyword is maximum is for representing directory block.
6. method according to claim 5, is characterized in that, described according to the text link in the link of doubtful directory block position, doubtful directory block and doubtful directory block with text feature, determine described piecemeal after webpage there is directory block, comprising:
Determine that doubtful directory block position meets prerequisite, and determine that doubtful directory block link exists the chain feature of preset novel text page, and determine that doubtful directory block has while there is preset novel text characteristic keyword in the text link of text feature, determine described piecemeal after webpage there is directory block.
7. according to the method described in claim 3 to 6 any one, it is characterized in that,
Described novel title characteristic keyword comprises: catalogue and/or title;
Described novel text characteristic keyword comprises: the, chapter, joint, return and/or volume.
8. a recognition device for listing of novel item, is characterized in that, comprising:
The first determining unit, for determining whether web presence exists listing of novel item feature, if so, determines that described web presence is listing of novel page;
Minute module unit, when determining that in the first determining unit web presence does not exist listing of novel item feature, carries out vision piecemeal to described web presence, obtains webpage after piecemeal;
Model is set up unit, for setting up the first document object model tree according to webpage after piecemeal;
Acquiring unit, for obtaining the feature of doubtful directory block according to described the first document object model tree;
The second determining unit, while there is listing of novel page for determining according to the feature of described doubtful directory block, determines after described piecemeal that webpage is listing of novel page.
9. the recognition device of listing of novel item according to claim 8, is characterized in that,
Described the first determining unit, determines whether to exist listing of novel item feature specifically for have the text link of text feature according to the link of web presence, title and full text.
10. the recognition device of listing of novel item according to claim 9, is characterized in that,
Described the first determining unit, specifically for determining according to the link of web presence whether described web presence is homepage or secondary homepage, if so, determine and do not have listing of novel item feature, if not, according to described web presence, set up the second document object model tree, utilize described the second document object model tree to obtain the title of described web presence and the text link that full text has text feature, determine that described title exists preset novel title characteristic keyword, and determining that described full text has in the text link of text feature exists the quantity of the text link of preset novel text characteristic keyword to be more than or equal to preset first threshold, and in determining in full, Similar Text link accounts for when the ratio of all text links is more than or equal to preset Second Threshold in full, determine and have listing of novel item feature, otherwise, determine and do not have listing of novel item feature.
The recognition device of 11. listing of novel items according to claim 10, it is characterized in that, acquiring unit, specifically for obtaining the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature according to described the first document object model tree;
The second determining unit, while there is listing of novel page specifically for determining according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, determines after described piecemeal that webpage is listing of novel page.
The recognition device of 12. listing of novel items according to claim 11, is characterized in that,
The second determining unit, while meeting first condition and second condition specifically for determining according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature, determines and has listing of novel page;
Described first condition comprises: in doubtful directory block, have in the text link of text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 3rd preset threshold value, and the ratio that in doubtful directory block, Similar Text link accounts for all text links in this doubtful directory block is more than or equal to the 4th preset threshold value;
Described second condition comprises: determining that according to the text link in doubtful directory block position, the link of doubtful directory block and doubtful directory block with text feature after described piecemeal, webpage exists under the prerequisite of directory block, meet following any one situation: represent in the text link in directory block with text feature and exist the quantity of the text link of preset novel text characteristic keyword to be more than or equal to the 5th preset threshold value; Or, represent that the quantity of Similar Text in directory block link is more than or equal to the 6th preset threshold value, and represent that the ratio that Similar Text link in directory block accounts for all text links in this doubtful directory block is more than or equal to the 7th preset threshold value; Or, the quantity of Similar Text in all doubtful directory block after described piecemeal in webpage link is more than or equal to the 8th preset threshold value, and after described piecemeal, the link of the Similar Text in all doubtful directory block in webpage accounts for after this piecemeal the ratio of all text links in webpage and is more than or equal to the 9th preset threshold value;
Describedly represent that directory block is: add up the quantity of the chain feature of the novel text page occurring in doubtful directory block, and add up the quantity that has the text link of preset novel text characteristic keyword in the text link in doubtful directory block with text feature; Determine that doubtful directory block that the quantity of the quantity of described chain feature and the text link of novel text characteristic keyword is maximum is for representing directory block.
The recognition device of 13. listing of novel items according to claim 12, is characterized in that,
The second determining unit, specifically for determining that doubtful directory block position meets prerequisite, and determine that doubtful directory block link exists the chain feature of preset novel text page, and determine that doubtful directory block has while there is preset novel text characteristic keyword in the text link of text feature, determine described piecemeal after webpage there is directory block.
14. recognition devices according to claim 10 to the listing of novel item described in 13 any one, is characterized in that,
Described novel title characteristic keyword comprises: catalogue and/or title;
Described novel text characteristic keyword comprises: the, chapter, joint, return and/or volume.
15. 1 kinds of communication systems, is characterized in that, comprise the recognition device of any listing of novel item described in claim 8 to 14.
CN201310031915.4A 2013-01-28 2013-01-28 A kind of recognition methods of listing of novel item, device and system Active CN103970755B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310031915.4A CN103970755B (en) 2013-01-28 2013-01-28 A kind of recognition methods of listing of novel item, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310031915.4A CN103970755B (en) 2013-01-28 2013-01-28 A kind of recognition methods of listing of novel item, device and system

Publications (2)

Publication Number Publication Date
CN103970755A true CN103970755A (en) 2014-08-06
CN103970755B CN103970755B (en) 2018-12-11

Family

ID=51240269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310031915.4A Active CN103970755B (en) 2013-01-28 2013-01-28 A kind of recognition methods of listing of novel item, device and system

Country Status (1)

Country Link
CN (1) CN103970755B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802933A (en) * 2016-12-28 2017-06-06 东软集团股份有限公司 A kind of determination method and device in news list region
CN111144069A (en) * 2019-12-30 2020-05-12 北大方正集团有限公司 Table-based directory typesetting method and device and storage medium
CN112749528A (en) * 2019-10-31 2021-05-04 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113221031A (en) * 2020-12-30 2021-08-06 江苏省未来网络创新研究院 Method for automatically identifying website directory page

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944104A (en) * 2010-08-19 2011-01-12 百度在线网络技术(北京)有限公司 Evaluation method and equipment for importance of webpage sub-blocks
US20110276562A1 (en) * 2009-01-16 2011-11-10 Beckett Madden-Woods Visualizing site structure and enabling site navigation for a search result or linked page
CN102346748A (en) * 2010-08-05 2012-02-08 盛乐信息技术(上海)有限公司 Automatic identification method for network literature directory type web pages
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110276562A1 (en) * 2009-01-16 2011-11-10 Beckett Madden-Woods Visualizing site structure and enabling site navigation for a search result or linked page
CN102346748A (en) * 2010-08-05 2012-02-08 盛乐信息技术(上海)有限公司 Automatic identification method for network literature directory type web pages
CN101944104A (en) * 2010-08-19 2011-01-12 百度在线网络技术(北京)有限公司 Evaluation method and equipment for importance of webpage sub-blocks
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈浩: "自定义主题信息抽取的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802933A (en) * 2016-12-28 2017-06-06 东软集团股份有限公司 A kind of determination method and device in news list region
CN112749528A (en) * 2019-10-31 2021-05-04 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN111144069A (en) * 2019-12-30 2020-05-12 北大方正集团有限公司 Table-based directory typesetting method and device and storage medium
CN113221031A (en) * 2020-12-30 2021-08-06 江苏省未来网络创新研究院 Method for automatically identifying website directory page
WO2022143192A1 (en) * 2020-12-30 2022-07-07 江苏省未来网络创新研究院 Method for automatically recognizing contents page of website

Also Published As

Publication number Publication date
CN103970755B (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN103019550B (en) The real-time exhibiting method of content association and system
CN104239465B (en) A kind of method and device scanned for based on scene information
CN103678307B (en) Page display method and client
CN107066174A (en) Floating layer display methods, device and user terminal
CN102664925B (en) A kind of method of displaying searching result and device
CN108804516A (en) Similar users search device, method and computer readable storage medium
CN103970755A (en) Novel catalog entry identification method, device and system
CN103077250A (en) Method and device for capturing webpage content
CN102915498A (en) Method and device for goods classification of e-commerce platform
CN104423991A (en) Webpage loading and webpage data providing method and device of mobile terminal
CN101930475A (en) Web page display method and browser
CN104090904A (en) Method and equipment for providing target search result
CN104765746A (en) Data processing method and device for mobile communication terminal browser
CN104598604A (en) Browsing method of website navigation applied in various browsers
CN103345498A (en) Webpage loading method, device and system based on transit server
CN103942211A (en) Text page recognition method and device
CN106383752A (en) Browser page abnormity recovery processing method and device
CN107784107A (en) Dark chain detection method and device based on flight behavior analysis
CN107180032A (en) Comment on content display method and system
CN103473085A (en) Method and equipment for loading target application on mobile terminal
CN102999576A (en) Method and equipment for confirming page description information corresponding to target pages
CN103227791B (en) A kind of method of data acquisition and device
CN104408135B (en) The loading method and device of webpage thermodynamic
CN105550183A (en) Identifying method of identifying information in webpage and electronic device
CN110020297A (en) A kind of loading method of web page contents, apparatus and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant