CN104317903B - The recognition methods of the chapters and sections integrality of chapters and sections formula text and device - Google Patents
The recognition methods of the chapters and sections integrality of chapters and sections formula text and device Download PDFInfo
- Publication number
- CN104317903B CN104317903B CN201410578534.2A CN201410578534A CN104317903B CN 104317903 B CN104317903 B CN 104317903B CN 201410578534 A CN201410578534 A CN 201410578534A CN 104317903 B CN104317903 B CN 104317903B
- Authority
- CN
- China
- Prior art keywords
- catalogue page
- catalogue
- content pages
- chapters
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000004458 analytical method Methods 0.000 claims abstract description 47
- 230000002045 lasting effect Effects 0.000 claims description 17
- 238000005267 amalgamation Methods 0.000 claims description 14
- 238000010276 construction Methods 0.000 claims description 13
- 239000000284 extract Substances 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 9
- 238000003066 decision tree Methods 0.000 description 10
- 230000008901 benefit Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of recognition methods of chapters and sections integrality of chapters and sections formula text and device, this method includes:Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;According to the corresponding multiple content pages of each catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined;Each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, the result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.The technical scheme that the present invention is provided can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.
Description
Technical field
The present invention relates to Internet technical field, particularly a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text and
Device.
Background technology
With becoming increasingly popular for computer and computer network, internet has been deep into people's work, studying and living
Every field, as people issue and obtain information important channel.
At present, chapters and sections formula text largely exists in internet, and one text may largely be reprinted by different web sites, by
Influenceed, may be caused in the content of the number of site text and imperfect, or even go out by some objective factors when reprinting
The false situation of existing content.By taking novel text as an example, novel reading is a kind of strong demand of Internet user, is especially set in movement
More occupy no small demand proportion for upper.Novel class website is largely present, and quality is very different, and same present networks novel can quilt
Different web sites are largely reprinted, but are influenceed by some objective factors, may result in the content of this novel on number of site
And imperfect (such as lacking chapters and sections), or even content are false (piecing together false chapters and sections).Search engine index these novel websites when,
Need to judge the chapters and sections integrality of novel, as far as possible the website complete to user's presentation content, improve user and obtain novel
The quality of content, lifts Consumer's Experience.
In correlation technique, by carrying out chapters and sections integrality judgement to different novel website human configuration templates, though this method
Right accuracy rate is very high, but shortcoming is also apparent from:The website that manpower can be covered is limited, not enough intelligently, changes for website form
Response not in time.Thus, how flexibly, quickly and accurately the chapters and sections integrality of identification chapters and sections formula text turns at present urgently
Technical problem to be solved.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on
State the recognition methods of the chapters and sections integrality of the chapters and sections formula text of problem and corresponding device.
According to one aspect of the present invention there is provided a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text, including:
Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, one mesh of each website correspondence
Record page, the multiple content pages of each catalogue page correspondence;According to the corresponding multiple content pages of each catalogue page, the chapters and sections formula text is determined
Originally the catalogue page set on different websites;Analyze each catalogue page and/or each catalogue page in the catalogue page set corresponding interior
Hold page, the result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.
Alternatively, according to the corresponding multiple content pages of each catalogue page, determine the chapters and sections formula text on different websites
Catalogue page set, including:The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as each two catalogue page
Occur simultaneously;According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
Alternatively, the common factor calculated between the corresponding content pages of each two catalogue page, including:Extract multiple catalogue pages
The Text eigenvector of each content pages in corresponding content pages;The content pages that will be provided with same text characteristic vector are gathered
Class, generates multiple content pages packets;It is grouped according to the multiple content pages and each corresponding content pages of catalogue page
Mapping relations, calculate the common factor between the corresponding content pages of each two catalogue page.
Alternatively, according to the common factor of each two catalogue page, catalogue page of the chapters and sections formula text on different websites is determined
Set, including:The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, obtains merging knot
Really;Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis
Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:Calculate the catalogue page set
The average value of the element number of the common factor of middle each two catalogue page;If in a certain catalogue page and the catalogue page set it is multiple other
The number of the element of the common factor of catalogue page is respectively less than the average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis
Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:If a certain catalogue page is corresponding
Content pages include the corresponding content pages of multiple other catalogue pages in the catalogue page set, and the corresponding content pages of the catalogue page
In also there is other guide page, it is determined that other guide page is the content pages of newest chapters and sections, and the catalogue page possesses and continued
Contribute the ability of new chapters and sections.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis
Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:If a certain catalogue page is corresponding
Some content pages is not existed in the catalogue page set in the corresponding content pages of other catalogue pages, and the content pages length does not belong to
In the corresponding interval range of average length of the corresponding content pages of the catalogue page, it is determined that the content pages are false content pages.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis
Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:If a certain catalogue page is corresponding
Some content pages is not existed in the catalogue page set in the corresponding content pages of other catalogue pages, and the content pages length belongs to this
The corresponding interval range of average length of the corresponding content pages of catalogue page, and the catalogue page does not possess the energy of the lasting new chapters and sections of contribution
Power, it is determined that the content pages are false content pages.
Alternatively, the catalogue page and multiple content pages for identifying chapters and sections formula text respectively from multiple websites, including:
The related webpage from multiple site search to chapters and sections formula text;The mesh of the chapters and sections formula text is identified from the webpage searched
Record page and multiple content pages.
Alternatively, the catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched, is wrapped
Include:By the web analysis searched into text object model tree structures;To each node in the document object model tree construction
Classified, to determine the structure piecemeal of the webpage;The catalogue page of the chapters and sections formula text is extracted according to the structure piecemeal
And multiple content pages.
According to another aspect of the present invention, a kind of identifying device of the chapters and sections integrality of chapters and sections formula text is additionally provided,
Including:
Acquisition module, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites, its
In, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module, suitable for according to the corresponding multiple content pages of each catalogue page, determining the chapters and sections formula text in difference
Catalogue page set on website;
Identification module, suitable for analyzing each catalogue page and/or the corresponding content pages of each catalogue page, root in the catalogue page set
The result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.
Alternatively, the determining module is further adapted for:The common factor between the corresponding content pages of each two catalogue page is calculated, and is made
For the common factor of each two catalogue page;According to the common factor of each two catalogue page, determine the chapters and sections formula text on different websites
Catalogue page set.
Alternatively, the determining module is further adapted for:Extract the text of each content pages in the corresponding content pages of multiple catalogue pages
Eigen vector;The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;According to described
Multiple content pages packets and the mapping relations of the corresponding content pages of each catalogue page, calculate each two catalogue page correspondence
Content pages between common factor.
Alternatively, the determining module is further adapted for:The element number of common factor is more than or equal to each two of predetermined threshold value
Catalogue page is merged, and obtains amalgamation result;Mesh using the amalgamation result as the chapters and sections formula text on different websites
Record page set.
Alternatively, the identification module is further adapted for:Calculate the member of the common factor of each two catalogue page in the catalogue page set
The average value of the number of element;If of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set
Number is respectively less than the average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
Alternatively, the identification module is further adapted for:If the corresponding content pages of a certain catalogue page include the catalogue page collection
The corresponding content pages of multiple other catalogue pages in conjunction, and also there is other guide page in the corresponding content pages of the catalogue page, then really
The fixed other guide page is the content pages of newest chapters and sections, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.
Alternatively, the identification module is further adapted for:If some corresponding content pages of a certain catalogue page do not exist in the mesh
Record in page set in the corresponding content pages of other catalogue pages, and the content pages length is not belonging to the corresponding content pages of the catalogue page
The corresponding interval range of average length, it is determined that the content pages are false content pages.
Alternatively, the identification module is further adapted for:If some corresponding content pages of a certain catalogue page do not exist in the mesh
Record in page set in the corresponding content pages of other catalogue pages, the content pages length belongs to being averaged for the corresponding content pages of the catalogue page
The corresponding interval range of length, and the catalogue page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are falseness
Content pages.
Alternatively, the acquisition module is further adapted for:The related webpage from multiple site search to chapters and sections formula text;From search
To webpage in identify the catalogue page and multiple content pages of the chapters and sections formula text.
Alternatively, the acquisition module is further adapted for:By the web analysis searched into text object model tree structures;To institute
Each node stated in document object model tree construction is classified, to determine the structure piecemeal of the webpage;According to the structure
Piecemeal extracts the catalogue page and multiple content pages of the chapters and sections formula text.
According to the technical scheme that provides of the present invention, the catalogue page of chapters and sections formula text is identified respectively and many from multiple websites
Individual content pages, and then according to the corresponding multiple content pages of each catalogue page, determine catalogue of the chapters and sections formula text on different websites
Page set.Each catalogue page and/or the corresponding content pages of each catalogue page, the knot obtained according to analysis in subsequent analytical bibliography page set
Fruit identifies the chapters and sections integrality of each catalogue page in catalogue page set.As can be seen here, the present invention is realized to data source (multiple stations
The catalogue page and multiple content pages of chapters and sections formula text on point) acquisition, the determination of catalogue page set and to catalogue page set
Analysis three automatically processing so that solve in correlation technique by human configuration template carry out chapters and sections integrality judgement
Cause the problem of efficiency is low.Also, the present invention can neatly obtain data source, and then determine catalogue page set, to catalogue page
Set is analyzed, the problem of solving the response of website form change in correlation technique not in time.In addition, catalogue page and content
Page can accurately, objectively reflect the chapters and sections integrality of chapters and sections formula text, and targetedly analysis chapters and sections formula text exists the present invention
Each catalogue page and/or the corresponding content pages of each catalogue page in catalogue page set on different websites, and then obtained according to analysis
As a result the chapters and sections integrality of each catalogue page in catalogue page set is identified so that recognition result is more accurate.To sum up, the present invention is carried
The technical scheme of confession can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by the embodiment of the present invention.
According to the accompanying drawings to the detailed description of the specific embodiment of the invention, those skilled in the art will be brighter
Above-mentioned and other purposes, the advantages and features of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area
Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 shows the flow of the recognition methods of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention
Figure;And
Fig. 2 shows the structure of the identifying device of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention
Schematic diagram.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Complete conveys to those skilled in the art.
In order to solve the above technical problems, the embodiments of the invention provide a kind of identification of the chapters and sections integrality of chapters and sections formula text
Method, Fig. 1 shows the flow chart of the recognition methods of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention.
As shown in figure 1, this method at least comprises the following steps S102 to step S106.
Step S102, the catalogue page and multiple content pages for identifying from multiple websites chapters and sections formula text respectively, wherein, often
One catalogue page of individual website correspondence, the multiple content pages of each catalogue page correspondence.
Step S104, according to the corresponding multiple content pages of each catalogue page, determine chapters and sections formula text on different websites
Catalogue page set.
Each catalogue page and/or the corresponding content pages of each catalogue page in step S106, analytical bibliography page set, according to analyzing
To result identify the chapters and sections integrality of each catalogue page in catalogue page set.
According to the technical scheme that provides of the present invention, the catalogue page of chapters and sections formula text is identified respectively and many from multiple websites
Individual content pages, and then according to the corresponding multiple content pages of each catalogue page, determine catalogue of the chapters and sections formula text on different websites
Page set.Each catalogue page and/or the corresponding content pages of each catalogue page, the knot obtained according to analysis in subsequent analytical bibliography page set
Fruit identifies the chapters and sections integrality of each catalogue page in catalogue page set.As can be seen here, the present invention is realized to data source (multiple stations
The catalogue page and multiple content pages of chapters and sections formula text on point) acquisition, the determination of catalogue page set and to catalogue page set
Analysis three automatically processing so that solve in correlation technique by human configuration template carry out chapters and sections integrality judgement
Cause the problem of efficiency is low.Also, the present invention can neatly obtain data source, and then determine catalogue page set, to catalogue page
Set is analyzed, the problem of solving the response of website form change in correlation technique not in time.In addition, catalogue page and content
Page can accurately, objectively reflect the chapters and sections integrality of chapters and sections formula text, and targetedly analysis chapters and sections formula text exists the present invention
Each catalogue page and/or the corresponding content pages of each catalogue page in catalogue page set on different websites, and then obtained according to analysis
As a result the chapters and sections integrality of each catalogue page in catalogue page set is identified so that recognition result is more accurate.To sum up, the present invention is carried
The technical scheme of confession can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.
The chapters and sections formula text referred in above step S102 refers to the text being made up of some chapters and sections, such as novel, paper.
When catalogue page refers to the catalogue of chapters and sections formula text, such as user's search novel, what is generally looked for is the catalogue page of novel.Content pages
Refer to the particular content of a certain chapters and sections of chapters and sections formula text.Chapters and sections formula text is identified respectively from multiple websites the invention provides one kind
This catalogue page and the preferred scheme of multiple content pages, in this scenario can be from multiple site search to chapters and sections formula text phase
The webpage of pass, and then identify from the webpage searched the catalogue page and multiple content pages of chapters and sections formula text.
Further, the catalogue page and multiple content pages of chapters and sections formula text are identified from the webpage searched to be adopted
The identification that manually redaction rule carries out catalogue page and content pages to the webpage searched is extracted.Or, can be based on mark
Masterplate, extract the template of best match is found in ATL every time, then using template directory page and content pages
Identification is extracted.In addition, in order to improve recognition efficiency, the present invention can also be by the web analysis searched into text object model tree
Structure, and each node in document object model tree construction is classified, to determine the structure piecemeal of webpage, and then according to knot
Structure piecemeal extracts the catalogue page and multiple content pages of chapters and sections formula text.Here provide a kind of preferred to document object model
Each node in tree construction is classified with the scheme for the structure piecemeal for determining webpage, in this scenario, can travel through text pair
As model tree structures, the content of each node in document object model tree construction is obtained, and then according to preset rules by each node
Content inputs decision tree, and each node is classified by decision tree.Or, document object model tree construction can be traveled through, is obtained
The dimensional characteristics of each node in document object model tree construction, and then the dimensional characteristics input of each node is determined according to preset rules
Plan tree, is classified by decision tree to each node.
Decision tree is in known various piecemeals on the basis of the statistics of various dimensional characteristics, by training decision tree
The corresponding piecemeal type of each node is drawn using the dimensional characteristics of each node.Text of the decision tree to webpage is described in detail below
Each node is classified in object model tree structures, with the scheme for the structure piecemeal for determining webpage.
First, it is determined that dimensional characteristics for piecemeal, in embodiments of the present invention, the dimensional characteristics that can be used are up to
105, relate generally to herein below:Text size, hyperlink number, hyperlink text length, highlighted text size (including add
The word of big overstriking), Chinese character length, English character length, numerical character length, particular keywords, specific punctuation mark
Etc..I.e. a type of piece can be taken specific value to determine by one or more of 105 dimensional characteristics feature.Need
It is noted that the dimensional characteristics according to determined by actual conditions are not limited to 105, it can also be expanded in subsequent process
Fill.
Secondly, the dimensional characteristics for piecemeal of determination are inputted into decision tree, decision tree is built for training.
Furthermore, the content of each node in the document object model tree construction of webpage is inputted into decision tree according to preset rules,
By the content of each node of decision tree analysis, the dimensional characteristics of each node are obtained, and then according to the dimensional characteristics of each node to each section
Point is classified.
Describe in detail above and the data source (catalogue page of chapters and sections formula text and many on multiple websites is obtained in step S102
Individual content pages) a variety of implementations, be explained below determine catalogue page set one or more implementations.
According to the corresponding multiple content pages of each catalogue page in above step S104, determine chapters and sections formula text at different stations
Catalogue page set on point, the invention provides a kind of preferred scheme, calculates each two catalogue page corresponding in this scenario
Common factor between content pages, and as the common factor of each two catalogue page, and then according to the common factor of each two catalogue page, determine chapters and sections
Catalogue page set of the formula text on different websites.
Further, in the preferred scheme of the present invention, each two catalogue page is calculated using the thought of cluster corresponding interior
Hold the common factor between page, can be the Text eigenvector for extracting each content pages in the corresponding content pages of multiple catalogue pages, with
The content pages that will be provided with same text characteristic vector afterwards are clustered, and generate multiple content pages packets, and then according to multiple contents
Page packet and the mapping relations of the corresponding content pages of each catalogue page, calculate the corresponding content pages of each two catalogue page
Between common factor.For example, multiple websites are website A, B and C, and the catalogue page of corresponding chapters and sections formula text is catalogue page respectively
A, B and C.The corresponding multiple content pages of catalogue page A are content pages A1, A2, A3, and the corresponding multiple content pages of catalogue page B are content
Page B1, B2, the corresponding multiple content pages of catalogue page C are content pages C1, C2, C3, C4.Extract content pages A1, A2, A3, B1, B2,
The Text eigenvector of each content pages is respectively a, b, c, a, b in C1, C2, C3, C4 ', a, b, c, d, will be provided with same text
The content pages of characteristic vector are clustered, and are generated multiple content pages and are grouped into { a, a, a }, { b, b }, { b ' }, { c, c }, { d }.Enter
And according to the packet of multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, calculate each two catalogue
Common factor between the corresponding content pages of page, i.e. common factor between catalogue page A and the corresponding content pages of catalogue page B is { a }, catalogue page
Common factor between A and the corresponding content pages of catalogue page C is { a, b, c }, between catalogue page B and the corresponding content pages of catalogue page C
Occur simultaneously for { a }.
Now, according to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined, can
To be to merge each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value, amalgamation result is obtained, will
Catalogue page set of the amalgamation result as chapters and sections formula text on different websites.Still by taking above-mentioned example as an example, by each two catalogue
Common factor between the corresponding content pages of page is as the common factor of each two catalogue page, i.e. catalogue page A and catalogue page B common factor is { a },
Catalogue page A and catalogue page C common factor are { a, b, c }, and catalogue page B and catalogue page C common factor are { a }.It is 1 to take predetermined threshold value, will
The each two catalogue page that the element number of common factor is more than or equal to 1 is merged, and obtains amalgamation result for catalogue page A, B, C, then
Catalogue page collection of the chapters and sections formula text on different websites is combined into catalogue page A, B, C.
In above step S104 according to the corresponding multiple content pages of each catalogue page, determine chapters and sections formula text in different websites
On catalogue page set after, each catalogue page and/or the corresponding content pages of each catalogue page in the page set of step S106 analytical bibliographies,
The result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in catalogue page set.The invention provides a variety of analyses
Method, introduce in detail below.
The first, calculates the average value of the element number of the common factor of each two catalogue page in catalogue page set, if a certain mesh
The number of record page and the element of the common factor of other multiple catalogue pages in catalogue page set is respectively less than the average value, it is determined that the catalogue
The corresponding chapters and sections of page are imperfect.By taking above-mentioned example as an example, catalogue page collection is combined into catalogue page A, B, C, catalogue page A and catalogue page B's
The element number of common factor is 1, and the element number of catalogue page A and catalogue page C common factor is 3, catalogue page B and catalogue page C common factor
Element number be 1, then average value is 5/3, and wherein catalogue page B and catalogue page A, catalogue page C number are 1, less than average
Value 5/3, it is determined that the corresponding chapters and sections of catalogue page B are imperfect.
Second, if to include other multiple catalogue pages in catalogue page set corresponding for the corresponding content pages of a certain catalogue page
Also there is other guide page in content pages, and the corresponding content pages of the catalogue page, it is determined that other guide page is newest chapters and sections
Content pages, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.Here, newest chapters and sections refer to that chapters and sections formula text is newest and delivered
Chapters and sections, such as one newest chapters and sections delivered of serial story.Novel user would generally chase after book, i.e., newest chapters and sections are sent out once author
Table, user just wants to immediately see, newest chapters and sections, which issue faster novel station and are more susceptible to family, to be liked.By taking above-mentioned example as an example,
Catalogue page A corresponding content pages A1, A2, A3 (its Text eigenvector is respectively a, b, c), the corresponding content pages B1 of catalogue page B,
B2 (its Text eigenvector is respectively a, b '), catalogue page C corresponding content pages C1, C2, C3, C4 (its Text eigenvector point
Wei not a, b, c, d).It can be seen that, the corresponding content pages of catalogue page A include catalogue page A and the corresponding content pages of catalogue page B, and mesh
Also there is other guide page (i.e. content pages C4) in record page C, it is determined that content pages C4 is the content pages of newest chapters and sections, and catalogue page C
Possesses the ability of the lasting new chapters and sections of contribution.
The third, if some corresponding content pages of a certain catalogue page do not exist in other catalogue pages correspondence in catalogue page set
Content pages in, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page,
It is false content pages then to determine the content pages.By taking above-mentioned example as an example, catalogue page A corresponding content pages A1, A2, A3 (its text
Eigen vector is respectively a, b, c), and corresponding content pages B1, B2 of catalogue page B (its Text eigenvector is respectively a, b '), mesh
Record page C corresponding content pages C1, C2, C3, C4 (its Text eigenvector is respectively a, b, c, d).It can be seen that, catalogue page B is corresponding
Content pages B2 is not existed in the corresponding content pages of catalogue page A, C, if content pages B2 length is not belonging to the corresponding contents of catalogue page B
The corresponding interval range of average length of page, it is determined that content pages B2 is false content pages.
4th kind, if some corresponding content pages of a certain catalogue page do not exist in other catalogue pages correspondence in catalogue page set
Content pages in, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and should
Catalogue page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.By taking above-mentioned example as an example,
Catalogue page A corresponding content pages A1, A2, A3 (its Text eigenvector is respectively a, b, c), the corresponding content pages B1 of catalogue page B,
B2 (its Text eigenvector is respectively a, b '), catalogue page C corresponding content pages C1, C2, C3, C4 (its Text eigenvector point
Wei not a, b, c, d).It can be seen that, the corresponding content pages B2 of catalogue page B are not existed in the corresponding content pages of catalogue page A, C, content pages
B2 length belongs to the corresponding interval range of average length of the corresponding content pages of catalogue page B, if catalogue page B does not possess lasting contribution
The ability of new chapters and sections, it is determined that content pages B2 is false content pages.
It should be noted that four kinds of analysis methods can be individually used for the analysis of chapters and sections integrality above, can also be to this
Any one or more of four kinds of analysis methods is combined the analysis for chapters and sections integrality.It is true for example with first method
Determine the corresponding chapters and sections of catalogue page B imperfect, further analyzed using the third or the 4th kind of method afterwards, it is determined that catalogue page B
Corresponding content pages B2 is false content pages so that recognition result is more accurate, objective.In addition, the example above (i.e. multiple stations
Point is website A, B and C, and the catalogue page of corresponding chapters and sections formula text is catalogue page A, B and C respectively.Catalogue page A is corresponding multiple interior
It is content pages A1, A2, A3 to hold page, and the corresponding multiple content pages of catalogue page B are content pages B1, B2, and catalogue page C is corresponding multiple interior
It is content pages C1, C2, C3, C4 to hold page) it is only schematical, it is not intended to limit the present invention.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of identification of the chapters and sections integrality of chapters and sections formula text
Device, with the recognition methods for the chapters and sections integrality for realizing above-mentioned chapters and sections formula text.
Fig. 2 shows the structure of the identifying device of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention
Schematic diagram.Referring to Fig. 2, the device at least includes:Acquisition module 210, determining module 220 and identification module 230.
Now introduce each composition or device of the identifying device of the chapters and sections integrality of the chapters and sections formula text of the embodiment of the present invention
Annexation between function and each several part:
Acquisition module 210, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites,
Wherein, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module 220, is coupled with acquisition module 210, suitable for according to the corresponding multiple content pages of each catalogue page, really
Fixed catalogue page set of the chapters and sections formula text on different websites;
Identification module 230, is coupled with determining module 220, suitable for analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property.
In one embodiment of the invention, above-mentioned determining module 220 is further adapted for:Calculate each two catalogue page corresponding interior
Hold the common factor between page, and be used as the common factor of each two catalogue page;According to the common factor of each two catalogue page, the chapters and sections formula is determined
Catalogue page set of the text on different websites.
In one embodiment of the invention, above-mentioned determining module 220 is further adapted for:Extract the corresponding content of multiple catalogue pages
The Text eigenvector of each content pages in page;The content pages that will be provided with same text characteristic vector are clustered, and are generated multiple
Content pages are grouped;According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, meter
Calculate the common factor between the corresponding content pages of each two catalogue page.
In one embodiment of the invention, above-mentioned determining module 220 is further adapted for:The element number of common factor is more than or waited
Merged in each two catalogue page of predetermined threshold value, obtain amalgamation result;It regard the amalgamation result as chapters and sections formula text
Originally the catalogue page set on different websites.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:Calculate every two in the catalogue page set
The average value of the number of the element of the common factor of individual catalogue page;If a certain catalogue page and other multiple catalogues in the catalogue page set
The number of the element of the common factor of page is respectively less than the average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:If the corresponding content pages of a certain catalogue page
Include the corresponding content pages of multiple other catalogue pages in the catalogue page set, and also deposited in the corresponding content pages of the catalogue page
In other guide page, it is determined that the other guide page is the content pages of newest chapters and sections, and the catalogue page possesses lasting contribution newly
The ability of chapters and sections.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:If a certain catalogue page is corresponding in some
Hold page not existing in the catalogue page set in the corresponding content pages of other catalogue pages, and the content pages length is not belonging to the mesh
Record the corresponding interval range of average length of the corresponding content pages of page, it is determined that the content pages are false content pages.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:If a certain catalogue page is corresponding in some
Hold page not existing in the catalogue page set in the corresponding content pages of other catalogue pages, the content pages length belongs to the catalogue page
The corresponding interval range of average length of corresponding content pages, and the catalogue page does not possess the ability of the lasting new chapters and sections of contribution, then
It is false content pages to determine the content pages.
In one embodiment of the invention, above-mentioned acquisition module 210 is further adapted for:It is literary from multiple site search to chapters and sections formula
This related webpage;The catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched.
In one embodiment of the invention, above-mentioned acquisition module 210 is further adapted for:By the web analysis searched into text
Object model tree structures;Each node in the document object model tree construction is classified, to determine the knot of the webpage
Structure piecemeal;The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
According to the combination of any one above-mentioned preferred embodiment or multiple preferred embodiments, the embodiment of the present invention can reach
Following beneficial effect:
According to the technical scheme that provides of the present invention, the catalogue page of chapters and sections formula text is identified respectively and many from multiple websites
Individual content pages, and then according to the corresponding multiple content pages of each catalogue page, determine catalogue of the chapters and sections formula text on different websites
Page set.Each catalogue page and/or the corresponding content pages of each catalogue page, the knot obtained according to analysis in subsequent analytical bibliography page set
Fruit identifies the chapters and sections integrality of each catalogue page in catalogue page set.As can be seen here, the present invention is realized to data source (multiple stations
The catalogue page and multiple content pages of chapters and sections formula text on point) acquisition, the determination of catalogue page set and to catalogue page set
Analysis three automatically processing so that solve in correlation technique by human configuration template carry out chapters and sections integrality judgement
Cause the problem of efficiency is low.Also, the present invention can neatly obtain data source, and then determine catalogue page set, to catalogue page
Set is analyzed, the problem of solving the response of website form change in correlation technique not in time.In addition, catalogue page and content
Page can accurately, objectively reflect the chapters and sections integrality of chapters and sections formula text, and targetedly analysis chapters and sections formula text exists the present invention
Each catalogue page and/or the corresponding content pages of each catalogue page in catalogue page set on different websites, and then obtained according to analysis
As a result the chapters and sections integrality of each catalogue page in catalogue page set is identified so that recognition result is more accurate.To sum up, the present invention is carried
The technical scheme of confession can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.
The invention also discloses:
A1, a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text, including:
Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, each website pair
Answer a catalogue page, the multiple content pages of each catalogue page correspondence;
According to the corresponding multiple content pages of each catalogue page, catalogue page of the chapters and sections formula text on different websites is determined
Set;
Each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, are obtained according to analysis
As a result the chapters and sections integrality of each catalogue page in the catalogue page set is identified.
A2, the method according to A1, wherein, according to the corresponding multiple content pages of each catalogue page, determine the chapters and sections
Catalogue page set of the formula text on different websites, including:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
A3, the method according to A1 or A2, wherein, the friendship calculated between the corresponding content pages of each two catalogue page
Collection, including:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, calculate
Common factor between the corresponding content pages of each two catalogue page.
A4, the method according to A1 to any one of A3, wherein, according to the common factor of each two catalogue page, determine the chapter
Catalogue page set of the section formula text on different websites, including:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, obtains merging knot
Really;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
A5, the method according to A1 to any one of A4, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
Calculate the average value of the element number of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than
The average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
A6, the method according to A1 to any one of A5, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
If the corresponding content pages of a certain catalogue page include other multiple catalogue pages in the catalogue page set it is corresponding in
Hold in page, and the corresponding content pages of the catalogue page and also there is other guide page, it is determined that the other guide page is newest chapters and sections
Content pages, and the catalogue page possesses the ability of the new chapters and sections of lasting contribution.
A7, the method according to A1 to any one of A6, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set
In content pages, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, then
It is false content pages to determine the content pages.
A8, the method according to A1 to any one of A7, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set
In content pages, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the mesh
Record page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
A9, the method according to A1 to any one of A8, wherein, it is described to identify chapters and sections formula text respectively from multiple websites
Catalogue page and multiple content pages, including:
The related webpage from multiple site search to chapters and sections formula text;
The catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched.
A10, the method according to A1 to any one of A9, wherein, the chapters and sections formula is identified from the webpage searched
The catalogue page of text and multiple content pages, including:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
B11, a kind of identifying device of the chapters and sections integrality of chapters and sections formula text, including:
Acquisition module, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites, its
In, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module, suitable for according to the corresponding multiple content pages of each catalogue page, determining the chapters and sections formula text in difference
Catalogue page set on website;
Identification module, suitable for analyzing each catalogue page and/or the corresponding content pages of each catalogue page, root in the catalogue page set
The result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.
B12, the device according to B11, wherein, the determining module is further adapted for:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
B13, the device according to B11 or B12, wherein, the determining module is further adapted for:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, calculate
Common factor between the corresponding content pages of each two catalogue page.
B14, the device according to B11 to any one of B13, wherein, the determining module is further adapted for:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, obtains merging knot
Really;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
B15, the device according to B11 to any one of B14, wherein, the identification module is further adapted for:
Calculate the average value of the number of the element of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than
The average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
B16, the device according to B11 to any one of B15, wherein, the identification module is further adapted for:
If the corresponding content pages of a certain catalogue page include other multiple catalogue pages in the catalogue page set it is corresponding in
Hold in page, and the corresponding content pages of the catalogue page and also there is other guide page, it is determined that the other guide page is newest chapters and sections
Content pages, and the catalogue page possesses the ability of the new chapters and sections of lasting contribution.
B17, the device according to B11 to any one of B16, wherein, the identification module is further adapted for:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set
In content pages, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, then
It is false content pages to determine the content pages.
B18, the device according to B11 to any one of B17, wherein, the identification module is further adapted for:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set
In content pages, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the mesh
Record page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
B19, the device according to B11 to any one of B18, wherein, the acquisition module is further adapted for:
The related webpage from multiple site search to chapters and sections formula text;
The catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched.
B20, the device according to B11 to any one of B19, wherein, the acquisition module is further adapted for:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
In the specification that this place is provided, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, exist
Above in the description of the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect
The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself
All as the separate embodiments of the present invention.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment
Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any
Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power
Profit is required, summary and accompanying drawing) disclosed in each feature can or similar purpose identical, equivalent by offer alternative features come generation
Replace.
Although in addition, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of be the same as Example does not mean in of the invention
Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of any
Mode it can use in any combination.
The present invention all parts embodiment can be realized with hardware, or with one or more processor run
Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor (DSP) realize the chapters and sections integrality of chapters and sections formula text according to embodiments of the present invention
Identifying device in some or all parts some or all functions.The present invention is also implemented as being used to perform this
In described method some or all equipment or program of device (for example, computer program and computer program
Product).Such program for realizing the present invention can be stored on a computer-readable medium, or can have one or many
The form of individual signal.Such signal can be downloaded from internet website and obtained, either on carrier signal provide or with
Any other form is provided.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not
Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of some different elements and coming real by means of properly programmed computer
It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch
To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame
Claim.
So far, although those skilled in the art will appreciate that detailed herein have shown and described multiple showing for the present invention
Example property embodiment, still, still can be direct according to present disclosure without departing from the spirit and scope of the present invention
It is determined that or deriving many other variations or modifications for meeting the principle of the invention.Therefore, the scope of the present invention is understood that and recognized
It is set to and covers other all these variations or modifications.
Claims (18)
1. a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text, including:
Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, each website correspondence one
Individual catalogue page, the multiple content pages of each catalogue page correspondence;
According to the corresponding multiple content pages of each catalogue page, catalogue page collection of the chapters and sections formula text on different websites is determined
Close;
Each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, the result obtained according to analysis
Identify the chapters and sections integrality of each catalogue page in the catalogue page set;
Wherein, the catalogue page and multiple content pages for identifying chapters and sections formula text respectively from multiple websites, including:From multiple
Webpage of the site search to chapters and sections formula text correlation;Identified from the webpage searched the catalogue page of the chapters and sections formula text with
And multiple content pages.
2. according to the method described in claim 1, wherein, according to the corresponding multiple content pages of each catalogue page, determine the chapter
Catalogue page set of the section formula text on different websites, including:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
3. method according to claim 2, wherein, the friendship calculated between the corresponding content pages of each two catalogue page
Collection, including:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, every two are calculated
Common factor between the corresponding content pages of individual catalogue page.
4. method according to claim 2, wherein, according to the common factor of each two catalogue page, determine the chapters and sections formula text
Catalogue page set on different websites, including:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, amalgamation result is obtained;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
5. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
Calculate the average value of the element number of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than described
Average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
6. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
If the corresponding content pages of a certain catalogue page include the corresponding content pages of multiple other catalogue pages in the catalogue page set,
And also there is other guide page in the corresponding content pages of the catalogue page, it is determined that the other guide page is the content of newest chapters and sections
Page, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.
7. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set
In page, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, it is determined that
The content pages are false content pages.
8. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or
The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained
Whole property, including:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set
In page, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the catalogue page
Do not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
9. the method according to any one of Claims 1-4, wherein, the chapters and sections formula is identified from the webpage searched
The catalogue page of text and multiple content pages, including:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
10. a kind of identifying device of the chapters and sections integrality of chapters and sections formula text, including:
Acquisition module, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites, wherein, often
One catalogue page of individual website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module, suitable for according to the corresponding multiple content pages of each catalogue page, determining the chapters and sections formula text in different websites
On catalogue page set;
Identification module, suitable for analyzing each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set, according to point
Analyse the chapters and sections integrality that obtained result identifies each catalogue page in the catalogue page set;
Wherein, the acquisition module is further adapted for:The related webpage from multiple site search to chapters and sections formula text;From the net searched
The catalogue page and multiple content pages of the chapters and sections formula text are identified in page.
11. device according to claim 10, wherein, the determining module is further adapted for:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
12. device according to claim 11, wherein, the determining module is further adapted for:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, every two are calculated
Common factor between the corresponding content pages of individual catalogue page.
13. device according to claim 11, wherein, the determining module is further adapted for:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, amalgamation result is obtained;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
14. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
Calculate the average value of the number of the element of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than described
Average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
15. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
If the corresponding content pages of a certain catalogue page include the corresponding content pages of multiple other catalogue pages in the catalogue page set,
And also there is other guide page in the corresponding content pages of the catalogue page, it is determined that the other guide page is the content of newest chapters and sections
Page, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.
16. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set
In page, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, it is determined that
The content pages are false content pages.
17. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set
In page, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the catalogue page
Do not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
18. the device according to any one of claim 10 to 13, wherein, the acquisition module is further adapted for:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410578534.2A CN104317903B (en) | 2014-10-24 | 2014-10-24 | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410578534.2A CN104317903B (en) | 2014-10-24 | 2014-10-24 | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104317903A CN104317903A (en) | 2015-01-28 |
CN104317903B true CN104317903B (en) | 2017-10-13 |
Family
ID=52373135
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410578534.2A Active CN104317903B (en) | 2014-10-24 | 2014-10-24 | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104317903B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106033405B (en) * | 2015-03-10 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Network book catalog integrity detection method and device |
CN105447130B (en) * | 2015-11-18 | 2018-12-25 | 北京奇虎科技有限公司 | The acquisition methods and device of the new chapters and sections of the network novel |
CN113407889B (en) * | 2021-07-15 | 2023-10-20 | 北京百度网讯科技有限公司 | Novel transcoding method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7158962B2 (en) * | 2002-11-27 | 2007-01-02 | International Business Machines Corporation | System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system |
CN103123640A (en) * | 2012-02-22 | 2013-05-29 | 深圳市谷古科技有限公司 | Method and device for searching novel |
CN103310160A (en) * | 2013-06-20 | 2013-09-18 | 北京神州绿盟信息安全科技股份有限公司 | Method, system and device for preventing webpage from being tampered with |
CN103365877A (en) * | 2012-03-29 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Method and server for making directory after webpage is transcoded |
US8631029B1 (en) * | 2010-03-26 | 2014-01-14 | A9.Com, Inc. | Evolutionary content determination and management |
CN103577566A (en) * | 2013-10-25 | 2014-02-12 | 北京奇虎科技有限公司 | Web reading content loading method and device |
-
2014
- 2014-10-24 CN CN201410578534.2A patent/CN104317903B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7158962B2 (en) * | 2002-11-27 | 2007-01-02 | International Business Machines Corporation | System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system |
US8631029B1 (en) * | 2010-03-26 | 2014-01-14 | A9.Com, Inc. | Evolutionary content determination and management |
CN103123640A (en) * | 2012-02-22 | 2013-05-29 | 深圳市谷古科技有限公司 | Method and device for searching novel |
CN103365877A (en) * | 2012-03-29 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Method and server for making directory after webpage is transcoded |
CN103310160A (en) * | 2013-06-20 | 2013-09-18 | 北京神州绿盟信息安全科技股份有限公司 | Method, system and device for preventing webpage from being tampered with |
CN103577566A (en) * | 2013-10-25 | 2014-02-12 | 北京奇虎科技有限公司 | Web reading content loading method and device |
Also Published As
Publication number | Publication date |
---|---|
CN104317903A (en) | 2015-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299457A (en) | A kind of opining mining method, device and equipment | |
CN105373546B (en) | A kind of information processing method and system for knowledge services | |
CN106933947B (en) | A kind of searching method and device, electronic equipment | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN109582849A (en) | A kind of Internet resources intelligent search method of knowledge based map | |
CN104462399B (en) | The processing method and processing device of search result | |
CN104537341A (en) | Human face picture information obtaining method and device | |
CN103559313B (en) | Searching method and device | |
CN108241649A (en) | The searching method and device of knowledge based collection of illustrative plates | |
CN104317903B (en) | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device | |
CN107102993A (en) | A kind of user's demand analysis method and device | |
CN106469187A (en) | The extracting method of key word and device | |
CN107666404A (en) | Broadband network user identification method and device | |
CN104537080B (en) | Information recommends method and system | |
CN109388796A (en) | The method for pushing and device of judgement document | |
CN104408036B (en) | It is associated with recognition methods and the device of topic | |
US20130268833A1 (en) | Apparatus and method for visualizing hyperlinks using color attribute values | |
CN110929058A (en) | Trademark picture retrieval method and device, storage medium and electronic device | |
CN104750609B (en) | Determine the method and device of interface layout compatibility | |
CN105608183B (en) | A kind of method and apparatus that polymeric type is provided and is answered | |
CN105468652A (en) | Retrieval sorting method and system | |
CN107193814A (en) | The method and apparatus that the automatic taxonomic revision of books is realized in digital reading | |
CN109145261A (en) | A kind of method and apparatus generating label | |
CN105786929A (en) | Information monitoring method and device | |
CN107133644A (en) | Digital library's content analysis system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220727 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |