CN104317903B - The recognition methods of the chapters and sections integrality of chapters and sections formula text and device - Google Patents

The recognition methods of the chapters and sections integrality of chapters and sections formula text and device Download PDF

Info

Publication number
CN104317903B
CN104317903B CN201410578534.2A CN201410578534A CN104317903B CN 104317903 B CN104317903 B CN 104317903B CN 201410578534 A CN201410578534 A CN 201410578534A CN 104317903 B CN104317903 B CN 104317903B
Authority
CN
China
Prior art keywords
catalogue page
catalogue
content pages
chapters
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410578534.2A
Other languages
Chinese (zh)
Other versions
CN104317903A (en
Inventor
魏少俊
郑燕琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410578534.2A priority Critical patent/CN104317903B/en
Publication of CN104317903A publication Critical patent/CN104317903A/en
Application granted granted Critical
Publication of CN104317903B publication Critical patent/CN104317903B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of recognition methods of chapters and sections integrality of chapters and sections formula text and device, this method includes:Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;According to the corresponding multiple content pages of each catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined;Each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, the result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.The technical scheme that the present invention is provided can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.

Description

The recognition methods of the chapters and sections integrality of chapters and sections formula text and device
Technical field
The present invention relates to Internet technical field, particularly a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text and Device.
Background technology
With becoming increasingly popular for computer and computer network, internet has been deep into people's work, studying and living Every field, as people issue and obtain information important channel.
At present, chapters and sections formula text largely exists in internet, and one text may largely be reprinted by different web sites, by Influenceed, may be caused in the content of the number of site text and imperfect, or even go out by some objective factors when reprinting The false situation of existing content.By taking novel text as an example, novel reading is a kind of strong demand of Internet user, is especially set in movement More occupy no small demand proportion for upper.Novel class website is largely present, and quality is very different, and same present networks novel can quilt Different web sites are largely reprinted, but are influenceed by some objective factors, may result in the content of this novel on number of site And imperfect (such as lacking chapters and sections), or even content are false (piecing together false chapters and sections).Search engine index these novel websites when, Need to judge the chapters and sections integrality of novel, as far as possible the website complete to user's presentation content, improve user and obtain novel The quality of content, lifts Consumer's Experience.
In correlation technique, by carrying out chapters and sections integrality judgement to different novel website human configuration templates, though this method Right accuracy rate is very high, but shortcoming is also apparent from:The website that manpower can be covered is limited, not enough intelligently, changes for website form Response not in time.Thus, how flexibly, quickly and accurately the chapters and sections integrality of identification chapters and sections formula text turns at present urgently Technical problem to be solved.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on State the recognition methods of the chapters and sections integrality of the chapters and sections formula text of problem and corresponding device.
According to one aspect of the present invention there is provided a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text, including: Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, one mesh of each website correspondence Record page, the multiple content pages of each catalogue page correspondence;According to the corresponding multiple content pages of each catalogue page, the chapters and sections formula text is determined Originally the catalogue page set on different websites;Analyze each catalogue page and/or each catalogue page in the catalogue page set corresponding interior Hold page, the result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.
Alternatively, according to the corresponding multiple content pages of each catalogue page, determine the chapters and sections formula text on different websites Catalogue page set, including:The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as each two catalogue page Occur simultaneously;According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
Alternatively, the common factor calculated between the corresponding content pages of each two catalogue page, including:Extract multiple catalogue pages The Text eigenvector of each content pages in corresponding content pages;The content pages that will be provided with same text characteristic vector are gathered Class, generates multiple content pages packets;It is grouped according to the multiple content pages and each corresponding content pages of catalogue page Mapping relations, calculate the common factor between the corresponding content pages of each two catalogue page.
Alternatively, according to the common factor of each two catalogue page, catalogue page of the chapters and sections formula text on different websites is determined Set, including:The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, obtains merging knot Really;Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:Calculate the catalogue page set The average value of the element number of the common factor of middle each two catalogue page;If in a certain catalogue page and the catalogue page set it is multiple other The number of the element of the common factor of catalogue page is respectively less than the average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:If a certain catalogue page is corresponding Content pages include the corresponding content pages of multiple other catalogue pages in the catalogue page set, and the corresponding content pages of the catalogue page In also there is other guide page, it is determined that other guide page is the content pages of newest chapters and sections, and the catalogue page possesses and continued Contribute the ability of new chapters and sections.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:If a certain catalogue page is corresponding Some content pages is not existed in the catalogue page set in the corresponding content pages of other catalogue pages, and the content pages length does not belong to In the corresponding interval range of average length of the corresponding content pages of the catalogue page, it is determined that the content pages are false content pages.
Alternatively, each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, according to analysis Obtained result identifies the chapters and sections integrality of each catalogue page in the catalogue page set, including:If a certain catalogue page is corresponding Some content pages is not existed in the catalogue page set in the corresponding content pages of other catalogue pages, and the content pages length belongs to this The corresponding interval range of average length of the corresponding content pages of catalogue page, and the catalogue page does not possess the energy of the lasting new chapters and sections of contribution Power, it is determined that the content pages are false content pages.
Alternatively, the catalogue page and multiple content pages for identifying chapters and sections formula text respectively from multiple websites, including: The related webpage from multiple site search to chapters and sections formula text;The mesh of the chapters and sections formula text is identified from the webpage searched Record page and multiple content pages.
Alternatively, the catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched, is wrapped Include:By the web analysis searched into text object model tree structures;To each node in the document object model tree construction Classified, to determine the structure piecemeal of the webpage;The catalogue page of the chapters and sections formula text is extracted according to the structure piecemeal And multiple content pages.
According to another aspect of the present invention, a kind of identifying device of the chapters and sections integrality of chapters and sections formula text is additionally provided, Including:
Acquisition module, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites, its In, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module, suitable for according to the corresponding multiple content pages of each catalogue page, determining the chapters and sections formula text in difference Catalogue page set on website;
Identification module, suitable for analyzing each catalogue page and/or the corresponding content pages of each catalogue page, root in the catalogue page set The result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.
Alternatively, the determining module is further adapted for:The common factor between the corresponding content pages of each two catalogue page is calculated, and is made For the common factor of each two catalogue page;According to the common factor of each two catalogue page, determine the chapters and sections formula text on different websites Catalogue page set.
Alternatively, the determining module is further adapted for:Extract the text of each content pages in the corresponding content pages of multiple catalogue pages Eigen vector;The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;According to described Multiple content pages packets and the mapping relations of the corresponding content pages of each catalogue page, calculate each two catalogue page correspondence Content pages between common factor.
Alternatively, the determining module is further adapted for:The element number of common factor is more than or equal to each two of predetermined threshold value Catalogue page is merged, and obtains amalgamation result;Mesh using the amalgamation result as the chapters and sections formula text on different websites Record page set.
Alternatively, the identification module is further adapted for:Calculate the member of the common factor of each two catalogue page in the catalogue page set The average value of the number of element;If of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set Number is respectively less than the average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
Alternatively, the identification module is further adapted for:If the corresponding content pages of a certain catalogue page include the catalogue page collection The corresponding content pages of multiple other catalogue pages in conjunction, and also there is other guide page in the corresponding content pages of the catalogue page, then really The fixed other guide page is the content pages of newest chapters and sections, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.
Alternatively, the identification module is further adapted for:If some corresponding content pages of a certain catalogue page do not exist in the mesh Record in page set in the corresponding content pages of other catalogue pages, and the content pages length is not belonging to the corresponding content pages of the catalogue page The corresponding interval range of average length, it is determined that the content pages are false content pages.
Alternatively, the identification module is further adapted for:If some corresponding content pages of a certain catalogue page do not exist in the mesh Record in page set in the corresponding content pages of other catalogue pages, the content pages length belongs to being averaged for the corresponding content pages of the catalogue page The corresponding interval range of length, and the catalogue page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are falseness Content pages.
Alternatively, the acquisition module is further adapted for:The related webpage from multiple site search to chapters and sections formula text;From search To webpage in identify the catalogue page and multiple content pages of the chapters and sections formula text.
Alternatively, the acquisition module is further adapted for:By the web analysis searched into text object model tree structures;To institute Each node stated in document object model tree construction is classified, to determine the structure piecemeal of the webpage;According to the structure Piecemeal extracts the catalogue page and multiple content pages of the chapters and sections formula text.
According to the technical scheme that provides of the present invention, the catalogue page of chapters and sections formula text is identified respectively and many from multiple websites Individual content pages, and then according to the corresponding multiple content pages of each catalogue page, determine catalogue of the chapters and sections formula text on different websites Page set.Each catalogue page and/or the corresponding content pages of each catalogue page, the knot obtained according to analysis in subsequent analytical bibliography page set Fruit identifies the chapters and sections integrality of each catalogue page in catalogue page set.As can be seen here, the present invention is realized to data source (multiple stations The catalogue page and multiple content pages of chapters and sections formula text on point) acquisition, the determination of catalogue page set and to catalogue page set Analysis three automatically processing so that solve in correlation technique by human configuration template carry out chapters and sections integrality judgement Cause the problem of efficiency is low.Also, the present invention can neatly obtain data source, and then determine catalogue page set, to catalogue page Set is analyzed, the problem of solving the response of website form change in correlation technique not in time.In addition, catalogue page and content Page can accurately, objectively reflect the chapters and sections integrality of chapters and sections formula text, and targetedly analysis chapters and sections formula text exists the present invention Each catalogue page and/or the corresponding content pages of each catalogue page in catalogue page set on different websites, and then obtained according to analysis As a result the chapters and sections integrality of each catalogue page in catalogue page set is identified so that recognition result is more accurate.To sum up, the present invention is carried The technical scheme of confession can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the embodiment of the present invention.
According to the accompanying drawings to the detailed description of the specific embodiment of the invention, those skilled in the art will be brighter Above-mentioned and other purposes, the advantages and features of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 shows the flow of the recognition methods of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention Figure;And
Fig. 2 shows the structure of the identifying device of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention Schematic diagram.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.
In order to solve the above technical problems, the embodiments of the invention provide a kind of identification of the chapters and sections integrality of chapters and sections formula text Method, Fig. 1 shows the flow chart of the recognition methods of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention. As shown in figure 1, this method at least comprises the following steps S102 to step S106.
Step S102, the catalogue page and multiple content pages for identifying from multiple websites chapters and sections formula text respectively, wherein, often One catalogue page of individual website correspondence, the multiple content pages of each catalogue page correspondence.
Step S104, according to the corresponding multiple content pages of each catalogue page, determine chapters and sections formula text on different websites Catalogue page set.
Each catalogue page and/or the corresponding content pages of each catalogue page in step S106, analytical bibliography page set, according to analyzing To result identify the chapters and sections integrality of each catalogue page in catalogue page set.
According to the technical scheme that provides of the present invention, the catalogue page of chapters and sections formula text is identified respectively and many from multiple websites Individual content pages, and then according to the corresponding multiple content pages of each catalogue page, determine catalogue of the chapters and sections formula text on different websites Page set.Each catalogue page and/or the corresponding content pages of each catalogue page, the knot obtained according to analysis in subsequent analytical bibliography page set Fruit identifies the chapters and sections integrality of each catalogue page in catalogue page set.As can be seen here, the present invention is realized to data source (multiple stations The catalogue page and multiple content pages of chapters and sections formula text on point) acquisition, the determination of catalogue page set and to catalogue page set Analysis three automatically processing so that solve in correlation technique by human configuration template carry out chapters and sections integrality judgement Cause the problem of efficiency is low.Also, the present invention can neatly obtain data source, and then determine catalogue page set, to catalogue page Set is analyzed, the problem of solving the response of website form change in correlation technique not in time.In addition, catalogue page and content Page can accurately, objectively reflect the chapters and sections integrality of chapters and sections formula text, and targetedly analysis chapters and sections formula text exists the present invention Each catalogue page and/or the corresponding content pages of each catalogue page in catalogue page set on different websites, and then obtained according to analysis As a result the chapters and sections integrality of each catalogue page in catalogue page set is identified so that recognition result is more accurate.To sum up, the present invention is carried The technical scheme of confession can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.
The chapters and sections formula text referred in above step S102 refers to the text being made up of some chapters and sections, such as novel, paper. When catalogue page refers to the catalogue of chapters and sections formula text, such as user's search novel, what is generally looked for is the catalogue page of novel.Content pages Refer to the particular content of a certain chapters and sections of chapters and sections formula text.Chapters and sections formula text is identified respectively from multiple websites the invention provides one kind This catalogue page and the preferred scheme of multiple content pages, in this scenario can be from multiple site search to chapters and sections formula text phase The webpage of pass, and then identify from the webpage searched the catalogue page and multiple content pages of chapters and sections formula text.
Further, the catalogue page and multiple content pages of chapters and sections formula text are identified from the webpage searched to be adopted The identification that manually redaction rule carries out catalogue page and content pages to the webpage searched is extracted.Or, can be based on mark Masterplate, extract the template of best match is found in ATL every time, then using template directory page and content pages Identification is extracted.In addition, in order to improve recognition efficiency, the present invention can also be by the web analysis searched into text object model tree Structure, and each node in document object model tree construction is classified, to determine the structure piecemeal of webpage, and then according to knot Structure piecemeal extracts the catalogue page and multiple content pages of chapters and sections formula text.Here provide a kind of preferred to document object model Each node in tree construction is classified with the scheme for the structure piecemeal for determining webpage, in this scenario, can travel through text pair As model tree structures, the content of each node in document object model tree construction is obtained, and then according to preset rules by each node Content inputs decision tree, and each node is classified by decision tree.Or, document object model tree construction can be traveled through, is obtained The dimensional characteristics of each node in document object model tree construction, and then the dimensional characteristics input of each node is determined according to preset rules Plan tree, is classified by decision tree to each node.
Decision tree is in known various piecemeals on the basis of the statistics of various dimensional characteristics, by training decision tree The corresponding piecemeal type of each node is drawn using the dimensional characteristics of each node.Text of the decision tree to webpage is described in detail below Each node is classified in object model tree structures, with the scheme for the structure piecemeal for determining webpage.
First, it is determined that dimensional characteristics for piecemeal, in embodiments of the present invention, the dimensional characteristics that can be used are up to 105, relate generally to herein below:Text size, hyperlink number, hyperlink text length, highlighted text size (including add The word of big overstriking), Chinese character length, English character length, numerical character length, particular keywords, specific punctuation mark Etc..I.e. a type of piece can be taken specific value to determine by one or more of 105 dimensional characteristics feature.Need It is noted that the dimensional characteristics according to determined by actual conditions are not limited to 105, it can also be expanded in subsequent process Fill.
Secondly, the dimensional characteristics for piecemeal of determination are inputted into decision tree, decision tree is built for training.
Furthermore, the content of each node in the document object model tree construction of webpage is inputted into decision tree according to preset rules, By the content of each node of decision tree analysis, the dimensional characteristics of each node are obtained, and then according to the dimensional characteristics of each node to each section Point is classified.
Describe in detail above and the data source (catalogue page of chapters and sections formula text and many on multiple websites is obtained in step S102 Individual content pages) a variety of implementations, be explained below determine catalogue page set one or more implementations.
According to the corresponding multiple content pages of each catalogue page in above step S104, determine chapters and sections formula text at different stations Catalogue page set on point, the invention provides a kind of preferred scheme, calculates each two catalogue page corresponding in this scenario Common factor between content pages, and as the common factor of each two catalogue page, and then according to the common factor of each two catalogue page, determine chapters and sections Catalogue page set of the formula text on different websites.
Further, in the preferred scheme of the present invention, each two catalogue page is calculated using the thought of cluster corresponding interior Hold the common factor between page, can be the Text eigenvector for extracting each content pages in the corresponding content pages of multiple catalogue pages, with The content pages that will be provided with same text characteristic vector afterwards are clustered, and generate multiple content pages packets, and then according to multiple contents Page packet and the mapping relations of the corresponding content pages of each catalogue page, calculate the corresponding content pages of each two catalogue page Between common factor.For example, multiple websites are website A, B and C, and the catalogue page of corresponding chapters and sections formula text is catalogue page respectively A, B and C.The corresponding multiple content pages of catalogue page A are content pages A1, A2, A3, and the corresponding multiple content pages of catalogue page B are content Page B1, B2, the corresponding multiple content pages of catalogue page C are content pages C1, C2, C3, C4.Extract content pages A1, A2, A3, B1, B2, The Text eigenvector of each content pages is respectively a, b, c, a, b in C1, C2, C3, C4 ', a, b, c, d, will be provided with same text The content pages of characteristic vector are clustered, and are generated multiple content pages and are grouped into { a, a, a }, { b, b }, { b ' }, { c, c }, { d }.Enter And according to the packet of multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, calculate each two catalogue Common factor between the corresponding content pages of page, i.e. common factor between catalogue page A and the corresponding content pages of catalogue page B is { a }, catalogue page Common factor between A and the corresponding content pages of catalogue page C is { a, b, c }, between catalogue page B and the corresponding content pages of catalogue page C Occur simultaneously for { a }.
Now, according to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined, can To be to merge each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value, amalgamation result is obtained, will Catalogue page set of the amalgamation result as chapters and sections formula text on different websites.Still by taking above-mentioned example as an example, by each two catalogue Common factor between the corresponding content pages of page is as the common factor of each two catalogue page, i.e. catalogue page A and catalogue page B common factor is { a }, Catalogue page A and catalogue page C common factor are { a, b, c }, and catalogue page B and catalogue page C common factor are { a }.It is 1 to take predetermined threshold value, will The each two catalogue page that the element number of common factor is more than or equal to 1 is merged, and obtains amalgamation result for catalogue page A, B, C, then Catalogue page collection of the chapters and sections formula text on different websites is combined into catalogue page A, B, C.
In above step S104 according to the corresponding multiple content pages of each catalogue page, determine chapters and sections formula text in different websites On catalogue page set after, each catalogue page and/or the corresponding content pages of each catalogue page in the page set of step S106 analytical bibliographies, The result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in catalogue page set.The invention provides a variety of analyses Method, introduce in detail below.
The first, calculates the average value of the element number of the common factor of each two catalogue page in catalogue page set, if a certain mesh The number of record page and the element of the common factor of other multiple catalogue pages in catalogue page set is respectively less than the average value, it is determined that the catalogue The corresponding chapters and sections of page are imperfect.By taking above-mentioned example as an example, catalogue page collection is combined into catalogue page A, B, C, catalogue page A and catalogue page B's The element number of common factor is 1, and the element number of catalogue page A and catalogue page C common factor is 3, catalogue page B and catalogue page C common factor Element number be 1, then average value is 5/3, and wherein catalogue page B and catalogue page A, catalogue page C number are 1, less than average Value 5/3, it is determined that the corresponding chapters and sections of catalogue page B are imperfect.
Second, if to include other multiple catalogue pages in catalogue page set corresponding for the corresponding content pages of a certain catalogue page Also there is other guide page in content pages, and the corresponding content pages of the catalogue page, it is determined that other guide page is newest chapters and sections Content pages, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.Here, newest chapters and sections refer to that chapters and sections formula text is newest and delivered Chapters and sections, such as one newest chapters and sections delivered of serial story.Novel user would generally chase after book, i.e., newest chapters and sections are sent out once author Table, user just wants to immediately see, newest chapters and sections, which issue faster novel station and are more susceptible to family, to be liked.By taking above-mentioned example as an example, Catalogue page A corresponding content pages A1, A2, A3 (its Text eigenvector is respectively a, b, c), the corresponding content pages B1 of catalogue page B, B2 (its Text eigenvector is respectively a, b '), catalogue page C corresponding content pages C1, C2, C3, C4 (its Text eigenvector point Wei not a, b, c, d).It can be seen that, the corresponding content pages of catalogue page A include catalogue page A and the corresponding content pages of catalogue page B, and mesh Also there is other guide page (i.e. content pages C4) in record page C, it is determined that content pages C4 is the content pages of newest chapters and sections, and catalogue page C Possesses the ability of the lasting new chapters and sections of contribution.
The third, if some corresponding content pages of a certain catalogue page do not exist in other catalogue pages correspondence in catalogue page set Content pages in, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, It is false content pages then to determine the content pages.By taking above-mentioned example as an example, catalogue page A corresponding content pages A1, A2, A3 (its text Eigen vector is respectively a, b, c), and corresponding content pages B1, B2 of catalogue page B (its Text eigenvector is respectively a, b '), mesh Record page C corresponding content pages C1, C2, C3, C4 (its Text eigenvector is respectively a, b, c, d).It can be seen that, catalogue page B is corresponding Content pages B2 is not existed in the corresponding content pages of catalogue page A, C, if content pages B2 length is not belonging to the corresponding contents of catalogue page B The corresponding interval range of average length of page, it is determined that content pages B2 is false content pages.
4th kind, if some corresponding content pages of a certain catalogue page do not exist in other catalogue pages correspondence in catalogue page set Content pages in, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and should Catalogue page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.By taking above-mentioned example as an example, Catalogue page A corresponding content pages A1, A2, A3 (its Text eigenvector is respectively a, b, c), the corresponding content pages B1 of catalogue page B, B2 (its Text eigenvector is respectively a, b '), catalogue page C corresponding content pages C1, C2, C3, C4 (its Text eigenvector point Wei not a, b, c, d).It can be seen that, the corresponding content pages B2 of catalogue page B are not existed in the corresponding content pages of catalogue page A, C, content pages B2 length belongs to the corresponding interval range of average length of the corresponding content pages of catalogue page B, if catalogue page B does not possess lasting contribution The ability of new chapters and sections, it is determined that content pages B2 is false content pages.
It should be noted that four kinds of analysis methods can be individually used for the analysis of chapters and sections integrality above, can also be to this Any one or more of four kinds of analysis methods is combined the analysis for chapters and sections integrality.It is true for example with first method Determine the corresponding chapters and sections of catalogue page B imperfect, further analyzed using the third or the 4th kind of method afterwards, it is determined that catalogue page B Corresponding content pages B2 is false content pages so that recognition result is more accurate, objective.In addition, the example above (i.e. multiple stations Point is website A, B and C, and the catalogue page of corresponding chapters and sections formula text is catalogue page A, B and C respectively.Catalogue page A is corresponding multiple interior It is content pages A1, A2, A3 to hold page, and the corresponding multiple content pages of catalogue page B are content pages B1, B2, and catalogue page C is corresponding multiple interior It is content pages C1, C2, C3, C4 to hold page) it is only schematical, it is not intended to limit the present invention.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of identification of the chapters and sections integrality of chapters and sections formula text Device, with the recognition methods for the chapters and sections integrality for realizing above-mentioned chapters and sections formula text.
Fig. 2 shows the structure of the identifying device of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention Schematic diagram.Referring to Fig. 2, the device at least includes:Acquisition module 210, determining module 220 and identification module 230.
Now introduce each composition or device of the identifying device of the chapters and sections integrality of the chapters and sections formula text of the embodiment of the present invention Annexation between function and each several part:
Acquisition module 210, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites, Wherein, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module 220, is coupled with acquisition module 210, suitable for according to the corresponding multiple content pages of each catalogue page, really Fixed catalogue page set of the chapters and sections formula text on different websites;
Identification module 230, is coupled with determining module 220, suitable for analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property.
In one embodiment of the invention, above-mentioned determining module 220 is further adapted for:Calculate each two catalogue page corresponding interior Hold the common factor between page, and be used as the common factor of each two catalogue page;According to the common factor of each two catalogue page, the chapters and sections formula is determined Catalogue page set of the text on different websites.
In one embodiment of the invention, above-mentioned determining module 220 is further adapted for:Extract the corresponding content of multiple catalogue pages The Text eigenvector of each content pages in page;The content pages that will be provided with same text characteristic vector are clustered, and are generated multiple Content pages are grouped;According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, meter Calculate the common factor between the corresponding content pages of each two catalogue page.
In one embodiment of the invention, above-mentioned determining module 220 is further adapted for:The element number of common factor is more than or waited Merged in each two catalogue page of predetermined threshold value, obtain amalgamation result;It regard the amalgamation result as chapters and sections formula text Originally the catalogue page set on different websites.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:Calculate every two in the catalogue page set The average value of the number of the element of the common factor of individual catalogue page;If a certain catalogue page and other multiple catalogues in the catalogue page set The number of the element of the common factor of page is respectively less than the average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:If the corresponding content pages of a certain catalogue page Include the corresponding content pages of multiple other catalogue pages in the catalogue page set, and also deposited in the corresponding content pages of the catalogue page In other guide page, it is determined that the other guide page is the content pages of newest chapters and sections, and the catalogue page possesses lasting contribution newly The ability of chapters and sections.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:If a certain catalogue page is corresponding in some Hold page not existing in the catalogue page set in the corresponding content pages of other catalogue pages, and the content pages length is not belonging to the mesh Record the corresponding interval range of average length of the corresponding content pages of page, it is determined that the content pages are false content pages.
In one embodiment of the invention, above-mentioned identification module 230 is further adapted for:If a certain catalogue page is corresponding in some Hold page not existing in the catalogue page set in the corresponding content pages of other catalogue pages, the content pages length belongs to the catalogue page The corresponding interval range of average length of corresponding content pages, and the catalogue page does not possess the ability of the lasting new chapters and sections of contribution, then It is false content pages to determine the content pages.
In one embodiment of the invention, above-mentioned acquisition module 210 is further adapted for:It is literary from multiple site search to chapters and sections formula This related webpage;The catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched.
In one embodiment of the invention, above-mentioned acquisition module 210 is further adapted for:By the web analysis searched into text Object model tree structures;Each node in the document object model tree construction is classified, to determine the knot of the webpage Structure piecemeal;The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
According to the combination of any one above-mentioned preferred embodiment or multiple preferred embodiments, the embodiment of the present invention can reach Following beneficial effect:
According to the technical scheme that provides of the present invention, the catalogue page of chapters and sections formula text is identified respectively and many from multiple websites Individual content pages, and then according to the corresponding multiple content pages of each catalogue page, determine catalogue of the chapters and sections formula text on different websites Page set.Each catalogue page and/or the corresponding content pages of each catalogue page, the knot obtained according to analysis in subsequent analytical bibliography page set Fruit identifies the chapters and sections integrality of each catalogue page in catalogue page set.As can be seen here, the present invention is realized to data source (multiple stations The catalogue page and multiple content pages of chapters and sections formula text on point) acquisition, the determination of catalogue page set and to catalogue page set Analysis three automatically processing so that solve in correlation technique by human configuration template carry out chapters and sections integrality judgement Cause the problem of efficiency is low.Also, the present invention can neatly obtain data source, and then determine catalogue page set, to catalogue page Set is analyzed, the problem of solving the response of website form change in correlation technique not in time.In addition, catalogue page and content Page can accurately, objectively reflect the chapters and sections integrality of chapters and sections formula text, and targetedly analysis chapters and sections formula text exists the present invention Each catalogue page and/or the corresponding content pages of each catalogue page in catalogue page set on different websites, and then obtained according to analysis As a result the chapters and sections integrality of each catalogue page in catalogue page set is identified so that recognition result is more accurate.To sum up, the present invention is carried The technical scheme of confession can flexibly, rapidly recognize the chapters and sections integrality of chapters and sections formula text, and recognition result is accurate, objective.
The invention also discloses:
A1, a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text, including:
Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, each website pair Answer a catalogue page, the multiple content pages of each catalogue page correspondence;
According to the corresponding multiple content pages of each catalogue page, catalogue page of the chapters and sections formula text on different websites is determined Set;
Each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, are obtained according to analysis As a result the chapters and sections integrality of each catalogue page in the catalogue page set is identified.
A2, the method according to A1, wherein, according to the corresponding multiple content pages of each catalogue page, determine the chapters and sections Catalogue page set of the formula text on different websites, including:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
A3, the method according to A1 or A2, wherein, the friendship calculated between the corresponding content pages of each two catalogue page Collection, including:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, calculate Common factor between the corresponding content pages of each two catalogue page.
A4, the method according to A1 to any one of A3, wherein, according to the common factor of each two catalogue page, determine the chapter Catalogue page set of the section formula text on different websites, including:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, obtains merging knot Really;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
A5, the method according to A1 to any one of A4, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
Calculate the average value of the element number of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than The average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
A6, the method according to A1 to any one of A5, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
If the corresponding content pages of a certain catalogue page include other multiple catalogue pages in the catalogue page set it is corresponding in Hold in page, and the corresponding content pages of the catalogue page and also there is other guide page, it is determined that the other guide page is newest chapters and sections Content pages, and the catalogue page possesses the ability of the new chapters and sections of lasting contribution.
A7, the method according to A1 to any one of A6, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set In content pages, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, then It is false content pages to determine the content pages.
A8, the method according to A1 to any one of A7, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set In content pages, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the mesh Record page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
A9, the method according to A1 to any one of A8, wherein, it is described to identify chapters and sections formula text respectively from multiple websites Catalogue page and multiple content pages, including:
The related webpage from multiple site search to chapters and sections formula text;
The catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched.
A10, the method according to A1 to any one of A9, wherein, the chapters and sections formula is identified from the webpage searched The catalogue page of text and multiple content pages, including:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
B11, a kind of identifying device of the chapters and sections integrality of chapters and sections formula text, including:
Acquisition module, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites, its In, one catalogue page of each website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module, suitable for according to the corresponding multiple content pages of each catalogue page, determining the chapters and sections formula text in difference Catalogue page set on website;
Identification module, suitable for analyzing each catalogue page and/or the corresponding content pages of each catalogue page, root in the catalogue page set The result obtained according to analysis identifies the chapters and sections integrality of each catalogue page in the catalogue page set.
B12, the device according to B11, wherein, the determining module is further adapted for:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
B13, the device according to B11 or B12, wherein, the determining module is further adapted for:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, calculate Common factor between the corresponding content pages of each two catalogue page.
B14, the device according to B11 to any one of B13, wherein, the determining module is further adapted for:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, obtains merging knot Really;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
B15, the device according to B11 to any one of B14, wherein, the identification module is further adapted for:
Calculate the average value of the number of the element of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than The average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
B16, the device according to B11 to any one of B15, wherein, the identification module is further adapted for:
If the corresponding content pages of a certain catalogue page include other multiple catalogue pages in the catalogue page set it is corresponding in Hold in page, and the corresponding content pages of the catalogue page and also there is other guide page, it is determined that the other guide page is newest chapters and sections Content pages, and the catalogue page possesses the ability of the new chapters and sections of lasting contribution.
B17, the device according to B11 to any one of B16, wherein, the identification module is further adapted for:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set In content pages, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, then It is false content pages to determine the content pages.
B18, the device according to B11 to any one of B17, wherein, the identification module is further adapted for:
If it is corresponding that some corresponding content pages of a certain catalogue page do not exist in other catalogue pages in the catalogue page set In content pages, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the mesh Record page does not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
B19, the device according to B11 to any one of B18, wherein, the acquisition module is further adapted for:
The related webpage from multiple site search to chapters and sections formula text;
The catalogue page and multiple content pages of the chapters and sections formula text are identified from the webpage searched.
B20, the device according to B11 to any one of B19, wherein, the acquisition module is further adapted for:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
In the specification that this place is provided, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, exist Above in the description of the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself All as the separate embodiments of the present invention.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can or similar purpose identical, equivalent by offer alternative features come generation Replace.
Although in addition, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of be the same as Example does not mean in of the invention Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of any Mode it can use in any combination.
The present invention all parts embodiment can be realized with hardware, or with one or more processor run Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) realize the chapters and sections integrality of chapters and sections formula text according to embodiments of the present invention Identifying device in some or all parts some or all functions.The present invention is also implemented as being used to perform this In described method some or all equipment or program of device (for example, computer program and computer program Product).Such program for realizing the present invention can be stored on a computer-readable medium, or can have one or many The form of individual signal.Such signal can be downloaded from internet website and obtained, either on carrier signal provide or with Any other form is provided.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of some different elements and coming real by means of properly programmed computer It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.
So far, although those skilled in the art will appreciate that detailed herein have shown and described multiple showing for the present invention Example property embodiment, still, still can be direct according to present disclosure without departing from the spirit and scope of the present invention It is determined that or deriving many other variations or modifications for meeting the principle of the invention.Therefore, the scope of the present invention is understood that and recognized It is set to and covers other all these variations or modifications.

Claims (18)

1. a kind of recognition methods of the chapters and sections integrality of chapters and sections formula text, including:
Identify the catalogue page and multiple content pages of chapters and sections formula text respectively from multiple websites, wherein, each website correspondence one Individual catalogue page, the multiple content pages of each catalogue page correspondence;
According to the corresponding multiple content pages of each catalogue page, catalogue page collection of the chapters and sections formula text on different websites is determined Close;
Each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set are analyzed, the result obtained according to analysis Identify the chapters and sections integrality of each catalogue page in the catalogue page set;
Wherein, the catalogue page and multiple content pages for identifying chapters and sections formula text respectively from multiple websites, including:From multiple Webpage of the site search to chapters and sections formula text correlation;Identified from the webpage searched the catalogue page of the chapters and sections formula text with And multiple content pages.
2. according to the method described in claim 1, wherein, according to the corresponding multiple content pages of each catalogue page, determine the chapter Catalogue page set of the section formula text on different websites, including:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
3. method according to claim 2, wherein, the friendship calculated between the corresponding content pages of each two catalogue page Collection, including:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, every two are calculated Common factor between the corresponding content pages of individual catalogue page.
4. method according to claim 2, wherein, according to the common factor of each two catalogue page, determine the chapters and sections formula text Catalogue page set on different websites, including:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, amalgamation result is obtained;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
5. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
Calculate the average value of the element number of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than described Average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
6. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
If the corresponding content pages of a certain catalogue page include the corresponding content pages of multiple other catalogue pages in the catalogue page set, And also there is other guide page in the corresponding content pages of the catalogue page, it is determined that the other guide page is the content of newest chapters and sections Page, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.
7. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set In page, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, it is determined that The content pages are false content pages.
8. the method according to any one of Claims 1-4, wherein, analyze in the catalogue page set each catalogue page and/or The corresponding content pages of each catalogue page, identify that the chapters and sections of each catalogue page in the catalogue page set are complete according to the result that analysis is obtained Whole property, including:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set In page, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the catalogue page Do not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
9. the method according to any one of Claims 1-4, wherein, the chapters and sections formula is identified from the webpage searched The catalogue page of text and multiple content pages, including:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
10. a kind of identifying device of the chapters and sections integrality of chapters and sections formula text, including:
Acquisition module, catalogue page and multiple content pages suitable for identifying chapters and sections formula text respectively from multiple websites, wherein, often One catalogue page of individual website correspondence, the multiple content pages of each catalogue page correspondence;
Determining module, suitable for according to the corresponding multiple content pages of each catalogue page, determining the chapters and sections formula text in different websites On catalogue page set;
Identification module, suitable for analyzing each catalogue page and/or the corresponding content pages of each catalogue page in the catalogue page set, according to point Analyse the chapters and sections integrality that obtained result identifies each catalogue page in the catalogue page set;
Wherein, the acquisition module is further adapted for:The related webpage from multiple site search to chapters and sections formula text;From the net searched The catalogue page and multiple content pages of the chapters and sections formula text are identified in page.
11. device according to claim 10, wherein, the determining module is further adapted for:
The common factor between the corresponding content pages of each two catalogue page is calculated, and is used as the common factor of each two catalogue page;
According to the common factor of each two catalogue page, catalogue page set of the chapters and sections formula text on different websites is determined.
12. device according to claim 11, wherein, the determining module is further adapted for:
Extract the Text eigenvector of each content pages in the corresponding content pages of multiple catalogue pages;
The content pages that will be provided with same text characteristic vector are clustered, and generate multiple content pages packets;
According to the packet of the multiple content pages and the mapping relations of the corresponding content pages of each catalogue page, every two are calculated Common factor between the corresponding content pages of individual catalogue page.
13. device according to claim 11, wherein, the determining module is further adapted for:
The each two catalogue page that the element number of common factor is more than or equal to predetermined threshold value is merged, amalgamation result is obtained;
Catalogue page set using the amalgamation result as the chapters and sections formula text on different websites.
14. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
Calculate the average value of the number of the element of the common factor of each two catalogue page in the catalogue page set;
If the number of a certain catalogue page and the element of the common factor of other multiple catalogue pages in the catalogue page set is respectively less than described Average value, it is determined that the corresponding chapters and sections of the catalogue page are imperfect.
15. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
If the corresponding content pages of a certain catalogue page include the corresponding content pages of multiple other catalogue pages in the catalogue page set, And also there is other guide page in the corresponding content pages of the catalogue page, it is determined that the other guide page is the content of newest chapters and sections Page, and the catalogue page possesses the ability of the lasting new chapters and sections of contribution.
16. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set In page, and the content pages length is not belonging to the corresponding interval range of average length of the corresponding content pages of the catalogue page, it is determined that The content pages are false content pages.
17. the device according to any one of claim 10 to 13, wherein, the identification module is further adapted for:
If some corresponding content pages of a certain catalogue page do not exist in the corresponding content of other catalogue pages in the catalogue page set In page, the content pages length belongs to the corresponding interval range of average length of the corresponding content pages of the catalogue page, and the catalogue page Do not possess the ability of the lasting new chapters and sections of contribution, it is determined that the content pages are false content pages.
18. the device according to any one of claim 10 to 13, wherein, the acquisition module is further adapted for:
By the web analysis searched into text object model tree structures;
Each node in the document object model tree construction is classified, to determine the structure piecemeal of the webpage;
The catalogue page and multiple content pages of the chapters and sections formula text are extracted according to the structure piecemeal.
CN201410578534.2A 2014-10-24 2014-10-24 The recognition methods of the chapters and sections integrality of chapters and sections formula text and device Active CN104317903B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410578534.2A CN104317903B (en) 2014-10-24 2014-10-24 The recognition methods of the chapters and sections integrality of chapters and sections formula text and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410578534.2A CN104317903B (en) 2014-10-24 2014-10-24 The recognition methods of the chapters and sections integrality of chapters and sections formula text and device

Publications (2)

Publication Number Publication Date
CN104317903A CN104317903A (en) 2015-01-28
CN104317903B true CN104317903B (en) 2017-10-13

Family

ID=52373135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410578534.2A Active CN104317903B (en) 2014-10-24 2014-10-24 The recognition methods of the chapters and sections integrality of chapters and sections formula text and device

Country Status (1)

Country Link
CN (1) CN104317903B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033405B (en) * 2015-03-10 2020-06-05 腾讯科技(深圳)有限公司 Network book catalog integrity detection method and device
CN105447130B (en) * 2015-11-18 2018-12-25 北京奇虎科技有限公司 The acquisition methods and device of the new chapters and sections of the network novel
CN113407889B (en) * 2021-07-15 2023-10-20 北京百度网讯科技有限公司 Novel transcoding method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7158962B2 (en) * 2002-11-27 2007-01-02 International Business Machines Corporation System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system
CN103123640A (en) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 Method and device for searching novel
CN103310160A (en) * 2013-06-20 2013-09-18 北京神州绿盟信息安全科技股份有限公司 Method, system and device for preventing webpage from being tampered with
CN103365877A (en) * 2012-03-29 2013-10-23 百度在线网络技术(北京)有限公司 Method and server for making directory after webpage is transcoded
US8631029B1 (en) * 2010-03-26 2014-01-14 A9.Com, Inc. Evolutionary content determination and management
CN103577566A (en) * 2013-10-25 2014-02-12 北京奇虎科技有限公司 Web reading content loading method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7158962B2 (en) * 2002-11-27 2007-01-02 International Business Machines Corporation System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system
US8631029B1 (en) * 2010-03-26 2014-01-14 A9.Com, Inc. Evolutionary content determination and management
CN103123640A (en) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 Method and device for searching novel
CN103365877A (en) * 2012-03-29 2013-10-23 百度在线网络技术(北京)有限公司 Method and server for making directory after webpage is transcoded
CN103310160A (en) * 2013-06-20 2013-09-18 北京神州绿盟信息安全科技股份有限公司 Method, system and device for preventing webpage from being tampered with
CN103577566A (en) * 2013-10-25 2014-02-12 北京奇虎科技有限公司 Web reading content loading method and device

Also Published As

Publication number Publication date
CN104317903A (en) 2015-01-28

Similar Documents

Publication Publication Date Title
CN109299457A (en) A kind of opining mining method, device and equipment
CN105373546B (en) A kind of information processing method and system for knowledge services
CN106933947B (en) A kind of searching method and device, electronic equipment
CN105528422A (en) Focused crawler processing method and apparatus
CN109582849A (en) A kind of Internet resources intelligent search method of knowledge based map
CN104462399B (en) The processing method and processing device of search result
CN104537341A (en) Human face picture information obtaining method and device
CN103559313B (en) Searching method and device
CN108241649A (en) The searching method and device of knowledge based collection of illustrative plates
CN104317903B (en) The recognition methods of the chapters and sections integrality of chapters and sections formula text and device
CN107102993A (en) A kind of user's demand analysis method and device
CN106469187A (en) The extracting method of key word and device
CN107666404A (en) Broadband network user identification method and device
CN104537080B (en) Information recommends method and system
CN109388796A (en) The method for pushing and device of judgement document
CN104408036B (en) It is associated with recognition methods and the device of topic
US20130268833A1 (en) Apparatus and method for visualizing hyperlinks using color attribute values
CN110929058A (en) Trademark picture retrieval method and device, storage medium and electronic device
CN104750609B (en) Determine the method and device of interface layout compatibility
CN105608183B (en) A kind of method and apparatus that polymeric type is provided and is answered
CN105468652A (en) Retrieval sorting method and system
CN107193814A (en) The method and apparatus that the automatic taxonomic revision of books is realized in digital reading
CN109145261A (en) A kind of method and apparatus generating label
CN105786929A (en) Information monitoring method and device
CN107133644A (en) Digital library's content analysis system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.