CN104317903A - Chapter type text chapter integrity identification method and device - Google Patents
Chapter type text chapter integrity identification method and device Download PDFInfo
- Publication number
- CN104317903A CN104317903A CN201410578534.2A CN201410578534A CN104317903A CN 104317903 A CN104317903 A CN 104317903A CN 201410578534 A CN201410578534 A CN 201410578534A CN 104317903 A CN104317903 A CN 104317903A
- Authority
- CN
- China
- Prior art keywords
- catalogue page
- catalogue
- content pages
- chapters
- pages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a chapter type text chapter integrity identification method and device. The method includes: respectively identifying catalogue pages and multiple content pages of chapter type text from multiple sites, wherein each site correspond to one catalogue page, and each catalogue page corresponds to multiple content pages; according to the content pages corresponding to each catalogue page, determining a catalogue page set of the chapter type text on different sites; analyzing each catalogue page and/or the content pages corresponding to each catalogue page in the catalogue page set, and identifying chapter integrity of each catalogue page in the catalogue page set according to a result obtained by analyzing. Chapter integrity of the chapter type text can be identified flexibly and quickly, and identification results are accurate and objective.
Description
Technical field
The present invention relates to Internet technical field, particularly a kind of recognition methods of chapters and sections integrality of chapters and sections formula text and device.
Background technology
Day by day universal along with computing machine and computer network, internet has been deep into the every field of people's work, studying and living, becomes people and issues and the important channel of obtaining information.
At present, chapters and sections formula text exists in a large number in internet, and one text may be reprinted in a large number by different web sites, owing to being subject to the impact of some objective factors when reprinting, the content in the number of site text may be caused and imperfect, even occur the situation of content falseness.For novel text, novel reading is the strong demand of one of Internet user, more occupies no small demand proportion especially on the mobile apparatus.Novel class website exists in a large number, quality is very different, same present networks novel can be reprinted in a large number by different web sites, but by the impact of some objective factors, the content of this this novel on number of site may be caused and imperfect (as lacking chapters and sections), even content falseness (piecing together false chapters and sections).Search engine, when these novel websites of index, needs to judge the chapters and sections integrality of novel, as far as possible to the website that user's rendering content is complete, improves the quality that user obtains novel content, promotes Consumer's Experience.
In correlation technique, by carrying out the judgement of chapters and sections integrality to different novel website human configuration template, although the method accuracy rate is very high, shortcoming also clearly: the website that manpower can cover is limited, not intelligence, for website form change response not in time.Thus, identify that the chapters and sections integrality of chapters and sections formula text becomes technical matters urgently to be resolved hurrily at present how flexibly, fast and exactly.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or the recognition methods of the chapters and sections integrality of chapters and sections formula text solved the problem at least in part and corresponding device.
According to one aspect of the present invention, provide a kind of recognition methods of chapters and sections integrality of chapters and sections formula text, comprising: identify the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, wherein, the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page; The multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website; Analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.
Alternatively, the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website, comprising: calculate the common factor between content pages corresponding to every two catalogue pages, and as the common factor of every two catalogue pages; According to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website.
Alternatively, the common factor between the content pages that every two catalogue pages of described calculating are corresponding, comprising: the Text eigenvector extracting each content pages in content pages corresponding to multiple catalogue page; The content pages possessing same text proper vector is carried out cluster, generates the grouping of multiple content pages; According to the mapping relations of the content pages that described multiple content pages is divided into groups and each catalogue page is corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.
Alternatively, according to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website, comprising: every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, and obtain amalgamation result; Using described amalgamation result as the catalogue page set of described chapters and sections formula text on different website.
Alternatively, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result that obtains and identify each catalogue page in described catalogue page set, comprising: the mean value calculating the element number of the common factor of every two catalogue pages in described catalogue page set; If the number of the element of the common factor of other catalogue pages multiple is all less than described mean value in a certain catalogue page and described catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.
Alternatively, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result that obtains and identify each catalogue page in described catalogue page set, comprise: if content pages corresponding to a certain catalogue page includes the content pages that in described catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that described other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.
Alternatively, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result that obtains and identify each catalogue page in described catalogue page set, comprise: if certain content pages corresponding to a certain catalogue page is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.
Alternatively, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result that obtains and identify each catalogue page in described catalogue page set, comprise: if certain content pages corresponding to a certain catalogue page is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.
Alternatively, describedly identify the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, comprising: the webpage relevant from multiple site search to chapters and sections formula text; The catalogue page of described chapters and sections formula text and multiple content pages is identified from the webpage searched.
Alternatively, from the webpage searched, identify the catalogue page of described chapters and sections formula text and multiple content pages, comprising: the web analysis searched is become text object model tree structures; Each node in described document object model tree construction is classified, to determine the structure piecemeal of described webpage; The catalogue page of described chapters and sections formula text and multiple content pages is extracted according to described structure piecemeal.
According to another aspect of the present invention, additionally provide a kind of recognition device of chapters and sections integrality of chapters and sections formula text, comprising:
Acquisition module, is suitable for identifying the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, wherein, and the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;
Determination module, is suitable for the multiple content pages corresponding according to each catalogue page, determines the catalogue page set of described chapters and sections formula text on different website;
Identification module, is suitable for analyzing each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.
Alternatively, described determination module is also suitable for: calculate the common factor between content pages corresponding to every two catalogue pages, and as the common factor of every two catalogue pages; According to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website.
Alternatively, described determination module is also suitable for: the Text eigenvector extracting each content pages in content pages corresponding to multiple catalogue page; The content pages possessing same text proper vector is carried out cluster, generates the grouping of multiple content pages; According to the mapping relations of the content pages that described multiple content pages is divided into groups and each catalogue page is corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.
Alternatively, described determination module is also suitable for: every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, and obtain amalgamation result; Using described amalgamation result as the catalogue page set of described chapters and sections formula text on different website.
Alternatively, described identification module is also suitable for: the mean value calculating the number of the element of the common factor of every two catalogue pages in described catalogue page set; If the number of the element of the common factor of other catalogue pages multiple is all less than described mean value in a certain catalogue page and described catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.
Alternatively, described identification module is also suitable for: if content pages corresponding to a certain catalogue page includes the content pages that in described catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that described other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.
Alternatively, described identification module is also suitable for: if certain content pages corresponding to a certain catalogue page is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.
Alternatively, described identification module is also suitable for: if certain content pages corresponding to a certain catalogue page is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.
Alternatively, described acquisition module is also suitable for: the webpage relevant from multiple site search to chapters and sections formula text; The catalogue page of described chapters and sections formula text and multiple content pages is identified from the webpage searched.
Alternatively, described acquisition module is also suitable for: the web analysis searched is become text object model tree structures; Each node in described document object model tree construction is classified, to determine the structure piecemeal of described webpage; The catalogue page of described chapters and sections formula text and multiple content pages is extracted according to described structure piecemeal.
According to technical scheme provided by the invention, identify the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, and then the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of chapters and sections formula text on different website.Each catalogue page and/or content pages corresponding to each catalogue page in the set of analytical bibliography page subsequently, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in catalogue page set.As can be seen here, present invention achieves the automatically process of the acquisition to data source (on multiple website the catalogue page of chapters and sections formula text and multiple content pages), the determination of catalogue page set and the analysis three to catalogue page set, thus solve and carry out chapters and sections integrality by human configuration template in correlation technique and judge to cause inefficient problem.Further, the present invention can obtain data source neatly, and then determines catalogue page set, analyzes catalogue page set, solves the response problem not in time of website form change in correlation technique.In addition, catalogue page and content pages can reflect the chapters and sections integrality of chapters and sections formula text accurately, objectively, the present invention analyzes each catalogue page in the catalogue page set of chapters and sections formula text on different website and/or content pages corresponding to each catalogue page targetedly, and then the chapters and sections integrality of each catalogue page in catalogue page set is identified according to the result that analysis obtains, make recognition result more accurate.To sum up, technical scheme provided by the invention can identify the chapters and sections integrality of chapters and sections formula text flexibly, rapidly, and recognition result is accurate, objective.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
According to hereafter by reference to the accompanying drawings to the detailed description of the specific embodiment of the invention, those skilled in the art will understand above-mentioned and other objects, advantage and feature of the present invention more.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the recognition methods of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention; And
Fig. 2 shows the structural representation of the recognition device of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
For solving the problems of the technologies described above, embodiments provide a kind of recognition methods of chapters and sections integrality of chapters and sections formula text, Fig. 1 shows the process flow diagram of the recognition methods of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention.As shown in Figure 1, the method at least comprises the following steps S102 to step S106.
Step S102, the catalogue page identifying chapters and sections formula text from multiple website respectively and multiple content pages, wherein, the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page.
Step S104, the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of chapters and sections formula text on different website.
Each catalogue page and/or content pages corresponding to each catalogue page in step S106, the set of analytical bibliography page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in catalogue page set.
According to technical scheme provided by the invention, identify the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, and then the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of chapters and sections formula text on different website.Each catalogue page and/or content pages corresponding to each catalogue page in the set of analytical bibliography page subsequently, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in catalogue page set.As can be seen here, present invention achieves the automatically process of the acquisition to data source (on multiple website the catalogue page of chapters and sections formula text and multiple content pages), the determination of catalogue page set and the analysis three to catalogue page set, thus solve and carry out chapters and sections integrality by human configuration template in correlation technique and judge to cause inefficient problem.Further, the present invention can obtain data source neatly, and then determines catalogue page set, analyzes catalogue page set, solves the response problem not in time of website form change in correlation technique.In addition, catalogue page and content pages can reflect the chapters and sections integrality of chapters and sections formula text accurately, objectively, the present invention analyzes each catalogue page in the catalogue page set of chapters and sections formula text on different website and/or content pages corresponding to each catalogue page targetedly, and then the chapters and sections integrality of each catalogue page in catalogue page set is identified according to the result that analysis obtains, make recognition result more accurate.To sum up, technical scheme provided by the invention can identify the chapters and sections integrality of chapters and sections formula text flexibly, rapidly, and recognition result is accurate, objective.
The chapters and sections formula text mentioned in step S102 above refers to the text be made up of some chapters and sections, as novel, paper etc.Catalogue page refers to the catalogue of chapters and sections formula text, and such as, during user search novel, what usually will look for is the catalogue page of novel.Content pages refers to the particular content of a certain chapters and sections of chapters and sections formula text.The invention provides and a kind ofly identify the catalogue page of chapters and sections formula text and the preferred version of multiple content pages respectively from multiple website, the webpage can be correlated with from multiple site search to chapters and sections formula text in this scenario, and then from the webpage searched, identify the catalogue page of chapters and sections formula text and multiple content pages.
Further, from the webpage searched, identify chapters and sections formula text catalogue page and multiple content pages can adopt manual compiling rule to extract the identification that the webpage searched carries out catalogue page and content pages.Or, based on the masterplate of mark, the template finding optimum matching in template base can be extracted at every turn, then uses the identification of this template directory page and content pages to extract.In addition, in order to improve recognition efficiency, the web analysis searched can also be become text object model tree structures by the present invention, and each node in document object model tree construction is classified, to determine the structure piecemeal of webpage, and then extract the catalogue page of chapters and sections formula text and multiple content pages according to structure piecemeal.This provide a kind of preferred scheme classifying to determine the structure piecemeal of webpage to each node in document object model tree construction, in this scenario, document object model tree construction can be traveled through, obtain the content of each node in document object model tree construction, and then according to preset rules by the content of each node input decision tree, by decision tree, each node is classified.Or, document object model tree construction can be traveled through, obtain the dimensional characteristics of each node in document object model tree construction, and then according to preset rules by the dimensional characteristics of each node input decision tree, by decision tree, each node be classified.
Decision tree is on the basis of the statistics of various dimensional characteristics in known various piecemeal, utilizes the dimensional characteristics of each node to draw point block type that each node is corresponding by training decision tree.Each node in the document object model tree construction of webpage is classified introducing decision tree in detail, to determine the scheme of the structure piecemeal of webpage below.
First, determine the dimensional characteristics of piecemeal, in embodiments of the present invention, operable dimensional characteristics reaches 105, relates generally to following content: text size, hyperlink number, hyperlink text length, highlighted text size (comprising the word strengthening overstriking), Chinese character length, English character length, numerical character length, particular keywords, specific punctuation mark etc.Namely the block of a type can be got specific value and determines by the one or more features in these 105 dimensional characteristics.It should be noted that, be not limited to 105 according to the determined dimensional characteristics of actual conditions, can also expand in subsequent process.
Secondly, by the dimensional characteristics input decision tree being used for piecemeal determined, decision tree is built for training.
Moreover, according to preset rules by the content of each node in the document object model tree construction of webpage input decision tree, by the content of each node of decision tree analysis, obtain the dimensional characteristics of each node, and then according to the dimensional characteristics of each node, each node is classified.
More than describe in step S102 the multiple implementation obtaining data source (on multiple website the catalogue page of chapters and sections formula text and multiple content pages) in detail, will one or more implementations determining catalogue page set be introduced below.
Multiple content pages corresponding according to each catalogue page in step S104 above, determine the catalogue page set of chapters and sections formula text on different website, the invention provides a kind of preferred scheme, calculate the common factor between content pages corresponding to every two catalogue pages in this scenario, and as the common factor of every two catalogue pages, and then according to the common factor of every two catalogue pages, determine the catalogue page set of chapters and sections formula text on different website.
Further, in preferred version of the present invention, the thought of employing cluster calculates the common factor between content pages corresponding to every two catalogue pages, it can be the Text eigenvector extracting each content pages in content pages corresponding to multiple catalogue page, subsequently the content pages possessing same text proper vector is carried out cluster, generate the grouping of multiple content pages, and then according to the mapping relations of the grouping of multiple content pages and each catalogue page content pages corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.For example, multiple website is website A, B and C, and the catalogue page of chapters and sections formula text corresponding is respectively catalogue page A, B and C.Multiple content pages that catalogue page A is corresponding are content pages A1, A2, A3, and multiple content pages that catalogue page B is corresponding are content pages B1, B2, and multiple content pages that catalogue page C is corresponding are content pages C1, C2, C3, C4.Extract content pages A1, in A2, A3, B1, B2, C1, C2, C3, C4 the Text eigenvector of each content pages be respectively a, b, c, a, b ', a, b, c, d, the content pages possessing same text proper vector is carried out cluster, generate multiple content pages and be grouped into { a, a, a}, { b, b}, and b ' }, { c, c}, { d}.And then according to the mapping relations of the grouping of multiple content pages and each catalogue page content pages corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages, namely the common factor between the content pages that catalogue page A and catalogue page B is corresponding is { a}, common factor between the content pages that catalogue page A and catalogue page C is corresponding is { a, b, common factor between the content pages that c}, catalogue page B and catalogue page C are corresponding is { a}.
Now, according to the common factor of every two catalogue pages, determine the catalogue page set of chapters and sections formula text on different website, can be that every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, obtain amalgamation result, using amalgamation result as the catalogue page set of chapters and sections formula text on different website.Still for above-mentioned example, using the common factor of the common factor between content pages corresponding for every two catalogue pages as every two catalogue pages, namely the common factor of catalogue page A and catalogue page B is { a}, the common factor of catalogue page A and catalogue page C is { a, b, c}, the common factor of catalogue page B and catalogue page C is { a}.Getting predetermined threshold value is 1, and every two catalogue pages element number of common factor being more than or equal to 1 merge, and obtaining amalgamation result is catalogue page A, B, C, then this chapters and sections formula text catalogue page set on different website is catalogue page A, B, C.
In multiple content pages that step S104 is above corresponding according to each catalogue page, after determining the catalogue page set of chapters and sections formula text on different website, each catalogue page and/or content pages corresponding to each catalogue page in the set of step S106 analytical bibliography page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in catalogue page set.The invention provides the method for multiple analysis, describe in detail below.
The first, calculate the mean value of the element number of the common factor of every two catalogue pages in catalogue page set, if the number of the element of the common factor of other catalogue pages multiple is all less than this mean value in a certain catalogue page and catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.For above-mentioned example, catalogue page set is catalogue page A, B, C, the element number of the common factor of catalogue page A and catalogue page B is 1, the element number of the common factor of catalogue page A and catalogue page C is 3, the element number of the common factor of catalogue page B and catalogue page C is 1, then mean value is 5/3, and wherein the number of catalogue page B and catalogue page A, catalogue page C is 1, be less than mean value 5/3, then determine that the chapters and sections that catalogue page B is corresponding are imperfect.
The second, if the content pages that a certain catalogue page is corresponding includes the content pages that in catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.Here, up-to-date chapters and sections refer to the up-to-date chapters and sections delivered of chapters and sections formula text, such as the up-to-date chapters and sections delivered of serial story.Novel user can chase after book usually, and namely up-to-date chapters and sections are delivered once author, and user just thinks to see at once, and up-to-date chapters and sections are issued faster novel station and more easily liked by user.For above-mentioned example, (its Text eigenvector is respectively a, b, c) for the content pages A1 that catalogue page A is corresponding, A2, A3, the content pages B1 that catalogue page B is corresponding, B2 (its Text eigenvector is respectively a, b '), (its Text eigenvector is respectively a, b, c, d) for the content pages C1 that catalogue page C is corresponding, C2, C3, C4.Visible, the content pages that catalogue page A is corresponding includes content pages corresponding to catalogue page A and catalogue page B, and catalogue page C also exists other guide page (i.e. content pages C4), then determine that content pages C4 is the content pages of up-to-date chapters and sections, and catalogue page C possesses the ability continuing the new chapters and sections of contribution.
The third, if certain content pages that a certain catalogue page is corresponding is not present in the content pages that in catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.For above-mentioned example, (its Text eigenvector is respectively a, b, c) for the content pages A1 that catalogue page A is corresponding, A2, A3, the content pages B1 that catalogue page B is corresponding, B2 (its Text eigenvector is respectively a, b '), (its Text eigenvector is respectively a, b, c, d) for the content pages C1 that catalogue page C is corresponding, C2, C3, C4.Visible, the content pages B2 that catalogue page B is corresponding is not present in content pages corresponding to catalogue page A, C, if content pages B2 length does not belong to interval range corresponding to the average length of content pages corresponding to catalogue page B, then determines that content pages B2 is false content pages.
4th kind, if certain content pages that a certain catalogue page is corresponding is not present in the content pages that in catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.For above-mentioned example, (its Text eigenvector is respectively a, b, c) for the content pages A1 that catalogue page A is corresponding, A2, A3, the content pages B1 that catalogue page B is corresponding, B2 (its Text eigenvector is respectively a, b '), (its Text eigenvector is respectively a, b, c, d) for the content pages C1 that catalogue page C is corresponding, C2, C3, C4.Visible, the content pages B2 that catalogue page B is corresponding is not present in content pages corresponding to catalogue page A, C, content pages B2 length belongs to interval range corresponding to the average length of content pages corresponding to catalogue page B, if catalogue page B does not possess the ability continuing the new chapters and sections of contribution, then determine that content pages B2 is false content pages.
It should be noted that, above four kinds of analytical approachs can separately for the analysis of chapters and sections integrality, can also to any one or the multiple analysis combined for chapters and sections integrality of these four kinds of analytical approachs.Such as adopt the chapters and sections that first method determination catalogue page B is corresponding imperfect, adopt the third or the 4th kind of method to analyze further afterwards, determining content pages B2 corresponding to catalogue page B is false content pages, makes recognition result more accurate, objective.In addition, (namely multiple website is website A, B and C, and the catalogue page of chapters and sections formula text corresponding is respectively catalogue page A, B and C in above-mentioned citing.Multiple content pages that catalogue page A is corresponding are content pages A1, A2, A3, multiple content pages that catalogue page B is corresponding are content pages B1, B2, multiple content pages that catalogue page C is corresponding are content pages C1, C2, C3, C4) be only schematic, do not limit the present invention.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of recognition device of chapters and sections integrality of chapters and sections formula text, to realize the recognition methods of the chapters and sections integrality of above-mentioned chapters and sections formula text.
Fig. 2 shows the structural representation of the recognition device of the chapters and sections integrality of chapters and sections formula text according to an embodiment of the invention.See Fig. 2, this device at least comprises: acquisition module 210, determination module 220 and identification module 230.
Now introduce the annexation between each composition of recognition device of the chapters and sections integrality of the chapters and sections formula text of the embodiment of the present invention or the function of device and each several part:
Acquisition module 210, is suitable for identifying the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, wherein, and the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;
Determination module 220, is coupled with acquisition module 210, is suitable for the multiple content pages corresponding according to each catalogue page, determines the catalogue page set of described chapters and sections formula text on different website;
Identification module 230, is coupled with determination module 220, is suitable for analyzing each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.
In one embodiment of the invention, above-mentioned determination module 220 is also suitable for: calculate the common factor between content pages corresponding to every two catalogue pages, and as the common factor of every two catalogue pages; According to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website.
In one embodiment of the invention, above-mentioned determination module 220 is also suitable for: the Text eigenvector extracting each content pages in content pages corresponding to multiple catalogue page; The content pages possessing same text proper vector is carried out cluster, generates the grouping of multiple content pages; According to the mapping relations of the content pages that described multiple content pages is divided into groups and each catalogue page is corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.
In one embodiment of the invention, above-mentioned determination module 220 is also suitable for: every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, and obtain amalgamation result; Using described amalgamation result as the catalogue page set of described chapters and sections formula text on different website.
In one embodiment of the invention, above-mentioned identification module 230 is also suitable for: the mean value calculating the number of the element of the common factor of every two catalogue pages in described catalogue page set; If the number of the element of the common factor of other catalogue pages multiple is all less than described mean value in a certain catalogue page and described catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.
In one embodiment of the invention, above-mentioned identification module 230 is also suitable for: if content pages corresponding to a certain catalogue page includes the content pages that in described catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that described other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.
In one embodiment of the invention, above-mentioned identification module 230 is also suitable for: if certain content pages corresponding to a certain catalogue page is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.
In one embodiment of the invention, above-mentioned identification module 230 is also suitable for: if certain content pages corresponding to a certain catalogue page is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.
In one embodiment of the invention, above-mentioned acquisition module 210 is also suitable for: the webpage relevant from multiple site search to chapters and sections formula text; The catalogue page of described chapters and sections formula text and multiple content pages is identified from the webpage searched.
In one embodiment of the invention, above-mentioned acquisition module 210 is also suitable for: the web analysis searched is become text object model tree structures; Each node in described document object model tree construction is classified, to determine the structure piecemeal of described webpage; The catalogue page of described chapters and sections formula text and multiple content pages is extracted according to described structure piecemeal.
According to the combination of any one preferred embodiment above-mentioned or multiple preferred embodiment, the embodiment of the present invention can reach following beneficial effect:
According to technical scheme provided by the invention, identify the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, and then the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of chapters and sections formula text on different website.Each catalogue page and/or content pages corresponding to each catalogue page in the set of analytical bibliography page subsequently, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in catalogue page set.As can be seen here, present invention achieves the automatically process of the acquisition to data source (on multiple website the catalogue page of chapters and sections formula text and multiple content pages), the determination of catalogue page set and the analysis three to catalogue page set, thus solve and carry out chapters and sections integrality by human configuration template in correlation technique and judge to cause inefficient problem.Further, the present invention can obtain data source neatly, and then determines catalogue page set, analyzes catalogue page set, solves the response problem not in time of website form change in correlation technique.In addition, catalogue page and content pages can reflect the chapters and sections integrality of chapters and sections formula text accurately, objectively, the present invention analyzes each catalogue page in the catalogue page set of chapters and sections formula text on different website and/or content pages corresponding to each catalogue page targetedly, and then the chapters and sections integrality of each catalogue page in catalogue page set is identified according to the result that analysis obtains, make recognition result more accurate.To sum up, technical scheme provided by the invention can identify the chapters and sections integrality of chapters and sections formula text flexibly, rapidly, and recognition result is accurate, objective.
The invention also discloses:
The recognition methods of the chapters and sections integrality of A1, a kind of chapters and sections formula text, comprising:
The catalogue page of chapters and sections formula text and multiple content pages is identified respectively from multiple website, wherein, the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;
The multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website;
Analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.
A2, method according to A1, wherein, the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website, comprising:
Calculate the common factor between content pages corresponding to every two catalogue pages, and as the common factor of every two catalogue pages;
According to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website.
A3, method according to A1 or A2, wherein, the common factor between the content pages that every two catalogue pages of described calculating are corresponding, comprising:
Extract the Text eigenvector of each content pages in content pages corresponding to multiple catalogue page;
The content pages possessing same text proper vector is carried out cluster, generates the grouping of multiple content pages;
According to the mapping relations of the content pages that described multiple content pages is divided into groups and each catalogue page is corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.
A4, method according to any one of A1 to A3, wherein, according to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website, comprising:
Every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, and obtain amalgamation result;
Using described amalgamation result as the catalogue page set of described chapters and sections formula text on different website.
A5, method according to any one of A1 to A4, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
Calculate the mean value of the element number of the common factor of every two catalogue pages in described catalogue page set;
If the number of the element of the common factor of other catalogue pages multiple is all less than described mean value in a certain catalogue page and described catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.
A6, method according to any one of A1 to A5, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
If the content pages that a certain catalogue page is corresponding includes the content pages that in described catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that described other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.
A7, method according to any one of A1 to A6, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.
A8, method according to any one of A1 to A7, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.
A9, method according to any one of A1 to A8, wherein, describedly identify the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, comprising:
The webpage relevant from multiple site search to chapters and sections formula text;
The catalogue page of described chapters and sections formula text and multiple content pages is identified from the webpage searched.
A10, method according to any one of A1 to A9, wherein, identify the catalogue page of described chapters and sections formula text and multiple content pages, comprising from the webpage searched:
The web analysis searched is become text object model tree structures;
Each node in described document object model tree construction is classified, to determine the structure piecemeal of described webpage;
The catalogue page of described chapters and sections formula text and multiple content pages is extracted according to described structure piecemeal.
The recognition device of the chapters and sections integrality of B11, a kind of chapters and sections formula text, comprising:
Acquisition module, is suitable for identifying the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, wherein, and the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;
Determination module, is suitable for the multiple content pages corresponding according to each catalogue page, determines the catalogue page set of described chapters and sections formula text on different website;
Identification module, is suitable for analyzing each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.
B12, device according to B11, wherein, described determination module is also suitable for:
Calculate the common factor between content pages corresponding to every two catalogue pages, and as the common factor of every two catalogue pages;
According to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website.
B13, device according to B11 or B12, wherein, described determination module is also suitable for:
Extract the Text eigenvector of each content pages in content pages corresponding to multiple catalogue page;
The content pages possessing same text proper vector is carried out cluster, generates the grouping of multiple content pages;
According to the mapping relations of the content pages that described multiple content pages is divided into groups and each catalogue page is corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.
B14, device according to any one of B11 to B13, wherein, described determination module is also suitable for:
Every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, and obtain amalgamation result;
Using described amalgamation result as the catalogue page set of described chapters and sections formula text on different website.
B15, device according to any one of B11 to B14, wherein, described identification module is also suitable for:
Calculate the mean value of the number of the element of the common factor of every two catalogue pages in described catalogue page set;
If the number of the element of the common factor of other catalogue pages multiple is all less than described mean value in a certain catalogue page and described catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.
B16, device according to any one of B11 to B15, wherein, described identification module is also suitable for:
If the content pages that a certain catalogue page is corresponding includes the content pages that in described catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that described other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.
B17, device according to any one of B11 to B16, wherein, described identification module is also suitable for:
If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.
B18, device according to any one of B11 to B17, wherein, described identification module is also suitable for:
If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.
B19, device according to any one of B11 to B18, wherein, described acquisition module is also suitable for:
The webpage relevant from multiple site search to chapters and sections formula text;
The catalogue page of described chapters and sections formula text and multiple content pages is identified from the webpage searched.
B20, device according to any one of B11 to B19, wherein, described acquisition module is also suitable for:
The web analysis searched is become text object model tree structures;
Each node in described document object model tree construction is classified, to determine the structure piecemeal of described webpage;
The catalogue page of described chapters and sections formula text and multiple content pages is extracted according to described structure piecemeal.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in detail in the claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize the some or all parts in the recognition device of the chapters and sections integrality of the chapters and sections formula text according to the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
So far, those skilled in the art will recognize that, although multiple exemplary embodiment of the present invention is illustrate and described herein detailed, but, without departing from the spirit and scope of the present invention, still can directly determine or derive other modification many or amendment of meeting the principle of the invention according to content disclosed by the invention.Therefore, scope of the present invention should be understood and regard as and cover all these other modification or amendments.
Claims (10)
1. a recognition methods for the chapters and sections integrality of chapters and sections formula text, comprising:
The catalogue page of chapters and sections formula text and multiple content pages is identified respectively from multiple website, wherein, the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;
The multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website;
Analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.
2. method according to claim 1, wherein, the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website, comprising:
Calculate the common factor between content pages corresponding to every two catalogue pages, and as the common factor of every two catalogue pages;
According to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website.
3. method according to claim 1 and 2, wherein, the common factor between the content pages that every two catalogue pages of described calculating are corresponding, comprising:
Extract the Text eigenvector of each content pages in content pages corresponding to multiple catalogue page;
The content pages possessing same text proper vector is carried out cluster, generates the grouping of multiple content pages;
According to the mapping relations of the content pages that described multiple content pages is divided into groups and each catalogue page is corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.
4. the method according to any one of claims 1 to 3, wherein, according to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website, comprising:
Every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, and obtain amalgamation result;
Using described amalgamation result as the catalogue page set of described chapters and sections formula text on different website.
5. the method according to any one of Claims 1-4, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
Calculate the mean value of the element number of the common factor of every two catalogue pages in described catalogue page set;
If the number of the element of the common factor of other catalogue pages multiple is all less than described mean value in a certain catalogue page and described catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.
6. the method according to any one of claim 1 to 5, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
If the content pages that a certain catalogue page is corresponding includes the content pages that in described catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that described other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.
7. the method according to any one of claim 1 to 6, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.
8. the method according to any one of claim 1 to 7, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:
If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.
9. the method according to any one of claim 1 to 8, wherein, describedly identifies the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, comprising:
The webpage relevant from multiple site search to chapters and sections formula text;
The catalogue page of described chapters and sections formula text and multiple content pages is identified from the webpage searched.
10. a recognition device for the chapters and sections integrality of chapters and sections formula text, comprising:
Acquisition module, is suitable for identifying the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, wherein, and the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;
Determination module, is suitable for the multiple content pages corresponding according to each catalogue page, determines the catalogue page set of described chapters and sections formula text on different website;
Identification module, is suitable for analyzing each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410578534.2A CN104317903B (en) | 2014-10-24 | 2014-10-24 | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410578534.2A CN104317903B (en) | 2014-10-24 | 2014-10-24 | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104317903A true CN104317903A (en) | 2015-01-28 |
CN104317903B CN104317903B (en) | 2017-10-13 |
Family
ID=52373135
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410578534.2A Active CN104317903B (en) | 2014-10-24 | 2014-10-24 | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104317903B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105447130A (en) * | 2015-11-18 | 2016-03-30 | 北京奇虎科技有限公司 | Method and device for acquiring new chapter of network novel |
CN106033405A (en) * | 2015-03-10 | 2016-10-19 | 腾讯科技(深圳)有限公司 | A network book contents integrity detection method and device |
CN113407889A (en) * | 2021-07-15 | 2021-09-17 | 北京百度网讯科技有限公司 | Novel transcoding method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7158962B2 (en) * | 2002-11-27 | 2007-01-02 | International Business Machines Corporation | System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system |
CN103123640A (en) * | 2012-02-22 | 2013-05-29 | 深圳市谷古科技有限公司 | Method and device for searching novel |
CN103310160A (en) * | 2013-06-20 | 2013-09-18 | 北京神州绿盟信息安全科技股份有限公司 | Method, system and device for preventing webpage from being tampered with |
CN103365877A (en) * | 2012-03-29 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Method and server for making directory after webpage is transcoded |
US8631029B1 (en) * | 2010-03-26 | 2014-01-14 | A9.Com, Inc. | Evolutionary content determination and management |
CN103577566A (en) * | 2013-10-25 | 2014-02-12 | 北京奇虎科技有限公司 | Web reading content loading method and device |
-
2014
- 2014-10-24 CN CN201410578534.2A patent/CN104317903B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7158962B2 (en) * | 2002-11-27 | 2007-01-02 | International Business Machines Corporation | System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system |
US8631029B1 (en) * | 2010-03-26 | 2014-01-14 | A9.Com, Inc. | Evolutionary content determination and management |
CN103123640A (en) * | 2012-02-22 | 2013-05-29 | 深圳市谷古科技有限公司 | Method and device for searching novel |
CN103365877A (en) * | 2012-03-29 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Method and server for making directory after webpage is transcoded |
CN103310160A (en) * | 2013-06-20 | 2013-09-18 | 北京神州绿盟信息安全科技股份有限公司 | Method, system and device for preventing webpage from being tampered with |
CN103577566A (en) * | 2013-10-25 | 2014-02-12 | 北京奇虎科技有限公司 | Web reading content loading method and device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106033405A (en) * | 2015-03-10 | 2016-10-19 | 腾讯科技(深圳)有限公司 | A network book contents integrity detection method and device |
CN106033405B (en) * | 2015-03-10 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Network book catalog integrity detection method and device |
CN105447130A (en) * | 2015-11-18 | 2016-03-30 | 北京奇虎科技有限公司 | Method and device for acquiring new chapter of network novel |
CN105447130B (en) * | 2015-11-18 | 2018-12-25 | 北京奇虎科技有限公司 | The acquisition methods and device of the new chapters and sections of the network novel |
CN113407889A (en) * | 2021-07-15 | 2021-09-17 | 北京百度网讯科技有限公司 | Novel transcoding method, device, equipment and storage medium |
CN113407889B (en) * | 2021-07-15 | 2023-10-20 | 北京百度网讯科技有限公司 | Novel transcoding method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN104317903B (en) | 2017-10-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108280114B (en) | Deep learning-based user literature reading interest analysis method | |
KR101723862B1 (en) | Apparatus and method for classifying and analyzing documents including text | |
CN106202380B (en) | Method and system for constructing classified corpus and server with system | |
CN105373546B (en) | A kind of information processing method and system for knowledge services | |
CN104486461A (en) | Domain name classification method and device and domain name recognition method and system | |
CN106970912A (en) | Chinese sentence similarity calculating method, computing device and computer-readable storage medium | |
CN105718585B (en) | Document and label word justice correlating method and its device | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN103559313B (en) | Searching method and device | |
CN104133868B (en) | A kind of strategy integrated for the classification of vertical reptile data | |
CN105912645A (en) | Intelligent question and answer method and apparatus | |
CN105653547A (en) | Method and device for extracting keywords of text | |
CN103873318A (en) | Website automated testing method and automated testing system | |
CN106598949A (en) | Method and device for confirming contribution degree of words to text | |
CN104462512A (en) | Chinese information search method and device based on knowledge graph | |
CN105550169A (en) | Method and device for identifying point of interest names based on character length | |
CN103106211B (en) | Emotion recognition method and emotion recognition device for customer consultation texts | |
CN103150331A (en) | Method and device for providing search engine tags | |
CN107301167A (en) | A kind of work(performance description information recognition methods and device | |
CN104331438A (en) | Method and device for selectively extracting content of novel webpage | |
CN104317903A (en) | Chapter type text chapter integrity identification method and device | |
CN106202349A (en) | Web page classifying dictionary creation method and device | |
CN105159885A (en) | Point-of-interest name identification method and device | |
CN103076894A (en) | Method and equipment for building input entries for object identity information according to object identity information | |
CN104408036B (en) | It is associated with recognition methods and the device of topic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220727 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |