CN109033041A

CN109033041A - The treating method and apparatus of document similarity

Info

Publication number: CN109033041A
Application number: CN201710432891.1A
Authority: CN
Inventors: 陈飞
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2018-12-18

Abstract

The invention discloses a kind for the treatment of method and apparatus of document similarity.Wherein, this method comprises: obtaining focus document library and target document, wherein focus document library includes the corresponding focus sequence of document in document library, and focus sequence includes the mark for characterizing focus in corresponding document；Obtain the focus sequence of target document；According to each focus sequence in the focus sequence of target document and focus document library, the sequencing of similarity of each document and target document in document library is determined.The present invention solves the technical issues of lookup artificial in the prior art document similar with target document leads to accuracy rate.

Description

The treating method and apparatus of document similarity

Technical field

The present invention relates to data processing fields, in particular to a kind for the treatment of method and apparatus of document similarity.

Background technique

To the judgement of similar document, there are mainly two types of realize at present.Method one: reading mark manually is carried out to document, is passed through Find out the document containing similar tags in inquiry document library；Method two: document vectorization (being based on word frequency) is calculated into document vector afterwards Between distance realize.

But method one: relying on pure manpower and solve, and not only heavy workload is time-consuming more but also more demanding to business personnel, because For the difference for being limited to professional knowledge experience, same document different people mark out come label may difference it is larger.And it formulates in advance Good a batch label is selected for business personnel, is not only increased job costs, can not be also fully solved experience variability issues.Method Two identify similar document by document vectorization, but the method based on word frequency largely can not really find out the weight of document Point, so accuracy rate is not high.

For the problem that lookup artificial in the prior art document similar with target document causes accuracy rate low, at present still It does not put forward effective solutions.

Summary of the invention

The embodiment of the invention provides a kind for the treatment of method and apparatus of document similarity, at least to solve in the prior art The technical issues of artificial lookup document similar with target document leads to accuracy rate.

According to an aspect of an embodiment of the present invention, a kind of processing method of document similarity is provided, comprising: obtain burnt Reveal document library and target document, wherein focus document library includes the corresponding focus sequence of document in document library, described Focus sequence includes the mark for characterizing focus in corresponding document；Obtain the focus sequence of the target document；According to Each focus sequence in the focus sequence and focus document library of the target document determines every in the document library The sequencing of similarity of a document and the target document.

According to another aspect of an embodiment of the present invention, a kind of processing unit of document similarity is additionally provided, comprising: first Module is obtained, for obtaining focus document library and target document, wherein focus document library includes document in document library Corresponding focus sequence, the focus sequence include the mark for characterizing focus in corresponding document；Second obtains module, For obtaining the focus sequence of the target document；First determining module, for according to the focus sequence of the target document and Each focus sequence in focus document library determines the phase of each document and the target document in the document library It sorts like degree.

In embodiments of the present invention, focus document library and target document are obtained, the focus sequence of target document, root are obtained According to each focus sequence in the focus sequence and focus document library of target document, each document and the mesh in document library are determined Mark the sequencing of similarity of document.All documents in focus sequence and document library of the above scheme by obtaining target document respectively Focus sequence, to obtain the sequencing of similarity of document and target document in document library, so obtained in document library with mesh The highest document of document similarity is marked, so that solving lookup artificial in the prior art document similar with target document causes The low problem of accuracy rate.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of the processing method of document similarity according to an embodiment of the present invention；And

Fig. 2 is the flow chart of the processing method of document similarity according to an embodiment of the present invention.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

In the following, being explained to the vocabulary occurred in embodiment, so as to the understanding to embodiment.

Judgement document: for recording process that law court hears a case and as a result, being carrier and the people of lawsuit action result Civil law institute is determining and distributes the exclusive evidence of party's obligation right.

Focus: i.e. central issue is briefly exactly the core of dispute, and contradictory to cross swords a little, case both parties strive Where the problem of holding, it is to be concluded by judge and being striven a little through what party confirmed in form, is to lead case trial, dispute solution Main line and hinge certainly, and the degree that judge is familiar with merit is embodied, hold a protrusion of the ability that law is contacted with case Mark.

Embodiment 1

According to embodiments of the present invention, a kind of embodiment of the processing method of document similarity is provided, it should be noted that Step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions, and It, in some cases, can be to be different from sequence execution institute herein and although logical order is shown in flow charts The step of showing or describing.

Fig. 1 is the flow chart of the processing method of document similarity according to an embodiment of the present invention, as shown in Figure 1, this method Include the following steps:

Step S102 obtains focus document library and target document, wherein focus document library includes document in document library Corresponding focus sequence, focus sequence include the mark for characterizing focus in corresponding document.

Specifically, above-mentioned document can be judgement document, the corresponding mark of focus can according to belonging to focus classification and Feature is that focus matches to obtain, and focus sequence is for a document, include the focal mark composition of institute set then at For the corresponding focus sequence of the paperwork, that is to say, that after the focus sequence for getting a document, it will be able to obtain the paperwork Focus.

Step S104 obtains the focus sequence of target document.

In above-mentioned steps, target document can be input to preset focus regulation engine, focus rule causes then Enough export the focus sequence of document.In an alternative embodiment, focus regulation engine can be by having determined that focus Document is learnt, and the attribute information and focus that obtain focus appear in the location information in document, so as to according to study Experience finds the focus in new document.

Step S106 determines text according to each focus sequence in the focus sequence of target document and focus document library The sequencing of similarity of each document and target document in stack room.

From the foregoing, it will be observed that the above embodiments of the present application obtain focus document library and target document, the coke of target document is obtained Point sequence determines every in document library according to each focus sequence in the focus sequence of target document and focus document library The sequencing of similarity of a document and target document.Above scheme passes through the focus sequence of acquisition target document and document library respectively In all documents focus sequence, to obtain the sequencing of similarity of document and target document in document library, and then obtained text In stack room with the highest document of target document similarity, so that it is similar to target document to solve lookup artificial in the prior art The document problem that causes accuracy rate low.

Optionally, according to the above embodiments of the present application, focus document library is obtained, comprising:

Step S1021 extracts the focus of each document in document library, obtains the corresponding focus sequence of each document.

In an alternative embodiment, by taking trademark infringement field as an example, it is divided into five large focal spot classes: A: trade mark type；B: Behavior pattern；C: counterplea reason；D: liability for tort；E: other focuses.

Wherein, focus class A can continue to divide according to feature, obtain following several subclasses: A1 well-known trademark, A2 quotient Product trade mark, A3 service mark, the A4 certification mark, A4 collective trademark, A5 stereoscopic trademarks, A6 sound trade mark, A7 three-dimensional symbol trade mark With A8 geographical sign；

Focus class B can continue to divide according to feature, obtain following several subclasses: B1 trade mark personation/counterfeit behavior, Commodity, the behavior of B3 trade mark contributory infringement, the B4 enterprise font size that exclusive right to use registered trade mark is invaded in B2 sale encroach on other people registered trademarks Behavior, B5 domain name encroaches on the behavior of other people registered trademarks, B6 invades the behavior and B7 other abuses of well-known trademark；

Focus class C can continue to divide according to feature, obtain following several subclasses: the first right of C1 and rationally make With, whether C2 have legitimate origin/legitimate channels, C3 registered trademark not actually used and C4 trade mark communal tenure people/it is shared Trade mark；

Focus class D can continue to divide according to feature, obtain following several subclasses: D1 stop infringement/stopping sale/ It destroys, D2 reimbursement of damages and D3 are eliminated the effects of the act；

Focus class E can continue to divide according to feature, and obtain following several subclasses: the whether suitable lattice of E1 main body, E2 are It is no be more than statute of limitation, E3 whether constitute trade mark use, E4 through administrative penalty/whether constituted not just through penal sentence and E5 Work as competition.

So, in this embodiment, can be sorted out by focus point, the corresponding mark of classification belonging to focus is For the mark of focus.

Step S1023, according to the focus Sequence composition focus document library of each document.

Specifically, focus document library can not only save focus sequence, pair of focus sequence and document can also be saved It should be related to, so as to find corresponding document according to focus sequence.

From the foregoing, it will be observed that the application above scheme extracts the focus of each document in document library, it is corresponding to obtain each document Focus sequence, according to the focus Sequence composition focus document library of each document.Above scheme is each in document library by obtaining The focus of document determines focus document library, so that reducing ratio it is not necessary that the full piece of document is compared during comparison To the used time, the efficiency that document compares is improved.

Optionally, according to the above embodiments of the present application, according in the focus sequence of target document and focus document library Each focus sequence determines the sequencing of similarity of each document and target document in document library, comprising:

Step S1061, if the first focus sequence in focus document library is identical as the focus sequence of target document, Determine the corresponding document of the first focus sequence and target document similarity highest.

In an alternative embodiment, by taking the focus sequence of target document is A1+B1+B2+C3+D2 as an example, if certain The focus sequence of one document is also A1+B1+B2+C3+D2, it is determined that the similarity highest of the paperwork and target document.

Herein it should be noted that, according to the focus of document constitute focus sequence during, generally according to preset coke The priority of class is put to arrange, for example, priority is respectively A > B > C > D in A1+B1+B2+C3+D2, and B1 > B2 > B3, In this case, during relatively focus sequence successively relatively, but focus sequence can also have it is focal its His arrangement mode, for example, the sequence occurred in document using focus, if in this way, the focus in above-mentioned document Sequence may not include A1, B1, B2, C3 and D2 simply by the presence of document and only wrap in this case for A1+B1+B2+C3+D2 Include A1, B1, B2, C3 and D2, it is determined that the paperwork is identical as target document.

Step S1063, if the second focus sequence in focus document library is not identical as the focus sequence of target document, Then by comparing the focus in each second focus sequence and the focus in target document, the corresponding text of the second focus sequence is determined The sequencing of similarity of book and target document.

Optionally, according to the above embodiments of the present application, focus includes focus class and focal characteristics, and focus class corresponds to preset Priority determines the second focus sequence pair by comparing the focus in each second focus sequence and the focus in target document The sequencing of similarity of the document and target document answered, comprising:

Step S1065 successively compares the corresponding focus of focus class in the second focus sequence according to the priority of focus class Feature focal characteristics corresponding with focus class in target document, obtain the corresponding comparison result of each focus class.

It in an alternative embodiment, is A1+B1+B2+C3, the priority of focus class with the focus sequence of target document For B > A > D > C, the focus sequence of document x is A1+B1+B2+D3, and document y focus sequence is B1+B2+B4+C3, document z's Focus sequence is A1+B1+B2+B4+C3.Compared the comparison result it is found that document x, y, z and the first focus class of target document It is respectively identical, different and different；The comparing result of second focus class is identical, different and identical；The comparison of third focus class As a result respectively different, identical and identical, the comparison result of the 4th focus class is different, identical and identical.

Step S1067, the second focus sequence different for the comparison result of the first focus class, determines the first focus class Comparison result is that be higher than the comparison result of the first focus class be different to the similarity of the corresponding document of identical second focus sequence The corresponding document of the second focus sequence similarity, wherein the corresponding priority of the first focus class be higher than the second focus class pair The priority answered.

In an alternative embodiment, with the focus sequence of target document be still A1+B1+B2+C3, focus class it is preferential For grade is B > A > D > C, the focus sequence of document x is A1+B1+B2+D3, and document y focus sequence is B1+B2+B4+C3, document z Focus sequence be A1+B1+B2+B4+C3, according to the comparison result in above-described embodiment it is found that the first of document x and document y The comparison result of focus class is different, and document x and the comparison result of the first focus class of document z are also different, and the first of document x is burnt The comparison result of point class be it is identical, the comparison result of the first focus class of document x and document y be all it is different, so the phase of document x It is higher than document y and document z like degree.

Step S1069, the second focus sequence identical for the comparison result of the first focus class, according to the second focus class Comparison result determines the sequencing of similarity of the corresponding document of the second focus sequence.

In an alternative embodiment, with the focus sequence of target document be still A1+B1+B2+C3, focus class it is preferential For grade is B > A > D > C, the focus sequence of document x is A1+B1+B2+D3, and document y focus sequence is B1+B2+B4+C3, document z Focus sequence be A1+B1+B2+B4+C3, according to the comparison result in above-described embodiment it is found that the first of document y and document z The comparison result of focus class is identical, and the comparison result of the second focus class of document y is difference, the second focus class of document z Comparison result is identical, therefore the similarity of document z is higher than the similarity of document y.

Optionally, according to the above embodiments of the present application, before obtaining focus document library, this method further include: determination is returned The quantity of palindrome book；Each focus sequence in the focus sequence and focus document library according to target document determines document library In each document and target document sequencing of similarity after, method further include: according in document library each document it is similar Degree sequence filters out document corresponding with the quantity of document is returned from high to low.

It in an alternative embodiment, is that lookup is similar with target document in document library with the target of the subtask For 5 parts of documents, then determine that the quantity for returning to document is 5, then is obtaining focus document library before carrying out sequencing of similarity In with after the sequencing of similarity of the focus of target document, obtain preceding 5 focus sequences, and preceding 5 focuses of return from high to low Sequence to document.

From the foregoing, it will be observed that the focus document quantity of return can be arranged according to demand through the above scheme, to obtain and need The document quantity being consistent is sought, it is artificial after the ranking results without returning to whole documents screen.

Embodiment 2

According to embodiments of the present invention, a kind of embodiment of the processing unit of document similarity is provided, Fig. 2 is according to this hair The flow chart of the processing method of the document similarity of bright embodiment, as shown in Fig. 2, the device includes:

First obtains module 20, for obtaining focus document library and target document, wherein focus document library includes text The corresponding focus sequence of document in stack room, focus sequence include the mark for characterizing focus in corresponding document.

First obtains module 22, for obtaining the focus sequence of target document.

In above-mentioned apparatus, target document can be input to preset focus regulation engine, focus rule causes then Enough export the focus sequence of document.In an alternative embodiment, focus regulation engine can be by having determined that focus Document is learnt, and the attribute information and focus that obtain focus appear in the location information in document, so as to according to study Experience finds the focus in new document.

First determining module 24, for each focus sequence in the focus sequence and focus document library according to target document Column, determine the sequencing of similarity of each document and target document in document library.

From the foregoing, it will be observed that the above embodiments of the present application, which obtain module by first, obtains focus document library and target document, The focus sequence that module obtains target document is obtained by second, by the first determining module according to the focus sequence of target document With each focus sequence in focus document library, the sequencing of similarity of each document and target document in document library is determined. The focus sequence of all documents in focus sequence and document library of the above scheme by obtaining target document respectively, to obtain The sequencing of similarity of document and target document in document library, so obtained in document library with the highest text of target document similarity Book, to solve the problems, such as that lookup artificial in the prior art document similar with target document causes accuracy rate low.

Optionally, according to the above embodiments of the present application, the first acquisition module includes:

Extracting sub-module obtains the corresponding focus sequence of each document for extracting the focus of each document in document library.

Submodule is constructed, for the focus Sequence composition focus document library according to each document.

Optionally, according to the above embodiments of the present application, the first determining module includes:

First determines submodule, if the focus sequence for the first focus sequence and target document in focus document library It arranges identical, it is determined that the corresponding document of the first focus sequence and target document similarity highest.

Second determines submodule, if the focus sequence for the second focus sequence and target document in focus document library Column are not identical, then by comparing the focus in each second focus sequence and the focus in target document, determine the second focus sequence Arrange the sequencing of similarity of corresponding document Yu target document.

Optionally, according to the above embodiments of the present application, focus includes focus class and the corresponding focal characteristics of focus class, focus Class also corresponds to preset priority, and second determines that submodule includes:

Comparing unit successively compares the corresponding coke of focus class in the second focus sequence for the priority according to focus class Point feature focal characteristics corresponding with focus class in target document, obtain the corresponding comparison result of each focus class.

First determination unit, the second focus sequence different for the comparison result for the first focus class, determines first The comparison result of focus class is that the similarity of the corresponding document of identical second focus sequence is higher than the comparison knot of the first focus class Fruit is the similarity of the different corresponding documents of the second focus sequence.

Second determination unit, for the identical second focus sequence of comparison result for the first focus class, according to second The comparison result of focus class determines the sequencing of similarity of the corresponding document of the second focus sequence, wherein the first focus class is corresponding Priority is higher than the corresponding priority of the second focus class.

Optionally, according to the above embodiments of the present application, device further include:

Second determining module, for determining the quantity for returning to document before obtaining focus document library.

Screening module, it is true for each focus sequence in the focus sequence and focus document library according to target document After the sequencing of similarity for determining each document and target document in document library, arranged according to the similarity of each document in document library Sequence filters out document corresponding with the quantity of document is returned from high to low.

The processing unit of the document similarity includes processor and memory, and above-mentioned first obtains module, the first acquisition Module and the first determining module etc. store in memory as program unit, are executed by processor stored in memory Above procedure unit realizes corresponding function.

Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, cause accurately by adjusting kernel parameter to solve lookup artificial in the prior art document similar with target document The low technical problem of rate.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.

The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The processing method of the existing document similarity.

The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation The processing method of document similarity described in Shi Zhihang.

The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor execute program when perform the steps of (method claim step, exclusive rights+from power).This Equipment in text can be server, PC, PAD, mobile phone etc..

Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just The program of beginningization there are as below methods step: focus document library and target document are obtained, wherein focus document library includes document The corresponding focus sequence of document in library, focus sequence include the mark for characterizing focus in corresponding document；Obtain target The focus sequence of document；According to each focus sequence in the focus sequence of target document and focus document library, document is determined The sequencing of similarity of each document and target document in library.

When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step Program: extract document library in each document focus, obtain the corresponding focus sequence of each document；According to the coke of each document Point sequence constitutes focus document library.

When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step Program: if the first focus sequence in focus document library is identical as the focus sequence of target document, it is determined that first is burnt The corresponding document of point sequence and target document similarity highest；If the second focus sequence and target text in focus document library The focus sequence of book is not identical, then by comparing the focus in each second focus sequence and the focus in target document, determines The sequencing of similarity of second focus sequence corresponding document and target document.

When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step Program: successively compare corresponding focal characteristics of focus class in the second focus sequence and target text according to the priority of focus class The corresponding focal characteristics of focus class in book obtain the corresponding comparison result of each focus class；For the comparison knot of the first focus class The second different focus sequence of fruit determines that the comparison result of the first focus class is the corresponding document of identical second focus sequence The comparison result that similarity is higher than the first focus class is the similarity of the different corresponding documents of the second focus sequence；For first The identical second focus sequence of the comparison result of focus class determines the second focus sequence pair according to the comparison result of the second focus class The sequencing of similarity for the document answered, wherein the corresponding priority of the first focus class is higher than the corresponding priority of the second focus class.

When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step Program: before obtaining focus document library, method further include: determine return document quantity；In the coke according to target document Each focus sequence in point sequence and focus document library determines the similarity of each document and target document in document library After sequence, method further include: filtered out from high to low according to the sequencing of similarity of each document in document library and return to document The corresponding document of quantity.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of processing method of document similarity characterized by comprising

Obtain focus document library and target document, wherein focus document library includes the corresponding coke of document in document library Point sequence, the focus sequence include the mark for characterizing focus in corresponding document；

Obtain the focus sequence of the target document；

According to each focus sequence in the focus sequence of the target document and focus document library, the document is determined The sequencing of similarity of each document and the target document in library.

2. the method according to claim 1, wherein obtaining focus document library, comprising:

The focus for extracting each document in the document library obtains the corresponding focus sequence of each document；

According to focus document library described in the focus Sequence composition of each document.

3. the method according to claim 1, wherein according to the focus sequence and the focus of the target document Change each focus sequence in document library, determines that the similarity of each document and the target document in the document library is arranged Sequence, comprising:

If the first focus sequence in focus document library is identical as the focus sequence of the target document, it is determined that institute State the corresponding document of the first focus sequence and the target document similarity highest；

If the second focus sequence in focus document library is not identical as the focus sequence of the target document, pass through Compare the focus in each second focus sequence and the focus in the target document, determines that the second focus sequence is corresponding The sequencing of similarity of document and the target document.

4. according to the method described in claim 3, it is characterized in that, the focus includes that focus class and the focus class are corresponding Focal characteristics, the focus class also correspond to preset priority, by comparing the focus in each second focus sequence and institute The focus in target document is stated, determines the sequencing of similarity of the second focus sequence corresponding document and the target document, Include:

According to the priority of focus class successively compare corresponding focal characteristics of focus class in the second focus sequence with it is described The corresponding focal characteristics of focus class in target document obtain the corresponding comparison result of each focus class；

The second focus sequence different for the comparison result of the first focus class determines that the comparison result of the first focus class is The comparison result that the similarity of the corresponding document of identical second focus sequence is higher than the first focus class is the second different focus The similarity of the corresponding document of sequence；

The second focus sequence identical for the comparison result of the first focus class, the comparison result according to the second focus class are true The sequencing of similarity of the corresponding document of the fixed second focus sequence, wherein the corresponding priority of the first focus class is higher than The corresponding priority of the second focus class.

5. method according to claim 1 to 4, which is characterized in that

Before obtaining focus document library, the method also includes: determine the quantity for returning to document；

Each focus sequence in the focus sequence and focus document library according to the target document determines the text After the sequencing of similarity of each document in stack room and the target document, the method also includes: according to the document library In the sequencing of similarity of each document filter out document corresponding with the return quantity of document from high to low.

6. a kind of processing unit of document similarity characterized by comprising

First obtains module, for obtaining focus document library and target document, wherein focus document library includes document The corresponding focus sequence of document in library, the focus sequence include the mark for characterizing focus in corresponding document；

Second obtains module, for obtaining the focus sequence of the target document；

First determining module, for each focus in the focus sequence and focus document library according to the target document Sequence determines the sequencing of similarity of each document and the target document in the document library.

7. device according to claim 6, which is characterized in that described first, which obtains module, includes:

Extracting sub-module obtains the corresponding focus sequence of each document for extracting the focus of each document in the document library；

Submodule is constructed, for the focus document library according to the focus Sequence composition of each document.

8. device according to claim 6, which is characterized in that first determining module includes:

First determines submodule, if the coke for the first focus sequence and the target document in focus document library Point sequence is identical, it is determined that the corresponding document of the first focus sequence and the target document similarity highest；

Second determines submodule, if the coke for the second focus sequence and the target document in focus document library Point sequence is not identical, then by comparing the focus in each second focus sequence and the focus in the target document, determines institute State the sequencing of similarity of the second focus sequence corresponding document and the target document.

9. device according to claim 8, which is characterized in that the focus includes that focus class and the focus class are corresponding Focal characteristics, the focus class also correspond to preset priority, and second determines that submodule includes:

Comparing unit successively compares the corresponding coke of focus class in the second focus sequence for the priority according to focus class Point feature focal characteristics corresponding with focus class in the target document, obtain the corresponding comparison result of each focus class；

First determination unit, the second focus sequence different for the comparison result for the first focus class, determines first The comparison result of focus class is that the similarity of the corresponding document of identical second focus sequence is higher than the comparison knot of the first focus class Fruit is the similarity of the different corresponding documents of the second focus sequence；

Second determination unit, for the identical second focus sequence of comparison result for the first focus class, according to second The comparison result of focus class determines the sequencing of similarity of the corresponding document of the second focus sequence, wherein first focus The corresponding priority of class is higher than the corresponding priority of the second focus class.

10. device according to any one of claims 6 to 9, which is characterized in that described device further include:

Second determining module, for determining the quantity for returning to document before obtaining focus document library；

Screening module, for each focus sequence in the focus sequence and focus document library according to the target document After column determine the sequencing of similarity of each document and the target document in the document library, according to every in the document library The sequencing of similarity of a document filters out document corresponding with the return quantity of document from high to low.