CN109033041A - The treating method and apparatus of document similarity - Google Patents
The treating method and apparatus of document similarity Download PDFInfo
- Publication number
- CN109033041A CN109033041A CN201710432891.1A CN201710432891A CN109033041A CN 109033041 A CN109033041 A CN 109033041A CN 201710432891 A CN201710432891 A CN 201710432891A CN 109033041 A CN109033041 A CN 109033041A
- Authority
- CN
- China
- Prior art keywords
- focus
- document
- sequence
- class
- library
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Abstract
The invention discloses a kind for the treatment of method and apparatus of document similarity.Wherein, this method comprises: obtaining focus document library and target document, wherein focus document library includes the corresponding focus sequence of document in document library, and focus sequence includes the mark for characterizing focus in corresponding document;Obtain the focus sequence of target document;According to each focus sequence in the focus sequence of target document and focus document library, the sequencing of similarity of each document and target document in document library is determined.The present invention solves the technical issues of lookup artificial in the prior art document similar with target document leads to accuracy rate.
Description
Technical field
The present invention relates to data processing fields, in particular to a kind for the treatment of method and apparatus of document similarity.
Background technique
To the judgement of similar document, there are mainly two types of realize at present.Method one: reading mark manually is carried out to document, is passed through
Find out the document containing similar tags in inquiry document library;Method two: document vectorization (being based on word frequency) is calculated into document vector afterwards
Between distance realize.
But method one: relying on pure manpower and solve, and not only heavy workload is time-consuming more but also more demanding to business personnel, because
For the difference for being limited to professional knowledge experience, same document different people mark out come label may difference it is larger.And it formulates in advance
Good a batch label is selected for business personnel, is not only increased job costs, can not be also fully solved experience variability issues.Method
Two identify similar document by document vectorization, but the method based on word frequency largely can not really find out the weight of document
Point, so accuracy rate is not high.
For the problem that lookup artificial in the prior art document similar with target document causes accuracy rate low, at present still
It does not put forward effective solutions.
Summary of the invention
The embodiment of the invention provides a kind for the treatment of method and apparatus of document similarity, at least to solve in the prior art
The technical issues of artificial lookup document similar with target document leads to accuracy rate.
According to an aspect of an embodiment of the present invention, a kind of processing method of document similarity is provided, comprising: obtain burnt
Reveal document library and target document, wherein focus document library includes the corresponding focus sequence of document in document library, described
Focus sequence includes the mark for characterizing focus in corresponding document;Obtain the focus sequence of the target document;According to
Each focus sequence in the focus sequence and focus document library of the target document determines every in the document library
The sequencing of similarity of a document and the target document.
According to another aspect of an embodiment of the present invention, a kind of processing unit of document similarity is additionally provided, comprising: first
Module is obtained, for obtaining focus document library and target document, wherein focus document library includes document in document library
Corresponding focus sequence, the focus sequence include the mark for characterizing focus in corresponding document;Second obtains module,
For obtaining the focus sequence of the target document;First determining module, for according to the focus sequence of the target document and
Each focus sequence in focus document library determines the phase of each document and the target document in the document library
It sorts like degree.
In embodiments of the present invention, focus document library and target document are obtained, the focus sequence of target document, root are obtained
According to each focus sequence in the focus sequence and focus document library of target document, each document and the mesh in document library are determined
Mark the sequencing of similarity of document.All documents in focus sequence and document library of the above scheme by obtaining target document respectively
Focus sequence, to obtain the sequencing of similarity of document and target document in document library, so obtained in document library with mesh
The highest document of document similarity is marked, so that solving lookup artificial in the prior art document similar with target document causes
The low problem of accuracy rate.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the processing method of document similarity according to an embodiment of the present invention;And
Fig. 2 is the flow chart of the processing method of document similarity according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
In the following, being explained to the vocabulary occurred in embodiment, so as to the understanding to embodiment.
Judgement document: for recording process that law court hears a case and as a result, being carrier and the people of lawsuit action result
Civil law institute is determining and distributes the exclusive evidence of party's obligation right.
Focus: i.e. central issue is briefly exactly the core of dispute, and contradictory to cross swords a little, case both parties strive
Where the problem of holding, it is to be concluded by judge and being striven a little through what party confirmed in form, is to lead case trial, dispute solution
Main line and hinge certainly, and the degree that judge is familiar with merit is embodied, hold a protrusion of the ability that law is contacted with case
Mark.
Embodiment 1
According to embodiments of the present invention, a kind of embodiment of the processing method of document similarity is provided, it should be noted that
Step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions, and
It, in some cases, can be to be different from sequence execution institute herein and although logical order is shown in flow charts
The step of showing or describing.
Fig. 1 is the flow chart of the processing method of document similarity according to an embodiment of the present invention, as shown in Figure 1, this method
Include the following steps:
Step S102 obtains focus document library and target document, wherein focus document library includes document in document library
Corresponding focus sequence, focus sequence include the mark for characterizing focus in corresponding document.
Specifically, above-mentioned document can be judgement document, the corresponding mark of focus can according to belonging to focus classification and
Feature is that focus matches to obtain, and focus sequence is for a document, include the focal mark composition of institute set then at
For the corresponding focus sequence of the paperwork, that is to say, that after the focus sequence for getting a document, it will be able to obtain the paperwork
Focus.
Step S104 obtains the focus sequence of target document.
In above-mentioned steps, target document can be input to preset focus regulation engine, focus rule causes then
Enough export the focus sequence of document.In an alternative embodiment, focus regulation engine can be by having determined that focus
Document is learnt, and the attribute information and focus that obtain focus appear in the location information in document, so as to according to study
Experience finds the focus in new document.
Step S106 determines text according to each focus sequence in the focus sequence of target document and focus document library
The sequencing of similarity of each document and target document in stack room.
From the foregoing, it will be observed that the above embodiments of the present application obtain focus document library and target document, the coke of target document is obtained
Point sequence determines every in document library according to each focus sequence in the focus sequence of target document and focus document library
The sequencing of similarity of a document and target document.Above scheme passes through the focus sequence of acquisition target document and document library respectively
In all documents focus sequence, to obtain the sequencing of similarity of document and target document in document library, and then obtained text
In stack room with the highest document of target document similarity, so that it is similar to target document to solve lookup artificial in the prior art
The document problem that causes accuracy rate low.
Optionally, according to the above embodiments of the present application, focus document library is obtained, comprising:
Step S1021 extracts the focus of each document in document library, obtains the corresponding focus sequence of each document.
In an alternative embodiment, by taking trademark infringement field as an example, it is divided into five large focal spot classes: A: trade mark type;B:
Behavior pattern;C: counterplea reason;D: liability for tort;E: other focuses.
Wherein, focus class A can continue to divide according to feature, obtain following several subclasses: A1 well-known trademark, A2 quotient
Product trade mark, A3 service mark, the A4 certification mark, A4 collective trademark, A5 stereoscopic trademarks, A6 sound trade mark, A7 three-dimensional symbol trade mark
With A8 geographical sign;
Focus class B can continue to divide according to feature, obtain following several subclasses: B1 trade mark personation/counterfeit behavior,
Commodity, the behavior of B3 trade mark contributory infringement, the B4 enterprise font size that exclusive right to use registered trade mark is invaded in B2 sale encroach on other people registered trademarks
Behavior, B5 domain name encroaches on the behavior of other people registered trademarks, B6 invades the behavior and B7 other abuses of well-known trademark;
Focus class C can continue to divide according to feature, obtain following several subclasses: the first right of C1 and rationally make
With, whether C2 have legitimate origin/legitimate channels, C3 registered trademark not actually used and C4 trade mark communal tenure people/it is shared
Trade mark;
Focus class D can continue to divide according to feature, obtain following several subclasses: D1 stop infringement/stopping sale/
It destroys, D2 reimbursement of damages and D3 are eliminated the effects of the act;
Focus class E can continue to divide according to feature, and obtain following several subclasses: the whether suitable lattice of E1 main body, E2 are
It is no be more than statute of limitation, E3 whether constitute trade mark use, E4 through administrative penalty/whether constituted not just through penal sentence and E5
Work as competition.
So, in this embodiment, can be sorted out by focus point, the corresponding mark of classification belonging to focus is
For the mark of focus.
Step S1023, according to the focus Sequence composition focus document library of each document.
Specifically, focus document library can not only save focus sequence, pair of focus sequence and document can also be saved
It should be related to, so as to find corresponding document according to focus sequence.
From the foregoing, it will be observed that the application above scheme extracts the focus of each document in document library, it is corresponding to obtain each document
Focus sequence, according to the focus Sequence composition focus document library of each document.Above scheme is each in document library by obtaining
The focus of document determines focus document library, so that reducing ratio it is not necessary that the full piece of document is compared during comparison
To the used time, the efficiency that document compares is improved.
Optionally, according to the above embodiments of the present application, according in the focus sequence of target document and focus document library
Each focus sequence determines the sequencing of similarity of each document and target document in document library, comprising:
Step S1061, if the first focus sequence in focus document library is identical as the focus sequence of target document,
Determine the corresponding document of the first focus sequence and target document similarity highest.
In an alternative embodiment, by taking the focus sequence of target document is A1+B1+B2+C3+D2 as an example, if certain
The focus sequence of one document is also A1+B1+B2+C3+D2, it is determined that the similarity highest of the paperwork and target document.
Herein it should be noted that, according to the focus of document constitute focus sequence during, generally according to preset coke
The priority of class is put to arrange, for example, priority is respectively A > B > C > D in A1+B1+B2+C3+D2, and B1 > B2 > B3,
In this case, during relatively focus sequence successively relatively, but focus sequence can also have it is focal its
His arrangement mode, for example, the sequence occurred in document using focus, if in this way, the focus in above-mentioned document
Sequence may not include A1, B1, B2, C3 and D2 simply by the presence of document and only wrap in this case for A1+B1+B2+C3+D2
Include A1, B1, B2, C3 and D2, it is determined that the paperwork is identical as target document.
Step S1063, if the second focus sequence in focus document library is not identical as the focus sequence of target document,
Then by comparing the focus in each second focus sequence and the focus in target document, the corresponding text of the second focus sequence is determined
The sequencing of similarity of book and target document.
Optionally, according to the above embodiments of the present application, focus includes focus class and focal characteristics, and focus class corresponds to preset
Priority determines the second focus sequence pair by comparing the focus in each second focus sequence and the focus in target document
The sequencing of similarity of the document and target document answered, comprising:
Step S1065 successively compares the corresponding focus of focus class in the second focus sequence according to the priority of focus class
Feature focal characteristics corresponding with focus class in target document, obtain the corresponding comparison result of each focus class.
It in an alternative embodiment, is A1+B1+B2+C3, the priority of focus class with the focus sequence of target document
For B > A > D > C, the focus sequence of document x is A1+B1+B2+D3, and document y focus sequence is B1+B2+B4+C3, document z's
Focus sequence is A1+B1+B2+B4+C3.Compared the comparison result it is found that document x, y, z and the first focus class of target document
It is respectively identical, different and different;The comparing result of second focus class is identical, different and identical;The comparison of third focus class
As a result respectively different, identical and identical, the comparison result of the 4th focus class is different, identical and identical.
Step S1067, the second focus sequence different for the comparison result of the first focus class, determines the first focus class
Comparison result is that be higher than the comparison result of the first focus class be different to the similarity of the corresponding document of identical second focus sequence
The corresponding document of the second focus sequence similarity, wherein the corresponding priority of the first focus class be higher than the second focus class pair
The priority answered.
In an alternative embodiment, with the focus sequence of target document be still A1+B1+B2+C3, focus class it is preferential
For grade is B > A > D > C, the focus sequence of document x is A1+B1+B2+D3, and document y focus sequence is B1+B2+B4+C3, document z
Focus sequence be A1+B1+B2+B4+C3, according to the comparison result in above-described embodiment it is found that the first of document x and document y
The comparison result of focus class is different, and document x and the comparison result of the first focus class of document z are also different, and the first of document x is burnt
The comparison result of point class be it is identical, the comparison result of the first focus class of document x and document y be all it is different, so the phase of document x
It is higher than document y and document z like degree.
Step S1069, the second focus sequence identical for the comparison result of the first focus class, according to the second focus class
Comparison result determines the sequencing of similarity of the corresponding document of the second focus sequence.
In an alternative embodiment, with the focus sequence of target document be still A1+B1+B2+C3, focus class it is preferential
For grade is B > A > D > C, the focus sequence of document x is A1+B1+B2+D3, and document y focus sequence is B1+B2+B4+C3, document z
Focus sequence be A1+B1+B2+B4+C3, according to the comparison result in above-described embodiment it is found that the first of document y and document z
The comparison result of focus class is identical, and the comparison result of the second focus class of document y is difference, the second focus class of document z
Comparison result is identical, therefore the similarity of document z is higher than the similarity of document y.
Optionally, according to the above embodiments of the present application, before obtaining focus document library, this method further include: determination is returned
The quantity of palindrome book;Each focus sequence in the focus sequence and focus document library according to target document determines document library
In each document and target document sequencing of similarity after, method further include: according in document library each document it is similar
Degree sequence filters out document corresponding with the quantity of document is returned from high to low.
It in an alternative embodiment, is that lookup is similar with target document in document library with the target of the subtask
For 5 parts of documents, then determine that the quantity for returning to document is 5, then is obtaining focus document library before carrying out sequencing of similarity
In with after the sequencing of similarity of the focus of target document, obtain preceding 5 focus sequences, and preceding 5 focuses of return from high to low
Sequence to document.
From the foregoing, it will be observed that the focus document quantity of return can be arranged according to demand through the above scheme, to obtain and need
The document quantity being consistent is sought, it is artificial after the ranking results without returning to whole documents screen.
Embodiment 2
According to embodiments of the present invention, a kind of embodiment of the processing unit of document similarity is provided, Fig. 2 is according to this hair
The flow chart of the processing method of the document similarity of bright embodiment, as shown in Fig. 2, the device includes:
First obtains module 20, for obtaining focus document library and target document, wherein focus document library includes text
The corresponding focus sequence of document in stack room, focus sequence include the mark for characterizing focus in corresponding document.
Specifically, above-mentioned document can be judgement document, the corresponding mark of focus can according to belonging to focus classification and
Feature is that focus matches to obtain, and focus sequence is for a document, include the focal mark composition of institute set then at
For the corresponding focus sequence of the paperwork, that is to say, that after the focus sequence for getting a document, it will be able to obtain the paperwork
Focus.
First obtains module 22, for obtaining the focus sequence of target document.
In above-mentioned apparatus, target document can be input to preset focus regulation engine, focus rule causes then
Enough export the focus sequence of document.In an alternative embodiment, focus regulation engine can be by having determined that focus
Document is learnt, and the attribute information and focus that obtain focus appear in the location information in document, so as to according to study
Experience finds the focus in new document.
First determining module 24, for each focus sequence in the focus sequence and focus document library according to target document
Column, determine the sequencing of similarity of each document and target document in document library.
From the foregoing, it will be observed that the above embodiments of the present application, which obtain module by first, obtains focus document library and target document,
The focus sequence that module obtains target document is obtained by second, by the first determining module according to the focus sequence of target document
With each focus sequence in focus document library, the sequencing of similarity of each document and target document in document library is determined.
The focus sequence of all documents in focus sequence and document library of the above scheme by obtaining target document respectively, to obtain
The sequencing of similarity of document and target document in document library, so obtained in document library with the highest text of target document similarity
Book, to solve the problems, such as that lookup artificial in the prior art document similar with target document causes accuracy rate low.
Optionally, according to the above embodiments of the present application, the first acquisition module includes:
Extracting sub-module obtains the corresponding focus sequence of each document for extracting the focus of each document in document library.
Submodule is constructed, for the focus Sequence composition focus document library according to each document.
Optionally, according to the above embodiments of the present application, the first determining module includes:
First determines submodule, if the focus sequence for the first focus sequence and target document in focus document library
It arranges identical, it is determined that the corresponding document of the first focus sequence and target document similarity highest.
Second determines submodule, if the focus sequence for the second focus sequence and target document in focus document library
Column are not identical, then by comparing the focus in each second focus sequence and the focus in target document, determine the second focus sequence
Arrange the sequencing of similarity of corresponding document Yu target document.
Optionally, according to the above embodiments of the present application, focus includes focus class and the corresponding focal characteristics of focus class, focus
Class also corresponds to preset priority, and second determines that submodule includes:
Comparing unit successively compares the corresponding coke of focus class in the second focus sequence for the priority according to focus class
Point feature focal characteristics corresponding with focus class in target document, obtain the corresponding comparison result of each focus class.
First determination unit, the second focus sequence different for the comparison result for the first focus class, determines first
The comparison result of focus class is that the similarity of the corresponding document of identical second focus sequence is higher than the comparison knot of the first focus class
Fruit is the similarity of the different corresponding documents of the second focus sequence.
Second determination unit, for the identical second focus sequence of comparison result for the first focus class, according to second
The comparison result of focus class determines the sequencing of similarity of the corresponding document of the second focus sequence, wherein the first focus class is corresponding
Priority is higher than the corresponding priority of the second focus class.
Optionally, according to the above embodiments of the present application, device further include:
Second determining module, for determining the quantity for returning to document before obtaining focus document library.
Screening module, it is true for each focus sequence in the focus sequence and focus document library according to target document
After the sequencing of similarity for determining each document and target document in document library, arranged according to the similarity of each document in document library
Sequence filters out document corresponding with the quantity of document is returned from high to low.
The processing unit of the document similarity includes processor and memory, and above-mentioned first obtains module, the first acquisition
Module and the first determining module etc. store in memory as program unit, are executed by processor stored in memory
Above procedure unit realizes corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one
Or more, cause accurately by adjusting kernel parameter to solve lookup artificial in the prior art document similar with target document
The low technical problem of rate.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited
Store up chip.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor
The processing method of the existing document similarity.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation
The processing method of document similarity described in Shi Zhihang.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can
The program run on a processor, processor execute program when perform the steps of (method claim step, exclusive rights+from power).This
Equipment in text can be server, PC, PAD, mobile phone etc..
Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just
The program of beginningization there are as below methods step: focus document library and target document are obtained, wherein focus document library includes document
The corresponding focus sequence of document in library, focus sequence include the mark for characterizing focus in corresponding document;Obtain target
The focus sequence of document;According to each focus sequence in the focus sequence of target document and focus document library, document is determined
The sequencing of similarity of each document and target document in library.
When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step
Program: extract document library in each document focus, obtain the corresponding focus sequence of each document;According to the coke of each document
Point sequence constitutes focus document library.
When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step
Program: if the first focus sequence in focus document library is identical as the focus sequence of target document, it is determined that first is burnt
The corresponding document of point sequence and target document similarity highest;If the second focus sequence and target text in focus document library
The focus sequence of book is not identical, then by comparing the focus in each second focus sequence and the focus in target document, determines
The sequencing of similarity of second focus sequence corresponding document and target document.
When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step
Program: successively compare corresponding focal characteristics of focus class in the second focus sequence and target text according to the priority of focus class
The corresponding focal characteristics of focus class in book obtain the corresponding comparison result of each focus class;For the comparison knot of the first focus class
The second different focus sequence of fruit determines that the comparison result of the first focus class is the corresponding document of identical second focus sequence
The comparison result that similarity is higher than the first focus class is the similarity of the different corresponding documents of the second focus sequence;For first
The identical second focus sequence of the comparison result of focus class determines the second focus sequence pair according to the comparison result of the second focus class
The sequencing of similarity for the document answered, wherein the corresponding priority of the first focus class is higher than the corresponding priority of the second focus class.
When executing on data processing equipment, above-mentioned computer program is further adapted for executing initialization there are as below methods step
Program: before obtaining focus document library, method further include: determine return document quantity;In the coke according to target document
Each focus sequence in point sequence and focus document library determines the similarity of each document and target document in document library
After sequence, method further include: filtered out from high to low according to the sequencing of similarity of each document in document library and return to document
The corresponding document of quantity.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module
It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or
Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of processing method of document similarity characterized by comprising
Obtain focus document library and target document, wherein focus document library includes the corresponding coke of document in document library
Point sequence, the focus sequence include the mark for characterizing focus in corresponding document;
Obtain the focus sequence of the target document;
According to each focus sequence in the focus sequence of the target document and focus document library, the document is determined
The sequencing of similarity of each document and the target document in library.
2. the method according to claim 1, wherein obtaining focus document library, comprising:
The focus for extracting each document in the document library obtains the corresponding focus sequence of each document;
According to focus document library described in the focus Sequence composition of each document.
3. the method according to claim 1, wherein according to the focus sequence and the focus of the target document
Change each focus sequence in document library, determines that the similarity of each document and the target document in the document library is arranged
Sequence, comprising:
If the first focus sequence in focus document library is identical as the focus sequence of the target document, it is determined that institute
State the corresponding document of the first focus sequence and the target document similarity highest;
If the second focus sequence in focus document library is not identical as the focus sequence of the target document, pass through
Compare the focus in each second focus sequence and the focus in the target document, determines that the second focus sequence is corresponding
The sequencing of similarity of document and the target document.
4. according to the method described in claim 3, it is characterized in that, the focus includes that focus class and the focus class are corresponding
Focal characteristics, the focus class also correspond to preset priority, by comparing the focus in each second focus sequence and institute
The focus in target document is stated, determines the sequencing of similarity of the second focus sequence corresponding document and the target document,
Include:
According to the priority of focus class successively compare corresponding focal characteristics of focus class in the second focus sequence with it is described
The corresponding focal characteristics of focus class in target document obtain the corresponding comparison result of each focus class;
The second focus sequence different for the comparison result of the first focus class determines that the comparison result of the first focus class is
The comparison result that the similarity of the corresponding document of identical second focus sequence is higher than the first focus class is the second different focus
The similarity of the corresponding document of sequence;
The second focus sequence identical for the comparison result of the first focus class, the comparison result according to the second focus class are true
The sequencing of similarity of the corresponding document of the fixed second focus sequence, wherein the corresponding priority of the first focus class is higher than
The corresponding priority of the second focus class.
5. method according to claim 1 to 4, which is characterized in that
Before obtaining focus document library, the method also includes: determine the quantity for returning to document;
Each focus sequence in the focus sequence and focus document library according to the target document determines the text
After the sequencing of similarity of each document in stack room and the target document, the method also includes: according to the document library
In the sequencing of similarity of each document filter out document corresponding with the return quantity of document from high to low.
6. a kind of processing unit of document similarity characterized by comprising
First obtains module, for obtaining focus document library and target document, wherein focus document library includes document
The corresponding focus sequence of document in library, the focus sequence include the mark for characterizing focus in corresponding document;
Second obtains module, for obtaining the focus sequence of the target document;
First determining module, for each focus in the focus sequence and focus document library according to the target document
Sequence determines the sequencing of similarity of each document and the target document in the document library.
7. device according to claim 6, which is characterized in that described first, which obtains module, includes:
Extracting sub-module obtains the corresponding focus sequence of each document for extracting the focus of each document in the document library;
Submodule is constructed, for the focus document library according to the focus Sequence composition of each document.
8. device according to claim 6, which is characterized in that first determining module includes:
First determines submodule, if the coke for the first focus sequence and the target document in focus document library
Point sequence is identical, it is determined that the corresponding document of the first focus sequence and the target document similarity highest;
Second determines submodule, if the coke for the second focus sequence and the target document in focus document library
Point sequence is not identical, then by comparing the focus in each second focus sequence and the focus in the target document, determines institute
State the sequencing of similarity of the second focus sequence corresponding document and the target document.
9. device according to claim 8, which is characterized in that the focus includes that focus class and the focus class are corresponding
Focal characteristics, the focus class also correspond to preset priority, and second determines that submodule includes:
Comparing unit successively compares the corresponding coke of focus class in the second focus sequence for the priority according to focus class
Point feature focal characteristics corresponding with focus class in the target document, obtain the corresponding comparison result of each focus class;
First determination unit, the second focus sequence different for the comparison result for the first focus class, determines first
The comparison result of focus class is that the similarity of the corresponding document of identical second focus sequence is higher than the comparison knot of the first focus class
Fruit is the similarity of the different corresponding documents of the second focus sequence;
Second determination unit, for the identical second focus sequence of comparison result for the first focus class, according to second
The comparison result of focus class determines the sequencing of similarity of the corresponding document of the second focus sequence, wherein first focus
The corresponding priority of class is higher than the corresponding priority of the second focus class.
10. device according to any one of claims 6 to 9, which is characterized in that described device further include:
Second determining module, for determining the quantity for returning to document before obtaining focus document library;
Screening module, for each focus sequence in the focus sequence and focus document library according to the target document
After column determine the sequencing of similarity of each document and the target document in the document library, according to every in the document library
The sequencing of similarity of a document filters out document corresponding with the return quantity of document from high to low.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710432891.1A CN109033041A (en) | 2017-06-09 | 2017-06-09 | The treating method and apparatus of document similarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710432891.1A CN109033041A (en) | 2017-06-09 | 2017-06-09 | The treating method and apparatus of document similarity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033041A true CN109033041A (en) | 2018-12-18 |
Family
ID=64629758
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710432891.1A Pending CN109033041A (en) | 2017-06-09 | 2017-06-09 | The treating method and apparatus of document similarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033041A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992664A (en) * | 2019-03-12 | 2019-07-09 | 平安科技(深圳)有限公司 | Mark classification method, device, computer equipment and the storage medium of central issue |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080005094A1 (en) * | 2006-07-01 | 2008-01-03 | Kevin Cunnane | Method and system for finding the focus of a document |
CN104331394A (en) * | 2014-08-29 | 2015-02-04 | 南通大学 | Text classification method based on viewpoint |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
US20160103823A1 (en) * | 2014-10-10 | 2016-04-14 | The Trustees Of Columbia University In The City Of New York | Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents |
CN105930470A (en) * | 2016-04-25 | 2016-09-07 | 安徽富驰信息技术有限公司 | File retrieval method based on feature weight analysis technology |
CN105930473A (en) * | 2016-04-25 | 2016-09-07 | 安徽富驰信息技术有限公司 | Random forest technology-based similar file retrieval method |
-
2017
- 2017-06-09 CN CN201710432891.1A patent/CN109033041A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080005094A1 (en) * | 2006-07-01 | 2008-01-03 | Kevin Cunnane | Method and system for finding the focus of a document |
CN104331394A (en) * | 2014-08-29 | 2015-02-04 | 南通大学 | Text classification method based on viewpoint |
US20160103823A1 (en) * | 2014-10-10 | 2016-04-14 | The Trustees Of Columbia University In The City Of New York | Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
CN105930470A (en) * | 2016-04-25 | 2016-09-07 | 安徽富驰信息技术有限公司 | File retrieval method based on feature weight analysis technology |
CN105930473A (en) * | 2016-04-25 | 2016-09-07 | 安徽富驰信息技术有限公司 | Random forest technology-based similar file retrieval method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992664A (en) * | 2019-03-12 | 2019-07-09 | 平安科技(深圳)有限公司 | Mark classification method, device, computer equipment and the storage medium of central issue |
CN109992664B (en) * | 2019-03-12 | 2023-04-18 | 平安科技(深圳)有限公司 | Dispute focus label classification method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033105A (en) | The method and apparatus for obtaining judgement document's focus | |
CN110263821B (en) | Training of transaction feature generation model, and method and device for generating transaction features | |
CN112101578B (en) | Distributed language relationship recognition method, system and device based on federal learning | |
US9509705B2 (en) | Automated secondary linking for fraud detection systems | |
US10579651B1 (en) | Method, system, and program for evaluating intellectual property right | |
CN108885631B (en) | Method and system for contract management in a data marketplace | |
CN110362689A (en) | A kind of methods of risk assessment, device, storage medium and server | |
WO2020063524A1 (en) | Method and system for determining legal instrument | |
CN108446968A (en) | A kind of method, apparatus and terminal device of accounting entry | |
CN107832444A (en) | Event based on search daily record finds method and device | |
CN105893566A (en) | Stock quotation data storage method and device | |
CN106940721A (en) | A kind of data processing method and system of self-defined choice box | |
CN113111569A (en) | Disorder processing method, model training method, device and computing equipment | |
CN107391532A (en) | The method and apparatus of data filtering | |
CN109598484A (en) | A kind of project under construction turns fixed assets number auditing method and device | |
CN110032721A (en) | A kind of judgement document's method for pushing and device | |
Cole | (Infra) structural discontinuity: Capital, labour, and technological change | |
CN110019774A (en) | Label distribution method, device, storage medium and electronic device | |
CN105159927B (en) | Method and device for selecting subject term of target text and terminal | |
CN114444863A (en) | Enterprise production safety assessment method, system, device and storage medium | |
CN109033041A (en) | The treating method and apparatus of document similarity | |
CN107784091B (en) | Operation authority query method and terminal device | |
CN106844743B (en) | Emotion classification method and device for Uygur language text | |
Hattori et al. | Using iterative narrowing to enable multi-party negotiations with multiple interdependent issues | |
CN109657929A (en) | Appraisal procedure, device and the computer equipment of trade mark registration percent of pass |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |
|
RJ01 | Rejection of invention patent application after publication |