CN105930470B

CN105930470B - A kind of document retrieval method based on feature weight analytical technology

Info

Publication number: CN105930470B
Application number: CN201610259097.7A
Authority: CN
Inventors: 张静川; 周宇; 贾真
Original assignee: Anhui Fu Chi Information Technology Co Ltd
Current assignee: Anhui Fu Chi Information Technology Co Ltd
Priority date: 2016-04-25
Filing date: 2016-04-25
Publication date: 2019-03-26
Anticipated expiration: 2036-04-25
Also published as: CN105930470A

Abstract

The present invention relates to a kind of document retrieval method based on feature weight analytical technology, solving compared with prior art can not be in the defect that specific area is effectively retrieved.The present invention the following steps are included: judgement document tissue, by judgement document according to case by hierarchical classification tissue；Case characteristics tree is constructed, for specified case by screening its publicly-owned feature and privately owned feature, and be organized into tree structure by logical relation between feature；Weight training is carried out to case characteristics tree, is trained using traditional decision-tree for different target, calculates the comprehensive weight of case feature；The acquisition of information, the filter condition and querying condition of input retrieval information are retrieved, input mode is condition selection, the text comprising condition or entire chapter judgement document；Calculate case similar matrix；Export search result.Based on the case characteristics tree that the present invention constructs meticulously by industrial nature is guidance, by semantic analysis and knowledge reasoning, retrieval rate and coverage rate are greatly improved.

Description

A kind of document retrieval method based on feature weight analytical technology

Technical field

The present invention relates to data retrieval technology field, specifically a kind of file inspections based on feature weight analytical technology Suo Fangfa.

Background technique

File Search Technique has been widely used in daily life, to daily information content acquisition provide it is very big just Benefit.Especially in the discussion of such as judicial case of special dimension, in the research process of certain difficult cases, professional remove according to Other than own service knowledge and experience, it is often necessary to have similar case by retrieval, to hold the processing of related episodes.And Existing Ordinary search technology (approach) includes universal search engine, industrial sustainability, guiding case；It has the following problems:

(1) universal search engine: such as Baidu, Yahoo；It absolutely not customizes, retrieval rate and covers for judicial domain Lid rate is very low；

(2) industrial sustainability: such as judgement document's net, nothing dispute net；Compared with universal search engine, retrieval rate and covering Rate has a distinct increment, and allows multi-filtering；But retrieval is based primarily upon keyword match, floats on surface, accuracy rate still compared with It is low；Filter condition be it is default, it is inflexible；

(3) it guiding case: is issued by most Supreme Court, there is authoritative, specific aim；But caseload is seldom, lag is tight Weight, and isolate each other, retrieval coverage rate is very low；This instructional model from top to bottom, regional adaptability also need to be considered.

In addition, above-mentioned retrieval technique does not support semantic retrieval, filtering, querying condition can not be freely combined, cannot be based on As a result consecutive retrieval is not carried out statistics and intuitive displaying to search result.Therefore it is more professional how to design a kind of retrieval Search method have become technical problem urgently to be solved.

Summary of the invention

The purpose of the present invention is to solve can not provide in the prior art in the defect that specific area is effectively retrieved It is a kind of to be solved the above problems based on the document retrieval method of feature weight analytical technology.

To achieve the goals above, technical scheme is as follows:

A kind of document retrieval method based on feature weight analytical technology, comprising the following steps:

The tissue of judgement document, by judgement document according to case by hierarchical classification tissue；

Case characteristics tree is constructed, for specified case by screening its publicly-owned feature and privately owned feature, and close by logic between feature System is organized into tree structure；

Weight training is carried out to case characteristics tree, is trained using traditional decision-tree for different target, calculates case The comprehensive weight of part feature；

The acquisition of information, the filter condition and querying condition of input retrieval information are retrieved, input mode is condition selection, packet Text or entire chapter judgement document containing condition；

Case similar matrix is calculated, validity feature tree is screened from characteristics tree set according to the filter condition of retrieval information； According to the querying condition of retrieval information, exploitation right is renewed, and is calculated in validity feature tree set using weighted manhattan distance method Similarity two-by-two forms similar matrix, and result is normalized；

Search result is exported, similar case is obtained from case similar matrix, finds the n case most like with querying condition Part or similarity are greater than the case of s, count to this information, and visualized.

The construction case characteristics tree the following steps are included:

Publicly-owned feature is defined, publicly-owned feature is case general property feature；

Privately owned feature is defined, privately owned feature is the specific properties of case；

It is special to form case by publicly-owned feature and privately owned feature organization at tree structure according to the logical relation between feature Sign tree.

The calculating case similar matrix the following steps are included:

The matrix for generating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition；

Effective case is obtained by filter condition, individual features value and weight are obtained according to querying condition, calculates inquiry item Part and case, the similarity of case and case.

Beneficial effect

A kind of document retrieval method based on feature weight analytical technology of the invention, compared with prior art with industry spy Property for guidance based on the case characteristics tree that constructs meticulously, by semantic analysis and knowledge reasoning, it is accurate to greatly improve retrieval Rate and coverage rate.By the way that filtering and querying condition can be freely combined to retrieve information as guiding principle；Pass through the similar square of construction case Battle array realizes the consecutive retrieval based on case；It is for statistical analysis to search result, intuitively show relevant information.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Specific embodiment

The effect of to make to structure feature of the invention and being reached, has a better understanding and awareness, to preferable Examples and drawings cooperation detailed description, is described as follows:

As shown in Figure 1, a kind of document retrieval method based on feature weight analytical technology of the present invention, including it is following Step:

The first step, the tissue of judgement document, by judgement document according to case by hierarchical classification tissue.Due to present specification Particularity is to propose: for different field, the industrial nature of different industries, the construction of Lai Jinhang characteristics tree, therefore being directed to For different field, owned industrial nature is not also identical, and herein for convenience of the elaboration of technical solution, spy is with judicial case The characteristic of part illustrates technical classification and design, therefore for judgement document, then according to its case by carrying out layering point Class loading.

Second step constructs case characteristics tree.For specified case by screening its publicly-owned feature and privately owned feature, and press feature Between logical relation be organized into tree structure.Construct herein case characteristics tree and case by one-to-one correspondence, reason is case by also having There are hierarchical structure (such as civil/marriage and family/divorce dispute), if by characteristics tree carry in corresponding case by hierarchical structure, that Whole characteristics trees can be organized into huge tree structure, convenient for safeguarding and browsing.Case feature is from structure in the technical program It is extracted in database and judgement document's text, is related to semantic analysis and knowledge reasoning, the similar case compared with the prior art retrieves system For system, accuracy rate, coverage rate have essential be substantially improved.Itself specifically includes the following steps:

(1) publicly-owned feature is defined.Publicly-owned feature is case general property feature, such as case time, area and case entity Information etc. is not accomplice as common to case.In general, publicly-owned feature is recorded in the structured database of Court business system, It directly acquires.

(2) privately owned feature is defined.Privately owned feature is reason for divorce, son in the specific properties of case, such as divorce dispute case Female's information, community property etc., it is peculiar by case for not accomplice.In general, in privately owned feature record judgement document's text.Generally , it is the comparison point of case similitude that the privately owned feature of case, which includes guiding case trial main idea and other central issues,.

(3) case is formed by publicly-owned feature and privately owned feature organization at tree structure according to the logical relation between feature Characteristics tree.

Third step carries out weight training to case characteristics tree.Based on domain knowledge, pass through informatics principle calculating case Part feature weight value is trained for different target using traditional decision-tree, calculates the comprehensive weight of case feature.

Case feature weight tree, be it is a kind of description case feature between relative weighting data structure.Case similar to having Searching system is different, and the information in search condition has weight, for calculate search condition and case, case and case it Between similarity.Introduce information weight can be realized again:

(1) when search condition can not all meet, the case sequence for meeting the higher condition of weight is forward；

(2) when search condition can all meet, the sequence of case can be weighted by other feature sorts.

And for the determination of case feature weight can there are many method, such as based on domain knowledge, it is former based on informatics Reason etc..Due to this programme by case feature organization at tree structure, corresponding feature weight is also tree structure, and is met certain Constraint, such as father node weight are equal to the sum of child node weight.

4th step retrieves the acquisition of information.The filter condition and querying condition of input retrieval information, input mode is condition Selection, the text comprising condition or entire chapter judgement document.

Wherein, filter condition is filter, and for limiting case time, area etc., usually the publicly-owned feature of case, does not join With case similarity calculation；Querying condition is requestor, retrieves dimension for specified, usually the privately owned feature of case, composition case Part similarity calculation dimension.The fundamental difference of two kinds of conditions is: filter condition must satisfy, the nonessential satisfaction of querying condition. User search condition is divided into filtering and inquiry, helps to improve the controllability and flexibility of searching system.

5th step calculates case similar matrix.It is screened from characteristics tree set effectively according to the filter condition of retrieval information Characteristics tree；According to the querying condition of retrieval information, exploitation right is renewed, and calculates validity feature tree using weighted manhattan distance method Similarity two-by-two in set forms similar matrix, and result is normalized.Itself specifically includes the following steps:

(1) matrix for generating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition, that is, is retouched The matrix for stating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition and is generated, and with querying condition Dynamic change.

(2) effective case is obtained by filter condition, individual features value and weight is obtained according to querying condition, calculate inquiry Condition and case, the similarity of case and case.After user inputs one group of retrieval information, effective case is obtained by filter condition Then part obtains individual features value and weight according to querying condition, calculate querying condition and case, case are similar to case Degree.The calculating of case similarity can be by defining suitable distance, and combines weight information.If effective caseload is N, So case similar matrix dimension is (N+1) × (N+1).The similarity for calculating case and case under querying condition, may be implemented Cascade retrieval based on case.

6th step exports search result.Similar case is obtained from case similar matrix, is found most like with querying condition N case or similarity be greater than s case, this information is counted, and is visualized.At this point it is possible to select As a result some case is condition in, obtains cascade search result by similar matrix.

The basic principles, main features and advantages of the present invention have been shown and described above.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and what is described in the above embodiment and the description is only the present invention Principle, various changes and improvements may be made to the invention without departing from the spirit and scope of the present invention, these variation and Improvement is both fallen in the range of claimed invention.The present invention claims protection scope by appended claims and its Equivalent defines.

Claims

1. a kind of document retrieval method based on feature weight analytical technology, which comprises the following steps:

11) tissue of judgement document, by judgement document according to case by hierarchical classification tissue；

12) case characteristics tree is constructed, for specified case by screening its publicly-owned feature and privately owned feature, and close by logic between feature System is organized into tree structure；

13) weight training is carried out to case characteristics tree, is trained using traditional decision-tree for different target, calculates case The comprehensive weight of feature；

14) acquisition of information, the filter condition and querying condition of input retrieval information are retrieved, input mode is condition selection, packet Text or entire chapter judgement document containing condition；

15) case similar matrix is calculated, validity feature tree is screened from characteristics tree set according to the filter condition of retrieval information；Root According to the querying condition of retrieval information, exploitation right is renewed, and is calculated two in validity feature tree set using weighted manhattan distance method Two similarities form similar matrix, and result are normalized；

16) search result is exported, similar case is obtained from case similar matrix, finds the n case most like with querying condition Part or similarity are greater than the case of s, count to this information, and visualized.

2. a kind of document retrieval method based on feature weight analytical technology according to claim 1, which is characterized in that institute The construction case characteristics tree stated the following steps are included:

21) publicly-owned feature is defined, publicly-owned feature is case general property feature；

22) privately owned feature is defined, privately owned feature is the specific properties of case；

23) case feature is formed by publicly-owned feature and privately owned feature organization at tree structure according to the logical relation between feature Tree.

3. a kind of document retrieval method based on feature weight analytical technology according to claim 1, which is characterized in that institute The calculating case similar matrix stated the following steps are included:

31) matrix for generating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition；

32) effective case is obtained by filter condition, individual features value and weight is obtained according to querying condition, calculate querying condition With the similarity of case, case and case.