CN105930470B - A kind of document retrieval method based on feature weight analytical technology - Google Patents
A kind of document retrieval method based on feature weight analytical technology Download PDFInfo
- Publication number
- CN105930470B CN105930470B CN201610259097.7A CN201610259097A CN105930470B CN 105930470 B CN105930470 B CN 105930470B CN 201610259097 A CN201610259097 A CN 201610259097A CN 105930470 B CN105930470 B CN 105930470B
- Authority
- CN
- China
- Prior art keywords
- case
- feature
- tree
- condition
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
Abstract
The present invention relates to a kind of document retrieval method based on feature weight analytical technology, solving compared with prior art can not be in the defect that specific area is effectively retrieved.The present invention the following steps are included: judgement document tissue, by judgement document according to case by hierarchical classification tissue;Case characteristics tree is constructed, for specified case by screening its publicly-owned feature and privately owned feature, and be organized into tree structure by logical relation between feature;Weight training is carried out to case characteristics tree, is trained using traditional decision-tree for different target, calculates the comprehensive weight of case feature;The acquisition of information, the filter condition and querying condition of input retrieval information are retrieved, input mode is condition selection, the text comprising condition or entire chapter judgement document;Calculate case similar matrix;Export search result.Based on the case characteristics tree that the present invention constructs meticulously by industrial nature is guidance, by semantic analysis and knowledge reasoning, retrieval rate and coverage rate are greatly improved.
Description
Technical field
The present invention relates to data retrieval technology field, specifically a kind of file inspections based on feature weight analytical technology
Suo Fangfa.
Background technique
File Search Technique has been widely used in daily life, to daily information content acquisition provide it is very big just
Benefit.Especially in the discussion of such as judicial case of special dimension, in the research process of certain difficult cases, professional remove according to
Other than own service knowledge and experience, it is often necessary to have similar case by retrieval, to hold the processing of related episodes.And
Existing Ordinary search technology (approach) includes universal search engine, industrial sustainability, guiding case;It has the following problems:
(1) universal search engine: such as Baidu, Yahoo;It absolutely not customizes, retrieval rate and covers for judicial domain
Lid rate is very low;
(2) industrial sustainability: such as judgement document's net, nothing dispute net;Compared with universal search engine, retrieval rate and covering
Rate has a distinct increment, and allows multi-filtering;But retrieval is based primarily upon keyword match, floats on surface, accuracy rate still compared with
It is low;Filter condition be it is default, it is inflexible;
(3) it guiding case: is issued by most Supreme Court, there is authoritative, specific aim;But caseload is seldom, lag is tight
Weight, and isolate each other, retrieval coverage rate is very low;This instructional model from top to bottom, regional adaptability also need to be considered.
In addition, above-mentioned retrieval technique does not support semantic retrieval, filtering, querying condition can not be freely combined, cannot be based on
As a result consecutive retrieval is not carried out statistics and intuitive displaying to search result.Therefore it is more professional how to design a kind of retrieval
Search method have become technical problem urgently to be solved.
Summary of the invention
The purpose of the present invention is to solve can not provide in the prior art in the defect that specific area is effectively retrieved
It is a kind of to be solved the above problems based on the document retrieval method of feature weight analytical technology.
To achieve the goals above, technical scheme is as follows:
A kind of document retrieval method based on feature weight analytical technology, comprising the following steps:
The tissue of judgement document, by judgement document according to case by hierarchical classification tissue;
Case characteristics tree is constructed, for specified case by screening its publicly-owned feature and privately owned feature, and close by logic between feature
System is organized into tree structure;
Weight training is carried out to case characteristics tree, is trained using traditional decision-tree for different target, calculates case
The comprehensive weight of part feature;
The acquisition of information, the filter condition and querying condition of input retrieval information are retrieved, input mode is condition selection, packet
Text or entire chapter judgement document containing condition;
Case similar matrix is calculated, validity feature tree is screened from characteristics tree set according to the filter condition of retrieval information;
According to the querying condition of retrieval information, exploitation right is renewed, and is calculated in validity feature tree set using weighted manhattan distance method
Similarity two-by-two forms similar matrix, and result is normalized;
Search result is exported, similar case is obtained from case similar matrix, finds the n case most like with querying condition
Part or similarity are greater than the case of s, count to this information, and visualized.
The construction case characteristics tree the following steps are included:
Publicly-owned feature is defined, publicly-owned feature is case general property feature;
Privately owned feature is defined, privately owned feature is the specific properties of case;
It is special to form case by publicly-owned feature and privately owned feature organization at tree structure according to the logical relation between feature
Sign tree.
The calculating case similar matrix the following steps are included:
The matrix for generating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition;
Effective case is obtained by filter condition, individual features value and weight are obtained according to querying condition, calculates inquiry item
Part and case, the similarity of case and case.
Beneficial effect
A kind of document retrieval method based on feature weight analytical technology of the invention, compared with prior art with industry spy
Property for guidance based on the case characteristics tree that constructs meticulously, by semantic analysis and knowledge reasoning, it is accurate to greatly improve retrieval
Rate and coverage rate.By the way that filtering and querying condition can be freely combined to retrieve information as guiding principle;Pass through the similar square of construction case
Battle array realizes the consecutive retrieval based on case;It is for statistical analysis to search result, intuitively show relevant information.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
The effect of to make to structure feature of the invention and being reached, has a better understanding and awareness, to preferable
Examples and drawings cooperation detailed description, is described as follows:
As shown in Figure 1, a kind of document retrieval method based on feature weight analytical technology of the present invention, including it is following
Step:
The first step, the tissue of judgement document, by judgement document according to case by hierarchical classification tissue.Due to present specification
Particularity is to propose: for different field, the industrial nature of different industries, the construction of Lai Jinhang characteristics tree, therefore being directed to
For different field, owned industrial nature is not also identical, and herein for convenience of the elaboration of technical solution, spy is with judicial case
The characteristic of part illustrates technical classification and design, therefore for judgement document, then according to its case by carrying out layering point
Class loading.
Second step constructs case characteristics tree.For specified case by screening its publicly-owned feature and privately owned feature, and press feature
Between logical relation be organized into tree structure.Construct herein case characteristics tree and case by one-to-one correspondence, reason is case by also having
There are hierarchical structure (such as civil/marriage and family/divorce dispute), if by characteristics tree carry in corresponding case by hierarchical structure, that
Whole characteristics trees can be organized into huge tree structure, convenient for safeguarding and browsing.Case feature is from structure in the technical program
It is extracted in database and judgement document's text, is related to semantic analysis and knowledge reasoning, the similar case compared with the prior art retrieves system
For system, accuracy rate, coverage rate have essential be substantially improved.Itself specifically includes the following steps:
(1) publicly-owned feature is defined.Publicly-owned feature is case general property feature, such as case time, area and case entity
Information etc. is not accomplice as common to case.In general, publicly-owned feature is recorded in the structured database of Court business system,
It directly acquires.
(2) privately owned feature is defined.Privately owned feature is reason for divorce, son in the specific properties of case, such as divorce dispute case
Female's information, community property etc., it is peculiar by case for not accomplice.In general, in privately owned feature record judgement document's text.Generally
, it is the comparison point of case similitude that the privately owned feature of case, which includes guiding case trial main idea and other central issues,.
(3) case is formed by publicly-owned feature and privately owned feature organization at tree structure according to the logical relation between feature
Characteristics tree.
Third step carries out weight training to case characteristics tree.Based on domain knowledge, pass through informatics principle calculating case
Part feature weight value is trained for different target using traditional decision-tree, calculates the comprehensive weight of case feature.
Case feature weight tree, be it is a kind of description case feature between relative weighting data structure.Case similar to having
Searching system is different, and the information in search condition has weight, for calculate search condition and case, case and case it
Between similarity.Introduce information weight can be realized again:
(1) when search condition can not all meet, the case sequence for meeting the higher condition of weight is forward;
(2) when search condition can all meet, the sequence of case can be weighted by other feature sorts.
And for the determination of case feature weight can there are many method, such as based on domain knowledge, it is former based on informatics
Reason etc..Due to this programme by case feature organization at tree structure, corresponding feature weight is also tree structure, and is met certain
Constraint, such as father node weight are equal to the sum of child node weight.
4th step retrieves the acquisition of information.The filter condition and querying condition of input retrieval information, input mode is condition
Selection, the text comprising condition or entire chapter judgement document.
Wherein, filter condition is filter, and for limiting case time, area etc., usually the publicly-owned feature of case, does not join
With case similarity calculation;Querying condition is requestor, retrieves dimension for specified, usually the privately owned feature of case, composition case
Part similarity calculation dimension.The fundamental difference of two kinds of conditions is: filter condition must satisfy, the nonessential satisfaction of querying condition.
User search condition is divided into filtering and inquiry, helps to improve the controllability and flexibility of searching system.
5th step calculates case similar matrix.It is screened from characteristics tree set effectively according to the filter condition of retrieval information
Characteristics tree;According to the querying condition of retrieval information, exploitation right is renewed, and calculates validity feature tree using weighted manhattan distance method
Similarity two-by-two in set forms similar matrix, and result is normalized.Itself specifically includes the following steps:
(1) matrix for generating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition, that is, is retouched
The matrix for stating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition and is generated, and with querying condition
Dynamic change.
(2) effective case is obtained by filter condition, individual features value and weight is obtained according to querying condition, calculate inquiry
Condition and case, the similarity of case and case.After user inputs one group of retrieval information, effective case is obtained by filter condition
Then part obtains individual features value and weight according to querying condition, calculate querying condition and case, case are similar to case
Degree.The calculating of case similarity can be by defining suitable distance, and combines weight information.If effective caseload is N,
So case similar matrix dimension is (N+1) × (N+1).The similarity for calculating case and case under querying condition, may be implemented
Cascade retrieval based on case.
6th step exports search result.Similar case is obtained from case similar matrix, is found most like with querying condition
N case or similarity be greater than s case, this information is counted, and is visualized.At this point it is possible to select
As a result some case is condition in, obtains cascade search result by similar matrix.
The basic principles, main features and advantages of the present invention have been shown and described above.The technology of the industry
Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and what is described in the above embodiment and the description is only the present invention
Principle, various changes and improvements may be made to the invention without departing from the spirit and scope of the present invention, these variation and
Improvement is both fallen in the range of claimed invention.The present invention claims protection scope by appended claims and its
Equivalent defines.
Claims (3)
1. a kind of document retrieval method based on feature weight analytical technology, which comprises the following steps:
11) tissue of judgement document, by judgement document according to case by hierarchical classification tissue;
12) case characteristics tree is constructed, for specified case by screening its publicly-owned feature and privately owned feature, and close by logic between feature
System is organized into tree structure;
13) weight training is carried out to case characteristics tree, is trained using traditional decision-tree for different target, calculates case
The comprehensive weight of feature;
14) acquisition of information, the filter condition and querying condition of input retrieval information are retrieved, input mode is condition selection, packet
Text or entire chapter judgement document containing condition;
15) case similar matrix is calculated, validity feature tree is screened from characteristics tree set according to the filter condition of retrieval information;Root
According to the querying condition of retrieval information, exploitation right is renewed, and is calculated two in validity feature tree set using weighted manhattan distance method
Two similarities form similar matrix, and result are normalized;
16) search result is exported, similar case is obtained from case similar matrix, finds the n case most like with querying condition
Part or similarity are greater than the case of s, count to this information, and visualized.
2. a kind of document retrieval method based on feature weight analytical technology according to claim 1, which is characterized in that institute
The construction case characteristics tree stated the following steps are included:
21) publicly-owned feature is defined, publicly-owned feature is case general property feature;
22) privately owned feature is defined, privately owned feature is the specific properties of case;
23) case feature is formed by publicly-owned feature and privately owned feature organization at tree structure according to the logical relation between feature
Tree.
3. a kind of document retrieval method based on feature weight analytical technology according to claim 1, which is characterized in that institute
The calculating case similar matrix stated the following steps are included:
31) matrix for generating case similarity two-by-two is calculated by case characteristics tree, feature weight tree, querying condition;
32) effective case is obtained by filter condition, individual features value and weight is obtained according to querying condition, calculate querying condition
With the similarity of case, case and case.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610259097.7A CN105930470B (en) | 2016-04-25 | 2016-04-25 | A kind of document retrieval method based on feature weight analytical technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610259097.7A CN105930470B (en) | 2016-04-25 | 2016-04-25 | A kind of document retrieval method based on feature weight analytical technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105930470A CN105930470A (en) | 2016-09-07 |
CN105930470B true CN105930470B (en) | 2019-03-26 |
Family
ID=56837041
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610259097.7A Active CN105930470B (en) | 2016-04-25 | 2016-04-25 | A kind of document retrieval method based on feature weight analytical technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105930470B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066468A (en) * | 2016-11-18 | 2017-08-18 | 北京市高级人民法院 | A kind of case search method based on genetic algorithm and nearest neighbor algorithm |
CN108241621B (en) * | 2016-12-23 | 2019-12-10 | 北京国双科技有限公司 | legal knowledge retrieval method and device |
CN108694178B (en) * | 2017-04-06 | 2020-11-27 | 北京国双科技有限公司 | Method and device for recommending judicial knowledge |
CN109033041A (en) * | 2017-06-09 | 2018-12-18 | 北京国双科技有限公司 | The treating method and apparatus of document similarity |
CN110019655A (en) * | 2017-07-21 | 2019-07-16 | 北京国双科技有限公司 | Precedent case acquisition methods and device |
CN110032721B (en) * | 2018-01-11 | 2023-11-03 | 北京国双科技有限公司 | Judge document pushing method and device |
CN108595548A (en) * | 2018-04-09 | 2018-09-28 | 南京网感至察信息科技有限公司 | A kind of case judge's prediction of result method based on Markov Logic Network |
CN109947897B (en) * | 2019-03-15 | 2020-12-15 | 南京邮电大学 | Judicial case event tree construction method |
CN112561744A (en) * | 2019-09-25 | 2021-03-26 | 北京国双科技有限公司 | Method and device for generating similar case retrieval report |
CN113160000A (en) * | 2021-04-22 | 2021-07-23 | 广州广电运通信息科技有限公司 | Legal information analysis method, system, device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320699A (en) * | 2014-08-04 | 2016-02-10 | 中国科学院深圳先进技术研究院 | Old age preferential treatment certificate service pushing method and system |
CN105354282A (en) * | 2015-10-30 | 2016-02-24 | 青岛海尔智能家电科技有限公司 | XML file retrieval method and apparatus |
CN105447198A (en) * | 2015-12-30 | 2016-03-30 | 深圳市瑞铭无限科技有限公司 | Convenient page script importing method and device |
CN105512339A (en) * | 2015-12-31 | 2016-04-20 | 深圳市朗科科技股份有限公司 | File searcher and searching method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6847959B1 (en) * | 2000-01-05 | 2005-01-25 | Apple Computer, Inc. | Universal interface for retrieval of information in a computer system |
US8983908B2 (en) * | 2013-02-15 | 2015-03-17 | Red Hat, Inc. | File link migration for decommisioning a storage server |
US9292525B2 (en) * | 2013-06-19 | 2016-03-22 | BlackBerry Limited; 2236008 Ontario Inc. | Searching data using pre-prepared search data |
-
2016
- 2016-04-25 CN CN201610259097.7A patent/CN105930470B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320699A (en) * | 2014-08-04 | 2016-02-10 | 中国科学院深圳先进技术研究院 | Old age preferential treatment certificate service pushing method and system |
CN105354282A (en) * | 2015-10-30 | 2016-02-24 | 青岛海尔智能家电科技有限公司 | XML file retrieval method and apparatus |
CN105447198A (en) * | 2015-12-30 | 2016-03-30 | 深圳市瑞铭无限科技有限公司 | Convenient page script importing method and device |
CN105512339A (en) * | 2015-12-31 | 2016-04-20 | 深圳市朗科科技股份有限公司 | File searcher and searching method |
Also Published As
Publication number | Publication date |
---|---|
CN105930470A (en) | 2016-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105930470B (en) | A kind of document retrieval method based on feature weight analytical technology | |
CN105930473B (en) | A kind of similar documents search method based on random forest technology | |
US10235421B2 (en) | Systems and methods for facilitating the gathering of open source intelligence | |
US20160019217A1 (en) | Systems and methods for recommending media items | |
CN106339502A (en) | Modeling recommendation method based on user behavior data fragmentation cluster | |
US8224805B2 (en) | Method for generating context hierarchy and system for generating context hierarchy | |
AU2011224139B2 (en) | Analysis of object structures such as benefits and provider contracts | |
CN101639859A (en) | Table classification device, table classification method, and table classification program | |
CN107180093A (en) | Information search method and device and ageing inquiry word recognition method and device | |
US20120099785A1 (en) | Using near-duplicate video frames to analyze, classify, track, and visualize evolution and fitness of videos | |
Hussein et al. | Using the interestingness measure lift to generate association rules | |
KR102108683B1 (en) | Method for providing recommendation contents including non-interest contents | |
Tibély et al. | Extracting tag hierarchies | |
CN107341199A (en) | A kind of recommendation method based on documentation & info general model | |
CN106354860A (en) | Method for automatically labelling and pushing information resource based on label sets | |
CN108280124A (en) | Product classification method and device, ranking list generation method and device, electronic equipment | |
CN110569273A (en) | Patent retrieval system and method based on relevance sorting | |
CN104408083A (en) | Socialized media analyzing system | |
CN108241713A (en) | A kind of inverted index search method based on polynary cutting | |
JP5500070B2 (en) | Data classification system, data classification method, and data classification program | |
Garcia-Buendia et al. | A bibliometric study of lean supply chain management research: 1996–2020 | |
Singh et al. | Structure-aware visualization of text corpora | |
CN104462552A (en) | Question and answer page core word extracting method and device | |
CN110825792A (en) | High-concurrency distributed data retrieval method based on golang middleware coroutine mode | |
Zainol et al. | Visualizing military explicit knowledge using document clustering techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |