CN107220295A

CN107220295A - A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method

Info

Publication number: CN107220295A
Application number: CN201710285854.2A
Authority: CN
Inventors: 王开红; 李建元; 陈涛; 蒋伶华; 范鸿俊; 温晓岳
Original assignee: Enjoyor Co Ltd
Current assignee: Yinjiang Technology Co.,Ltd.
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2017-09-29
Anticipated expiration: 2037-04-27
Also published as: CN107220295B

Abstract

A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, comprises the following steps：Step 1：Data Collection, pretreatment；Step 2：Participle and vector representation；Step 3：TF_CDF feature clusterings；Step 4：Desensitization typical case collection is automatically generated, cluster classification and classification keyword is obtained；Step 5：Generate mediation strategy prompting；Step 6：Create index and calculate the degree of correlation：The core of full-text search engine includes index creation and relatedness computation, and mediation strategy prompting etc. in the typical case data in step 4 and obtained cluster classification and step 5 is synchronized into elasticsearch creates index；Step 7：Search result and showing interface：User input query content, obtains similar typical case, case classification and class label information, mediation strategy and recommends, and automatically generate similar cases analysis report.A kind of accuracy rate of present invention offer is higher, fireballing people's contradiction reconciles case retrieval and mediation strategy recommends method.

Description

A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method

Technical field

The present invention relates to search technique and policy recommendation field, especially judicial domain people contradiction reconcile case retrieval and Mediation strategy recommends method.

Background technology

Internet technology rapidly develops, information resources explosive growth, the Working and life styles of profound influence people.Such as The present, internet is increasingly becoming the main place for obtaining resource and information interchange, and user can be learnt largely by internet hunt Information, brings great convenience to work and life.However, with the growth of Internet resources, the knot obtained by search engine The fruit not necessarily desired information of user, often also needs to carry out lookup screening in search result, this reduces The search efficiency of user.Also, the existing search engine based on Web can not provide search service for specific field.Cause How this improves the efficiency that user obtains target search result, and the search for improving user experience and formation specific area is drawn Hold up, be the problem of current search technique field is challenging.

Nowadays, in raising search quality, substantial amounts of work has been carried out in terms of the search engine of professional domain and individual demand Make.The search result that patent CN201010559233 is obtained for multiple search engines is different, to from multiple search engines The sorting position weight in weight and search engine that search result is held up according to index carries out basic sequence；Further according to co-occurrence information Situations such as come to basis sequence be modified adjustment, make the foundation of searching order more reasonable, improve the quality of search result.Specially Sharp CN201210548858.2 provides search category list and search engine list in browser window, and user can be to identical Search input, corresponding search result is produced using different search categories and search engine.In view of different browsers and Influence of the classification to search result, but constantly conversion browser and search category combinatorial search are needed, meet demand to find Information.Patent CN201310226576.5 provides the first search result according to user's search term and receives user and searched for first The behavioral data of hitch fruit.Search term is recommended according to the generation of visual signature, user behavior data and search term, can accurately be excavated The search intention of user, search result has more specific aim, meets the individual demand of user's search, improves user's search experience. Patent CN102567326B discloses a kind of information search collator and method, including：Determining unit, predicting unit, sequence Unit, the accuracy of search result is improved with reference to search log information, organizational structure information.

People's mediation search service is the professional search engine for judicial domain, the search engine pattern based on web And service, it is impossible to the demand of user is met, it is necessary to customize professional domain search service；Reconcile what case was serviced as case retrieval Data source, data volume is larger, and specification is not filled in clearly, and data fill in variation, can influence search engine service quality. Such as：Most of cases are short text data, and traditional TF_IDF scheduling algorithm search accuracy rates are not high；Looked for from substantial amounts of case Representational case is provided for study in use, comparing consuming time and efforts；It is related to substantial amounts of privacy information, people in case Work desensitization process workload is big etc..

The content of the invention

In order to which the accuracy rate for overcoming the shortcomings of existing people's mediation way of search is not high, the consuming time is longer, the present invention is carried Case retrieval is reconciled for people's contradiction that a kind of accuracy rate is higher, the consuming time is shorter and mediation strategy recommends method.

The technical solution adopted for the present invention to solve the technical problems is：

A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, comprises the following steps：

Step 1：Data Collection, pretreatment

Collect people's mediation case information, be stored in database, it is necessary to comprising field include：Dispute details, conciliation As a result details, are reconciled, time, end time, peace-maker, affiliated area, regulating member is reconciled and evaluates field, wherein, dispute Details, conciliation details and evaluation are text datas, and other fields are structural data.

The data being collected into are pre-processed, it is ensured that it is not sky to reconcile result, reconcile details field, and repeated data is deleted Remove；

Step 2：Participle and vector representation

Create contradiction and reconcile professional domain dictionary mediate.txt, the word of easy participle mistake, especially contradiction are reconciled Professional domain vocabulary, the word of correct cutting is will be unable to according to conciliation case data, is added contradiction and is reconciled professional domain dictionary mediate.txt；Also there are some insignificant words in other Chinese, the not high word of these meaningless, discriminations is added and stopped Dictionary stopword.txt is used, directly stop words is removed during participle and not performed an analysis；

The text field is carried out by participle according to dictionary mediate.txt and deactivation dictionary stopword.txt, by textual data According to the form for being expressed as vector；

Step 3：TF_CDF feature clusterings

Because contradiction reconciles case without detailed classification information, textual words weight is calculated using TF_CDF, and carry out TF_ CDF feature clusterings obtain the detailed classification of case and classification keyword, while obtaining word TF_CDF values from cluster result；

Step 4：It is automatic to desensitize and to carry out case scoring automatic and generate desensitization typical case collection；

Step 5：Generate mediation strategy prompting

Using the typical case with class label as analyze data, some classification reconciles plan according to procedure below generation Slightly：

(5.1) the typical case collection with class label is obtained, mediation strategy field is extracted；

(5.2) mediation strategy has second- and third-tier regulations mark, disconnects mediation strategy according to mark, forms conciliation regulations；

(5.3) regulations will be reconciled and carries out TF_CDF clusterings, and extract the keyword of conciliation regulations；

(5.4) classification scoring is carried out to reconciling regulations.Score basis include the bar number comprising conciliation regulations in classification, had The conciliation regulations of same keyword ratio shared in classification；

(5.5) conciliation regulations are scored, and score basis include：In regulations classification keyword occur number and number of times and The quality of text；

(5.6) by the classification scoring descending sort of the regulations of conciliation, the higher classification of scoring is extracted, the extraction point in these classifications The high conciliation regulations of value, as mediation strategy prompt message, are stored in database；

Step 6：Create index and calculate the degree of correlation

The core of full-text search engine includes index creation and relatedness computation, by the typical case data in step 4 and Mediation strategy prompting etc. is synchronized to elasticsearch and creates index in obtained cluster classification and step 5；

Step 7：Search result and showing interface

User input query content, obtains similar typical case, case classification and class label information, mediation strategy and recommends, And automatically generate similar cases analysis report.

Further, in the step 7, search procedure is as follows：

(7.1) similar cases.Search result acquiescence is exported according to degree of correlation descending sort, and user can be manually to retrieval result Filtered and sorted, for example, show specified type and the case of period, according to time ascending order or descending sort etc.；

(7.2) case classification and class label, can show corresponding case class as the filter condition retrieved, each case Other and classification keyword；

(7.3) mediation strategy is recommended：According to the obtained CROSS REFERENCE of search, recommend mediation strategy automatically；

(7.4) retrieval result is analyzed：User clicks on analysis button when needed, obtains retrieval result analysis report, report It is divided into：Time series analysis, regional context are analyzed, regulating member is analyzed, pacification worker analyzes, mediate interpretation of result and the time-division is used in conciliation Analysis, binary search is carried out according to analysis result in result set.

Further, in the step 3, " case details " field carries out the following institute of feature clustering step in reconciling contradiction Show：

(3.1) initial value is determined

People's contradiction, which reconciles " case details ", can gather for k classes, common n bars contradiction case, composition corpus D={ d₁, d₂,....,d_n, corpus refers to the set of " case details " field information in all cases here, and d is composition corpus Single " case details " information, carries out participle, the not repeated word of acquisition is { t by corpus Chinese version₁,t₂,....,t_N}；

(3.2) " case details " are assigned to nearest neighbor classifier according to cosine similarity

Using module of the cosine similarity as cluster, shown in such as formula (1)：

Wherein,It is case d_iApart from the minimum COS distance of each cluster centre, i.e. case d_iBelong to j classes.It is jth Individual cluster centre；

(3.3) TF_CDF models are updated

The within-cluster variance E of cluster is calculated, if E is less than the half E of initial within-cluster variance₀/ 2, then update TF_CDF； If clustering error E is more than E₀/ 2 are skipped step (3.3)；The classification entropy of word is calculated according to formula (2)：

Wherein,It is word w occur in j class documents_pDocument account for the ratios of j class documents,It is that word is included in class j w_pNumber of documents, cw^jIt is the total number of documents in class j, H (w_p) it is word w_pClassification entropy in k classes, the bigger word of classification entropy Distribution is more uniform, and word weight is lower；Traditional feature extraction based on entropy is exactly the pass of the word that selects entropy relatively low as text Keyword, can select some and only have certain class peculiar but without representational word, and missing some multiple classes and having but compare has generation The word of table, for example：" first and second fasten the dispute ... of neighborhood relationship ... leak downstairs ", " neighbourhood " entangles in leak dispute and noise here Frequency of occurrence is higher in confused, but it is just representative belong to neighbourhood's dispute major class, " upper downstairs " almost only in leak dispute, Using the feature extraction based on entropy, then " upper to go downstairs " weight is very big, and " neighbourhood " possibly can not then be extracted.

The present invention calculates some word w using formula (3)_pTF_CDF values：

Wherein, TF_pIt is word w_pWord frequency in text i, DF_pWord document frequency refers to include this list in corpus H (w in the number of documents of word, denominator_p) be word entropy, ln () is natural logrithm function, reduce document frequency proportion；ε is one Individual smaller value, prevents H (w_p) be 0 when there is mistake, word frequency, document frequency and classification entropy are considered, relative to short essay This accuracy is higher

(3.4) cluster centre is updated：It regard the average of each class Chinese version vector as new cluster centre；

(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then TF_CDF values no longer change, and obtain To k classification and TF_CDF models；

(3.6) after the completion of class tag extraction, cluster, several words higher word TF_CDF in each classification is extracted and are used as class Distinguishing label.

In the step (3.1), initial value determination process is as follows：

1. initial TF_CDF values are determined

The initial TF_CDF of given word value is word word frequency.I is after participle for some case, is expressed as d_i={ w₁, w₂,...,w_j, j=1,2 ..., N, N be to enter the not repeated word number after participle, w in corpus_jIt is word t_jIn case The number of times occurred in i；

2. initial cluster center is determined

Calculate k distant initial cluster centers：C={ C₁,C₂,...,C_k},c^j={ c₁,c₂,...,c_N, C is k Individual cluster centre, c is the vector form of single cluster centre, facilitates class label all to be represented with subscript for expression；

3. cluster within-cluster variance is calculated

Calculate each case and initial case center distance and, determine initial clustering within-cluster variance E₀。

In the step (3.3), once renewal TF_CDF values are performed a plurality of times in iteration, or one cluster centre of setting changes The threshold value of value, renewal is performed during more than this threshold value.

In the step (4), the process for automatically generating desensitization typical case collection is as follows：

(4.1) automatic desensitization

Using natural language processing technique automatic identification name and certificate address information, the information that will identify that uses certain in original text Certain is replaced；

(4.2) typical case collection is automatically generated

Scored by case, high-quality typical case collection is automatically generated by machine, case is divided into inhomogeneity by step 3 Not, case scoring is carried out to each classification respectively, higher case that case in each classification is scored is used as the typical case of this class Example.

In the step (4.2), case Rating Model creation method is as follows：

1. case template is analyzed, merit brief introduction, mediation process is divided into, mediates result explanation, reconciles gains in depth of comprehension module；

2. the text quality of each case comprising modules is analyzed using natural language processing method,

Q_t=aQ_l+bQ_p+cQ_s (4)

Wherein, a+b+c=1 is the proportion shared by each quality score, Q_lIt is word numerical value quality, if the text of non-participle This length, more than certain threshold value T_l, then Q_lFor 1, otherwise Q_lBy exponential damping；Q_pIt is that word repeats to occur in ratio, i.e. text Frequency highest word accounts for the ratio of the total word number of text, Q_pLess than certain threshold value T_pDuration is 1, and index is pressed during more than certain threshold value Decay, Q_sIt is the ratio of the text size before text size and participle after participle；

3. each pacification worker is assigned scoring weight necessarily；Trouble-shooter is often handled, and it is higher to reconcile success rate Case case pacification worker scoring weight Q_h∈ [0,1] is improved；

4. sentiment analysis is carried out to reconciling feedback information, obtains and reconcile feedback weight Q_e；To positive feedback and negative feedback Scoring weight is assigned, the weight actively scored is higher；

5. consider several aspects more than, create case Rating Model Q=(α Q_h+βQ_e)Q_t, α is Q_hIn case The proportion accounted in scoring, β is Q_eShared case scoring weight, alpha+beta=1, according to case Rating Model, is carried out to case information Scoring, automatically generates the typical case collection with class label.

In the step 6, the process for creating index and the calculating degree of correlation is as follows：

(6.1) index is created

Full-text index is created to " dispute details " and " mediating result explanation ", " mediation strategy recommendation " using ES, wherein, " dispute details " and " mediate result illustrate " are the fields included in initial data, and " mediation strategy recommendations " is calculated in step 5 The field of acquisition；

(6.2) relatedness computation

Cluster obtains classification, class label, cluster centre and TF_CDF values in step 3, and this patent uses TF_CDF weights Text vector is represented, in addition, input search content Query, participle obtains pn word, equally uses TF_CDF vector representations, counts Calculate text similarity.If word is appeared in class label, the text degree of correlation is correspondingly improved.

The present invention technical concept be：Build the judicial domain case retrieval clothes for being applied to that short text is searched for, accuracy is high Business.Using natural language processing technique and high level machine learning art, deed of arrangement quality score and high-quality tune are realized automatically The structure of case library is solved, and excavates the mediation strategy of similar cases on this basis, promotes people's mediation " the accomplice people having the same aspiration and interest ", " public affairs It is positive to reconcile ".The present invention obtains case classification and class label by text TF_CDF feature clusterings；Created using elasticsearch Distributed fault-tolerance index is built, the text degree of correlation is calculated according to TF_CDF values and cluster labels, obtains similar cases, improve and search in full Rope and category function of search accuracy.

Beneficial effects of the present invention are mainly manifested in：

1) contradiction is reconciled into case type to segment, the classification information hidden in case can not only be excavated, be more convenient user Category is searched, and improves Consumer's Experience.

2) case retrieval professional domain feature is directed to, case quality score, the high-quality typical case of automatic desensitization generation is carried out Example collection.

3) mediation strategy prompting is produced for identical category case, facilitates user to understand the processing mode of similar cases, promoted Enter people's mediation " the accomplice people having the same aspiration and interest ", " just to reconcile ".

4) case short text data characteristic is reconciled for contradiction, obtains TF_CDF models, comprehensive word word frequency and classification text Shelves frequency, the TF_IDF modelling effects than commonly using are more preferable.

5) using TF_CDF carry out feature clustering, using cosine similarity as cluster module.

6) search output acquiescence is ranked up according to similarity, using TF_CDF values, class tag computation text similarity, is carried High retrieval result accuracy.

Brief description of the drawings

Fig. 1 is that contradiction reconciles case retrieval engine flow chart.

Fig. 2 is the schematic diagram that part contradiction reconciles dictionary mediate.txt.

Fig. 3 is that contradiction reconciles the schematic diagram that part disables dictionary stopword.txt.

Embodiment

The invention will be further described below in conjunction with the accompanying drawings.

1~Fig. 3 of reference picture, a kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, comprises the following steps：

Step 1：Data Collection, pretreatment

Collect people's mediation case information, be stored in database, it is necessary to comprising field include：Dispute details, conciliation As a result details, are reconciled, the fields, wherein dispute such as time, end time, peace-maker, affiliated area, regulating member, evaluation is reconciled Details, conciliation details and evaluation are text data (unstructured datas), and other fields are structural data.

The data being collected into are pre-processed, it is ensured that it is not sky, and conciliation case is not by reconcile result, reconcile details field Can be excessively simple；By data de-duplication.

Step 2：Participle and vector representation

Text data is unstructured data, computer can not Direct Analysis processing, it is necessary to by text dividing be word simultaneously Represented with structural data, facilitate computer to carry out subsequent treatment.English can be according to space participle, and Chinese does not have clear and definite word Separator, participle difficulty is larger, and the accuracy of participle can be improved by auxiliary dictionary during general participle.

The present invention creates contradiction and reconciles professional domain dictionary mediate.txt, by the word of easy participle mistake, especially lance Shield reconciles professional domain vocabulary, for example：" XX villagers' committees " can be split as " XX villages/committee's meeting ", be actually " XX/ villagers' committees ", According to reconcile case data will " villagers' committee " etc. can not correct cutting word, addition contradiction conciliation professional domain dictionary mediate.txt；Also there are some insignificant words in other Chinese, such as " ", " " etc., contradiction reconcile in professional domain and also There are Party A, Party B etc., these words do not include information not only, and certain interference is also caused to subsequent analysis, by these are meaningless, The not high word of discrimination, which is added, disables dictionary stopword.txt, directly removes stop words during participle and does not perform an analysis.

The text field is carried out by participle according to dictionary mediate.txt and deactivation dictionary stopword.txt, by textual data According to the form for being expressed as vector, convenient analyzing and processing.

Step 3：TF_CDF feature clusterings

" case details " only have several major classes to divide during contradiction is reconciled, and without detailed class discrimination, all data mix Together, made troubles to analysis and search etc., and without training data, to extract class label can only be by way of cluster Carry out." case details " are subjected to participle in step 2, and use vector representation, but the larger feature of text data dimension is not Substantially, in addition it is also necessary to carry out feature extraction, cluster is finally carried out using machine learning algorithm and obtains class label.

The present invention calculates textual words weight using TF_CDF, not only considers the word frequency and document frequency of word, also integrates Consider token-category frequency, result of calculation is more more reliable than conventional TF_IDF feature extraction algorithms.And it is traditional based on information Entropy, the Text character extraction of mutual information etc. is usually used in known text class label, or some training data, classified or Tag extraction.Contradiction reconciles case without detailed classification information in the present invention, and word TF_CDF values are obtained from cluster result.

" case details " field progress feature clustering step is as follows in reconciling contradiction：

(3.1) initial value is determined

People's contradiction, which reconciles " case details ", can gather for k classes, common n bars contradiction case, composition corpus D={ d₁, d₂,....,d_n}.Here corpus refers to the set of " case details " field information in all cases, and d is composition corpus Single " case details " information.Corpus Chinese version is subjected to participle, the not repeated word of acquisition is { t₁,t₂,....,t_N}。

1. initial TF_CDF values are determined

TF_CDF value, according to simple, fireballing principle is calculated, will give the initial TF_CDF of word by iterative calculation Value be word word frequency.I is after participle for some case, is expressed as d_i={ w₁,w₂,...,w_j, j=1,2 ..., N, N be language Material entered the not repeated word number after participle, w in storehouse_jIt is word t_jThe number of times occurred in case i.

2. initial cluster center is determined

The selection of initial cluster center influences larger to cluster result, can be according to the original of initial cluster center as far as possible Then, k distant initial cluster centers are calculated：C={ C₁,C₂,...,C_k},c^j={ c₁,c₂,...,c_N, C is k poly- Class center, c is the vector form of single cluster centre, facilitates class label all to be represented with subscript for expression.

3. within-cluster variance is calculated

Calculate each case and initial case center error and, determine initial clustering within-cluster variance E₀

(3.2) " case details " are assigned to nearest neighbor classifier

Cosine similarity measures the similitude between them by measuring the cosine value of two vectorial angles, more It is the difference on direction, it is insensitive to absolute numerical value.Text data is by being multi-C vector after vectorization, using cosine phase Like module of the degree as cluster, such as shown in formula (1)：

Wherein,It is case d_iApart from the minimum COS distance of each cluster centre, i.e. case d_iBelong to j classes.It is jth Individual cluster centre.

(3.3) TF_CDF models are updated

The within-cluster variance E of cluster is calculated, if E is less than the half E of initial within-cluster variance₀/ 2, then update TF_CDF. If E is more than E₀/ 2 are skipped step (3).After classification, if the number of times that a word occurs in a certain class is more, other classes The number of times of middle appearance is less, then illustrates that this word is important in such, it should increase the weight of this word accordingly.If The ratio that one word occurs in each class is essentially identical, then illustrates that the discrimination of this word is relatively low, it should reduce this The weight of word.Entropy of the word in all kinds of middle distributions is calculated according to formula (2)：

Wherein,It is word w occur in j class documents_pDocument account for the ratios of j class documents,It is that word is included in class j w_pNumber of documents, cw^jIt is the total number of documents in class j, H (w_p) it is word w_pEntropy in k classes, word is distributed in each class More uniform, entropy is bigger, and the discrimination of word is lower.

Short text data to that can be divided into several major classes, after stop words is removed, a big chunk frequency of occurrences is more Word implication it is more important, some word w_pTF_CDF calculate as shown in formula (3)：

Wherein, TF_pIt is word frequency of p-th of word in text i, DF in text i_pWord document frequency refers in corpus H (w in number of documents comprising this word, denominator_p) be word entropy, ln () is natural logrithm function, reduce document frequency Proportion；ε is a smaller value, prevents H (w_p) be 0 when there is mistake, consider word frequency, document frequency and classification entropy, It is higher relative to short text accuracy

It is relatively large that each iteration all updates TF_CDF value amounts of calculation, can iteration be performed a plurality of times and once update TF_CDF values, The threshold value of a cluster centre change value can also be set, renewal is performed during more than this threshold value.

(3.4) cluster centre is updated

It regard the average of each class Chinese version vector as new cluster centre.

(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then TF_CDF values no longer change, and obtain To k class and TF_CDF models.

Step 4：Automatically generate desensitization typical case collection

(4.1) automatic desensitization

It is privacy information in protection case, it is necessary to which the privacy information such as substantial amounts of name address in case is carried out into special place Reason, artificial treatment takes a substantial amount of time and energy, and the present invention carries out automatic desensitization process.Using natural language processing technique certainly Dynamic identification name and certificate address information, the information that will identify that are replaced in original text with so-and-so.

(4.2) typical case collection is automatically generated

Reconciling case registration, there is no fixed standard in, and the quality of case has larger difference, in pacification worker's short time It is difficult to find the excellent case for being available for just examining, therefore set up high-quality typical case collection and have great significance.But, manually from big The typical case extracted in the case of amount for study reference not only expends the time, is also easily influenceed by individual factor.The present invention is logical Case scoring is crossed, high-quality typical case collection is automatically generated by machine, objective and fair saves the time.Case is divided into by step 3 It is different classes of, case scoring is carried out to each classification respectively, higher case that case in each classification is scored is used as this class Typical case, case Rating Model creation method is as follows：

1. case template is analyzed, several comprising modules are divided into, such as merit brief introduction, mediation process, result is mediated Illustrate, reconcile the modules such as gains in depth of comprehension, each classification there may be different demarcation method, and other cases are entered according to corresponding comprising modules Row is divided；

2. the text quality of each case comprising modules is analyzed using natural language processing method, for example：Case constitutes mould The word length of block, the implication of text, if be made up of etc. that (other indexs that can react text quality are wrapped repetitor or sentence It is contained within the scope of this patent), the high case scoring of text quality is higher.

Q_t=aQ_l+bQ_p+cQ_s (4)

Wherein, a+b+c=1 is the proportion shared by each quality score, Q_lIt is word numerical value quality, if the text of non-participle This length, more than certain threshold value T_l, then Q_lFor 1, otherwise Q_lBy exponential damping；Q_pIt is that word repeats to occur in ratio, i.e. text Frequency highest word accounts for the ratio of the total word number of text, Q_pLess than certain threshold value T_pDuration is 1, and index is pressed during more than certain threshold value Decay, Q_sIt is the ratio of the text size before text size and participle after participle, because can remove after text participle single The word and stop words of word, Q_sWhether can weigh in text has too many meaningless word, Q_sYue great text qualities are relatively preferable.

3. case pacification worker has chief pacification worker, common pacification worker, part-time staff etc., analyzes the tune of pacification worker Solution experience, excavates it and is good at field etc., assigns each pacification worker scoring weight necessarily；Trouble-shooter is often handled, and Reconcile the case pacification worker scoring weight Q of the higher case of success rate_h∈ [0,1] is properly increased.

5. consider several aspects more than, create case Rating Model Q=(α Q_h+βQ_e)Q_t, α is Q_hIn case The proportion accounted in scoring, β is Q_eShared case scoring weight, alpha+beta=1.According to case Rating Model, case information is carried out Scoring, automatically generates the typical case collection with class label.

Step 5：Generate mediation strategy prompting

The mediation strategy of case is the content that user compares care, can as similar cases conciliation foundation, realize " same The case people having the same aspiration and interest ", promotes the fairness of people's mediation service.But, case checks that mediation strategy is wasted time and energy one by one, automatic Display phase The mediation strategy of case is closed, user time can be greatly saved.The present invention automatically generates mediation strategy, to carry class in step 4 The typical case of distinguishing label generates mediation strategy as analyze data, some classification according to following steps：

(5.1) the typical case collection of class label is carried in obtaining step 4, mediation strategy field is extracted.

(5.2) mediation strategy has second- and third-tier regulations mark, disconnects mediation strategy according to mark, forms conciliation regulations.

(5.3) regulations will be reconciled and carry out TF_CDF clusterings, and adjusted according to the method extraction that keyword is extracted in step 3 The keyword of solution regulations.

(5.4) classification scoring is carried out to reconciling regulations.Score basis include the bar number comprising conciliation regulations in classification, had The conciliation regulations of same keyword ratio shared in classification etc..

(5.5) conciliation regulations are scored.Score basis include：The frequency, the matter of text of keyword appearance in regulations Amount etc.；

(5.6) by the classification scoring descending sort of the regulations of conciliation, the higher classification of scoring is extracted, the extraction point in these classifications The high conciliation regulations of value, as mediation strategy prompt message, are stored in database.

Step 6：Create index and calculate the degree of correlation

The core of full-text search engine includes index creation and relatedness computation, by the typical case data in step 4 and Mediation strategy prompting etc. is synchronized to elasticsearch and creates index in obtained cluster classification and step 5.

(6.1) index is created

Elasticsearch is developed based on Lucene, is using most wide one of search engine of increasing income now.The present invention is adopted Full-text index is created to " dispute details " and " mediating result to illustrate ", " mediation strategy recommendation " with ES.Wherein, " dispute details " and " mediate result illustrate " is the field included in initial data, and " mediation strategy recommendation " is the field that acquisition is calculated in step 5.

(6.2) relatedness computation

Relatedness computation is the degree of correlation for calculating search input Query and index text, and is given tacit consent to according to degree of correlation descending Sequence output, relatedness computation determines the output at interface, directly affects Consumer's Experience, and accurately and effectively relatedness computation is convenient User searches.Cluster obtains classification, class label, cluster centre and TF_CDF values in step 3, and this patent uses TF_CDF weights Text vector is represented, text similarity is calculated.In addition, input search content Query, participle obtains pn word, if word Appear in class label, be then correspondingly improved the text degree of correlation.

Step 7：Search result and showing interface

Server end carries out a series of data mining calculating, and final purpose is to show user in client.User is defeated Enter to inquire about content, can obtain similar typical case, case classification and class label information, mediation strategy recommend, for the convenience of the user from All angles fully understand CROSS REFERENCE, can automatically generate similar cases analysis report.Particular content is as follows：

(7.1) similar cases.Search result acquiescence is exported according to degree of correlation descending sort, and user can be manually to retrieval result Filtered and sorted：Specified type and the case of period are shown, according to time-sequencing etc..

(7.2) case classification and class label.Each case shows corresponding case classification and class label, also can conduct The filter condition of retrieval.

(7.3) mediation strategy is recommended.According to the obtained CROSS REFERENCE of search, mediation strategy is automatically generated.

(7.4) retrieval result is analyzed.Analysis of cases is not that acquiescence is shown in main interface, and user can click on when needed Analysis button, obtains the report of retrieval result root system, and report is divided into：Time series analysis, regional context analysis, pacification worker's analysis, tune Locate interpretation of result etc., binary search can be carried out in result set according to analysis result.

The present embodiment checking data are Shanghai City people's mediation data, and process is as follows：

Step 1：Data Collection, pretreatment

People's mediation case information is collected, is stored in database, field is as shown in Table 1.

The contradiction of table 1 reconciles field information

The data being collected into are pre-processed, " MEDIATE_CIRCS " is removed for the empty and simple field of description；Will Repeat the data deletion that dispute details are repeated；The privacy information for being related to name rooming list in case is replaced with " so-and-so ".It is right MEDIATE_CIRCS, MEDIATE_EXPLAIN, RESULT_RECOMMEND field carry out full-text search, and other parts carry out accurate Really value retrieval.

Step 2：Participle and vector representation

Text data is unstructured data, it is impossible to which Direct Analysis is, it is necessary to be word by text dividing.Contradiction is made to reconcile Professional domain dictionary mediate.txt and deactivation dictionary stopword.txt.According to dictionary and disable dictionary by " dispute details ", " mediating result explanation " text carries out participle, and text data is expressed as to the form of vector, facilitates computer disposal.

(2.1) " dispute details " are as follows in contradiction and disputes case：

Both Parties fasten neighborhood relationship downstairs.On November 29th, 2009, Party B's toilet floor drain leak stopping water to Party A Family, causes Party A family roof, metope, hangs that cupboard door is impaired, Party A requires Party B's reimbursement of damages, and both sides are compensation issue generation difference Cause dispute.

(2.2) dictionary includes word, word frequency and part of speech (can omit), and often one word of row, is separated with space, and part contradiction is adjusted Solve dictionary mediate.txt as shown in Figure 2.

(2.3) contradiction reconciles part and disables dictionary stopword.txt as shown in figure 3, every one word of row.

(2.4) by " dispute details " participle and handle

Upper downstairs/neighborhood relationship/toilet/floor drain/leak/causes/roof/metope/to hang cupboard door/is damaged/requirement/reparation Loss/compensation issue/generation/difference/causes dispute

Step 3：TF_CDF is calculated and TF_CDF feature clusterings

People's contradiction can be divided into 4 big classification, be respectively：Compensate dispute, neighbourhood's dispute, work dispute, contract dispute；Often It can be divided into several groups again below individual major class.Group division is carried out to contract dispute major class below, feature clustering step is as follows It is shown：

(3.1) initial value is determined

It is 6 groups by contract dispute major class cluster, has 2122 inconsistency datas, text data passes through participle, removed Some word word frequency are less than 3 word, form N-dimensional word vector.

1. initial TF_CDF values are determined

The value for giving initial TF_CDF is word word frequency value, calculates simple speed fast.Certain data i is represented by d_i= {w₁,w₂,...,w_N, N is data dimension, and w is word word frequency value.

2. initial cluster center is determined

Calculate k distant initial cluster center C={ C₁,C₂,...,C_k},c_j={ c₁,c₂,...,c_N}。

(3.2) it is assigned to nearest neighbor classifier according to cosine similarity

Text data, using module of the cosine similarity as cluster, is pressed by being multi-C vector after vectorization Each case and the distance at each class center are calculated according to formula (1), case classification is obtained.

(3.3) TF_CDF is updated

After classification, if the number of times that a word occurs in a certain class is more, the number of times occurred in other classes is less, then Illustrate that this word is important in such, it should increase the weight of this word accordingly.If a word goes out in each class The existing frequency is essentially identical, then illustrates that the discrimination of this word is relatively low, it should reduce the weight of this word.According to formula (2) entropy of the word in all kinds of middle distributions is calculated.Word is distributed more uniform in each class, and entropy is bigger, and the discrimination of word is lower.

Short text data for several major classes can be divided into, after stop words is removed, a big chunk frequency of occurrences compared with Many word implications are more important, and the present invention calculates the TF_CDF values of word according to formula (3).

It is relatively large that each iteration all updates TF_CDF value amounts of calculation, can iteration a TF_CDF be performed a plurality of times update, The threshold value of a cluster centre change value can be set, renewal is performed during more than this threshold value.

(3.4) cluster centre is updated

Update after TF_CDF, regard the average of each class Chinese version vector as new cluster centre.

(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then the entropy of word no longer changes i.e. TF_CDF models no longer change, and obtain k cluster and TF_CDF models.

(3.6) class tag extraction

After the completion of cluster, the word of word TF_CDF values preceding 5 in each classification is extracted respectively, class label is used as.Contract Classification and class label after the completion of dispute cluster is as shown in table 2.

The 2-in-1 similar cluster result of table

Step 4：Desensitization typical case collection is automatically generated, process is as follows：

(4.1) automatic desensitization

For protection case privacy information, name and certificate address information are recognized using natural language processing technique, taken off automatically Quick processing, the information that will identify that is replaced in original text with so-and-so.It is as follows：.

" it is employer-employee relationship to apply so-and-so with Zhang.At the beginning of in Septembers, 2009, apply so-and-so and made referrals to through so-and-so nurse place of matchmakers and opened Nurse's work is done by so-and-so family, and arranges monthly 1300 yuans of wages, while eating, living in Zhang family ... "；

(4.2) typical case collection is automatically generated

2. the text quality of each case comprising modules is analyzed using natural language processing method, for example：Case constitutes mould The word length of block, the implication of text, if be made up of etc. that (other indexs that can react text quality are wrapped repetitor or sentence It is contained within the scope of this patent).High higher, the Q of case scoring of text quality_tIt is higher.

3. case pacification worker has chief pacification worker, common pacification worker, part-time staff etc., analyzes the tune of pacification worker Solution experience, excavates it and is good at field etc., assigns each pacification worker scoring weight necessarily, obtains scoring Q_h。

4. exchange solution result feedback information and carry out sentiment analysis, scoring weight, product are assigned to positive feedback and negative feedback The weight of pole scoring is higher, obtains Q_e；

5. consider several aspects more than, create case Rating Model.According to case Rating Model, case is believed Breath is scored, and automatically generates the typical case collection with class label.

Step 5：Generate mediation strategy prompting

(5.3) regulations will be reconciled and carry out TF_CDF progress clusterings, and according to the key of step 3 extraction conciliation regulations Word.

(5.5) conciliation regulations are scored.Score basis include：The frequency, the matter of text of keyword appearance in regulations Amount etc..

Step 6：Create index and calculate the degree of correlation

(6.1) index is created

(6.2) relatedness computation

Relatedness computation is the degree of correlation for calculating search input Query and index text, and according to degree of correlation descending sort Output, relatedness computation determines the output at interface, directly affects Consumer's Experience, and accurately and effectively relatedness computation facilitates user Search.Cluster obtains classification, class label, TF_CDF values in step 3, and this patent calculates the degree of correlation.

Step 7：Search result set showing interface

Claims

1. a kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, it is characterised in that：Comprise the following steps：

Step 1：Data Collection, pretreatment

Collect people's mediation case information, be stored in database, it is necessary to comprising field include：Dispute details, reconcile result, Details are reconciled, time, end time, peace-maker, affiliated area, regulating member is reconciled and evaluates field, wherein, dispute details, It is text data to reconcile details and evaluation, and other fields are structural data；

The data being collected into are pre-processed, it is ensured that it is not sky to reconcile result, reconcile details field, by data de-duplication；

Step 2：Participle and vector representation

Create contradiction and reconcile professional domain dictionary mediate.txt, the word of easy participle mistake, especially contradiction are reconciled into specialty Field Words, the word of correct cutting is will be unable to according to conciliation case data, is added contradiction and is reconciled professional domain dictionary mediate.txt；Also there are some insignificant words in other Chinese, the not high word of these meaningless, discriminations is added and stopped Dictionary stopword.txt is used, directly stop words is removed during participle and not performed an analysis；

The text field is carried out by participle according to dictionary mediate.txt and deactivation dictionary stopword.txt, by text data table It is shown as the form of vector；

Step 3：TF_CDF feature clusterings

Because contradiction reconciles case without detailed classification information, textual words weight is calculated using TF_CDF, and it is special to carry out TF_CDF Levy cluster and obtain the detailed classification of case and classification keyword, while obtaining word TF_CDF values from cluster result；

Step 5：Generate mediation strategy prompting

Using the typical case with class label as analyze data, some classification generates mediation strategy according to procedure below：

(5.4) classification scoring is carried out to reconciling regulations, score basis include the bar number comprising conciliation regulations in classification, with identical The conciliation regulations of keyword ratio shared in classification；

(5.5) conciliation regulations are scored, and score basis include：Classification keyword occurs in regulations number and number of times and text Quality；

(5.6) the scoring descending sort of regulations classification will be reconciled, the higher classification of scoring is extracted, score value is extracted in these classifications high Conciliation regulations, as mediation strategy prompt message, be stored in database；

Step 6：Create index and calculate the degree of correlation

The core of full-text search engine includes index creation and relatedness computation, the typical case data in step 4 and will obtain Cluster classification and step 5 in mediation strategy prompting etc. be synchronized to elasticsearch create index；

Step 7：Search result and showing interface

User input query content, obtains similar typical case, case classification and class label information, mediation strategy and recommends, and from Dynamic generation similar cases analysis report.

2. a kind of people's contradiction as claimed in claim 1 reconciles case retrieval and mediation strategy recommends method, it is characterised in that： In the step 7, search procedure is as follows：

(7.1) similar cases, search result acquiescence is exported according to degree of correlation descending sort, and user can be carried out to retrieval result manually Filtering and sequence：Specified type and the case of period are shown, according to time-sequencing；

(7.2) case classification and class label, can as retrieval filter condition, each case show corresponding case classification and Classification keyword；

(7.3) mediation strategy is recommended：According to the obtained CROSS REFERENCE of search, mediation strategy is automatically generated；

(7.4) retrieval result is analyzed：User clicks on analysis button when needed, obtains retrieval result analysis report, and report is divided into： Time series analysis, regional context analysis, regulating member analysis, pacification worker analyze, mediate interpretation of result and reconciled used time analysis, root Binary search is carried out in result set according to analysis result.

3. a kind of people's contradiction as claimed in claim 1 or 2 reconciles case retrieval and mediation strategy recommends method, its feature exists In：In the step 3, " case details " field progress feature clustering step is as follows in reconciling contradiction：

(3.1) initial value is determined

People's contradiction, which reconciles " case details ", can gather for k classes, common n bars contradiction case, composition corpus D={ d₁,d₂,...., d_n, corpus refers to the set of " case details " field information in all cases here, and d is the single " case for constituting corpus Part details " information, carries out participle, the not repeated word of acquisition is { t by corpus Chinese version₁,t₂,....,t_N}；

<mrow> <msubsup> <mi>s</mi> <mi>i</mi> <mi>j</mi> </msubsup> <mo>=</mo> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mi>j</mi> </munder> <mrow> <mo>(</mo> <mfrac> <mrow> <msub> <mover> <mi>d</mi> <mo>&RightArrow;</mo> </mover> <mi>i</mi> </msub> <mo>&CenterDot;</mo> <mover> <msup> <mi>c</mi> <mi>j</mi> </msup> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mrow> <mo>|</mo> <msub> <mover> <mi>d</mi> <mo>&RightArrow;</mo> </mover> <mi>i</mi> </msub> <mo>|</mo> </mrow> <mo>&times;</mo> <mrow> <mo>|</mo> <mover> <msup> <mi>c</mi> <mi>j</mi> </msup> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein,It is case d_iApart from the minimum COS distance of each cluster centre, i.e. case d_iBelong to j classes,It is poly- j-th Class center；

(3.3) TF_CDF models are updated

The within-cluster variance E of cluster is calculated, if E is less than the half E of initial within-cluster variance₀/ 2, then update TF_CDF；If Cluster error E and be more than E₀/ 2 are skipped step (3.3)；Entropy of the word in all kinds of middle distributions is calculated according to formula (2)：

<mrow> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>p</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <mo>-</mo> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>&times;</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>cw</mi> <mi>p</mi> <mi>j</mi> </msubsup> </mrow> <mrow> <msup> <mi>cw</mi> <mi>j</mi> </msup> </mrow> </mfrac> <mo>,</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>=</mo> <mn>1</mn> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Wherein,It is word w occur in j class documents_pDocument account for the ratios of j class documents,It is that word w is included in class j_p's Number of documents, cw^jIt is the total number of documents in class j, H (w_p) it is word w_pEntropy in k classes；

Some word w_pTF_CDF calculate as shown in formula (3)：

<mrow> <mi>T</mi> <mi>F</mi> <mo>_</mo> <msub> <mi>CDF</mi> <mi>p</mi> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>TF</mi> <mi>p</mi> </msub> <mo>&times;</mo> <mi>ln</mi> <mrow> <mo>(</mo> <msub> <mi>DF</mi> <mi>p</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>(</mo> <mi>H</mi> <mo>(</mo> <msub> <mi>w</mi> <mi>p</mi> </msub> <mo>)</mo> <mo>+</mo> <mi>&epsiv;</mi> <mo>)</mo> <msqrt> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mi>p</mi> </msub> <mo>+</mo> <mi>ln</mi> <mo>(</mo> <mrow> <msub> <mi>DF</mi> <mi>p</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

Wherein, TF_pIt is word frequency of p-th of word in text i, DF in document_pRefer to the document that this word is included in corpus Quantity, q is the word number included in document i, denominator H (w_p) be word entropy, ln () is natural logrithm function, and ε is one Smaller value；

(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then TF_CDF values no longer change, and obtain k Cluster and TF_CDF models；

(3.6) after the completion of class tag extraction, cluster, several words higher word TF_CDF in each classification is extracted and are used as classification mark Label.

4. a kind of people's contradiction as claimed in claim 3 reconciles case retrieval and mediation strategy recommends method, it is characterised in that： In the step (3.1), initial value determination process is as follows：

1. initial TF_CDF values are determined

The initial TF_CDF of given word value is word word frequency, and i is after participle for some case, is expressed as d_i={ w₁,w₂,..., w_j, j=1,2 ..., N, N be to enter the not repeated word number after participle, w in corpus_jIt is word t_jOccur in case i Number of times；

2. initial cluster center is determined

Calculate k distant initial cluster centers：C={ C₁,C₂,...,C_k},c^j={ c₁,c₂,...,c_N, C is k poly- Class center, c is the vector form of single cluster centre, facilitates class label all to be represented with subscript for expression；

3. cluster within-cluster variance is calculated

5. a kind of people's contradiction as claimed in claim 3 reconciles case retrieval and mediation strategy recommends method, it is characterised in that： Iteration repeatedly, performs and once updates TF_CDF values, or set the threshold of a cluster centre change value in the step (3.3) Value, renewal is performed during more than this threshold value.

6. a kind of people's contradiction as claimed in claim 1 or 2 reconciles case retrieval and mediation strategy recommends method, its feature exists In：In the step (4), the process for automatically generating desensitization typical case collection is as follows：

(4.1) automatic desensitization

Using natural language processing technique automatic identification name and certificate address information, the information that will identify that uses so-and-so generation in original text Replace；

(4.2) typical case collection is automatically generated

Scored by case, high-quality typical case collection automatically generated by machine, step 3 case is divided into it is different classes of, point Other to carry out case scoring to each classification, higher case that case in each classification is scored is used as the typical case of this class.

7. a kind of people's contradiction as claimed in claim 6 reconciles case retrieval and mediation strategy recommends method, it is characterised in that： In the step (4.2), case Rating Model creation method is as follows：

Q_t=aQ_l+bQ_p+cQ_s (4)

Wherein, a+b+c=1 is the proportion shared by each quality score, Q_lIt is word numerical value quality, if the text of non-participle is long Degree, more than certain threshold value T_l, then Q_lFor 1, otherwise Q_lBy exponential damping；Q_pIt is that word repeats frequency of occurrence in ratio, i.e. text Highest word accounts for the ratio of the total word number of text, Q_pLess than certain threshold value T_pDuration is 1, and exponential damping is pressed during more than certain threshold value, Q_sIt is the ratio of the text size before text size and participle after participle；

3. each pacification worker is assigned scoring weight necessarily；Trouble-shooter is often handled, and reconciles the higher case of success rate The case pacification worker scoring weight Q of part_h∈ [0,1] is improved；

4. sentiment analysis is carried out to reconciling feedback information, obtains and reconcile feedback weight Q_e；Assign and commenting to positive feedback and negative feedback Fraction weight, the weight actively scored is higher；

5. consider several aspects more than, create case Rating Model Q=(α Q_h+βQ_e)Q_t, α is Q_hIn case scoring The proportion accounted for, β is Q_eShared case scoring weight, alpha+beta=1, according to case Rating Model, is scored case information, Automatically generate the typical case collection with class label.

8. a kind of people's contradiction as claimed in claim 1 or 2 reconciles case retrieval and mediation strategy recommends method, its feature exists In：In the step 6, the process for creating index and the calculating degree of correlation is as follows：

(6.1) index is created

Full-text index is created to " dispute details " and " mediating result explanation ", " mediation strategy recommendation " using ES, wherein, " dispute Details " and " mediate result illustrate " are the fields included in initial data, and " mediation strategy recommendations " is to calculate acquisition in step 5 Field；

(6.2) relatedness computation

Cluster obtains classification, class label, cluster centre and TF_CDF values in step 3, and this patent is represented using TF_CDF weights Text vector, input search content Query, participle obtains pn word, text weight is represented with TF_CDF, calculates text similar Degree；If word is appeared in class label, the text degree of correlation is correspondingly improved.