CN107220295A - A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method - Google Patents

A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method Download PDF

Info

Publication number
CN107220295A
CN107220295A CN201710285854.2A CN201710285854A CN107220295A CN 107220295 A CN107220295 A CN 107220295A CN 201710285854 A CN201710285854 A CN 201710285854A CN 107220295 A CN107220295 A CN 107220295A
Authority
CN
China
Prior art keywords
case
mrow
word
classification
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710285854.2A
Other languages
Chinese (zh)
Other versions
CN107220295B (en
Inventor
王开红
李建元
陈涛
蒋伶华
范鸿俊
温晓岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yinjiang Technology Co.,Ltd.
Original Assignee
Enjoyor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Enjoyor Co Ltd filed Critical Enjoyor Co Ltd
Priority to CN201710285854.2A priority Critical patent/CN107220295B/en
Publication of CN107220295A publication Critical patent/CN107220295A/en
Application granted granted Critical
Publication of CN107220295B publication Critical patent/CN107220295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, comprises the following steps:Step 1:Data Collection, pretreatment;Step 2:Participle and vector representation;Step 3:TF_CDF feature clusterings;Step 4:Desensitization typical case collection is automatically generated, cluster classification and classification keyword is obtained;Step 5:Generate mediation strategy prompting;Step 6:Create index and calculate the degree of correlation:The core of full-text search engine includes index creation and relatedness computation, and mediation strategy prompting etc. in the typical case data in step 4 and obtained cluster classification and step 5 is synchronized into elasticsearch creates index;Step 7:Search result and showing interface:User input query content, obtains similar typical case, case classification and class label information, mediation strategy and recommends, and automatically generate similar cases analysis report.A kind of accuracy rate of present invention offer is higher, fireballing people's contradiction reconciles case retrieval and mediation strategy recommends method.

Description

A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method
Technical field
The present invention relates to search technique and policy recommendation field, especially judicial domain people contradiction reconcile case retrieval and Mediation strategy recommends method.
Background technology
Internet technology rapidly develops, information resources explosive growth, the Working and life styles of profound influence people.Such as The present, internet is increasingly becoming the main place for obtaining resource and information interchange, and user can be learnt largely by internet hunt Information, brings great convenience to work and life.However, with the growth of Internet resources, the knot obtained by search engine The fruit not necessarily desired information of user, often also needs to carry out lookup screening in search result, this reduces The search efficiency of user.Also, the existing search engine based on Web can not provide search service for specific field.Cause How this improves the efficiency that user obtains target search result, and the search for improving user experience and formation specific area is drawn Hold up, be the problem of current search technique field is challenging.
Nowadays, in raising search quality, substantial amounts of work has been carried out in terms of the search engine of professional domain and individual demand Make.The search result that patent CN201010559233 is obtained for multiple search engines is different, to from multiple search engines The sorting position weight in weight and search engine that search result is held up according to index carries out basic sequence;Further according to co-occurrence information Situations such as come to basis sequence be modified adjustment, make the foundation of searching order more reasonable, improve the quality of search result.Specially Sharp CN201210548858.2 provides search category list and search engine list in browser window, and user can be to identical Search input, corresponding search result is produced using different search categories and search engine.In view of different browsers and Influence of the classification to search result, but constantly conversion browser and search category combinatorial search are needed, meet demand to find Information.Patent CN201310226576.5 provides the first search result according to user's search term and receives user and searched for first The behavioral data of hitch fruit.Search term is recommended according to the generation of visual signature, user behavior data and search term, can accurately be excavated The search intention of user, search result has more specific aim, meets the individual demand of user's search, improves user's search experience. Patent CN102567326B discloses a kind of information search collator and method, including:Determining unit, predicting unit, sequence Unit, the accuracy of search result is improved with reference to search log information, organizational structure information.
People's mediation search service is the professional search engine for judicial domain, the search engine pattern based on web And service, it is impossible to the demand of user is met, it is necessary to customize professional domain search service;Reconcile what case was serviced as case retrieval Data source, data volume is larger, and specification is not filled in clearly, and data fill in variation, can influence search engine service quality. Such as:Most of cases are short text data, and traditional TF_IDF scheduling algorithm search accuracy rates are not high;Looked for from substantial amounts of case Representational case is provided for study in use, comparing consuming time and efforts;It is related to substantial amounts of privacy information, people in case Work desensitization process workload is big etc..
The content of the invention
In order to which the accuracy rate for overcoming the shortcomings of existing people's mediation way of search is not high, the consuming time is longer, the present invention is carried Case retrieval is reconciled for people's contradiction that a kind of accuracy rate is higher, the consuming time is shorter and mediation strategy recommends method.
The technical solution adopted for the present invention to solve the technical problems is:
A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, comprises the following steps:
Step 1:Data Collection, pretreatment
Collect people's mediation case information, be stored in database, it is necessary to comprising field include:Dispute details, conciliation As a result details, are reconciled, time, end time, peace-maker, affiliated area, regulating member is reconciled and evaluates field, wherein, dispute Details, conciliation details and evaluation are text datas, and other fields are structural data.
The data being collected into are pre-processed, it is ensured that it is not sky to reconcile result, reconcile details field, and repeated data is deleted Remove;
Step 2:Participle and vector representation
Create contradiction and reconcile professional domain dictionary mediate.txt, the word of easy participle mistake, especially contradiction are reconciled Professional domain vocabulary, the word of correct cutting is will be unable to according to conciliation case data, is added contradiction and is reconciled professional domain dictionary mediate.txt;Also there are some insignificant words in other Chinese, the not high word of these meaningless, discriminations is added and stopped Dictionary stopword.txt is used, directly stop words is removed during participle and not performed an analysis;
The text field is carried out by participle according to dictionary mediate.txt and deactivation dictionary stopword.txt, by textual data According to the form for being expressed as vector;
Step 3:TF_CDF feature clusterings
Because contradiction reconciles case without detailed classification information, textual words weight is calculated using TF_CDF, and carry out TF_ CDF feature clusterings obtain the detailed classification of case and classification keyword, while obtaining word TF_CDF values from cluster result;
Step 4:It is automatic to desensitize and to carry out case scoring automatic and generate desensitization typical case collection;
Step 5:Generate mediation strategy prompting
Using the typical case with class label as analyze data, some classification reconciles plan according to procedure below generation Slightly:
(5.1) the typical case collection with class label is obtained, mediation strategy field is extracted;
(5.2) mediation strategy has second- and third-tier regulations mark, disconnects mediation strategy according to mark, forms conciliation regulations;
(5.3) regulations will be reconciled and carries out TF_CDF clusterings, and extract the keyword of conciliation regulations;
(5.4) classification scoring is carried out to reconciling regulations.Score basis include the bar number comprising conciliation regulations in classification, had The conciliation regulations of same keyword ratio shared in classification;
(5.5) conciliation regulations are scored, and score basis include:In regulations classification keyword occur number and number of times and The quality of text;
(5.6) by the classification scoring descending sort of the regulations of conciliation, the higher classification of scoring is extracted, the extraction point in these classifications The high conciliation regulations of value, as mediation strategy prompt message, are stored in database;
Step 6:Create index and calculate the degree of correlation
The core of full-text search engine includes index creation and relatedness computation, by the typical case data in step 4 and Mediation strategy prompting etc. is synchronized to elasticsearch and creates index in obtained cluster classification and step 5;
Step 7:Search result and showing interface
User input query content, obtains similar typical case, case classification and class label information, mediation strategy and recommends, And automatically generate similar cases analysis report.
Further, in the step 7, search procedure is as follows:
(7.1) similar cases.Search result acquiescence is exported according to degree of correlation descending sort, and user can be manually to retrieval result Filtered and sorted, for example, show specified type and the case of period, according to time ascending order or descending sort etc.;
(7.2) case classification and class label, can show corresponding case class as the filter condition retrieved, each case Other and classification keyword;
(7.3) mediation strategy is recommended:According to the obtained CROSS REFERENCE of search, recommend mediation strategy automatically;
(7.4) retrieval result is analyzed:User clicks on analysis button when needed, obtains retrieval result analysis report, report It is divided into:Time series analysis, regional context are analyzed, regulating member is analyzed, pacification worker analyzes, mediate interpretation of result and the time-division is used in conciliation Analysis, binary search is carried out according to analysis result in result set.
Further, in the step 3, " case details " field carries out the following institute of feature clustering step in reconciling contradiction Show:
(3.1) initial value is determined
People's contradiction, which reconciles " case details ", can gather for k classes, common n bars contradiction case, composition corpus D={ d1, d2,....,dn, corpus refers to the set of " case details " field information in all cases here, and d is composition corpus Single " case details " information, carries out participle, the not repeated word of acquisition is { t by corpus Chinese version1,t2,....,tN};
(3.2) " case details " are assigned to nearest neighbor classifier according to cosine similarity
Using module of the cosine similarity as cluster, shown in such as formula (1):
Wherein,It is case diApart from the minimum COS distance of each cluster centre, i.e. case diBelong to j classes.It is jth Individual cluster centre;
(3.3) TF_CDF models are updated
The within-cluster variance E of cluster is calculated, if E is less than the half E of initial within-cluster variance0/ 2, then update TF_CDF; If clustering error E is more than E0/ 2 are skipped step (3.3);The classification entropy of word is calculated according to formula (2):
Wherein,It is word w occur in j class documentspDocument account for the ratios of j class documents,It is that word is included in class j wpNumber of documents, cwjIt is the total number of documents in class j, H (wp) it is word wpClassification entropy in k classes, the bigger word of classification entropy Distribution is more uniform, and word weight is lower;Traditional feature extraction based on entropy is exactly the pass of the word that selects entropy relatively low as text Keyword, can select some and only have certain class peculiar but without representational word, and missing some multiple classes and having but compare has generation The word of table, for example:" first and second fasten the dispute ... of neighborhood relationship ... leak downstairs ", " neighbourhood " entangles in leak dispute and noise here Frequency of occurrence is higher in confused, but it is just representative belong to neighbourhood's dispute major class, " upper downstairs " almost only in leak dispute, Using the feature extraction based on entropy, then " upper to go downstairs " weight is very big, and " neighbourhood " possibly can not then be extracted.
The present invention calculates some word w using formula (3)pTF_CDF values:
Wherein, TFpIt is word wpWord frequency in text i, DFpWord document frequency refers to include this list in corpus H (w in the number of documents of word, denominatorp) be word entropy, ln () is natural logrithm function, reduce document frequency proportion;ε is one Individual smaller value, prevents H (wp) be 0 when there is mistake, word frequency, document frequency and classification entropy are considered, relative to short essay This accuracy is higher
(3.4) cluster centre is updated:It regard the average of each class Chinese version vector as new cluster centre;
(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then TF_CDF values no longer change, and obtain To k classification and TF_CDF models;
(3.6) after the completion of class tag extraction, cluster, several words higher word TF_CDF in each classification is extracted and are used as class Distinguishing label.
In the step (3.1), initial value determination process is as follows:
1. initial TF_CDF values are determined
The initial TF_CDF of given word value is word word frequency.I is after participle for some case, is expressed as di={ w1, w2,...,wj, j=1,2 ..., N, N be to enter the not repeated word number after participle, w in corpusjIt is word tjIn case The number of times occurred in i;
2. initial cluster center is determined
Calculate k distant initial cluster centers:C={ C1,C2,...,Ck},cj={ c1,c2,...,cN, C is k Individual cluster centre, c is the vector form of single cluster centre, facilitates class label all to be represented with subscript for expression;
3. cluster within-cluster variance is calculated
Calculate each case and initial case center distance and, determine initial clustering within-cluster variance E0
In the step (3.3), once renewal TF_CDF values are performed a plurality of times in iteration, or one cluster centre of setting changes The threshold value of value, renewal is performed during more than this threshold value.
In the step (4), the process for automatically generating desensitization typical case collection is as follows:
(4.1) automatic desensitization
Using natural language processing technique automatic identification name and certificate address information, the information that will identify that uses certain in original text Certain is replaced;
(4.2) typical case collection is automatically generated
Scored by case, high-quality typical case collection is automatically generated by machine, case is divided into inhomogeneity by step 3 Not, case scoring is carried out to each classification respectively, higher case that case in each classification is scored is used as the typical case of this class Example.
In the step (4.2), case Rating Model creation method is as follows:
1. case template is analyzed, merit brief introduction, mediation process is divided into, mediates result explanation, reconciles gains in depth of comprehension module;
2. the text quality of each case comprising modules is analyzed using natural language processing method,
Qt=aQl+bQp+cQs (4)
Wherein, a+b+c=1 is the proportion shared by each quality score, QlIt is word numerical value quality, if the text of non-participle This length, more than certain threshold value Tl, then QlFor 1, otherwise QlBy exponential damping;QpIt is that word repeats to occur in ratio, i.e. text Frequency highest word accounts for the ratio of the total word number of text, QpLess than certain threshold value TpDuration is 1, and index is pressed during more than certain threshold value Decay, QsIt is the ratio of the text size before text size and participle after participle;
3. each pacification worker is assigned scoring weight necessarily;Trouble-shooter is often handled, and it is higher to reconcile success rate Case case pacification worker scoring weight Qh∈ [0,1] is improved;
4. sentiment analysis is carried out to reconciling feedback information, obtains and reconcile feedback weight Qe;To positive feedback and negative feedback Scoring weight is assigned, the weight actively scored is higher;
5. consider several aspects more than, create case Rating Model Q=(α Qh+βQe)Qt, α is QhIn case The proportion accounted in scoring, β is QeShared case scoring weight, alpha+beta=1, according to case Rating Model, is carried out to case information Scoring, automatically generates the typical case collection with class label.
In the step 6, the process for creating index and the calculating degree of correlation is as follows:
(6.1) index is created
Full-text index is created to " dispute details " and " mediating result explanation ", " mediation strategy recommendation " using ES, wherein, " dispute details " and " mediate result illustrate " are the fields included in initial data, and " mediation strategy recommendations " is calculated in step 5 The field of acquisition;
(6.2) relatedness computation
Cluster obtains classification, class label, cluster centre and TF_CDF values in step 3, and this patent uses TF_CDF weights Text vector is represented, in addition, input search content Query, participle obtains pn word, equally uses TF_CDF vector representations, counts Calculate text similarity.If word is appeared in class label, the text degree of correlation is correspondingly improved.
The present invention technical concept be:Build the judicial domain case retrieval clothes for being applied to that short text is searched for, accuracy is high Business.Using natural language processing technique and high level machine learning art, deed of arrangement quality score and high-quality tune are realized automatically The structure of case library is solved, and excavates the mediation strategy of similar cases on this basis, promotes people's mediation " the accomplice people having the same aspiration and interest ", " public affairs It is positive to reconcile ".The present invention obtains case classification and class label by text TF_CDF feature clusterings;Created using elasticsearch Distributed fault-tolerance index is built, the text degree of correlation is calculated according to TF_CDF values and cluster labels, obtains similar cases, improve and search in full Rope and category function of search accuracy.
Beneficial effects of the present invention are mainly manifested in:
1) contradiction is reconciled into case type to segment, the classification information hidden in case can not only be excavated, be more convenient user Category is searched, and improves Consumer's Experience.
2) case retrieval professional domain feature is directed to, case quality score, the high-quality typical case of automatic desensitization generation is carried out Example collection.
3) mediation strategy prompting is produced for identical category case, facilitates user to understand the processing mode of similar cases, promoted Enter people's mediation " the accomplice people having the same aspiration and interest ", " just to reconcile ".
4) case short text data characteristic is reconciled for contradiction, obtains TF_CDF models, comprehensive word word frequency and classification text Shelves frequency, the TF_IDF modelling effects than commonly using are more preferable.
5) using TF_CDF carry out feature clustering, using cosine similarity as cluster module.
6) search output acquiescence is ranked up according to similarity, using TF_CDF values, class tag computation text similarity, is carried High retrieval result accuracy.
Brief description of the drawings
Fig. 1 is that contradiction reconciles case retrieval engine flow chart.
Fig. 2 is the schematic diagram that part contradiction reconciles dictionary mediate.txt.
Fig. 3 is that contradiction reconciles the schematic diagram that part disables dictionary stopword.txt.
Embodiment
The invention will be further described below in conjunction with the accompanying drawings.
1~Fig. 3 of reference picture, a kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, comprises the following steps:
Step 1:Data Collection, pretreatment
Collect people's mediation case information, be stored in database, it is necessary to comprising field include:Dispute details, conciliation As a result details, are reconciled, the fields, wherein dispute such as time, end time, peace-maker, affiliated area, regulating member, evaluation is reconciled Details, conciliation details and evaluation are text data (unstructured datas), and other fields are structural data.
The data being collected into are pre-processed, it is ensured that it is not sky, and conciliation case is not by reconcile result, reconcile details field Can be excessively simple;By data de-duplication.
Step 2:Participle and vector representation
Text data is unstructured data, computer can not Direct Analysis processing, it is necessary to by text dividing be word simultaneously Represented with structural data, facilitate computer to carry out subsequent treatment.English can be according to space participle, and Chinese does not have clear and definite word Separator, participle difficulty is larger, and the accuracy of participle can be improved by auxiliary dictionary during general participle.
The present invention creates contradiction and reconciles professional domain dictionary mediate.txt, by the word of easy participle mistake, especially lance Shield reconciles professional domain vocabulary, for example:" XX villagers' committees " can be split as " XX villages/committee's meeting ", be actually " XX/ villagers' committees ", According to reconcile case data will " villagers' committee " etc. can not correct cutting word, addition contradiction conciliation professional domain dictionary mediate.txt;Also there are some insignificant words in other Chinese, such as " ", " " etc., contradiction reconcile in professional domain and also There are Party A, Party B etc., these words do not include information not only, and certain interference is also caused to subsequent analysis, by these are meaningless, The not high word of discrimination, which is added, disables dictionary stopword.txt, directly removes stop words during participle and does not perform an analysis.
The text field is carried out by participle according to dictionary mediate.txt and deactivation dictionary stopword.txt, by textual data According to the form for being expressed as vector, convenient analyzing and processing.
Step 3:TF_CDF feature clusterings
" case details " only have several major classes to divide during contradiction is reconciled, and without detailed class discrimination, all data mix Together, made troubles to analysis and search etc., and without training data, to extract class label can only be by way of cluster Carry out." case details " are subjected to participle in step 2, and use vector representation, but the larger feature of text data dimension is not Substantially, in addition it is also necessary to carry out feature extraction, cluster is finally carried out using machine learning algorithm and obtains class label.
The present invention calculates textual words weight using TF_CDF, not only considers the word frequency and document frequency of word, also integrates Consider token-category frequency, result of calculation is more more reliable than conventional TF_IDF feature extraction algorithms.And it is traditional based on information Entropy, the Text character extraction of mutual information etc. is usually used in known text class label, or some training data, classified or Tag extraction.Contradiction reconciles case without detailed classification information in the present invention, and word TF_CDF values are obtained from cluster result.
" case details " field progress feature clustering step is as follows in reconciling contradiction:
(3.1) initial value is determined
People's contradiction, which reconciles " case details ", can gather for k classes, common n bars contradiction case, composition corpus D={ d1, d2,....,dn}.Here corpus refers to the set of " case details " field information in all cases, and d is composition corpus Single " case details " information.Corpus Chinese version is subjected to participle, the not repeated word of acquisition is { t1,t2,....,tN}。
1. initial TF_CDF values are determined
TF_CDF value, according to simple, fireballing principle is calculated, will give the initial TF_CDF of word by iterative calculation Value be word word frequency.I is after participle for some case, is expressed as di={ w1,w2,...,wj, j=1,2 ..., N, N be language Material entered the not repeated word number after participle, w in storehousejIt is word tjThe number of times occurred in case i.
2. initial cluster center is determined
The selection of initial cluster center influences larger to cluster result, can be according to the original of initial cluster center as far as possible Then, k distant initial cluster centers are calculated:C={ C1,C2,...,Ck},cj={ c1,c2,...,cN, C is k poly- Class center, c is the vector form of single cluster centre, facilitates class label all to be represented with subscript for expression.
3. within-cluster variance is calculated
Calculate each case and initial case center error and, determine initial clustering within-cluster variance E0
(3.2) " case details " are assigned to nearest neighbor classifier
Cosine similarity measures the similitude between them by measuring the cosine value of two vectorial angles, more It is the difference on direction, it is insensitive to absolute numerical value.Text data is by being multi-C vector after vectorization, using cosine phase Like module of the degree as cluster, such as shown in formula (1):
Wherein,It is case diApart from the minimum COS distance of each cluster centre, i.e. case diBelong to j classes.It is jth Individual cluster centre.
(3.3) TF_CDF models are updated
The within-cluster variance E of cluster is calculated, if E is less than the half E of initial within-cluster variance0/ 2, then update TF_CDF. If E is more than E0/ 2 are skipped step (3).After classification, if the number of times that a word occurs in a certain class is more, other classes The number of times of middle appearance is less, then illustrates that this word is important in such, it should increase the weight of this word accordingly.If The ratio that one word occurs in each class is essentially identical, then illustrates that the discrimination of this word is relatively low, it should reduce this The weight of word.Entropy of the word in all kinds of middle distributions is calculated according to formula (2):
Wherein,It is word w occur in j class documentspDocument account for the ratios of j class documents,It is that word is included in class j wpNumber of documents, cwjIt is the total number of documents in class j, H (wp) it is word wpEntropy in k classes, word is distributed in each class More uniform, entropy is bigger, and the discrimination of word is lower.
Short text data to that can be divided into several major classes, after stop words is removed, a big chunk frequency of occurrences is more Word implication it is more important, some word wpTF_CDF calculate as shown in formula (3):
Wherein, TFpIt is word frequency of p-th of word in text i, DF in text ipWord document frequency refers in corpus H (w in number of documents comprising this word, denominatorp) be word entropy, ln () is natural logrithm function, reduce document frequency Proportion;ε is a smaller value, prevents H (wp) be 0 when there is mistake, consider word frequency, document frequency and classification entropy, It is higher relative to short text accuracy
It is relatively large that each iteration all updates TF_CDF value amounts of calculation, can iteration be performed a plurality of times and once update TF_CDF values, The threshold value of a cluster centre change value can also be set, renewal is performed during more than this threshold value.
(3.4) cluster centre is updated
It regard the average of each class Chinese version vector as new cluster centre.
(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then TF_CDF values no longer change, and obtain To k class and TF_CDF models.
(3.6) after the completion of class tag extraction, cluster, several words higher word TF_CDF in each classification is extracted and are used as class Distinguishing label.
Step 4:Automatically generate desensitization typical case collection
(4.1) automatic desensitization
It is privacy information in protection case, it is necessary to which the privacy information such as substantial amounts of name address in case is carried out into special place Reason, artificial treatment takes a substantial amount of time and energy, and the present invention carries out automatic desensitization process.Using natural language processing technique certainly Dynamic identification name and certificate address information, the information that will identify that are replaced in original text with so-and-so.
(4.2) typical case collection is automatically generated
Reconciling case registration, there is no fixed standard in, and the quality of case has larger difference, in pacification worker's short time It is difficult to find the excellent case for being available for just examining, therefore set up high-quality typical case collection and have great significance.But, manually from big The typical case extracted in the case of amount for study reference not only expends the time, is also easily influenceed by individual factor.The present invention is logical Case scoring is crossed, high-quality typical case collection is automatically generated by machine, objective and fair saves the time.Case is divided into by step 3 It is different classes of, case scoring is carried out to each classification respectively, higher case that case in each classification is scored is used as this class Typical case, case Rating Model creation method is as follows:
1. case template is analyzed, several comprising modules are divided into, such as merit brief introduction, mediation process, result is mediated Illustrate, reconcile the modules such as gains in depth of comprehension, each classification there may be different demarcation method, and other cases are entered according to corresponding comprising modules Row is divided;
2. the text quality of each case comprising modules is analyzed using natural language processing method, for example:Case constitutes mould The word length of block, the implication of text, if be made up of etc. that (other indexs that can react text quality are wrapped repetitor or sentence It is contained within the scope of this patent), the high case scoring of text quality is higher.
Qt=aQl+bQp+cQs (4)
Wherein, a+b+c=1 is the proportion shared by each quality score, QlIt is word numerical value quality, if the text of non-participle This length, more than certain threshold value Tl, then QlFor 1, otherwise QlBy exponential damping;QpIt is that word repeats to occur in ratio, i.e. text Frequency highest word accounts for the ratio of the total word number of text, QpLess than certain threshold value TpDuration is 1, and index is pressed during more than certain threshold value Decay, QsIt is the ratio of the text size before text size and participle after participle, because can remove after text participle single The word and stop words of word, QsWhether can weigh in text has too many meaningless word, QsYue great text qualities are relatively preferable.
3. case pacification worker has chief pacification worker, common pacification worker, part-time staff etc., analyzes the tune of pacification worker Solution experience, excavates it and is good at field etc., assigns each pacification worker scoring weight necessarily;Trouble-shooter is often handled, and Reconcile the case pacification worker scoring weight Q of the higher case of success rateh∈ [0,1] is properly increased.
4. sentiment analysis is carried out to reconciling feedback information, obtains and reconcile feedback weight Qe;To positive feedback and negative feedback Scoring weight is assigned, the weight actively scored is higher;
5. consider several aspects more than, create case Rating Model Q=(α Qh+βQe)Qt, α is QhIn case The proportion accounted in scoring, β is QeShared case scoring weight, alpha+beta=1.According to case Rating Model, case information is carried out Scoring, automatically generates the typical case collection with class label.
Step 5:Generate mediation strategy prompting
The mediation strategy of case is the content that user compares care, can as similar cases conciliation foundation, realize " same The case people having the same aspiration and interest ", promotes the fairness of people's mediation service.But, case checks that mediation strategy is wasted time and energy one by one, automatic Display phase The mediation strategy of case is closed, user time can be greatly saved.The present invention automatically generates mediation strategy, to carry class in step 4 The typical case of distinguishing label generates mediation strategy as analyze data, some classification according to following steps:
(5.1) the typical case collection of class label is carried in obtaining step 4, mediation strategy field is extracted.
(5.2) mediation strategy has second- and third-tier regulations mark, disconnects mediation strategy according to mark, forms conciliation regulations.
(5.3) regulations will be reconciled and carry out TF_CDF clusterings, and adjusted according to the method extraction that keyword is extracted in step 3 The keyword of solution regulations.
(5.4) classification scoring is carried out to reconciling regulations.Score basis include the bar number comprising conciliation regulations in classification, had The conciliation regulations of same keyword ratio shared in classification etc..
(5.5) conciliation regulations are scored.Score basis include:The frequency, the matter of text of keyword appearance in regulations Amount etc.;
(5.6) by the classification scoring descending sort of the regulations of conciliation, the higher classification of scoring is extracted, the extraction point in these classifications The high conciliation regulations of value, as mediation strategy prompt message, are stored in database.
Step 6:Create index and calculate the degree of correlation
The core of full-text search engine includes index creation and relatedness computation, by the typical case data in step 4 and Mediation strategy prompting etc. is synchronized to elasticsearch and creates index in obtained cluster classification and step 5.
(6.1) index is created
Elasticsearch is developed based on Lucene, is using most wide one of search engine of increasing income now.The present invention is adopted Full-text index is created to " dispute details " and " mediating result to illustrate ", " mediation strategy recommendation " with ES.Wherein, " dispute details " and " mediate result illustrate " is the field included in initial data, and " mediation strategy recommendation " is the field that acquisition is calculated in step 5.
(6.2) relatedness computation
Relatedness computation is the degree of correlation for calculating search input Query and index text, and is given tacit consent to according to degree of correlation descending Sequence output, relatedness computation determines the output at interface, directly affects Consumer's Experience, and accurately and effectively relatedness computation is convenient User searches.Cluster obtains classification, class label, cluster centre and TF_CDF values in step 3, and this patent uses TF_CDF weights Text vector is represented, text similarity is calculated.In addition, input search content Query, participle obtains pn word, if word Appear in class label, be then correspondingly improved the text degree of correlation.
Step 7:Search result and showing interface
Server end carries out a series of data mining calculating, and final purpose is to show user in client.User is defeated Enter to inquire about content, can obtain similar typical case, case classification and class label information, mediation strategy recommend, for the convenience of the user from All angles fully understand CROSS REFERENCE, can automatically generate similar cases analysis report.Particular content is as follows:
(7.1) similar cases.Search result acquiescence is exported according to degree of correlation descending sort, and user can be manually to retrieval result Filtered and sorted:Specified type and the case of period are shown, according to time-sequencing etc..
(7.2) case classification and class label.Each case shows corresponding case classification and class label, also can conduct The filter condition of retrieval.
(7.3) mediation strategy is recommended.According to the obtained CROSS REFERENCE of search, mediation strategy is automatically generated.
(7.4) retrieval result is analyzed.Analysis of cases is not that acquiescence is shown in main interface, and user can click on when needed Analysis button, obtains the report of retrieval result root system, and report is divided into:Time series analysis, regional context analysis, pacification worker's analysis, tune Locate interpretation of result etc., binary search can be carried out in result set according to analysis result.
The present embodiment checking data are Shanghai City people's mediation data, and process is as follows:
Step 1:Data Collection, pretreatment
People's mediation case information is collected, is stored in database, field is as shown in Table 1.
The contradiction of table 1 reconciles field information
The data being collected into are pre-processed, " MEDIATE_CIRCS " is removed for the empty and simple field of description;Will Repeat the data deletion that dispute details are repeated;The privacy information for being related to name rooming list in case is replaced with " so-and-so ".It is right MEDIATE_CIRCS, MEDIATE_EXPLAIN, RESULT_RECOMMEND field carry out full-text search, and other parts carry out accurate Really value retrieval.
Step 2:Participle and vector representation
Text data is unstructured data, it is impossible to which Direct Analysis is, it is necessary to be word by text dividing.Contradiction is made to reconcile Professional domain dictionary mediate.txt and deactivation dictionary stopword.txt.According to dictionary and disable dictionary by " dispute details ", " mediating result explanation " text carries out participle, and text data is expressed as to the form of vector, facilitates computer disposal.
(2.1) " dispute details " are as follows in contradiction and disputes case:
Both Parties fasten neighborhood relationship downstairs.On November 29th, 2009, Party B's toilet floor drain leak stopping water to Party A Family, causes Party A family roof, metope, hangs that cupboard door is impaired, Party A requires Party B's reimbursement of damages, and both sides are compensation issue generation difference Cause dispute.
(2.2) dictionary includes word, word frequency and part of speech (can omit), and often one word of row, is separated with space, and part contradiction is adjusted Solve dictionary mediate.txt as shown in Figure 2.
(2.3) contradiction reconciles part and disables dictionary stopword.txt as shown in figure 3, every one word of row.
(2.4) by " dispute details " participle and handle
Upper downstairs/neighborhood relationship/toilet/floor drain/leak/causes/roof/metope/to hang cupboard door/is damaged/requirement/reparation Loss/compensation issue/generation/difference/causes dispute
Step 3:TF_CDF is calculated and TF_CDF feature clusterings
People's contradiction can be divided into 4 big classification, be respectively:Compensate dispute, neighbourhood's dispute, work dispute, contract dispute;Often It can be divided into several groups again below individual major class.Group division is carried out to contract dispute major class below, feature clustering step is as follows It is shown:
(3.1) initial value is determined
It is 6 groups by contract dispute major class cluster, has 2122 inconsistency datas, text data passes through participle, removed Some word word frequency are less than 3 word, form N-dimensional word vector.
1. initial TF_CDF values are determined
The value for giving initial TF_CDF is word word frequency value, calculates simple speed fast.Certain data i is represented by di= {w1,w2,...,wN, N is data dimension, and w is word word frequency value.
2. initial cluster center is determined
Calculate k distant initial cluster center C={ C1,C2,...,Ck},cj={ c1,c2,...,cN}。
(3.2) it is assigned to nearest neighbor classifier according to cosine similarity
Text data, using module of the cosine similarity as cluster, is pressed by being multi-C vector after vectorization Each case and the distance at each class center are calculated according to formula (1), case classification is obtained.
(3.3) TF_CDF is updated
After classification, if the number of times that a word occurs in a certain class is more, the number of times occurred in other classes is less, then Illustrate that this word is important in such, it should increase the weight of this word accordingly.If a word goes out in each class The existing frequency is essentially identical, then illustrates that the discrimination of this word is relatively low, it should reduce the weight of this word.According to formula (2) entropy of the word in all kinds of middle distributions is calculated.Word is distributed more uniform in each class, and entropy is bigger, and the discrimination of word is lower.
Short text data for several major classes can be divided into, after stop words is removed, a big chunk frequency of occurrences compared with Many word implications are more important, and the present invention calculates the TF_CDF values of word according to formula (3).
It is relatively large that each iteration all updates TF_CDF value amounts of calculation, can iteration a TF_CDF be performed a plurality of times update, The threshold value of a cluster centre change value can be set, renewal is performed during more than this threshold value.
(3.4) cluster centre is updated
Update after TF_CDF, regard the average of each class Chinese version vector as new cluster centre.
(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then the entropy of word no longer changes i.e. TF_CDF models no longer change, and obtain k cluster and TF_CDF models.
(3.6) class tag extraction
After the completion of cluster, the word of word TF_CDF values preceding 5 in each classification is extracted respectively, class label is used as.Contract Classification and class label after the completion of dispute cluster is as shown in table 2.
The 2-in-1 similar cluster result of table
Step 4:Desensitization typical case collection is automatically generated, process is as follows:
(4.1) automatic desensitization
For protection case privacy information, name and certificate address information are recognized using natural language processing technique, taken off automatically Quick processing, the information that will identify that is replaced in original text with so-and-so.It is as follows:.
" it is employer-employee relationship to apply so-and-so with Zhang.At the beginning of in Septembers, 2009, apply so-and-so and made referrals to through so-and-so nurse place of matchmakers and opened Nurse's work is done by so-and-so family, and arranges monthly 1300 yuans of wages, while eating, living in Zhang family ... ";
(4.2) typical case collection is automatically generated
Reconciling case registration, there is no fixed standard in, and the quality of case has larger difference, in pacification worker's short time It is difficult to find the excellent case for being available for just examining, therefore set up high-quality typical case collection and have great significance.But, manually from big The typical case extracted in the case of amount for study reference not only expends the time, is also easily influenceed by individual factor.The present invention is logical Case scoring is crossed, high-quality typical case collection is automatically generated by machine, objective and fair saves the time.Case is divided into by step 3 It is different classes of, case scoring is carried out to each classification respectively, higher case that case in each classification is scored is used as this class Typical case, case Rating Model creation method is as follows:
1. case template is analyzed, several comprising modules are divided into, such as merit brief introduction, mediation process, result is mediated Illustrate, reconcile the modules such as gains in depth of comprehension, each classification there may be different demarcation method, and other cases are entered according to corresponding comprising modules Row is divided;
2. the text quality of each case comprising modules is analyzed using natural language processing method, for example:Case constitutes mould The word length of block, the implication of text, if be made up of etc. that (other indexs that can react text quality are wrapped repetitor or sentence It is contained within the scope of this patent).High higher, the Q of case scoring of text qualitytIt is higher.
3. case pacification worker has chief pacification worker, common pacification worker, part-time staff etc., analyzes the tune of pacification worker Solution experience, excavates it and is good at field etc., assigns each pacification worker scoring weight necessarily, obtains scoring Qh
4. exchange solution result feedback information and carry out sentiment analysis, scoring weight, product are assigned to positive feedback and negative feedback The weight of pole scoring is higher, obtains Qe
5. consider several aspects more than, create case Rating Model.According to case Rating Model, case is believed Breath is scored, and automatically generates the typical case collection with class label.
Step 5:Generate mediation strategy prompting
The mediation strategy of case is the content that user compares care, can as similar cases conciliation foundation, realize " same The case people having the same aspiration and interest ", promotes the fairness of people's mediation service.But, case checks that mediation strategy is wasted time and energy one by one, automatic Display phase The mediation strategy of case is closed, user time can be greatly saved.The present invention automatically generates mediation strategy, to carry class in step 4 The typical case of distinguishing label generates mediation strategy as analyze data, some classification according to following steps:
(5.1) the typical case collection of class label is carried in obtaining step 4, mediation strategy field is extracted.
(5.2) mediation strategy has second- and third-tier regulations mark, disconnects mediation strategy according to mark, forms conciliation regulations.
(5.3) regulations will be reconciled and carry out TF_CDF progress clusterings, and according to the key of step 3 extraction conciliation regulations Word.
(5.4) classification scoring is carried out to reconciling regulations.Score basis include the bar number comprising conciliation regulations in classification, had The conciliation regulations of same keyword ratio shared in classification etc..
(5.5) conciliation regulations are scored.Score basis include:The frequency, the matter of text of keyword appearance in regulations Amount etc..
(5.6) by the classification scoring descending sort of the regulations of conciliation, the higher classification of scoring is extracted, the extraction point in these classifications The high conciliation regulations of value, as mediation strategy prompt message, are stored in database.
Step 6:Create index and calculate the degree of correlation
The core of full-text search engine includes index creation and relatedness computation, by the typical case data in step 4 and Mediation strategy prompting etc. is synchronized to elasticsearch and creates index in obtained cluster classification and step 5.
(6.1) index is created
Elasticsearch is developed based on Lucene, is using most wide one of search engine of increasing income now.The present invention is adopted Full-text index is created to " dispute details " and " mediating result to illustrate ", " mediation strategy recommendation " with ES.Wherein, " dispute details " and " mediate result illustrate " is the field included in initial data, and " mediation strategy recommendation " is the field that acquisition is calculated in step 5.
(6.2) relatedness computation
Relatedness computation is the degree of correlation for calculating search input Query and index text, and according to degree of correlation descending sort Output, relatedness computation determines the output at interface, directly affects Consumer's Experience, and accurately and effectively relatedness computation facilitates user Search.Cluster obtains classification, class label, TF_CDF values in step 3, and this patent calculates the degree of correlation.
Step 7:Search result set showing interface
Server end carries out a series of data mining calculating, and final purpose is to show user in client.User is defeated Enter to inquire about content, can obtain similar typical case, case classification and class label information, mediation strategy recommend, for the convenience of the user from All angles fully understand CROSS REFERENCE, can automatically generate similar cases analysis report.Particular content is as follows:
(7.1) similar cases.Search result acquiescence is exported according to degree of correlation descending sort, and user can be manually to retrieval result Filtered and sorted:Specified type and the case of period are shown, according to time-sequencing etc..
(7.2) case classification and class label.Each case shows corresponding case classification and class label, also can conduct The filter condition of retrieval.
(7.3) mediation strategy is recommended.According to the obtained CROSS REFERENCE of search, mediation strategy is automatically generated.
(7.4) retrieval result is analyzed.Analysis of cases is not that acquiescence is shown in main interface, and user can click on when needed Analysis button, obtains the report of retrieval result root system, and report is divided into:Time series analysis, regional context analysis, pacification worker's analysis, tune Locate interpretation of result etc., binary search can be carried out in result set according to analysis result.

Claims (8)

1. a kind of people's contradiction reconciles case retrieval and mediation strategy recommends method, it is characterised in that:Comprise the following steps:
Step 1:Data Collection, pretreatment
Collect people's mediation case information, be stored in database, it is necessary to comprising field include:Dispute details, reconcile result, Details are reconciled, time, end time, peace-maker, affiliated area, regulating member is reconciled and evaluates field, wherein, dispute details, It is text data to reconcile details and evaluation, and other fields are structural data;
The data being collected into are pre-processed, it is ensured that it is not sky to reconcile result, reconcile details field, by data de-duplication;
Step 2:Participle and vector representation
Create contradiction and reconcile professional domain dictionary mediate.txt, the word of easy participle mistake, especially contradiction are reconciled into specialty Field Words, the word of correct cutting is will be unable to according to conciliation case data, is added contradiction and is reconciled professional domain dictionary mediate.txt;Also there are some insignificant words in other Chinese, the not high word of these meaningless, discriminations is added and stopped Dictionary stopword.txt is used, directly stop words is removed during participle and not performed an analysis;
The text field is carried out by participle according to dictionary mediate.txt and deactivation dictionary stopword.txt, by text data table It is shown as the form of vector;
Step 3:TF_CDF feature clusterings
Because contradiction reconciles case without detailed classification information, textual words weight is calculated using TF_CDF, and it is special to carry out TF_CDF Levy cluster and obtain the detailed classification of case and classification keyword, while obtaining word TF_CDF values from cluster result;
Step 4:It is automatic to desensitize and to carry out case scoring automatic and generate desensitization typical case collection;
Step 5:Generate mediation strategy prompting
Using the typical case with class label as analyze data, some classification generates mediation strategy according to procedure below:
(5.1) the typical case collection with class label is obtained, mediation strategy field is extracted;
(5.2) mediation strategy has second- and third-tier regulations mark, disconnects mediation strategy according to mark, forms conciliation regulations;
(5.3) regulations will be reconciled and carries out TF_CDF clusterings, and extract the keyword of conciliation regulations;
(5.4) classification scoring is carried out to reconciling regulations, score basis include the bar number comprising conciliation regulations in classification, with identical The conciliation regulations of keyword ratio shared in classification;
(5.5) conciliation regulations are scored, and score basis include:Classification keyword occurs in regulations number and number of times and text Quality;
(5.6) the scoring descending sort of regulations classification will be reconciled, the higher classification of scoring is extracted, score value is extracted in these classifications high Conciliation regulations, as mediation strategy prompt message, be stored in database;
Step 6:Create index and calculate the degree of correlation
The core of full-text search engine includes index creation and relatedness computation, the typical case data in step 4 and will obtain Cluster classification and step 5 in mediation strategy prompting etc. be synchronized to elasticsearch create index;
Step 7:Search result and showing interface
User input query content, obtains similar typical case, case classification and class label information, mediation strategy and recommends, and from Dynamic generation similar cases analysis report.
2. a kind of people's contradiction as claimed in claim 1 reconciles case retrieval and mediation strategy recommends method, it is characterised in that: In the step 7, search procedure is as follows:
(7.1) similar cases, search result acquiescence is exported according to degree of correlation descending sort, and user can be carried out to retrieval result manually Filtering and sequence:Specified type and the case of period are shown, according to time-sequencing;
(7.2) case classification and class label, can as retrieval filter condition, each case show corresponding case classification and Classification keyword;
(7.3) mediation strategy is recommended:According to the obtained CROSS REFERENCE of search, mediation strategy is automatically generated;
(7.4) retrieval result is analyzed:User clicks on analysis button when needed, obtains retrieval result analysis report, and report is divided into: Time series analysis, regional context analysis, regulating member analysis, pacification worker analyze, mediate interpretation of result and reconciled used time analysis, root Binary search is carried out in result set according to analysis result.
3. a kind of people's contradiction as claimed in claim 1 or 2 reconciles case retrieval and mediation strategy recommends method, its feature exists In:In the step 3, " case details " field progress feature clustering step is as follows in reconciling contradiction:
(3.1) initial value is determined
People's contradiction, which reconciles " case details ", can gather for k classes, common n bars contradiction case, composition corpus D={ d1,d2,...., dn, corpus refers to the set of " case details " field information in all cases here, and d is the single " case for constituting corpus Part details " information, carries out participle, the not repeated word of acquisition is { t by corpus Chinese version1,t2,....,tN};
(3.2) " case details " are assigned to nearest neighbor classifier according to cosine similarity
Using module of the cosine similarity as cluster, shown in such as formula (1):
<mrow> <msubsup> <mi>s</mi> <mi>i</mi> <mi>j</mi> </msubsup> <mo>=</mo> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mi>j</mi> </munder> <mrow> <mo>(</mo> <mfrac> <mrow> <msub> <mover> <mi>d</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>i</mi> </msub> <mo>&amp;CenterDot;</mo> <mover> <msup> <mi>c</mi> <mi>j</mi> </msup> <mo>&amp;RightArrow;</mo> </mover> </mrow> <mrow> <mrow> <mo>|</mo> <msub> <mover> <mi>d</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>i</mi> </msub> <mo>|</mo> </mrow> <mo>&amp;times;</mo> <mrow> <mo>|</mo> <mover> <msup> <mi>c</mi> <mi>j</mi> </msup> <mo>&amp;RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Wherein,It is case diApart from the minimum COS distance of each cluster centre, i.e. case diBelong to j classes,It is poly- j-th Class center;
(3.3) TF_CDF models are updated
The within-cluster variance E of cluster is calculated, if E is less than the half E of initial within-cluster variance0/ 2, then update TF_CDF;If Cluster error E and be more than E0/ 2 are skipped step (3.3);Entropy of the word in all kinds of middle distributions is calculated according to formula (2):
<mrow> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>p</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <mo>-</mo> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>&amp;times;</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>cw</mi> <mi>p</mi> <mi>j</mi> </msubsup> </mrow> <mrow> <msup> <mi>cw</mi> <mi>j</mi> </msup> </mrow> </mfrac> <mo>,</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msubsup> <mi>pw</mi> <mi>p</mi> <mi>j</mi> </msubsup> <mo>=</mo> <mn>1</mn> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
Wherein,It is word w occur in j class documentspDocument account for the ratios of j class documents,It is that word w is included in class jp's Number of documents, cwjIt is the total number of documents in class j, H (wp) it is word wpEntropy in k classes;
Some word wpTF_CDF calculate as shown in formula (3):
<mrow> <mi>T</mi> <mi>F</mi> <mo>_</mo> <msub> <mi>CDF</mi> <mi>p</mi> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>TF</mi> <mi>p</mi> </msub> <mo>&amp;times;</mo> <mi>ln</mi> <mrow> <mo>(</mo> <msub> <mi>DF</mi> <mi>p</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>(</mo> <mi>H</mi> <mo>(</mo> <msub> <mi>w</mi> <mi>p</mi> </msub> <mo>)</mo> <mo>+</mo> <mi>&amp;epsiv;</mi> <mo>)</mo> <msqrt> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mi>p</mi> </msub> <mo>+</mo> <mi>ln</mi> <mo>(</mo> <mrow> <msub> <mi>DF</mi> <mi>p</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
Wherein, TFpIt is word frequency of p-th of word in text i, DF in documentpRefer to the document that this word is included in corpus Quantity, q is the word number included in document i, denominator H (wp) be word entropy, ln () is natural logrithm function, and ε is one Smaller value;
(3.4) cluster centre is updated:It regard the average of each class Chinese version vector as new cluster centre;
(3.5) repeat step (3.2)~(3.4), until cluster centre no longer changes, then TF_CDF values no longer change, and obtain k Cluster and TF_CDF models;
(3.6) after the completion of class tag extraction, cluster, several words higher word TF_CDF in each classification is extracted and are used as classification mark Label.
4. a kind of people's contradiction as claimed in claim 3 reconciles case retrieval and mediation strategy recommends method, it is characterised in that: In the step (3.1), initial value determination process is as follows:
1. initial TF_CDF values are determined
The initial TF_CDF of given word value is word word frequency, and i is after participle for some case, is expressed as di={ w1,w2,..., wj, j=1,2 ..., N, N be to enter the not repeated word number after participle, w in corpusjIt is word tjOccur in case i Number of times;
2. initial cluster center is determined
Calculate k distant initial cluster centers:C={ C1,C2,...,Ck},cj={ c1,c2,...,cN, C is k poly- Class center, c is the vector form of single cluster centre, facilitates class label all to be represented with subscript for expression;
3. cluster within-cluster variance is calculated
Calculate each case and initial case center distance and, determine initial clustering within-cluster variance E0
5. a kind of people's contradiction as claimed in claim 3 reconciles case retrieval and mediation strategy recommends method, it is characterised in that: Iteration repeatedly, performs and once updates TF_CDF values, or set the threshold of a cluster centre change value in the step (3.3) Value, renewal is performed during more than this threshold value.
6. a kind of people's contradiction as claimed in claim 1 or 2 reconciles case retrieval and mediation strategy recommends method, its feature exists In:In the step (4), the process for automatically generating desensitization typical case collection is as follows:
(4.1) automatic desensitization
Using natural language processing technique automatic identification name and certificate address information, the information that will identify that uses so-and-so generation in original text Replace;
(4.2) typical case collection is automatically generated
Scored by case, high-quality typical case collection automatically generated by machine, step 3 case is divided into it is different classes of, point Other to carry out case scoring to each classification, higher case that case in each classification is scored is used as the typical case of this class.
7. a kind of people's contradiction as claimed in claim 6 reconciles case retrieval and mediation strategy recommends method, it is characterised in that: In the step (4.2), case Rating Model creation method is as follows:
1. case template is analyzed, merit brief introduction, mediation process is divided into, mediates result explanation, reconciles gains in depth of comprehension module;
2. the text quality of each case comprising modules is analyzed using natural language processing method,
Qt=aQl+bQp+cQs (4)
Wherein, a+b+c=1 is the proportion shared by each quality score, QlIt is word numerical value quality, if the text of non-participle is long Degree, more than certain threshold value Tl, then QlFor 1, otherwise QlBy exponential damping;QpIt is that word repeats frequency of occurrence in ratio, i.e. text Highest word accounts for the ratio of the total word number of text, QpLess than certain threshold value TpDuration is 1, and exponential damping is pressed during more than certain threshold value, QsIt is the ratio of the text size before text size and participle after participle;
3. each pacification worker is assigned scoring weight necessarily;Trouble-shooter is often handled, and reconciles the higher case of success rate The case pacification worker scoring weight Q of parth∈ [0,1] is improved;
4. sentiment analysis is carried out to reconciling feedback information, obtains and reconcile feedback weight Qe;Assign and commenting to positive feedback and negative feedback Fraction weight, the weight actively scored is higher;
5. consider several aspects more than, create case Rating Model Q=(α Qh+βQe)Qt, α is QhIn case scoring The proportion accounted for, β is QeShared case scoring weight, alpha+beta=1, according to case Rating Model, is scored case information, Automatically generate the typical case collection with class label.
8. a kind of people's contradiction as claimed in claim 1 or 2 reconciles case retrieval and mediation strategy recommends method, its feature exists In:In the step 6, the process for creating index and the calculating degree of correlation is as follows:
(6.1) index is created
Full-text index is created to " dispute details " and " mediating result explanation ", " mediation strategy recommendation " using ES, wherein, " dispute Details " and " mediate result illustrate " are the fields included in initial data, and " mediation strategy recommendations " is to calculate acquisition in step 5 Field;
(6.2) relatedness computation
Cluster obtains classification, class label, cluster centre and TF_CDF values in step 3, and this patent is represented using TF_CDF weights Text vector, input search content Query, participle obtains pn word, text weight is represented with TF_CDF, calculates text similar Degree;If word is appeared in class label, the text degree of correlation is correspondingly improved.
CN201710285854.2A 2017-04-27 2017-04-27 Searching and mediating strategy recommendation method for human-human contradiction mediating case Active CN107220295B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710285854.2A CN107220295B (en) 2017-04-27 2017-04-27 Searching and mediating strategy recommendation method for human-human contradiction mediating case

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710285854.2A CN107220295B (en) 2017-04-27 2017-04-27 Searching and mediating strategy recommendation method for human-human contradiction mediating case

Publications (2)

Publication Number Publication Date
CN107220295A true CN107220295A (en) 2017-09-29
CN107220295B CN107220295B (en) 2020-02-07

Family

ID=59944639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710285854.2A Active CN107220295B (en) 2017-04-27 2017-04-27 Searching and mediating strategy recommendation method for human-human contradiction mediating case

Country Status (1)

Country Link
CN (1) CN107220295B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133436A (en) * 2017-11-23 2018-06-08 科大讯飞股份有限公司 Automatic method and system of deciding a case
CN108595660A (en) * 2018-04-28 2018-09-28 腾讯科技(深圳)有限公司 Label information generation method, device, storage medium and the equipment of multimedia resource
CN109783639A (en) * 2018-12-24 2019-05-21 银江股份有限公司 A kind of conciliation case intelligence allocating method and system based on feature extraction
CN109783640A (en) * 2018-12-20 2019-05-21 广州恒巨信息科技有限公司 One type case recommended method, system and device
CN110138583A (en) * 2019-03-03 2019-08-16 北京立思辰安科技术有限公司 A kind of methods of exhibiting of warning intelligent analysis
CN110162590A (en) * 2019-02-22 2019-08-23 北京捷风数据技术有限公司 A kind of database displaying method and device thereof of calling for tenders of project text combination economic factor
CN110175468A (en) * 2019-05-05 2019-08-27 浙江工业大学 A kind of name desensitization method retaining distribution characteristics
CN110188092A (en) * 2019-04-28 2019-08-30 浙江工业大学 The system and method for novel contradiction and disputes in a kind of excavation people's mediation
CN110674243A (en) * 2019-07-02 2020-01-10 厦门耐特源码信息科技有限公司 Corpus index construction method based on dynamic K-means algorithm
CN110689964A (en) * 2019-09-12 2020-01-14 银江股份有限公司 Health data sample searching method and system
CN111161819A (en) * 2019-12-31 2020-05-15 重庆亚德科技股份有限公司 Traditional Chinese medical record data processing system and method
CN111753067A (en) * 2020-03-19 2020-10-09 北京信聚知识产权有限公司 Innovative assessment method, device and equipment for technical background text
CN111858901A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Text recommendation method and system based on semantic similarity
CN112258349A (en) * 2020-11-16 2021-01-22 南通知法互联网科技有限公司 Multi-dimensional mediation, management and analysis system for people mediation
CN112883169A (en) * 2021-04-29 2021-06-01 南京视察者智能科技有限公司 Contradiction evolution analysis method and device based on big data
CN113536780A (en) * 2021-06-29 2021-10-22 华东师范大学 Intelligent auxiliary case judging method for enterprise bankruptcy cases based on natural language processing
CN115796172A (en) * 2023-02-08 2023-03-14 广东粤港澳大湾区硬科技创新研究院 Fault case recommendation method, device and system
CN116843162A (en) * 2023-08-28 2023-10-03 之江实验室 Contradiction reconciliation scheme recommendation and scoring system and method
CN116860977A (en) * 2023-08-21 2023-10-10 之江实验室 Abnormality detection system and method for contradiction dispute mediation
CN116910817A (en) * 2023-09-13 2023-10-20 北京国药新创科技发展有限公司 Desensitization processing method and device for medical data and electronic equipment
CN117235309A (en) * 2023-09-14 2023-12-15 哈尔滨哈工智慧嘉利通科技股份有限公司 Urban management similar case recommendation method based on acquisition and elastic search technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130185235A1 (en) * 2012-01-18 2013-07-18 Fuji Xerox Co., Ltd. Non-transitory computer readable medium storing a program, search apparatus, search method, and clustering device
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
CN104851025A (en) * 2015-05-09 2015-08-19 湘南学院 Case-reasoning-based personalized recommendation method for E-commerce website commodity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130185235A1 (en) * 2012-01-18 2013-07-18 Fuji Xerox Co., Ltd. Non-transitory computer readable medium storing a program, search apparatus, search method, and clustering device
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
CN104851025A (en) * 2015-05-09 2015-08-19 湘南学院 Case-reasoning-based personalized recommendation method for E-commerce website commodity

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133436A (en) * 2017-11-23 2018-06-08 科大讯飞股份有限公司 Automatic method and system of deciding a case
CN108595660A (en) * 2018-04-28 2018-09-28 腾讯科技(深圳)有限公司 Label information generation method, device, storage medium and the equipment of multimedia resource
CN109783640A (en) * 2018-12-20 2019-05-21 广州恒巨信息科技有限公司 One type case recommended method, system and device
CN109783639B (en) * 2018-12-24 2020-10-27 银江股份有限公司 Mediated case intelligent dispatching method and system based on feature extraction
CN109783639A (en) * 2018-12-24 2019-05-21 银江股份有限公司 A kind of conciliation case intelligence allocating method and system based on feature extraction
CN110162590A (en) * 2019-02-22 2019-08-23 北京捷风数据技术有限公司 A kind of database displaying method and device thereof of calling for tenders of project text combination economic factor
CN110138583A (en) * 2019-03-03 2019-08-16 北京立思辰安科技术有限公司 A kind of methods of exhibiting of warning intelligent analysis
CN110138583B (en) * 2019-03-03 2022-04-12 杭州立思辰安科科技有限公司 Display method for intelligent alarm analysis
CN110188092A (en) * 2019-04-28 2019-08-30 浙江工业大学 The system and method for novel contradiction and disputes in a kind of excavation people's mediation
CN110188092B (en) * 2019-04-28 2021-08-03 浙江工业大学 System and method for mining new type contradiction dispute in people mediation
CN111858901A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Text recommendation method and system based on semantic similarity
CN110175468A (en) * 2019-05-05 2019-08-27 浙江工业大学 A kind of name desensitization method retaining distribution characteristics
CN110674243A (en) * 2019-07-02 2020-01-10 厦门耐特源码信息科技有限公司 Corpus index construction method based on dynamic K-means algorithm
CN110689964A (en) * 2019-09-12 2020-01-14 银江股份有限公司 Health data sample searching method and system
CN110689964B (en) * 2019-09-12 2022-08-26 银江技术股份有限公司 Health data sample searching method and system
CN111161819A (en) * 2019-12-31 2020-05-15 重庆亚德科技股份有限公司 Traditional Chinese medical record data processing system and method
CN111753067A (en) * 2020-03-19 2020-10-09 北京信聚知识产权有限公司 Innovative assessment method, device and equipment for technical background text
CN112258349A (en) * 2020-11-16 2021-01-22 南通知法互联网科技有限公司 Multi-dimensional mediation, management and analysis system for people mediation
CN112883169A (en) * 2021-04-29 2021-06-01 南京视察者智能科技有限公司 Contradiction evolution analysis method and device based on big data
CN113536780A (en) * 2021-06-29 2021-10-22 华东师范大学 Intelligent auxiliary case judging method for enterprise bankruptcy cases based on natural language processing
CN115796172A (en) * 2023-02-08 2023-03-14 广东粤港澳大湾区硬科技创新研究院 Fault case recommendation method, device and system
CN116860977A (en) * 2023-08-21 2023-10-10 之江实验室 Abnormality detection system and method for contradiction dispute mediation
CN116860977B (en) * 2023-08-21 2023-12-08 之江实验室 Abnormality detection system and method for contradiction dispute mediation
CN116843162A (en) * 2023-08-28 2023-10-03 之江实验室 Contradiction reconciliation scheme recommendation and scoring system and method
CN116843162B (en) * 2023-08-28 2024-02-09 之江实验室 Contradiction reconciliation scheme recommendation and scoring system and method
CN116910817A (en) * 2023-09-13 2023-10-20 北京国药新创科技发展有限公司 Desensitization processing method and device for medical data and electronic equipment
CN116910817B (en) * 2023-09-13 2023-12-29 北京国药新创科技发展有限公司 Desensitization processing method and device for medical data and electronic equipment
CN117235309A (en) * 2023-09-14 2023-12-15 哈尔滨哈工智慧嘉利通科技股份有限公司 Urban management similar case recommendation method based on acquisition and elastic search technology

Also Published As

Publication number Publication date
CN107220295B (en) 2020-02-07

Similar Documents

Publication Publication Date Title
CN107220295A (en) A kind of people&#39;s contradiction reconciles case retrieval and mediation strategy recommends method
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
Salloum et al. Mining social media text: extracting knowledge from Facebook
CN104820629B (en) A kind of intelligent public sentiment accident emergent treatment system and method
CN109783639B (en) Mediated case intelligent dispatching method and system based on feature extraction
US10565233B2 (en) Suffix tree similarity measure for document clustering
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
Inzalkar et al. A survey on text mining-techniques and application
CN105045875B (en) Personalized search and device
CN106599054B (en) Method and system for classifying and pushing questions
CN108595525B (en) Lawyer information processing method and system
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
CN106815297A (en) A kind of academic resources recommendation service system and method
CN109271477A (en) A kind of method and system by internet building taxonomy library
CN104392006B (en) A kind of event query processing method and processing device
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN110781679B (en) News event keyword mining method based on associated semantic chain network
US10387805B2 (en) System and method for ranking news feeds
CN108681548B (en) Lawyer information processing method and system
Dang et al. Framework for retrieving relevant contents related to fashion from online social network data
CN106682236A (en) Machine learning based patent data processing method and processing system adopting same
CN108027814A (en) Disable word recognition method and device
CN108681977B (en) Lawyer information processing method and system
CN110781297B (en) Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN109063171A (en) Semantic-based reso urce matching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Patentee after: Yinjiang Technology Co.,Ltd.

Address before: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Patentee before: ENJOYOR Co.,Ltd.