CN109726287A - A kind of people's mediation case classification system and method based on transfer learning and deep learning - Google Patents
A kind of people's mediation case classification system and method based on transfer learning and deep learning Download PDFInfo
- Publication number
- CN109726287A CN109726287A CN201811590341.3A CN201811590341A CN109726287A CN 109726287 A CN109726287 A CN 109726287A CN 201811590341 A CN201811590341 A CN 201811590341A CN 109726287 A CN109726287 A CN 109726287A
- Authority
- CN
- China
- Prior art keywords
- data
- people
- mediation
- auxiliary
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The present invention relates to a kind of people's mediation case classification system and method based on transfer learning and deep learning, present system includes data acquisition module, characteristic extracting module, feature transferring module, network training module, and system structure is simple, has a wide range of application;The method of the present invention includes construction character vector matrix, helper data vectorsization processing, the processing of people's mediation data vectorization, auxiliary data after vectorization is input in neural network, extract auxiliary data features, the auxiliary data features of extraction are moved in the people's mediation data after vectorization, train classification models.The method of the present invention can effectively convert all texts, will not ignore low-frequency word, and dimension decline is obvious, and training speed is fast, is convenient for subsequent online iteration optimization;It solves the otherness between people's mediation field and field of auxiliary simultaneously, meets the individual demand of specific area.
Description
Technical field
The present invention relates to data processing sorting technique fields more particularly to a kind of based on transfer learning and deep learning
People's mediation case classification system and method.
Background technique
Currently, China mediates an issue more than 900 ten thousand every year, existing dispute type but only has 20 multiclass, with economic society
The quantity of the development of meeting, case increases and the type of case shows diversification, how quickly accurately to be divided case
Class simultaneously increases new case type in time, improves the efficiency for reconciling work, is the serious problem that people's mediation work faces.When
The case type number that preceding people's mediation case type has the disadvantage that 1, has deposited is few, can not cover all disputes;2, no
Newly-increased dispute type can be separated with the dispute class area deposited in time;3, specific item does not refine under existing dispute type, no
Dispute main points can accurately be embodied.
The subdivision of people's mediation case type is many kinds of, and Text Classification can help people accurately from magnanimity number
Automating sorting function is realized according to middle extraction type feature.Existing people's mediation data mainly based on short text, deposit by short text
Sparsity, real-time, magnanimity and the lack of standard the features such as.These features of short text make text classification face following difficulty
Point: 1, short text Feature Words are few, are indicated with traditional vector space model based on entry, will cause the dilute of vector space
It dredges, in addition the information such as word frequency, Term co-occurrence frequency cannot be fully utilized, and potential semantic association closes between losing word
System;2, the lack of standard of short text makes occur atypical characteristic word and the unrecognized unregistered word of dictionary for word segmentation in text,
Cause traditional text pretreatment and document representation method not accurate enough;3, short text data is huge, in sorting algorithm
Selection on be often more likely to the learning method of non-inert, inert learning method will cause excessively high time complexity.
With a large amount of generations of short text data, people have done a large amount of exploration and practices for the sorting technique of short text.
But the application of the technology still belongs to blank in people's mediation field (strongly professional short text).Number of patent application CN
201710686945.7 proposing a kind of short text classification that composite class dimension-reduction algorithm is combined with weighting lack sampling SVM algorithm
Method solves the problems, such as high latitude sparsity and class imbalance in text classification, but the effect in mostly classification accuracy
It is bad.Number of patent application CN201510271672.0 discloses a kind of short text classification method based on convolutional neural networks, leads to
The word for crossing pre-training indicates that vector carries out semantic extension to short text, and the semantic feature of fixed length is extracted using convolutional neural networks
Vector makes its semantic feature vectorization expression be further enhanced, and the performance of its classification task is finally made to be improved.
But this method is difficult to expand corpus according to external auxiliary data in vertical field.Due to " people's mediation " field
Data are strongly professional, content is short, feature extraction is difficult, and dispute constantly develops, and the present invention provides one kind to be based on transfer learning and depth
Spend the file classification method of study.
Summary of the invention
The present invention is to overcome above-mentioned shortcoming, and it is an object of the present invention to provide a kind of based on transfer learning and deep learning
People's mediation case classification system and method, present system include data acquisition module, characteristic extracting module, feature migration
Module, network training module, system structure is simple, has a wide range of application;The method of the present invention includes construction character vector matrix, auxiliary
Data vectorization is helped to handle, the auxiliary data after vectorization is input to neural network by the processing of people's mediation data vectorization
In, auxiliary data features are extracted, the auxiliary data features of extraction are moved in the people's mediation data after vectorization, training
Disaggregated model.The method of the present invention can effectively convert all texts, will not ignore low-frequency word, and dimension decline is obvious,
Training speed is fast, is convenient for subsequent online iteration optimization;Solves the difference between people's mediation field and field of auxiliary simultaneously
The opposite sex meets the individual demand of specific area.
The present invention is to reach above-mentioned purpose by the following technical programs: a kind of people based on transfer learning and deep learning
Poll solution case classification system, comprising: data acquisition module, characteristic extracting module, feature transferring module, network training module;
The data acquisition module is used to acquire people's mediation data and auxiliary data, and to the people's mediation data collected
Data cleansing, duplicate removal pretreatment operation are carried out with auxiliary data, forms auxiliary data collection and people's mediation data set;Feature mentions
Modulus block extracts auxiliary data features and people's mediation data characteristics using convolutional neural networks;Feature transferring module is used for will
Auxiliary data generic features move in new neural network, are applied in people's mediation case classification;Network training mould
Block obtains final training pattern for the training to convolutional neural networks.
A kind of people's mediation case classification method based on transfer learning and deep learning, includes the following steps:
(1) collector people's condition data and auxiliary data, and people's mediation data and auxiliary data pre-process
To auxiliary data collection A, people's mediation data set B;
(2) character vector matrix is constructed, vectorization processing is carried out to auxiliary data, the auxiliary data after vectorization is defeated
Enter into convolutional neural networks, extracts auxiliary data features;Field of auxiliary mould is obtained to convolutional neural networks re -training simultaneously
Type, and the network structure of field of auxiliary model is saved as into .meta file, network parameter saves as .checkpoint text
Part;
(3) auxiliary data features of extraction are moved in new neural network using transfer learning technology, the new nerve
Network is the neural network that the network based on field of auxiliary model is rebuild;And vectorization processing is carried out to people's mediation data
In the convolutional neural networks being input to afterwards, training sorter model obtains and saves final people's mediation classification mould
Type;Classified using the people's mediation disaggregated model to people's mediation case.
Preferably, the step (1) is specific as follows:
(1.1) it collects auxiliary data: collecting long article notebook data relevant to field as field of auxiliary data;
(1.2) collector people's condition data: collecting people's mediation data in recent years, according to expertise by people's mediation
Data stamp group label;
(1.3) data cleansing: the auxiliary data of collection is cleaned, and deletes the interference character in text, is deleted too short
Data;The people's mediation data of collection are cleaned, data of poor quality and too short are deleted, delete the interference in text
Character;
(1.4) data deduplication: similar using cosine angle algorithm, Euclidean distance, Jaccard based on the data after cleaning
Degree, Longest Common Substring, any one or more method deletion repetition and set of metadata of similar data in edit distance approach;
(1.5) data after cleaning and duplicate removal are deposited into data warehouse, obtain auxiliary data collection A, people's mediation number
According to collection B.
Preferably, the step (2) is specific as follows:
(2.1) character vector matrix is constructed: by the text dividing of auxiliary data collection A and people's mediation data set B at single
Character, one character of a line are stored in .txt file;Assuming that C is character set used in data, character vector matrix Q is constructed
∈R|C|×|C|;
(2.2) text is embedded in: assuming that the character string of a text is [s1,s2,s3,…,sn], snIt is n-th in text
Character, then according to character string and character vector matrix construction text vector S ∈ Rn×|C|;Therefore, to auxiliary data collection A text
Final output text vector space I ∈ R after insertion|L*n|×C|, L is the sum for assisting data set A;
(2.3) the text vector space I of output is input in convolutional calculation layer, text matrix is done using filter
Convolution algorithm, if filter size is h × n, wherein h is the character quantity in convolution kernel window, then exports after convolution operation special
Levy tiAre as follows:
ti=f (WSi:i+h-1+b)
Wherein b ∈ R is bias term, W ∈ Rh×nFor the weight matrix of convolution kernel, f is convolution kernel function;The filter application
In a text { S1:h,S2:h+1,…,Sn-h+1Obtain feature T are as follows:
T=[t1,t2,t3,t4,…,tn-h+1]
Wherein t ∈ Rn-h+1;Down-sampling is carried out to feature by max-pooling algorithm, retains most important feature
The then feature vector V of full articulamentum are as follows:
Wherein k is the number of convolution kernel;It is normalized by Softmax layers;
(2.4) field of auxiliary model is obtained to convolutional neural networks re -training based on auxiliary data collection A, and will auxiliary
The network structure of domain model saves as .meta file, and network parameter saves as .checkpoint file.
Preferably, the character vector matrix Q is encoded using one-hot, diagonal entry is set as 1, remaining is
Each row vector of 0, matrix Q represent a character.
Preferably, the step (2.4) is in the training process, it is based on cross entropy training objective function, that is, the instruction used
Practicing objective function is the cross entropy for minimizing destination probability distribution and actual probability distribution, wherein training objective function J (θ)
Definition are as follows:
Wherein, l is training sample number, and α is regularization factors,It is sample xiCorrect classification;Based on the instruction
Practice objective function, the error of sample is calculated by gradient descent algorithm, and updates network structure using the mode of feedback propagation
The set θ of hyper parameter, more new formula are as follows:
Wherein, λ is learning rate.
Preferably, the method that the training obtains field of auxiliary model are as follows:
(i) auxiliary data collection A is divided into P equal portions, successively extracts several equal portions data as training set, if remaining
Dry equal portions data carry out cross validation, using average value as the accuracy of auxiliary data collection A, accuracy highest as verifying collection
A training pattern preserve, as model M1;
(ii) confusion matrix, wrong sub-matrix record cast M are utilized1The data obscured of prediction auxiliary data collection A classification and every
The number of a classification mistake point, the semi-artificial cleaning data of further progress, cleaning if discovery is there are data quality problem after analysis
It is used as data set D afterwards;Wherein each column of confusion matrix represent predicted value, and what every a line represented is actual classification;
(iii) data set D is according to convolutional neural networks re -training, the preferable field of auxiliary model of output category result.
Preferably, the step (3) is specific as follows:
(3.1) tectonic network figure: according to the .meta file reconstruction neural network of preservation, the network structure and field of auxiliary
The network structure of model is identical;
(3.2) feature migrates: according to the .checkpoint file of preservation, field of auxiliary model parameter being moved to step
(3.1) in the neural network rebuild;Vectorization processing is carried out to people's mediation data set B according to character vector matrix, by vector
People's mediation data after change are input in this convolutional neural networks, training sorter model;
(3.3) circuit training network is iterated until penalty values no longer reduce, and obtains and saves final people's tune
Solve disaggregated model, the field of auxiliary model as next transfer learning;Finally, using the people's mediation disaggregated model to the people
Case is reconciled to classify.
Preferably, the step (3.2) is specific as follows:
(3.2.1) judges whether auxiliary data collection A is consistent with the categorical measure of people's mediation data set B before migration: such as
Both fruits categorical measure is consistent, thens follow the steps (3.2.2);If the two categorical measure is inconsistent, then follow the steps
(3.2.3);
(3.2.2) restores according to the .checkpoint file of preservation and migrates all parameters, root in field of auxiliary model
Vectorization processing is carried out to people's mediation data set B according to character vector matrix, the people's mediation data after vectorization are input to
In this convolutional neural networks, training sorter model;
(3.2.3) restores and migrates the power of convolution kernel in field of auxiliary model according to the .checkpoint file of preservation
Weight matrix carries out vectorization processing to people's mediation data set B according to character vector matrix, by the people's mediation after vectorization
Data are input in this convolutional neural networks, update softmax parameter, training sorter model.
Preferably, the auxiliary data, refers to judgement document's data.
The beneficial effects of the present invention are: (1) present invention use the convolutional neural networks file classification method of character level,
Effectively all texts can be converted, low-frequency word will not be ignored, dimension decline is obvious, and training speed is fast, convenient for subsequent
Online iteration optimization;(2) feature of field of auxiliary data can be moved to people's mediation using transfer learning method by the present invention
In data characteristics, it is difficult to solve the problems, such as short text feature extraction, while improving the generalization ability of model;(3) of the invention
The technical solution of realization has certain flexibility for people's mediation field, and people's mediation dispute constantly develops, for
The new dispute of subsequent appearance, the present invention quickly can be migrated and be applied.
Detailed description of the invention
Fig. 1 is the flow diagram of the method for the present invention;
Fig. 2 is the confusion matrix exemplary diagram that the present invention uses;
Fig. 3 is the block flow diagram of transfer learning of the present invention.
Specific embodiment
The present invention is described further combined with specific embodiments below, but protection scope of the present invention and not only limits
In this:
Embodiment: a kind of people's mediation case classification system based on transfer learning and deep learning, including data acquisition
Module, characteristic extracting module, feature transferring module, network training module;The data acquisition module is for acquiring people's tune
Data and auxiliary data are solved, and data cleansing, duplicate removal pretreatment are carried out to the people's mediation data and auxiliary data that collect
Operation forms auxiliary data collection and people's mediation data set;Characteristic extracting module extracts supplementary number using convolutional neural networks
According to feature and people's mediation data characteristics;Auxiliary data generic features for being moved to new nerve net by feature transferring module
In network, applied in people's mediation case classification;Network training module is obtained for the training to convolutional neural networks
Final training pattern.
As shown in Figure 1, a kind of people's mediation case classification method based on transfer learning and deep learning includes following step
It is rapid:
(1) people's mediation data and auxiliary data pretreatment
(1.1) it collects auxiliary data: collecting data (long text) relevant to field and be used as field of auxiliary data;This reality
It applies example and acquires nearly 100,000 judgement document's data as auxiliary data, wherein judgement document's type is 20 classes.
(1.2) collector people's condition data: the present embodiment acquires more than 60,000 item of nearly 3 years people's mediation cases, according to special
People's mediation case is stamped group label by family's experience, and group label amounts to 88 classes.
(1.3) data cleansing: the field of auxiliary data of collection are cleaned, and delete the interference character in text, are deleted
Too short data;The people's mediation data of collection are cleaned, data of poor quality and too short are deleted, are deleted in text
Interfere character.The present embodiment using regular expression delete judgement document's data in the time, the date, number, additional character (
N, *) etc. interference character, delete judgement document's data in content less than 30 characters data;The people are deleted using expert judgments
Reconcile the indefinite data of case type, using regular expression delete people's mediation data in the time, the date, identification card number,
Address, telephone number, bank's card number etc. interfere character, data of the content less than 15 characters in deletion people's mediation data.
(1.4) data deduplication: according to step (1.3) cleaning after data, can with cosine angle algorithm, Euclidean distance,
The methods of Jaccard similarity, Longest Common Substring, editing distance delete repetition and similar data, and the present embodiment uses
It is similar greater than in 0.8 data and people's mediation case that Jaccard similarity algorithm deletes similarity factor in judgement document
Coefficient is greater than 0.9 data.
(1.5) data after cleaning and duplicate removal are deposited into data warehouse, obtain judgement document's data set A and the people
Condition data collection B.
(2) field of auxiliary feature is extracted using convolutional neural networks:
(2.1) it constructs character vector matrix: the sentence of judgement document's data set A and people's mediation data set B is cut into
Single character, character deduplication, one character of a line are stored in vocab.txt file;C=5000 is number in the present embodiment
The character set used in constructs a character vector matrix Q ∈ R|C|×|C|, matrix Q is using one-hot coding, diagonal line element
Element is set as 1, remaining is 0, and each row vector of matrix Q represents a character.
(2.2) text is embedded in: in the present embodiment, every data regular length being set as 600, the data meeting greater than 600
It is truncated, unified character is filled less than 600 data<pad>.Assuming that the character string of a text is [s1,s2,s3,…,
sn] (0≤n≤600), snFor n-th character in text, then according to character string and character vector matrix construction text vector s
∈R600×|C|.And so on, final output text vector space I ∈ R is embedded in for judgement document's data set A text|600*L|×C|, L is the sum of judgement document's data set A.
(2.3) network structure used in the present invention is as shown in table 1 below:
Table 1
Convolutional calculation is carried out by convolutional calculation layer according to the text vector space I of step (2.2) output, if filter
Size is h × n, and wherein h is the character quantity in convolution kernel window, then feature t is exported after convolution operationiAre as follows:
ti=f (WSi:i+h-1+b)
Wherein b ∈ R is bias term, W ∈ Rh×nFor the weight matrix of convolution kernel, f is convolution kernel function;The filter application
In a text { S1:h,S2:h+1,…,Sn-h+1Obtain feature T are as follows:
T=[t1,t2,t3,t4,…,tn-h+1]
Wherein t ∈ Rn-h+1;Down-sampling is carried out to feature by max-pooling algorithm, retains most important feature
The then feature vector V of full articulamentum are as follows:
Wherein k is the number of convolution kernel;It is normalized by Softmax layers, Softmax functional form is as follows:
Wherein, xiIt is input short text, zjIt is j-th of classification, θ is the hyper parameter collection for needing to estimate in convolutional neural networks
It closing, Z is the predefined category set of training sample,It is network structure to sample xiIn classification zjOn scoring, i.e.,
By many-sorted logic, the scoring is mapped as the ProbabilityDistribution Vector about all predefined classifications by this special recurrence classifier,
The dimension of the probability vector and the predefined category set are in the same size;
For the present embodiment through excessive wheel test, as the character quantity h=3 in convolution kernel window, effect is best, generates feature
T are as follows:
T=[t1,t2,t3,t4,…,t600]
Wherein, t ∈ R600;Maximum value is taken out from each vector using the pond max-pooling layer, maximum value represents
Most important signal, this Pooling mode can solve the sentence inputting problem of variable-length, the most output of terminal cistern layer
For the maximum value in convolutional calculation layer.Gradient disappears in order to prevent, and the present embodiment introduces Relu in connection first layer entirely and activates letter
Number, by test, the convergence rate of the SGD that Relu is obtained can it is faster than sigmoid/tanh very much, its mathematic(al) representation is:
F (x)=1 (x < 0) (ax)+1 (>=0 x) (x)
Wherein a is the constant of a very little;In this way, not only having modified data distribution, but also the value of some negative axis is remained, made
Obtaining negative axis information will not all lose;Model over-fitting, the present embodiment introduce Dropout technology in order to prevent simultaneously, by handing over
Fork verifying, it is best to imply effect when node dropout rate is equal to 0.5, the network that dropout is generated at random when 0.5
Structure is most.It is normalized in the connection second layer entirely using Softmax, shows judgement document's probability in 20 classes point
Cloth.
(2.4) in loop iteration training process, the training objective function that the present embodiment uses is to minimize destination probability
The cross entropy of distribution and actual probability distribution, the definition of training objective function J (θ) are as follows:
Wherein, l is training sample number, and α is regularization factors,It is sample xiCorrect classification.Based on the training
Objective function, by gradient descent algorithm calculate batch sample error, and using feedback propagation (Back Propagation,
BP mode) updates the set θ of the hyper parameter of the network structure, specifically more new formula are as follows:
Wherein, λ is learning rate, passes through test in the present embodiment and works as α=0.3, λ=1 × e-3Shi Xiaoguo is best.
(2.5) judgement document's data set A is divided into 10 equal portions, successively extracts 9 equal portions data as training set, 1 and waits numbers
Collect according to as verifying, cross validation is carried out, using average value as the accuracy of judgement document's data set A, accuracy highest one
Secondary training pattern preserves, as model M1。
(2.6) confusion matrix (each column of matrix represent predicted value, and what every a line represented is actual classification) is utilized,
Wrong sub-matrix record cast M1The number of data and each classification mistake point that prediction judgement document's data set A classification is obscured, analysis
Afterwards discovery there are data quality problem (such as: judgement document's classification marking error, judgement document's unclassified are true), further into
For the semi-artificial cleaning data of row as judgement document data set D, confusion matrix is as shown in Figure 2.
(2.7) data set D is according to step (2) convolutional neural networks re -training, output category result preferably (accuracy
Greater than judgement document's model 90%), this model is as field of auxiliary model M2。
(2.8) by model M2Network save as my_model.meta, network parameter saves as my_
model.checkpoint。
(3) transfer learning technology is utilized, auxiliary data features are applied in people's mediation case classification, and to the people
Condition data carries out in the convolutional neural networks being input to after vectorization processing, and training sorter model is obtained and saved
Final people's mediation disaggregated model;Classified using the people's mediation disaggregated model to people's mediation case, wherein migrating
The block flow diagram of study is as shown in Figure 3:
(3.1) tectonic network figure: the my_model.meta file saved according to step (2.4), reconstruction neural network (with
Judgement document's data neural network structure is identical).
(3.2) feature migrates: according to the .checkpoint file of preservation, field of auxiliary model parameter being moved to step
(3.1) in the neural network rebuild;Vectorization processing is carried out to people's mediation data set B according to character vector matrix, by vector
People's mediation data after change are input in this convolutional neural networks, training sorter model;
(3.2.1) judges whether judgement document's data set A is consistent with the categorical measure of people's mediation data set B: if two
Person's categorical measure is consistent, thens follow the steps (3.2.2);If the two categorical measure is inconsistent, (3.2.3) is thened follow the steps.
The .checkpoint file that (3.2.2) is saved according to step (2.4), Restoration model M2In all parameters, according to
The character vector matrix of step (2.1) output carries out vectorization processing to people's mediation data set B, by the people after vectorization
Condition data is input in this convolutional neural networks, training sorter model.
The .checkpoint file that (3.2.3) is saved according to step (2.4), Restoration model M2The weight square of middle convolution kernel
Battle array carries out vectorization processing to people's mediation data set B according to the character vector matrix of step (2.1) output, after vectorization
People's mediation data be input in this convolutional neural networks, update softmax parameter, train classification models.
(3.3) circuit training network is iterated until penalty values no longer reduce, and saves people's mediation disaggregated model,
Field of auxiliary model as next transfer learning.
Since the present embodiment judgement document number of types and people's mediation number of types are inconsistent, therefore Restoration model M2In
The weight matrix of convolution kernel carries out vectorization to people's mediation data set B according to the character vector matrix of step (2.1) output
Processing, the people's mediation data after vectorization are input in this convolutional neural networks, update (the people's mediation of softmax parameter
Segment number of types class=88), train classification models save people's mediation disaggregated model M3。
During people's mediation informationization promotes and applies, there can be following two situation:
1, the data of people's mediation can be more and more, while in the short time, and dispute type will not change;At this time will
Model M3All parameters move in new person people's condition data, improve the accuracy of classification.
2, people's mediation informationization application more becomes mature, and the data of people's mediation can be more and more, while it is possible that
New dispute type;At this time by model M3Convolution kernel weight move in new person people's condition data, update softmax parameter
(new people's mediation number of types), avoids training from the beginning.
It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention
Made change when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to the present invention
Protection scope.
Claims (10)
1. a kind of people's mediation case classification method based on transfer learning and deep learning, it is characterised in that including walking as follows
It is rapid:
(1) collector people's condition data and auxiliary data, and people's mediation data and auxiliary data are pre-processed to obtain auxiliary
Help data set A, people's mediation data set B;
(2) character vector matrix is constructed, vectorization processing is carried out to auxiliary data, the auxiliary data after vectorization is input to volume
In product neural network, auxiliary data features are extracted;Field of auxiliary model is obtained to convolutional neural networks re -training simultaneously, and will
The network structure of field of auxiliary model saves as .meta file, and network parameter saves as .checkpoint file;
(3) auxiliary data features of extraction are moved in new neural network using transfer learning technology, the new neural network
For the neural network rebuild based on the network of field of auxiliary model;And it is inputted after carrying out vectorization processing to people's mediation data
Into obtained convolutional neural networks, training sorter model obtains and saves final people's mediation disaggregated model;Using this
People's mediation disaggregated model classifies to people's mediation case.
2. a kind of people's mediation case classification method based on transfer learning and deep learning according to claim 1,
Be characterized in that: the step (1) is specific as follows:
(1.1) it collects auxiliary data: collecting long article notebook data relevant to field as field of auxiliary data;
(1.2) collector people's condition data: collecting people's mediation data in recent years, is beaten people's mediation data according to expertise
Upper group label;
(1.3) data cleansing: the auxiliary data of collection is cleaned, and is deleted the interference character in text, is deleted too short number
According to;The people's mediation data of collection are cleaned, data of poor quality and too short are deleted, delete the interference character in text;
(1.4) data deduplication: based on the data after cleaning, using cosine angle algorithm, Euclidean distance, Jaccard similarity, most
Any one or more method in long public substring, edit distance approach deletes repetition and set of metadata of similar data;
(1.5) data after cleaning and duplicate removal are deposited into data warehouse, obtain auxiliary data collection A, people's mediation data set
B。
3. a kind of people's mediation case classification method based on transfer learning and deep learning according to claim 1,
Be characterized in that: the step (2) is specific as follows:
(2.1) character vector matrix is constructed: by the text dividing of auxiliary data collection A and people's mediation data set B at single character,
One character of a line is stored in .txt file;Assuming that C is character set used in data, character vector matrix Q ∈ R is constructed|C|×|C|;
(2.2) text is embedded in: assuming that the character string of a text is [s1,s2,s3,…,sn], snFor n-th of character in text,
Then according to character string and character vector matrix construction text vector S ∈ Rn×|C|;Therefore, after to the insertion of auxiliary data collection A text
Final output text vector space I ∈ R|L*n|×|C|, L is the sum for assisting data set A;
(2.3) the text vector space I of output is input in convolutional calculation layer, convolution fortune is done to text matrix using filter
It calculates, if filter size is h × n, wherein h is the character quantity in convolution kernel window, then feature t is exported after convolution operationiAre as follows:
ti=f (WSi:i+h-1+b)
Wherein b ∈ R is bias term, W ∈ Rh×nFor the weight matrix of convolution kernel, f is convolution kernel function;The filter is applied to one
A text { S1:h,S2:h+1,…,Sn-h+1Obtain feature T are as follows:
T=[t1,t2,t3,t4,…,tn-h+1]
Wherein t ∈ Rn-h+1;Down-sampling is carried out to feature by max-pooling algorithm, retains most important feature
The then feature vector V of full articulamentum are as follows:
Wherein k is the number of convolution kernel;It is normalized by Softmax layers;
(2.4) field of auxiliary model is obtained to convolutional neural networks re -training based on auxiliary data collection A, and by field of auxiliary mould
The network structure of type saves as .meta file, and network parameter saves as .checkpoint file.
4. a kind of people's mediation case classification method based on transfer learning and deep learning according to claim 3,
Be characterized in that: the character vector matrix Q is encoded using one-hot, and diagonal entry is set as 1, remaining is 0, matrix Q's
Each row vector represents a character.
5. a kind of people's mediation case classification method based on transfer learning and deep learning according to claim 3,
Be characterized in that: the step (2.4) in the training process, is based on cross entropy training objective function, that is, the training objective letter used
Number is the cross entropy for minimizing destination probability distribution and actual probability distribution, wherein the definition of training objective function J (θ) are as follows:
Wherein, l is training sample number, and α is regularization factors,It is sample xiCorrect classification;Based on the training mesh
Scalar functions calculate the error of sample by gradient descent algorithm, and the super ginseng of network structure is updated using the mode of feedback propagation
Several set θ, more new formula are as follows:
Wherein, λ is learning rate.
6. a kind of people's mediation case classification method based on transfer learning and deep learning according to claim 1,
It is characterized in that: the method that the training obtains field of auxiliary model are as follows:
(i) auxiliary data collection A is divided into P equal portions, successively extracts several equal portions data as training set, remaining several equal portions
Data carry out cross validation as verifying collection, and using average value as the accuracy of auxiliary data collection A, accuracy is highest primary
Training pattern preserves, as model M1;
(ii) confusion matrix, wrong sub-matrix record cast M are utilized1The data and each classification that prediction auxiliary data collection A classification is obscured
The number of mistake point, the semi-artificial cleaning data of further progress, conduct after cleaning if discovery is there are data quality problem after analysis
Data set D;Wherein each column of confusion matrix represent predicted value, and what every a line represented is actual classification;
(iii) data set D is according to convolutional neural networks re -training, the preferable field of auxiliary model of output category result.
7. a kind of people's mediation case classification method based on transfer learning and deep learning according to claim 1,
Be characterized in that: the step (3) is specific as follows:
(3.1) tectonic network figure: according to the .meta file reconstruction neural network of preservation, the network structure and field of auxiliary model
Network structure it is identical;
(3.2) feature migrates: according to the .checkpoint file of preservation, field of auxiliary model parameter being moved to step (3.1)
In the neural network of reconstruction;Vectorization processing is carried out to people's mediation data set B according to character vector matrix, after vectorization
People's mediation data are input in this convolutional neural networks, training sorter model;
(3.3) circuit training network is iterated until penalty values no longer reduce, and obtains and saves final people's mediation point
Class model, the field of auxiliary model as next transfer learning;Finally, using the people's mediation disaggregated model to people's mediation case
Example is classified.
8. a kind of people's mediation case classification method based on transfer learning and deep learning according to claim 7,
Be characterized in that: the step (3.2) is specific as follows:
(3.2.1) judges whether auxiliary data collection A is consistent with the categorical measure of people's mediation data set B before migration: if two
Person's categorical measure is consistent, thens follow the steps (3.2.2);If the two categorical measure is inconsistent, (3.2.3) is thened follow the steps;
(3.2.2) restores according to the .checkpoint file of preservation and migrates all parameters in field of auxiliary model, according to word
It accords with vector matrix and vectorization processing is carried out to people's mediation data set B, the people's mediation data after vectorization are input to this volume
In product neural network, training sorter model;
(3.2.3) restores and migrates the weight square of convolution kernel in field of auxiliary model according to the .checkpoint file of preservation
Battle array carries out vectorization processing to people's mediation data set B according to character vector matrix, and the people's mediation data after vectorization are defeated
Enter into this convolutional neural networks, updates softmax parameter, training sorter model.
9. a kind of people's mediation case based on transfer learning and deep learning according to any one of claims 1 to 8 point
Class method, it is characterised in that: the auxiliary data refers to judgement document's data.
10. a kind of people's mediation case classification system based on transfer learning and deep learning, characterized by comprising: data are adopted
Collect module, characteristic extracting module, feature transferring module, network training module;The data acquisition module is for acquiring the people
Condition data and auxiliary data, and the people's mediation data collected and auxiliary data progress data cleansing, duplicate removal are located in advance
Reason operation, forms auxiliary data collection and people's mediation data set;Characteristic extracting module extracts supplementary number using convolutional neural networks
According to feature and people's mediation data characteristics;Auxiliary data generic features for being moved to new neural network by feature transferring module
In, it is applied in people's mediation case classification;Network training module obtains final for the training to convolutional neural networks
Training pattern.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811590341.3A CN109726287A (en) | 2018-12-25 | 2018-12-25 | A kind of people's mediation case classification system and method based on transfer learning and deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811590341.3A CN109726287A (en) | 2018-12-25 | 2018-12-25 | A kind of people's mediation case classification system and method based on transfer learning and deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109726287A true CN109726287A (en) | 2019-05-07 |
Family
ID=66297111
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811590341.3A Pending CN109726287A (en) | 2018-12-25 | 2018-12-25 | A kind of people's mediation case classification system and method based on transfer learning and deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109726287A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196911A (en) * | 2019-06-06 | 2019-09-03 | 申林森 | A kind of people's livelihood data automatic classification management system |
CN110825872A (en) * | 2019-09-11 | 2020-02-21 | 成都数之联科技有限公司 | Method and system for extracting and classifying litigation request information |
CN111046177A (en) * | 2019-11-26 | 2020-04-21 | 方正璞华软件(武汉)股份有限公司 | Automatic arbitration case prejudging method and device |
CN111601418A (en) * | 2020-05-25 | 2020-08-28 | 博彦集智科技有限公司 | Color temperature adjusting method and device, storage medium and processor |
CN113901781A (en) * | 2021-09-15 | 2022-01-07 | 昆明理工大学 | Similar case matching method for fusing segmented coding and affine mechanism |
CN116843162A (en) * | 2023-08-28 | 2023-10-03 | 之江实验室 | Contradiction reconciliation scheme recommendation and scoring system and method |
CN113901781B (en) * | 2021-09-15 | 2024-04-26 | 昆明理工大学 | Similar case matching method integrating segment coding and affine mechanism |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104834747A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Short text classification method based on convolution neutral network |
-
2018
- 2018-12-25 CN CN201811590341.3A patent/CN109726287A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104834747A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Short text classification method based on convolution neutral network |
Non-Patent Citations (4)
Title |
---|
OKI SAPUTRA JAYA 等: "Analysis of Convolution Neural Network for Transfer Learning of Sentiment Analysis in Indonesian Tweets", 《DSIT "18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON DATA SCIENCE AND INFORMATION TECHNOLOGY》 * |
SUIN SEO 等: "Offensive Sentence Classification Using Character-Level CNN and Transfer Learning with Fake Sentences", 《INTERNATIONAL CONFERENCE ON NEUTRAL INFORMATION PROCESSING》 * |
金佳佳: "基于深度学习的短文本分类算法研究及应用", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
陈钊: "面向中文文本的情感分析方法研究", 《万方数据知识服务平台》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196911A (en) * | 2019-06-06 | 2019-09-03 | 申林森 | A kind of people's livelihood data automatic classification management system |
CN110196911B (en) * | 2019-06-06 | 2022-04-22 | 申林森 | Automatic classification management system for civil data |
CN110825872A (en) * | 2019-09-11 | 2020-02-21 | 成都数之联科技有限公司 | Method and system for extracting and classifying litigation request information |
CN111046177A (en) * | 2019-11-26 | 2020-04-21 | 方正璞华软件(武汉)股份有限公司 | Automatic arbitration case prejudging method and device |
CN111601418A (en) * | 2020-05-25 | 2020-08-28 | 博彦集智科技有限公司 | Color temperature adjusting method and device, storage medium and processor |
CN113901781A (en) * | 2021-09-15 | 2022-01-07 | 昆明理工大学 | Similar case matching method for fusing segmented coding and affine mechanism |
CN113901781B (en) * | 2021-09-15 | 2024-04-26 | 昆明理工大学 | Similar case matching method integrating segment coding and affine mechanism |
CN116843162A (en) * | 2023-08-28 | 2023-10-03 | 之江实验室 | Contradiction reconciliation scheme recommendation and scoring system and method |
CN116843162B (en) * | 2023-08-28 | 2024-02-09 | 之江实验室 | Contradiction reconciliation scheme recommendation and scoring system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109446332A (en) | A kind of people's mediation case classification system and method based on feature migration and adaptive learning | |
CN109726287A (en) | A kind of people's mediation case classification system and method based on transfer learning and deep learning | |
CN110442684A (en) | A kind of class case recommended method based on content of text | |
CN108364028A (en) | A kind of internet site automatic classification method based on deep learning | |
CN108492200A (en) | A kind of user property estimating method and device based on convolutional neural networks | |
CN108984745A (en) | A kind of neural network file classification method merging more knowledge mappings | |
CN109947963A (en) | A kind of multiple dimensioned Hash search method based on deep learning | |
CN108334605A (en) | File classification method, device, computer equipment and storage medium | |
CN110110335A (en) | A kind of name entity recognition method based on Overlay model | |
CN106447066A (en) | Big data feature extraction method and device | |
CN109102014A (en) | The image classification method of class imbalance based on depth convolutional neural networks | |
CN106991374A (en) | Handwritten Digit Recognition method based on convolutional neural networks and random forest | |
CN107563439A (en) | A kind of model for identifying cleaning food materials picture and identification food materials class method for distinguishing | |
CN109697469A (en) | A kind of self study small sample Classifying Method in Remote Sensing Image based on consistency constraint | |
CN106815369A (en) | A kind of file classification method based on Xgboost sorting algorithms | |
CN108509982A (en) | A method of the uneven medical data of two classification of processing | |
CN109840322A (en) | It is a kind of based on intensified learning cloze test type reading understand analysis model and method | |
CN108710894A (en) | A kind of Active Learning mask method and device based on cluster representative point | |
CN108804677A (en) | In conjunction with the deep learning question classification method and system of multi-layer attention mechanism | |
CN109934260A (en) | Image, text and data fusion sensibility classification method and device based on random forest | |
CN109960763A (en) | A kind of photography community personalization friend recommendation method based on user's fine granularity photography preference | |
CN109886161A (en) | A kind of road traffic index identification method based on possibility cluster and convolutional neural networks | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN111754345A (en) | Bit currency address classification method based on improved random forest | |
CN108846047A (en) | A kind of picture retrieval method and system based on convolution feature |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190507 |
|
RJ01 | Rejection of invention patent application after publication |