CN110110087A

CN110110087A - A kind of Feature Engineering method for Law Text classification based on two classifiers

Info

Publication number: CN110110087A
Application number: CN201910401645.9A
Authority: CN
Inventors: 段强; 李锐; 尹青山
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2019-08-09

Abstract

The present invention provides a kind of Feature Engineering method for Law Text classification based on two classifiers, belong to natural language processing technique field, the method of previous text vector is used to construct two classifiers for extracting key message, Feature Engineering needed for further constructing machine learning by the result that a series of two classifier extracts by the present invention.Do so the key message characterization that can make to influence judgement in text.And the judgement for key message, use preferable two classifier of accuracy.Then by the combination construction feature engineering vector of different characteristic, available one to the accurate clearly vector description of Law Text.Existing multi-categorizer finally can be used to classify, obtained result can be used to assist legal decision, auxiliary jurisprudential study etc..

Description

A kind of Feature Engineering method for Law Text classification based on two classifiers

Technical field

The present invention relates to natural language processing technique more particularly to a kind of classifying for Law Text based on two classifiers Feature Engineering method.

Background technique

There are One-Hot, TFIDF, Word2Vec etc. in the common text vector method of natural language processing field at present. Convert text to vector table be shown with help computerized algorithm understand and learning text feature, to complete text classification.Usually On, the meaning that vector indicates is more clear, and classifying quality is better.One-Hot and TFIDF model is all based on bag of words (BOW) Derive model with it, be primarily upon the frequency of presence or absence and the appearance of word, and desalinate the connection between text context, therefore More suitable for retrieval rather than the understanding of natural language and classification.Word2vec is to be converted to each word by neural network The algorithm of vector form.Traditional term vector method is compared, Word2vec does not indicate vector using sparse matrix first, therefore not Dimension disaster can be generated on large-scale text；Meanwhile it uses CBOW (continuous bag of words) and Skip-gram model, It can consider the contextual relation near word.But in Law Text classification, it is limited to the bottleneck and law text of Chinese word segmentation Strong logical relation between book context, text information cannot be extracted well by only using Word2vec progress vectorization, especially In the practice of polytypic task, effective information extraction not will lead to classification accuracy comprehensively and be remarkably decreased.

The rise of big data application has all opened up new approaches to all trades and professions, has provided new method.In jurisprudential study and method In rule application, the case where accumulation of data shows quantity big, updating decision, imbalanced training sets.How artificial intelligence and machine used Device study becomes more and more valuable to handle a large amount of Law Text with the efficiency of this auxiliary judgement, raising law decision.

The use of traditional natural language processing method segment by text then using the side such as one-hot, TFIDF Method is encoded, and essence is that the frequency of occurrence and frequency based on bag of words, to word carry out statistical summary, can not be direct Machine learning is assigned to understand Law Text and extract key message to the ability of auxiliary judgement.Word2vec algorithm passes through nerve One word is converted to the vector form of fixed length by network, and each vector can be considered as a point in higher dimensional space, therefore should Method can embody the relevance between word, be the preferable text vector algorithm of current performance.But classify in legal documents Practice in, since document format more standardizes, similitude is higher, and is more classification tasks, and sample distribution is quite uneven Weighing apparatus, the key message for causing word2vec that can not will be helpful to judgement well embody.

Feature Engineering, which refers to, filters out better data characteristics from initial data with a series of mode of engineering to mention The training effect of rising mould type, generally includes latent structure and selection.In natural language processing, Feature Engineering further includes text point Word, removal stop words and vectorization.The vectorization method of common text has One-Hot the and TFIDF method based on bag of words, The word2vec method indicated with distribution.Since Chinese word segmentation can introduce error, and Law Text exists largely to merit warp The colloquial style description crossed, the text analyzing effect for causing machine learning final are affected.The information of some excessively details is difficult Significantly extracted, and the division of case and the qualitative details for tending to rely on these keys.Therefore, in Law Text In more classification tasks, accuracy rate will receive significant impact.

There are following disadvantages for existing technology:

1, legal documents often use colloquial description to the details of merit, and key message is hidden relatively deep.Therefore Chinese The ambiguousness of participle influences whether the machine learning effect of text classification.

2, legal documents often describe that mode is relatively fixed, and word is single, therefore common vectorization of the algorithm to text As a result more similar, it is unfavorable for machine learning and distinguishes its classification.

Summary of the invention

For the advantage and shortcoming of current existing text vector method, the task of Combining law text classification is special Property, the Feature Engineering method for Law Text classification based on two classifiers that the invention proposes a kind of, by previous text The method of vectorization is used to construct two classifiers of extraction key message, the result extracted by a series of two classifier into Feature Engineering needed for one step constructs machine learning.Do so the key message characterization that can make to influence judgement in text.And Judgement for key message uses preferable two classifier of accuracy.Then pass through the combination construction feature work of different characteristic Cheng Xiangliang, available one to the accurate clearly vector description of Law Text.Finally can be used existing multi-categorizer into Row classification, obtained result can be used to assist legal decision, auxiliary jurisprudential study etc..

A kind of Feature Engineering method for Law Text classification based on two classifiers proposed by the present invention, it is not letter Single result by text vector is used in polytypic model training, but is first used in the result of text vector simpler In the training of two single disaggregated models, indicate whether that there are certain criminal offence or crime are crucial with a series of two classification results Then information constructs a feature vector, and then in more classification tasks of Law Text.Practice have shown that this method meeting Chinese word segmentation when reducing model training by inaccuracy is influenced, and the key message that vector is contained is increased, and lift scheme is quasi- Exactness.

Operating procedure are as follows:

The first step obtains training dataset.It can be carried out by artificially collecting legal decision document and by court verdict Mark obtains corresponding charge label.Or other equivalent modes obtain the Law Text of label.

Second step carries out Chinese word segmentation, removal stop words.The library such as Jieba that open source can be used, by whole section of Law Text Cutting is phrase, then compares and deactivates dictionary for the stop words defined removal.

Third step, text vector.Using the prior art such as One-Hot, TFIDF or Word2vec by the dictionary after participle Vectorization is carried out, and provides the feature vector of every text.

4th step defines behavior label and corresponding two classifier of training.Define the behavioural characteristic for wanting to extract and progress Mark, for training two classifiers.

5th step, training multi-categorizer, using the label of two classifiers corresponding to each text in previous step as defeated Enter to train multi-categorizer, which is charge label obtained in the first step.

So far, it constructs and completes for the multi-categorizer of Law Text.If there is this needs of news are predicted and are classified, corresponding step It is rapid as follows:

The first step carries out Chinese word segmentation, removal stop words.The library such as Jieba that open source can be used, by whole section of Law Text Cutting is phrase, then compares and deactivates dictionary for the stop words defined removal.

Second step, text vector.The dictionary after participle is carried out vectorization by used vectorization model when using training, And provide the feature vector of every text.

Vector in second step is predicted one by one using multiple two classifiers, obtains behavioural characteristic vector by third step.

4th step, using corresponding to each text in previous step, behavioural characteristic vector is as the input of multi-categorizer, in advance Measure charge.

The beneficial effects of the invention are as follows

A. multiple two classification tasks are converted to for the process that more classification tasks extract feature by one and extract feature, it is artificial to advise Determine detailed information and go in Law Text to find, therefore will not be influenced by segmentation ambiguity, key message will not be ignored.

B. the result difference of the vectorization of different texts is relatively big in two classification tasks, therefore helps to improve algorithm To their discrimination, accuracy of judgement degree is promoted.

Detailed description of the invention

Fig. 1 is workflow schematic diagram of the invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments, based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

The method of previous text vector is used to construct two classifiers for extracting key message, key message can be by It is described as behavior label.Machine learning is further constructed by the behavioural information that a series of two classifier extracts to classify institute more The Feature Engineering needed.

The invention proposes a kind of Feature Engineering methods of new Law Text, it is not simply by text vector As a result it is used in polytypic model training, but the result of text vector is first used in the instruction of simpler two disaggregated model In white silk, indicated whether then to construct one there are certain criminal offence or crime key message with a series of two classification results A feature vector, and then in more classification tasks of Law Text.Practice have shown that when this method can reduce model training by The influence of the Chinese word segmentation of inaccuracy increases the key message that vector is contained, lift scheme accuracy.

Specifically, after obtaining legal documents and corresponding charge label, first using in the open source of existing python Text is divided into single phrase by text participle packet Jieba (or Thulac).Then open source or self-built deactivated dictionary is utilized Inessential stop words is removed.The phrase that ambiguity will likely occur in the professional knowledge that this two step can use legal field is put Enter dictionary for word segmentation to be retained, or mistake participle that will affect model training etc. is put into stop words dictionary and is removed.It will be each After Law Text participle, the technologies such as OneHot, TFIDF or Word2vec and LSTM can be used and carry out text vector Change, is converted into the unified vector of dimension in case subsequent training uses.

Then need to define the behavior label for extracting text key message.In practice, " unjustified enrichment " can be used, " act of purchase and sale ", " causing death ", " act of violence ", " being related to government offices ", " illegal encroachment ", " somatic damage ", " subjective event Meaning ", " public arena ", " offering or accepting bribes ".These labels are only blanket, there are many careless omissions and insufficient, still can be with It improves and increases according to legal profession knowledge, to adapt to specific requirements.

Then using the machine learning algorithm training sub-classifier that can be used for two classification in sklearn, such as Logistic Regression, SVM, Bayes classifier, Xgboost, decision tree etc.；LightGBM disclosed in Microsoft can also be used to calculate Method.The performances of these machine learning algorithms is while it is not guaranteed that more excellent than neural network methods such as deep learnings, but in two classification Easy to use and performance gap and little in task.There are two types of feasible methods for the mark of training set:

1. targetedly judging there is which above kind behavior, it is then right to exist to the proposition of each legal decision document Position is answered to be set as 1, there is no be then set as 0.The binary vector that a fixed length can be generated in this way needs which subclassification trained Device just take it is corresponding who as label.Although doing so the precision of preferable specific aim and tag along sort, need big The manpower of amount is labeled judicial big data, and there are certain human errors, therefore it is not recommended that uses.

2. providing fixed and generally existing or more crucial behavior label for each charge.For example, the acceptance of bribes Often include " being related to international organ ", " offering or accepting bribes ", " subject intent " etc., therefore can to the sub-classifier label of acceptance of bribes text To be labeled as [0,0,0,0,1,0,0,1,0,1], i.e., existing three behavior corresponding position marks 1.And all acceptance of bribess are all marked Note is identical sub-classifier label.

Corresponding sub-classifier all is trained using full dose data to all behavior labels.The result of sub-classifier is as member The input of classifier, the label of meta classifier directly use the original charge label of training set.

So far, the feature vector construction of Law Text finishes, and length is the number of the behavior label defined, content two System 0 or 1 indicates whether there is corresponding behavior.In the task of actual Law Text classification, so that it may use the behavior Label is trained and predicts as feature input meta classifier.

There is also further promote improved space for this scheme.There are two the points that can be mainly promoted:

1. the confidence level using two classifiers replaces 0 or 1.Confidence level represents the determination degree of classifier judging result, Value range (0,1).Because the charge determined to each marks out behavior label and is fixed, but might not be all Law Text all centainly includes all behaviors.Therefore providing confidence level can allow description more accurate.

2. the division of pair label more refines, increasing -1 indicates that associated description is not present.Because 1 representative clearly exists before The behavior is clearly not present in the behavior, 0 representative, but is exactly that associated description is not present there are also the third situation, can not clearly define With the presence or absence of the behavior.Therefore increasing a classification can be such that feature construction more refines, but also have the possibility for introducing error Property.

The present invention is flexibly used the prior art, is maximized favourable factors and minimized unfavourable ones using general at present Open Framework and language, by law text This Feature Engineering method is promoted, and key message can be preferably obtained, and improves training accuracy.Law is assisted with this Judgement, auxiliary jurisprudential study etc..

The foregoing is merely presently preferred embodiments of the present invention, is only used to illustrate the technical scheme of the present invention, and is not intended to limit Determine protection scope of the present invention.Any modification, equivalent substitution, improvement and etc. done all within the spirits and principles of the present invention, It is included within the scope of protection of the present invention.

Claims

1. a kind of Feature Engineering method for Law Text classification based on two classifiers, which is characterized in that

First the result of text vector is used in the training of two disaggregated models, indicates whether there is crime row with two classification results For or crime key message, then construct a feature vector, and then in more classification tasks of Law Text.

2. the method according to claim 1, wherein

Implementation steps are as follows

The first step obtains training dataset；

Second step carries out Chinese word segmentation, removal stop words；

Third step, text vector；

4th step defines behavior label and corresponding two classifier of training；

5th step, training multi-categorizer, goes the label of two classifiers corresponding to each text in previous step as input Training multi-categorizer, the step label are label obtained in the first step；

So far, it constructs and completes for the multi-categorizer of Law Text.

3. according to the method described in claim 2, it is characterized in that,

In the first step, by artificially collecting legal decision document and being labeled the corresponding charge label of acquisition by court verdict.

4. according to the method described in claim 2, it is characterized in that,

In second step, it is phrase by whole section of Law Text cutting using the Jieba of open source, then compares and deactivate dictionary for definition Good stop words removal.

5. according to the method described in claim 2, it is characterized in that,

In third step, the dictionary after participle is subjected to vectorization using prior art One-Hot or TFIDF or Word2vec, and Provide the feature vector of every text.

6. according to the method described in claim 2, it is characterized in that,

In 4th step, defines the behavioural characteristic for wanting to extract and be labeled, for training two classifiers.

7. according to the method described in claim 2, it is characterized in that,

If there is this needs of news are predicted and are classified, corresponding steps are as follows:

The first step carries out Chinese word segmentation, removal stop words

Second step, text vector

4th step predicts behavioural characteristic vector corresponding to each text in previous step as the input of multi-categorizer Charge.

8. the method according to the description of claim 7 is characterized in that

It is phrase by whole section of Law Text cutting using the library such as Jieba of open source, then compareing deactivated dictionary will in the first step The stop words removal defined.

9. the method according to the description of claim 7 is characterized in that

In second step, the dictionary after participle is carried out vectorization by used vectorization model when using training, and provides every provision This feature vector.