CN110110087A - A kind of Feature Engineering method for Law Text classification based on two classifiers - Google Patents

A kind of Feature Engineering method for Law Text classification based on two classifiers Download PDF

Info

Publication number
CN110110087A
CN110110087A CN201910401645.9A CN201910401645A CN110110087A CN 110110087 A CN110110087 A CN 110110087A CN 201910401645 A CN201910401645 A CN 201910401645A CN 110110087 A CN110110087 A CN 110110087A
Authority
CN
China
Prior art keywords
text
vector
classifiers
training
law
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910401645.9A
Other languages
Chinese (zh)
Inventor
段强
李锐
尹青山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Hi Tech Investment and Development Co Ltd
Original Assignee
Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Hi Tech Investment and Development Co Ltd filed Critical Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority to CN201910401645.9A priority Critical patent/CN110110087A/en
Publication of CN110110087A publication Critical patent/CN110110087A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Abstract

The present invention provides a kind of Feature Engineering method for Law Text classification based on two classifiers, belong to natural language processing technique field, the method of previous text vector is used to construct two classifiers for extracting key message, Feature Engineering needed for further constructing machine learning by the result that a series of two classifier extracts by the present invention.Do so the key message characterization that can make to influence judgement in text.And the judgement for key message, use preferable two classifier of accuracy.Then by the combination construction feature engineering vector of different characteristic, available one to the accurate clearly vector description of Law Text.Existing multi-categorizer finally can be used to classify, obtained result can be used to assist legal decision, auxiliary jurisprudential study etc..

Description

A kind of Feature Engineering method for Law Text classification based on two classifiers
Technical field
The present invention relates to natural language processing technique more particularly to a kind of classifying for Law Text based on two classifiers Feature Engineering method.
Background technique
There are One-Hot, TFIDF, Word2Vec etc. in the common text vector method of natural language processing field at present. Convert text to vector table be shown with help computerized algorithm understand and learning text feature, to complete text classification.Usually On, the meaning that vector indicates is more clear, and classifying quality is better.One-Hot and TFIDF model is all based on bag of words (BOW) Derive model with it, be primarily upon the frequency of presence or absence and the appearance of word, and desalinate the connection between text context, therefore More suitable for retrieval rather than the understanding of natural language and classification.Word2vec is to be converted to each word by neural network The algorithm of vector form.Traditional term vector method is compared, Word2vec does not indicate vector using sparse matrix first, therefore not Dimension disaster can be generated on large-scale text;Meanwhile it uses CBOW (continuous bag of words) and Skip-gram model, It can consider the contextual relation near word.But in Law Text classification, it is limited to the bottleneck and law text of Chinese word segmentation Strong logical relation between book context, text information cannot be extracted well by only using Word2vec progress vectorization, especially In the practice of polytypic task, effective information extraction not will lead to classification accuracy comprehensively and be remarkably decreased.
The rise of big data application has all opened up new approaches to all trades and professions, has provided new method.In jurisprudential study and method In rule application, the case where accumulation of data shows quantity big, updating decision, imbalanced training sets.How artificial intelligence and machine used Device study becomes more and more valuable to handle a large amount of Law Text with the efficiency of this auxiliary judgement, raising law decision.
The use of traditional natural language processing method segment by text then using the side such as one-hot, TFIDF Method is encoded, and essence is that the frequency of occurrence and frequency based on bag of words, to word carry out statistical summary, can not be direct Machine learning is assigned to understand Law Text and extract key message to the ability of auxiliary judgement.Word2vec algorithm passes through nerve One word is converted to the vector form of fixed length by network, and each vector can be considered as a point in higher dimensional space, therefore should Method can embody the relevance between word, be the preferable text vector algorithm of current performance.But classify in legal documents Practice in, since document format more standardizes, similitude is higher, and is more classification tasks, and sample distribution is quite uneven Weighing apparatus, the key message for causing word2vec that can not will be helpful to judgement well embody.
Feature Engineering, which refers to, filters out better data characteristics from initial data with a series of mode of engineering to mention The training effect of rising mould type, generally includes latent structure and selection.In natural language processing, Feature Engineering further includes text point Word, removal stop words and vectorization.The vectorization method of common text has One-Hot the and TFIDF method based on bag of words, The word2vec method indicated with distribution.Since Chinese word segmentation can introduce error, and Law Text exists largely to merit warp The colloquial style description crossed, the text analyzing effect for causing machine learning final are affected.The information of some excessively details is difficult Significantly extracted, and the division of case and the qualitative details for tending to rely on these keys.Therefore, in Law Text In more classification tasks, accuracy rate will receive significant impact.
There are following disadvantages for existing technology:
1, legal documents often use colloquial description to the details of merit, and key message is hidden relatively deep.Therefore Chinese The ambiguousness of participle influences whether the machine learning effect of text classification.
2, legal documents often describe that mode is relatively fixed, and word is single, therefore common vectorization of the algorithm to text As a result more similar, it is unfavorable for machine learning and distinguishes its classification.
Summary of the invention
For the advantage and shortcoming of current existing text vector method, the task of Combining law text classification is special Property, the Feature Engineering method for Law Text classification based on two classifiers that the invention proposes a kind of, by previous text The method of vectorization is used to construct two classifiers of extraction key message, the result extracted by a series of two classifier into Feature Engineering needed for one step constructs machine learning.Do so the key message characterization that can make to influence judgement in text.And Judgement for key message uses preferable two classifier of accuracy.Then pass through the combination construction feature work of different characteristic Cheng Xiangliang, available one to the accurate clearly vector description of Law Text.Finally can be used existing multi-categorizer into Row classification, obtained result can be used to assist legal decision, auxiliary jurisprudential study etc..
A kind of Feature Engineering method for Law Text classification based on two classifiers proposed by the present invention, it is not letter Single result by text vector is used in polytypic model training, but is first used in the result of text vector simpler In the training of two single disaggregated models, indicate whether that there are certain criminal offence or crime are crucial with a series of two classification results Then information constructs a feature vector, and then in more classification tasks of Law Text.Practice have shown that this method meeting Chinese word segmentation when reducing model training by inaccuracy is influenced, and the key message that vector is contained is increased, and lift scheme is quasi- Exactness.
Operating procedure are as follows:
The first step obtains training dataset.It can be carried out by artificially collecting legal decision document and by court verdict Mark obtains corresponding charge label.Or other equivalent modes obtain the Law Text of label.
Second step carries out Chinese word segmentation, removal stop words.The library such as Jieba that open source can be used, by whole section of Law Text Cutting is phrase, then compares and deactivates dictionary for the stop words defined removal.
Third step, text vector.Using the prior art such as One-Hot, TFIDF or Word2vec by the dictionary after participle Vectorization is carried out, and provides the feature vector of every text.
4th step defines behavior label and corresponding two classifier of training.Define the behavioural characteristic for wanting to extract and progress Mark, for training two classifiers.
5th step, training multi-categorizer, using the label of two classifiers corresponding to each text in previous step as defeated Enter to train multi-categorizer, which is charge label obtained in the first step.
So far, it constructs and completes for the multi-categorizer of Law Text.If there is this needs of news are predicted and are classified, corresponding step It is rapid as follows:
The first step carries out Chinese word segmentation, removal stop words.The library such as Jieba that open source can be used, by whole section of Law Text Cutting is phrase, then compares and deactivates dictionary for the stop words defined removal.
Second step, text vector.The dictionary after participle is carried out vectorization by used vectorization model when using training, And provide the feature vector of every text.
Vector in second step is predicted one by one using multiple two classifiers, obtains behavioural characteristic vector by third step.
4th step, using corresponding to each text in previous step, behavioural characteristic vector is as the input of multi-categorizer, in advance Measure charge.
The beneficial effects of the invention are as follows
A. multiple two classification tasks are converted to for the process that more classification tasks extract feature by one and extract feature, it is artificial to advise Determine detailed information and go in Law Text to find, therefore will not be influenced by segmentation ambiguity, key message will not be ignored.
B. the result difference of the vectorization of different texts is relatively big in two classification tasks, therefore helps to improve algorithm To their discrimination, accuracy of judgement degree is promoted.
Detailed description of the invention
Fig. 1 is workflow schematic diagram of the invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments, based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
The method of previous text vector is used to construct two classifiers for extracting key message, key message can be by It is described as behavior label.Machine learning is further constructed by the behavioural information that a series of two classifier extracts to classify institute more The Feature Engineering needed.
The invention proposes a kind of Feature Engineering methods of new Law Text, it is not simply by text vector As a result it is used in polytypic model training, but the result of text vector is first used in the instruction of simpler two disaggregated model In white silk, indicated whether then to construct one there are certain criminal offence or crime key message with a series of two classification results A feature vector, and then in more classification tasks of Law Text.Practice have shown that when this method can reduce model training by The influence of the Chinese word segmentation of inaccuracy increases the key message that vector is contained, lift scheme accuracy.
Specifically, after obtaining legal documents and corresponding charge label, first using in the open source of existing python Text is divided into single phrase by text participle packet Jieba (or Thulac).Then open source or self-built deactivated dictionary is utilized Inessential stop words is removed.The phrase that ambiguity will likely occur in the professional knowledge that this two step can use legal field is put Enter dictionary for word segmentation to be retained, or mistake participle that will affect model training etc. is put into stop words dictionary and is removed.It will be each After Law Text participle, the technologies such as OneHot, TFIDF or Word2vec and LSTM can be used and carry out text vector Change, is converted into the unified vector of dimension in case subsequent training uses.
Then need to define the behavior label for extracting text key message.In practice, " unjustified enrichment " can be used, " act of purchase and sale ", " causing death ", " act of violence ", " being related to government offices ", " illegal encroachment ", " somatic damage ", " subjective event Meaning ", " public arena ", " offering or accepting bribes ".These labels are only blanket, there are many careless omissions and insufficient, still can be with It improves and increases according to legal profession knowledge, to adapt to specific requirements.
Then using the machine learning algorithm training sub-classifier that can be used for two classification in sklearn, such as Logistic Regression, SVM, Bayes classifier, Xgboost, decision tree etc.;LightGBM disclosed in Microsoft can also be used to calculate Method.The performances of these machine learning algorithms is while it is not guaranteed that more excellent than neural network methods such as deep learnings, but in two classification Easy to use and performance gap and little in task.There are two types of feasible methods for the mark of training set:
1. targetedly judging there is which above kind behavior, it is then right to exist to the proposition of each legal decision document Position is answered to be set as 1, there is no be then set as 0.The binary vector that a fixed length can be generated in this way needs which subclassification trained Device just take it is corresponding who as label.Although doing so the precision of preferable specific aim and tag along sort, need big The manpower of amount is labeled judicial big data, and there are certain human errors, therefore it is not recommended that uses.
2. providing fixed and generally existing or more crucial behavior label for each charge.For example, the acceptance of bribes Often include " being related to international organ ", " offering or accepting bribes ", " subject intent " etc., therefore can to the sub-classifier label of acceptance of bribes text To be labeled as [0,0,0,0,1,0,0,1,0,1], i.e., existing three behavior corresponding position marks 1.And all acceptance of bribess are all marked Note is identical sub-classifier label.
Corresponding sub-classifier all is trained using full dose data to all behavior labels.The result of sub-classifier is as member The input of classifier, the label of meta classifier directly use the original charge label of training set.
So far, the feature vector construction of Law Text finishes, and length is the number of the behavior label defined, content two System 0 or 1 indicates whether there is corresponding behavior.In the task of actual Law Text classification, so that it may use the behavior Label is trained and predicts as feature input meta classifier.
There is also further promote improved space for this scheme.There are two the points that can be mainly promoted:
1. the confidence level using two classifiers replaces 0 or 1.Confidence level represents the determination degree of classifier judging result, Value range (0,1).Because the charge determined to each marks out behavior label and is fixed, but might not be all Law Text all centainly includes all behaviors.Therefore providing confidence level can allow description more accurate.
2. the division of pair label more refines, increasing -1 indicates that associated description is not present.Because 1 representative clearly exists before The behavior is clearly not present in the behavior, 0 representative, but is exactly that associated description is not present there are also the third situation, can not clearly define With the presence or absence of the behavior.Therefore increasing a classification can be such that feature construction more refines, but also have the possibility for introducing error Property.
The present invention is flexibly used the prior art, is maximized favourable factors and minimized unfavourable ones using general at present Open Framework and language, by law text This Feature Engineering method is promoted, and key message can be preferably obtained, and improves training accuracy.Law is assisted with this Judgement, auxiliary jurisprudential study etc..
The foregoing is merely presently preferred embodiments of the present invention, is only used to illustrate the technical scheme of the present invention, and is not intended to limit Determine protection scope of the present invention.Any modification, equivalent substitution, improvement and etc. done all within the spirits and principles of the present invention, It is included within the scope of protection of the present invention.

Claims (9)

1. a kind of Feature Engineering method for Law Text classification based on two classifiers, which is characterized in that
First the result of text vector is used in the training of two disaggregated models, indicates whether there is crime row with two classification results For or crime key message, then construct a feature vector, and then in more classification tasks of Law Text.
2. the method according to claim 1, wherein
Implementation steps are as follows
The first step obtains training dataset;
Second step carries out Chinese word segmentation, removal stop words;
Third step, text vector;
4th step defines behavior label and corresponding two classifier of training;
5th step, training multi-categorizer, goes the label of two classifiers corresponding to each text in previous step as input Training multi-categorizer, the step label are label obtained in the first step;
So far, it constructs and completes for the multi-categorizer of Law Text.
3. according to the method described in claim 2, it is characterized in that,
In the first step, by artificially collecting legal decision document and being labeled the corresponding charge label of acquisition by court verdict.
4. according to the method described in claim 2, it is characterized in that,
In second step, it is phrase by whole section of Law Text cutting using the Jieba of open source, then compares and deactivate dictionary for definition Good stop words removal.
5. according to the method described in claim 2, it is characterized in that,
In third step, the dictionary after participle is subjected to vectorization using prior art One-Hot or TFIDF or Word2vec, and Provide the feature vector of every text.
6. according to the method described in claim 2, it is characterized in that,
In 4th step, defines the behavioural characteristic for wanting to extract and be labeled, for training two classifiers.
7. according to the method described in claim 2, it is characterized in that,
If there is this needs of news are predicted and are classified, corresponding steps are as follows:
The first step carries out Chinese word segmentation, removal stop words
Second step, text vector
Vector in second step is predicted one by one using multiple two classifiers, obtains behavioural characteristic vector by third step.
4th step predicts behavioural characteristic vector corresponding to each text in previous step as the input of multi-categorizer Charge.
8. the method according to the description of claim 7 is characterized in that
It is phrase by whole section of Law Text cutting using the library such as Jieba of open source, then compareing deactivated dictionary will in the first step The stop words removal defined.
9. the method according to the description of claim 7 is characterized in that
In second step, the dictionary after participle is carried out vectorization by used vectorization model when using training, and provides every provision This feature vector.
CN201910401645.9A 2019-05-15 2019-05-15 A kind of Feature Engineering method for Law Text classification based on two classifiers Pending CN110110087A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910401645.9A CN110110087A (en) 2019-05-15 2019-05-15 A kind of Feature Engineering method for Law Text classification based on two classifiers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910401645.9A CN110110087A (en) 2019-05-15 2019-05-15 A kind of Feature Engineering method for Law Text classification based on two classifiers

Publications (1)

Publication Number Publication Date
CN110110087A true CN110110087A (en) 2019-08-09

Family

ID=67490149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910401645.9A Pending CN110110087A (en) 2019-05-15 2019-05-15 A kind of Feature Engineering method for Law Text classification based on two classifiers

Country Status (1)

Country Link
CN (1) CN110110087A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111242237A (en) * 2020-01-20 2020-06-05 北京明略软件系统有限公司 Bert model training and classifying method, system, medium and computer equipment
CN111949794A (en) * 2020-08-14 2020-11-17 扬州大学 Online active machine learning method for text multi-classification task
CN112114795A (en) * 2020-09-18 2020-12-22 北京航空航天大学 Method and device for predicting deactivation of auxiliary tool in open source community
CN112488551A (en) * 2020-12-11 2021-03-12 浪潮云信息技术股份公司 XGboost algorithm-based hot line intelligent order dispatching method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310825A (en) * 1998-06-23 2001-08-29 微软公司 Methods and apparatus for classifying text and for building a text classifier
US20100186091A1 (en) * 2008-05-13 2010-07-22 James Luke Turner Methods to dynamically establish overall national security or sensitivity classification for information contained in electronic documents; to provide control for electronic document/information access and cross domain document movement; to establish virtual security perimeters within or among computer networks for electronic documents/information; to enforce physical security perimeters for electronic documents between or among networks by means of a perimeter breach alert system
CN108228687A (en) * 2017-06-20 2018-06-29 上海吉贝克信息技术有限公司 Big data knowledge excavation and accurate tracking and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310825A (en) * 1998-06-23 2001-08-29 微软公司 Methods and apparatus for classifying text and for building a text classifier
US20100186091A1 (en) * 2008-05-13 2010-07-22 James Luke Turner Methods to dynamically establish overall national security or sensitivity classification for information contained in electronic documents; to provide control for electronic document/information access and cross domain document movement; to establish virtual security perimeters within or among computer networks for electronic documents/information; to enforce physical security perimeters for electronic documents between or among networks by means of a perimeter breach alert system
CN108228687A (en) * 2017-06-20 2018-06-29 上海吉贝克信息技术有限公司 Big data knowledge excavation and accurate tracking and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邓文超: "基于深度学习的司法智能研究", 《中国优秀硕士学位论文全文数据库 社会科学Ⅰ辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111242237A (en) * 2020-01-20 2020-06-05 北京明略软件系统有限公司 Bert model training and classifying method, system, medium and computer equipment
CN111949794A (en) * 2020-08-14 2020-11-17 扬州大学 Online active machine learning method for text multi-classification task
CN112114795A (en) * 2020-09-18 2020-12-22 北京航空航天大学 Method and device for predicting deactivation of auxiliary tool in open source community
CN112488551A (en) * 2020-12-11 2021-03-12 浪潮云信息技术股份公司 XGboost algorithm-based hot line intelligent order dispatching method
CN112488551B (en) * 2020-12-11 2023-04-07 浪潮云信息技术股份公司 Hot line intelligent order dispatching method based on XGboost algorithm

Similar Documents

Publication Publication Date Title
CN106919673B (en) Text mood analysis system based on deep learning
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN110110087A (en) A kind of Feature Engineering method for Law Text classification based on two classifiers
CN107944480A (en) A kind of enterprises ' industry sorting technique
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN111104510B (en) Text classification training sample expansion method based on word embedding
CN110413783A (en) A kind of judicial style classification method and system based on attention mechanism
CN110276054A (en) A kind of insurance text structure implementation method
CN109657058A (en) A kind of abstracting method of notice information
CN109271527A (en) A kind of appellative function point intelligent identification Method
CN109446423B (en) System and method for judging sentiment of news and texts
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN110910175A (en) Tourist ticket product portrait generation method
CN109783636A (en) A kind of car review subject distillation method based on classifier chains
Safrin et al. Sentiment analysis on online product review
CN112860889A (en) BERT-based multi-label classification method
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN110909542A (en) Intelligent semantic series-parallel analysis method and system
Bengio et al. Using web co-occurrence statistics for improving image categorization
WO2021128704A1 (en) Open set classification method based on classification utility
CN110110326B (en) Text cutting method based on subject information
CN110245234A (en) A kind of multi-source data sample correlating method based on ontology and semantic similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190809