CN110110087A - A kind of Feature Engineering method for Law Text classification based on two classifiers - Google Patents
A kind of Feature Engineering method for Law Text classification based on two classifiers Download PDFInfo
- Publication number
- CN110110087A CN110110087A CN201910401645.9A CN201910401645A CN110110087A CN 110110087 A CN110110087 A CN 110110087A CN 201910401645 A CN201910401645 A CN 201910401645A CN 110110087 A CN110110087 A CN 110110087A
- Authority
- CN
- China
- Prior art keywords
- text
- vector
- classifiers
- training
- law
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services; Handling legal documents
Abstract
The present invention provides a kind of Feature Engineering method for Law Text classification based on two classifiers, belong to natural language processing technique field, the method of previous text vector is used to construct two classifiers for extracting key message, Feature Engineering needed for further constructing machine learning by the result that a series of two classifier extracts by the present invention.Do so the key message characterization that can make to influence judgement in text.And the judgement for key message, use preferable two classifier of accuracy.Then by the combination construction feature engineering vector of different characteristic, available one to the accurate clearly vector description of Law Text.Existing multi-categorizer finally can be used to classify, obtained result can be used to assist legal decision, auxiliary jurisprudential study etc..
Description
Technical field
The present invention relates to natural language processing technique more particularly to a kind of classifying for Law Text based on two classifiers
Feature Engineering method.
Background technique
There are One-Hot, TFIDF, Word2Vec etc. in the common text vector method of natural language processing field at present.
Convert text to vector table be shown with help computerized algorithm understand and learning text feature, to complete text classification.Usually
On, the meaning that vector indicates is more clear, and classifying quality is better.One-Hot and TFIDF model is all based on bag of words (BOW)
Derive model with it, be primarily upon the frequency of presence or absence and the appearance of word, and desalinate the connection between text context, therefore
More suitable for retrieval rather than the understanding of natural language and classification.Word2vec is to be converted to each word by neural network
The algorithm of vector form.Traditional term vector method is compared, Word2vec does not indicate vector using sparse matrix first, therefore not
Dimension disaster can be generated on large-scale text;Meanwhile it uses CBOW (continuous bag of words) and Skip-gram model,
It can consider the contextual relation near word.But in Law Text classification, it is limited to the bottleneck and law text of Chinese word segmentation
Strong logical relation between book context, text information cannot be extracted well by only using Word2vec progress vectorization, especially
In the practice of polytypic task, effective information extraction not will lead to classification accuracy comprehensively and be remarkably decreased.
The rise of big data application has all opened up new approaches to all trades and professions, has provided new method.In jurisprudential study and method
In rule application, the case where accumulation of data shows quantity big, updating decision, imbalanced training sets.How artificial intelligence and machine used
Device study becomes more and more valuable to handle a large amount of Law Text with the efficiency of this auxiliary judgement, raising law decision.
The use of traditional natural language processing method segment by text then using the side such as one-hot, TFIDF
Method is encoded, and essence is that the frequency of occurrence and frequency based on bag of words, to word carry out statistical summary, can not be direct
Machine learning is assigned to understand Law Text and extract key message to the ability of auxiliary judgement.Word2vec algorithm passes through nerve
One word is converted to the vector form of fixed length by network, and each vector can be considered as a point in higher dimensional space, therefore should
Method can embody the relevance between word, be the preferable text vector algorithm of current performance.But classify in legal documents
Practice in, since document format more standardizes, similitude is higher, and is more classification tasks, and sample distribution is quite uneven
Weighing apparatus, the key message for causing word2vec that can not will be helpful to judgement well embody.
Feature Engineering, which refers to, filters out better data characteristics from initial data with a series of mode of engineering to mention
The training effect of rising mould type, generally includes latent structure and selection.In natural language processing, Feature Engineering further includes text point
Word, removal stop words and vectorization.The vectorization method of common text has One-Hot the and TFIDF method based on bag of words,
The word2vec method indicated with distribution.Since Chinese word segmentation can introduce error, and Law Text exists largely to merit warp
The colloquial style description crossed, the text analyzing effect for causing machine learning final are affected.The information of some excessively details is difficult
Significantly extracted, and the division of case and the qualitative details for tending to rely on these keys.Therefore, in Law Text
In more classification tasks, accuracy rate will receive significant impact.
There are following disadvantages for existing technology:
1, legal documents often use colloquial description to the details of merit, and key message is hidden relatively deep.Therefore Chinese
The ambiguousness of participle influences whether the machine learning effect of text classification.
2, legal documents often describe that mode is relatively fixed, and word is single, therefore common vectorization of the algorithm to text
As a result more similar, it is unfavorable for machine learning and distinguishes its classification.
Summary of the invention
For the advantage and shortcoming of current existing text vector method, the task of Combining law text classification is special
Property, the Feature Engineering method for Law Text classification based on two classifiers that the invention proposes a kind of, by previous text
The method of vectorization is used to construct two classifiers of extraction key message, the result extracted by a series of two classifier into
Feature Engineering needed for one step constructs machine learning.Do so the key message characterization that can make to influence judgement in text.And
Judgement for key message uses preferable two classifier of accuracy.Then pass through the combination construction feature work of different characteristic
Cheng Xiangliang, available one to the accurate clearly vector description of Law Text.Finally can be used existing multi-categorizer into
Row classification, obtained result can be used to assist legal decision, auxiliary jurisprudential study etc..
A kind of Feature Engineering method for Law Text classification based on two classifiers proposed by the present invention, it is not letter
Single result by text vector is used in polytypic model training, but is first used in the result of text vector simpler
In the training of two single disaggregated models, indicate whether that there are certain criminal offence or crime are crucial with a series of two classification results
Then information constructs a feature vector, and then in more classification tasks of Law Text.Practice have shown that this method meeting
Chinese word segmentation when reducing model training by inaccuracy is influenced, and the key message that vector is contained is increased, and lift scheme is quasi-
Exactness.
Operating procedure are as follows:
The first step obtains training dataset.It can be carried out by artificially collecting legal decision document and by court verdict
Mark obtains corresponding charge label.Or other equivalent modes obtain the Law Text of label.
Second step carries out Chinese word segmentation, removal stop words.The library such as Jieba that open source can be used, by whole section of Law Text
Cutting is phrase, then compares and deactivates dictionary for the stop words defined removal.
Third step, text vector.Using the prior art such as One-Hot, TFIDF or Word2vec by the dictionary after participle
Vectorization is carried out, and provides the feature vector of every text.
4th step defines behavior label and corresponding two classifier of training.Define the behavioural characteristic for wanting to extract and progress
Mark, for training two classifiers.
5th step, training multi-categorizer, using the label of two classifiers corresponding to each text in previous step as defeated
Enter to train multi-categorizer, which is charge label obtained in the first step.
So far, it constructs and completes for the multi-categorizer of Law Text.If there is this needs of news are predicted and are classified, corresponding step
It is rapid as follows:
The first step carries out Chinese word segmentation, removal stop words.The library such as Jieba that open source can be used, by whole section of Law Text
Cutting is phrase, then compares and deactivates dictionary for the stop words defined removal.
Second step, text vector.The dictionary after participle is carried out vectorization by used vectorization model when using training,
And provide the feature vector of every text.
Vector in second step is predicted one by one using multiple two classifiers, obtains behavioural characteristic vector by third step.
4th step, using corresponding to each text in previous step, behavioural characteristic vector is as the input of multi-categorizer, in advance
Measure charge.
The beneficial effects of the invention are as follows
A. multiple two classification tasks are converted to for the process that more classification tasks extract feature by one and extract feature, it is artificial to advise
Determine detailed information and go in Law Text to find, therefore will not be influenced by segmentation ambiguity, key message will not be ignored.
B. the result difference of the vectorization of different texts is relatively big in two classification tasks, therefore helps to improve algorithm
To their discrimination, accuracy of judgement degree is promoted.
Detailed description of the invention
Fig. 1 is workflow schematic diagram of the invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments, based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
The method of previous text vector is used to construct two classifiers for extracting key message, key message can be by
It is described as behavior label.Machine learning is further constructed by the behavioural information that a series of two classifier extracts to classify institute more
The Feature Engineering needed.
The invention proposes a kind of Feature Engineering methods of new Law Text, it is not simply by text vector
As a result it is used in polytypic model training, but the result of text vector is first used in the instruction of simpler two disaggregated model
In white silk, indicated whether then to construct one there are certain criminal offence or crime key message with a series of two classification results
A feature vector, and then in more classification tasks of Law Text.Practice have shown that when this method can reduce model training by
The influence of the Chinese word segmentation of inaccuracy increases the key message that vector is contained, lift scheme accuracy.
Specifically, after obtaining legal documents and corresponding charge label, first using in the open source of existing python
Text is divided into single phrase by text participle packet Jieba (or Thulac).Then open source or self-built deactivated dictionary is utilized
Inessential stop words is removed.The phrase that ambiguity will likely occur in the professional knowledge that this two step can use legal field is put
Enter dictionary for word segmentation to be retained, or mistake participle that will affect model training etc. is put into stop words dictionary and is removed.It will be each
After Law Text participle, the technologies such as OneHot, TFIDF or Word2vec and LSTM can be used and carry out text vector
Change, is converted into the unified vector of dimension in case subsequent training uses.
Then need to define the behavior label for extracting text key message.In practice, " unjustified enrichment " can be used,
" act of purchase and sale ", " causing death ", " act of violence ", " being related to government offices ", " illegal encroachment ", " somatic damage ", " subjective event
Meaning ", " public arena ", " offering or accepting bribes ".These labels are only blanket, there are many careless omissions and insufficient, still can be with
It improves and increases according to legal profession knowledge, to adapt to specific requirements.
Then using the machine learning algorithm training sub-classifier that can be used for two classification in sklearn, such as Logistic
Regression, SVM, Bayes classifier, Xgboost, decision tree etc.;LightGBM disclosed in Microsoft can also be used to calculate
Method.The performances of these machine learning algorithms is while it is not guaranteed that more excellent than neural network methods such as deep learnings, but in two classification
Easy to use and performance gap and little in task.There are two types of feasible methods for the mark of training set:
1. targetedly judging there is which above kind behavior, it is then right to exist to the proposition of each legal decision document
Position is answered to be set as 1, there is no be then set as 0.The binary vector that a fixed length can be generated in this way needs which subclassification trained
Device just take it is corresponding who as label.Although doing so the precision of preferable specific aim and tag along sort, need big
The manpower of amount is labeled judicial big data, and there are certain human errors, therefore it is not recommended that uses.
2. providing fixed and generally existing or more crucial behavior label for each charge.For example, the acceptance of bribes
Often include " being related to international organ ", " offering or accepting bribes ", " subject intent " etc., therefore can to the sub-classifier label of acceptance of bribes text
To be labeled as [0,0,0,0,1,0,0,1,0,1], i.e., existing three behavior corresponding position marks 1.And all acceptance of bribess are all marked
Note is identical sub-classifier label.
Corresponding sub-classifier all is trained using full dose data to all behavior labels.The result of sub-classifier is as member
The input of classifier, the label of meta classifier directly use the original charge label of training set.
So far, the feature vector construction of Law Text finishes, and length is the number of the behavior label defined, content two
System 0 or 1 indicates whether there is corresponding behavior.In the task of actual Law Text classification, so that it may use the behavior
Label is trained and predicts as feature input meta classifier.
There is also further promote improved space for this scheme.There are two the points that can be mainly promoted:
1. the confidence level using two classifiers replaces 0 or 1.Confidence level represents the determination degree of classifier judging result,
Value range (0,1).Because the charge determined to each marks out behavior label and is fixed, but might not be all
Law Text all centainly includes all behaviors.Therefore providing confidence level can allow description more accurate.
2. the division of pair label more refines, increasing -1 indicates that associated description is not present.Because 1 representative clearly exists before
The behavior is clearly not present in the behavior, 0 representative, but is exactly that associated description is not present there are also the third situation, can not clearly define
With the presence or absence of the behavior.Therefore increasing a classification can be such that feature construction more refines, but also have the possibility for introducing error
Property.
The present invention is flexibly used the prior art, is maximized favourable factors and minimized unfavourable ones using general at present Open Framework and language, by law text
This Feature Engineering method is promoted, and key message can be preferably obtained, and improves training accuracy.Law is assisted with this
Judgement, auxiliary jurisprudential study etc..
The foregoing is merely presently preferred embodiments of the present invention, is only used to illustrate the technical scheme of the present invention, and is not intended to limit
Determine protection scope of the present invention.Any modification, equivalent substitution, improvement and etc. done all within the spirits and principles of the present invention,
It is included within the scope of protection of the present invention.
Claims (9)
1. a kind of Feature Engineering method for Law Text classification based on two classifiers, which is characterized in that
First the result of text vector is used in the training of two disaggregated models, indicates whether there is crime row with two classification results
For or crime key message, then construct a feature vector, and then in more classification tasks of Law Text.
2. the method according to claim 1, wherein
Implementation steps are as follows
The first step obtains training dataset;
Second step carries out Chinese word segmentation, removal stop words;
Third step, text vector;
4th step defines behavior label and corresponding two classifier of training;
5th step, training multi-categorizer, goes the label of two classifiers corresponding to each text in previous step as input
Training multi-categorizer, the step label are label obtained in the first step;
So far, it constructs and completes for the multi-categorizer of Law Text.
3. according to the method described in claim 2, it is characterized in that,
In the first step, by artificially collecting legal decision document and being labeled the corresponding charge label of acquisition by court verdict.
4. according to the method described in claim 2, it is characterized in that,
In second step, it is phrase by whole section of Law Text cutting using the Jieba of open source, then compares and deactivate dictionary for definition
Good stop words removal.
5. according to the method described in claim 2, it is characterized in that,
In third step, the dictionary after participle is subjected to vectorization using prior art One-Hot or TFIDF or Word2vec, and
Provide the feature vector of every text.
6. according to the method described in claim 2, it is characterized in that,
In 4th step, defines the behavioural characteristic for wanting to extract and be labeled, for training two classifiers.
7. according to the method described in claim 2, it is characterized in that,
If there is this needs of news are predicted and are classified, corresponding steps are as follows:
The first step carries out Chinese word segmentation, removal stop words
Second step, text vector
Vector in second step is predicted one by one using multiple two classifiers, obtains behavioural characteristic vector by third step.
4th step predicts behavioural characteristic vector corresponding to each text in previous step as the input of multi-categorizer
Charge.
8. the method according to the description of claim 7 is characterized in that
It is phrase by whole section of Law Text cutting using the library such as Jieba of open source, then compareing deactivated dictionary will in the first step
The stop words removal defined.
9. the method according to the description of claim 7 is characterized in that
In second step, the dictionary after participle is carried out vectorization by used vectorization model when using training, and provides every provision
This feature vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910401645.9A CN110110087A (en) | 2019-05-15 | 2019-05-15 | A kind of Feature Engineering method for Law Text classification based on two classifiers |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910401645.9A CN110110087A (en) | 2019-05-15 | 2019-05-15 | A kind of Feature Engineering method for Law Text classification based on two classifiers |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110110087A true CN110110087A (en) | 2019-08-09 |
Family
ID=67490149
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910401645.9A Pending CN110110087A (en) | 2019-05-15 | 2019-05-15 | A kind of Feature Engineering method for Law Text classification based on two classifiers |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110110087A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111242237A (en) * | 2020-01-20 | 2020-06-05 | 北京明略软件系统有限公司 | Bert model training and classifying method, system, medium and computer equipment |
CN111949794A (en) * | 2020-08-14 | 2020-11-17 | 扬州大学 | Online active machine learning method for text multi-classification task |
CN112114795A (en) * | 2020-09-18 | 2020-12-22 | 北京航空航天大学 | Method and device for predicting deactivation of auxiliary tool in open source community |
CN112488551A (en) * | 2020-12-11 | 2021-03-12 | 浪潮云信息技术股份公司 | XGboost algorithm-based hot line intelligent order dispatching method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1310825A (en) * | 1998-06-23 | 2001-08-29 | 微软公司 | Methods and apparatus for classifying text and for building a text classifier |
US20100186091A1 (en) * | 2008-05-13 | 2010-07-22 | James Luke Turner | Methods to dynamically establish overall national security or sensitivity classification for information contained in electronic documents; to provide control for electronic document/information access and cross domain document movement; to establish virtual security perimeters within or among computer networks for electronic documents/information; to enforce physical security perimeters for electronic documents between or among networks by means of a perimeter breach alert system |
CN108228687A (en) * | 2017-06-20 | 2018-06-29 | 上海吉贝克信息技术有限公司 | Big data knowledge excavation and accurate tracking and system |
-
2019
- 2019-05-15 CN CN201910401645.9A patent/CN110110087A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1310825A (en) * | 1998-06-23 | 2001-08-29 | 微软公司 | Methods and apparatus for classifying text and for building a text classifier |
US20100186091A1 (en) * | 2008-05-13 | 2010-07-22 | James Luke Turner | Methods to dynamically establish overall national security or sensitivity classification for information contained in electronic documents; to provide control for electronic document/information access and cross domain document movement; to establish virtual security perimeters within or among computer networks for electronic documents/information; to enforce physical security perimeters for electronic documents between or among networks by means of a perimeter breach alert system |
CN108228687A (en) * | 2017-06-20 | 2018-06-29 | 上海吉贝克信息技术有限公司 | Big data knowledge excavation and accurate tracking and system |
Non-Patent Citations (1)
Title |
---|
邓文超: "基于深度学习的司法智能研究", 《中国优秀硕士学位论文全文数据库 社会科学Ⅰ辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111242237A (en) * | 2020-01-20 | 2020-06-05 | 北京明略软件系统有限公司 | Bert model training and classifying method, system, medium and computer equipment |
CN111949794A (en) * | 2020-08-14 | 2020-11-17 | 扬州大学 | Online active machine learning method for text multi-classification task |
CN112114795A (en) * | 2020-09-18 | 2020-12-22 | 北京航空航天大学 | Method and device for predicting deactivation of auxiliary tool in open source community |
CN112488551A (en) * | 2020-12-11 | 2021-03-12 | 浪潮云信息技术股份公司 | XGboost algorithm-based hot line intelligent order dispatching method |
CN112488551B (en) * | 2020-12-11 | 2023-04-07 | 浪潮云信息技术股份公司 | Hot line intelligent order dispatching method based on XGboost algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106919673B (en) | Text mood analysis system based on deep learning | |
CN109189901B (en) | Method for automatically discovering new classification and corresponding corpus in intelligent customer service system | |
CN110110087A (en) | A kind of Feature Engineering method for Law Text classification based on two classifiers | |
CN107944480A (en) | A kind of enterprises ' industry sorting technique | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN108563638B (en) | Microblog emotion analysis method based on topic identification and integrated learning | |
CN111104510B (en) | Text classification training sample expansion method based on word embedding | |
CN110413783A (en) | A kind of judicial style classification method and system based on attention mechanism | |
CN110276054A (en) | A kind of insurance text structure implementation method | |
CN109657058A (en) | A kind of abstracting method of notice information | |
CN109271527A (en) | A kind of appellative function point intelligent identification Method | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
CN109492105A (en) | A kind of text sentiment classification method based on multiple features integrated study | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics | |
CN110910175A (en) | Tourist ticket product portrait generation method | |
CN109783636A (en) | A kind of car review subject distillation method based on classifier chains | |
Safrin et al. | Sentiment analysis on online product review | |
CN112860889A (en) | BERT-based multi-label classification method | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN111462752A (en) | Client intention identification method based on attention mechanism, feature embedding and BI-L STM | |
CN110909542A (en) | Intelligent semantic series-parallel analysis method and system | |
Bengio et al. | Using web co-occurrence statistics for improving image categorization | |
WO2021128704A1 (en) | Open set classification method based on classification utility | |
CN110110326B (en) | Text cutting method based on subject information | |
CN110245234A (en) | A kind of multi-source data sample correlating method based on ontology and semantic similarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190809 |