CN109635254A

CN109635254A - Paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model

Info

Publication number: CN109635254A
Application number: CN201811467956.7A
Authority: CN
Inventors: 廖勇; 张笑颜
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2018-12-03
Filing date: 2018-12-03
Publication date: 2019-04-16

Abstract

The present invention proposes a kind of paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model.Firstly, the frequency of occurrences with searching keyword establishes keyword database.Secondly, classifying to keyword.Furthermore the plagiarism type that first coarse screens determining article is carried out using decision tree and naive Bayesian fusion.Finally, learning in the case where classification standard can not be specified when using decision tree classification with SVM, riffle is formed.This patent is intended to improve current paper duplicate checking system, improves system for the accuracy of paper duplicate checking.

Description

Paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model

Technical field:

The present invention relates to a kind of text checking methods, and in particular to is based on naive Bayesian, decision tree and SVM mixed model Paper duplicate checking method.

Background technique:

Current internet is very flourishing, the research achievement for having many different researchers to upload on network.Present many positions, example Academic title's paper will be completed as teacher, doctor carry out academic title's competition, graduates' graduation is also required to finish one's graduation thesis, however There are many people to violate lowest permissible level of virtue, wherein in order to reach the research achievement that the personal purpose of oneself plagiarizes others.It is learned to hit Art is faked and academic improper behavior, paper duplicate checking software come into being.But this technology is complete not enough, the possibility of erroneous judgement Property is very high.There is also following Railway Projects for current paper duplicate checking system: (1) very tight for the duplicate checking technology of text in article Lattice, but the plagiarism of the central idea in article is but difficult to recognize.(2) inevitably occur in article some formula or Some knowledge class descriptions, these should not be calculated to plagiarize, but many duplicate checking systems are but judged to plagiarize now.(3) for plagiarizing The differentiation of type is unobvious, leads to not the plagiarism severity for judging author.For problem as above, art technology is needed Personnel solve.

Summary of the invention:

In view of the above-mentioned problems, the present invention proposes a kind of paper duplicate checking method.It is specific as follows:

1. the paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model, which is characterized in that including with Lower four steps:

S1 establishes keyword database with the frequency of occurrences of searching keyword；

S2 classifies to keyword；

S3 carries out preliminary coarse sizing using decision tree and naive Bayesian fusion；

S4 learns with SVM when can not specify classification standard when with decision tree classification, forms riffle.

2. the paper duplicate checking side according to claim 1 based on naive Bayesian, decision tree and SVM mixed model Method, which is characterized in that step S2 includes following sub-step:

S21 classifies keyword, is divided into innovation class and knowledge class；

S22, it is 40% that the repetitive rate of the keyword of knowledge class, which can be extended the deadline, but innovative keyword is tolerated Rate wants lower, is 5%；Can prevent in this way in duplicate checking article for some universal knowledeges utilization and caused by erroneous judgement.

3. the paper duplicate checking side according to claim 1 based on naive Bayesian, decision tree and SVM mixed model Method, which is characterized in that step S3 includes following sub-step:

S31: key index is extracted by detection chart, data, keyword, central idea；

S32: spearman rank correlation coefficient is selected to determine the correlation of index between any two, and to the correlation filtered out Property strong index dimensionality reduction carried out using Principal Component Analysis, reconfigure as one group of new generalized variable being independent of each other；

S33: beginning, six four sections of the interlude, concluding paragraph parts of article are chosen, power is analyzed using analytic hierarchy process (AHP) It is heavy, the integrated value of six parts is obtained after weighted comprehensive；The extracting method of interlude are as follows: if intermediate body part core views number Greater than four, then the section most by number of words in each core views, after its number of words is arranged from big to small, chooses highest four A section；If core views number equal to four, directly chooses the most section of the number of words in this four viewpoints；It is selected if less than four Preceding four paragraphs for taking the number of words in text after all paragraph number of words arrangements most；

S34: set of types will be plagiarized and be expressed as dependent variable, Criterion Attribute set representations are independent variable, with paragraph Criterion Attribute Six position integrated values and its corresponding type of plagiarizing are training sample, and CART is established by way of recursive subdivision to training sample Decision tree；

S35: counting CART decision tree and Bayesian model respectively and classify in the training process correct training sample number, It is the classification accuracy A of two algorithms divided by training sample sum_CARTAnd A_NB；And then it calculates decision-tree model and is copied respectively to all kinds of The training accuracy b (k), k=1,2 ... attacked, m, m are whole plagiarism type sums；Decision-tree model is defined to plagiarize in output Type is Y_tWhen be to the posterior probability of all kinds of plagiarisms

By the posterior probability P (Y of itself and Bayesian model output_k| X) NB weighted comprehensive, it obtains,

At this point, plagiarism type corresponding to obtained maximum probability is final classification output result.

4. a kind of paper duplicate checking based on naive Bayesian, decision tree and SVM mixed model according to claim 1 Method, which is characterized in that step S4 includes following sub-step:

S41: training sample set is generated, training sample is actively selected；I.e. on the various articles of training, C classification is drawn a circle to approve Training article collection I1, I2 ..., IC, are respectively sampled I1, I2 ..., IC using the method for uniform sampling, generate training Sample set I ' 1, I ' 2 ..., I ' C, the equal of article quantity of each sample set is using the plagiarism probability of article as sample vector；

S42: the class splitting scheme of node classifier is as follows:

Assuming that it is respectively in S1 and S2 that the positive counter-example class set that node classifier class divides, which closes respectively S1 and S2, N1 and N2, Classification number, C=N1+N2 are total classification number that the node need to divide, X_jIndicate jth class sample set, j=1,2 ..., C, X_j's Number of samples is n_j, sample vector x；

1) all kinds of centers is calculated

2) it sets i as class splitting scheme number, for all splitting schemes, according to step 3), 4) calculates

3) center of positive example and counter-example class set S1 and S2 is calculated:

Calculate the Euclidean distance between the center of S1 and S2:

dⁱ _S1S2=| | e₁ ⁱ-e₂ ⁱ||

4) center to the center of S2 all kinds of in the average distance and S2 at center of the center all kinds of in S1 to S1 is calculated Average distance:

5) d is calculated according to the following formulaⁱ, the scheme being maximized is required scheme

dⁱ=d_S1S2 ⁱ+d_S1 ⁱ+d_S2 ⁱ

According to node classifier class division methods presented hereinbefore, the class for designing each node classifier top-downly is divided Scheme finally establishes complete decision tree；

S43: training sample set I1 ', I2 ' ..., IC ' are utilized, each node classifier is trained, has been ultimately formed Whole SVM decision tree classifier；

S44: using whole pixels of image to be classified as test sample collection, test point is carried out with SVM decision tree classifier Classification results are mapped back image and realize image classification by class.

The beneficial effects of the present invention are: solving the subproblem in current paper duplicate checking system, the tool plagiarized has been refined Body situation.Using keyword classification and keyword repetitive rate inquiry reduce paper duplicate checking in may cause for knowledge type weight Multiple erroneous judgement merges the plagiarism type for establishing CART decision tree to judge paper using naive Bayesian and decision Tree algorithms, For the plagiarism type that cannot clearly classify, establishes SVM decision tree classifier using SVM and decision Tree algorithms fusion and divided Class further analyzes plagiarism degree.

Detailed description of the invention

Additional aspect of the invention and advantage will be apparent and hold from the description of the embodiment in conjunction with the following figures It is readily understood, in which:

Fig. 1 is overview flow chart of the present invention.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.

In the description of the present invention, it is to be understood that, term " longitudinal direction ", " transverse direction ", "upper", "lower", "front", "rear", The orientation or positional relationship of the instructions such as "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside" is based on attached drawing institute The orientation or positional relationship shown, is merely for convenience of description of the present invention and simplification of the description, rather than the dress of indication or suggestion meaning It sets or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as to limit of the invention System.

The present invention proposes a kind of paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model.By right Situation is plagiarized to paper in the fusion of naive Bayesian and decision Tree algorithms and carries out the determining plagiarism type of coarse sizing, then by certainly Plan tree and SVM algorithm fusion are classified further to the plagiarism type that can not classify.

In conjunction with attached drawing 1, the present invention is described in detail, mainly comprises the steps that

Step 1: starting.

Step 2: extracting keyword, detect keyword repetitive rate.

Keyword database is established with the frequency of occurrences of searching keyword, keyword is classified, is divided into innovation class With knowledge class.It is 40% that the repetitive rate of the keyword of knowledge class, which can be extended the deadline, but for innovative keyword tolerance rate It wants lower, is 5%；Can prevent in this way in duplicate checking article for some universal knowledeges utilization and caused by erroneous judgement.

Step 3: establishing CART decision tree.

Key index is extracted by detection chart, data, keyword, central idea.Select spearman rank correlation system Number is dropped the strong index of the correlation filtered out using Principal Component Analysis to determine the correlation of index between any two Dimension reconfigures as one group of new generalized variable being independent of each other.Choose beginning, four sections of the interlude, concluding paragraph six of article Part analyzes weight using analytic hierarchy process (AHP), the integrated value of six parts is obtained after weighted comprehensive；The extracting method of interlude Are as follows: if intermediate body part core views number is greater than four, by the most section of number of words in each core views, by its number of words After arranging from big to small, highest four sections are chosen；If core views number is equal to four, directly choose in this four viewpoints The most section of number of words；Preceding four paragraphs of number of words at most in text after all paragraph number of words arrangements are chosen if less than four. Set of types will be plagiarized and be expressed as dependent variable, Criterion Attribute set representations are independent variable, with six position integrated values of paragraph Criterion Attribute It is training sample with its corresponding type of plagiarizing, establishes CART decision tree by way of recursive subdivision to training sample.

Step 4: whether can judge to plagiarize type.

CART decision tree and Bayesian model is counted respectively to classify in the training process correct training sample number, divided by Training sample sum is the classification accuracy A of two algorithms_CARTAnd A_NB；And then decision-tree model is calculated respectively to all kinds of plagiarisms Training accuracy b (k), k=1,2 ..., m, m are whole plagiarism type sums；It defines decision-tree model and plagiarizes type in output For Y_tWhen be to the posterior probability of all kinds of plagiarisms

By the posterior probability P (Y of itself and Bayesian model output_k|X)_NBWeighted comprehensive obtains

Step 5: forming SVM decision tree classifier.

Training sample set is generated, training sample is actively selected.I.e. on the various articles of training, the training of C classification is drawn a circle to approve Article collection I1, I2 ..., IC are sampled using method the difference I1, I2 ..., IC of uniform sampling, generate training sample set I ' 1, I ' 2 ..., I ' C, the article quantity of each sample set is equal, using the plagiarism probability of article as sample vector.Node classification The class splitting scheme of device is as follows:

1) all kinds of centers is calculated

Calculate the Euclidean distance between the center of S1 and S2:

dⁱ _S1S2=||e₁ ⁱ-e₂ ⁱ||

5) di is calculated according to the following formula, and the scheme being maximized is required scheme

dⁱ=d_S1S2 ⁱ+d_S1 ⁱ+d_S2 ⁱ

Using training sample set I1 ', I2 ' ..., IC ', each node classifier is trained, is ultimately formed complete SVM decision tree classifier.Using whole pixels of image to be classified as test sample collection, surveyed with SVM decision tree classifier Classification results are mapped back image and realize image classification by examination classification.

Step 6: terminating.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not A variety of change, modification, replacement and modification can be carried out to these embodiments in the case where being detached from the principle of the present invention and objective, this The range of invention is defined by the claims and their equivalents.

Claims

1. the paper duplicate checking method based on naive Bayesian, decision tree and SVM mixed model, which is characterized in that including following four A step:

S2 classifies to keyword；

2. the paper duplicate checking method according to claim 1 based on naive Bayesian, decision tree and SVM mixed model, It is characterized in that, step S2 includes following sub-step:

S21 classifies keyword, is divided into innovation class and knowledge class；

S22, it is 40% that the repetitive rate of the keyword of knowledge class, which can be extended the deadline, but innovative keyword tolerance rate is wanted It is lower, it is 5%；Can prevent in this way in duplicate checking article for some universal knowledeges utilization and caused by erroneous judgement.

3. the paper duplicate checking method according to claim 1 based on naive Bayesian, decision tree and SVM mixed model, It is characterized in that, step S3 includes following sub-step:

S31: key index is extracted by detection chart, data, keyword, central idea；

S32: spearman rank correlation coefficient is selected to determine the correlation of index between any two, and strong to the correlation filtered out Index dimensionality reduction is carried out using Principal Component Analysis, reconfigure as one group of new generalized variable being independent of each other；

S33: beginning, six four sections of the interlude, concluding paragraph parts of article are chosen, weight is analyzed using analytic hierarchy process (AHP), is added The integrated value of six parts is obtained after power is comprehensive；The extracting method of interlude are as follows: if intermediate body part core views number is greater than Four, then the section most by number of words in each core views, after its number of words is arranged from big to small, chooses highest four Section；If core views number equal to four, directly chooses the most section of the number of words in this four viewpoints；It is chosen if less than four Preceding four paragraphs of number of words at most in text after all paragraph number of words arrangements；

S34: set of types will be plagiarized and be expressed as dependent variable, Criterion Attribute set representations are independent variable, with six of paragraph Criterion Attribute Position integrated value and its corresponding type of plagiarizing are training sample, establish CART decision by way of recursive subdivision to training sample Tree；

S35: counting CART decision tree and Bayesian model respectively and classify in the training process correct training sample number, divided by Training sample sum is the classification accuracy A of two algorithms_CARTAnd A_NB；And then decision-tree model is calculated respectively to all kinds of plagiarisms Training accuracy b (k), k=1,2 ..., m, m are whole plagiarism type sums；It defines decision-tree model and plagiarizes type in output For Y_tWhen be to the posterior probability of all kinds of plagiarisms

By the posterior probability P (Y of itself and Bayesian model output_k|X)_NBWeighted comprehensive obtains,

4. a kind of paper duplicate checking side based on naive Bayesian, decision tree and SVM mixed model according to claim 1 Method, which is characterized in that step S4 includes following sub-step:

S41: training sample set is generated, training sample is actively selected；I.e. on the various articles of training, the training of C classification is drawn a circle to approve Article collection I1, I2 ..., IC are respectively sampled I1, I2 ..., IC using the method for uniform sampling, generate training sample Collect I ' 1, I ' 2 ..., I ' C, the article quantity of each sample set is equal, using the plagiarism probability of article as sample vector；

S42: the class splitting scheme of node classifier is as follows:

Assuming that it is respectively the classification in S1 and S2 that the positive counter-example class set that node classifier class divides, which closes respectively S1 and S2, N1 and N2, Number, C=N1+N2 are total classification number that the node need to divide, X_jIndicate jth class sample set, j=1,2 ..., C, X_jSample Number is n_j, sample vector x；

1) all kinds of centers is calculated

Calculate the Euclidean distance between the center of S1 and S2:

dⁱ _S1S2=| | e₁ ⁱ-e₂ ⁱ||

4) center all kinds of in the average distance and S2 at center of the center all kinds of in S1 to S1 is calculated to the flat of the center of S2 Equal distance:

dⁱ=d_S1S2 ⁱ+d_S1 ⁱ+d_S2 ⁱ

According to node classifier class division methods presented hereinbefore, the class division side of each node classifier is designed top-downly Case finally establishes complete decision tree；

S43: utilizing training sample set I1 ', I2 ' ..., IC ', be trained to each node classifier, ultimately forms complete SVM decision tree classifier；

S44: using whole pixels of image to be classified as test sample collection, carrying out testing classification with SVM decision tree classifier, Classification results map back image and realize image classification.