CN113837240A

CN113837240A - Classification system and classification method for education department

Info

Publication number: CN113837240A
Application number: CN202111030674.2A
Authority: CN
Inventors: 张静鹏
Original assignee: Nanjing Insect Software Co ltd
Current assignee: Nanjing Insect Software Co ltd
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2021-12-24

Abstract

A classification system and a classification method for education department, comprising: step 1: establishing a labeling data set; step 2: transcoding the annotation data set; and step 3: establishing a training set and a testing set of a certain subject; and 4, step 4: building a model based on a convolutional neural network; and 5: training a model; step 6: and (5) subject classification. The method effectively overcomes the defects that in the prior art, a subject classification system which has high accuracy and takes education department as a reference is not available in the market, the classification accuracy of the traditional classification method is low, the classification difficulty is high, and the classification of the traditional article does not relate to the education department.

Description

Classification system and classification method for education department

Technical Field

The embodiment of the invention relates to the technical field of classification, belongs to the field of student classification and subject classification, particularly relates to a classification system and a classification method for education departments, and particularly relates to a periodical classification system and a classification method based on a convolutional neural network and the subjects of the education departments.

Background

At present, manufacturers have made many studies on subject classification, but only the conventional word frequency analysis and keyword clustering are used to distinguish subjects of periodicals, and there is no subject classification system with high accuracy on the basis of education department in the market.

Problem 1: the traditional classification method has low classification accuracy and high classification difficulty.

Most of traditional classification methods adopt word frequency association, namely, the frequency of occurrence of a certain keyword in an article is high, and the article can be hooked with a subject associated with the keyword. With the development of the times and the richness of article contents, the classification method cannot adapt to the trend. For example, a classification method would be for a scalpel and a medical hook, but if an article teaches the manufacturing process of a scalpel, it is clear that the article is not too much associated with medicine. Secondly, the difficulty of collecting all the keywords related to medicine is too high, which results in a large amount of manpower and material resources to maintain the system.

Problem 2: the traditional article classification does not relate to the education department.

The relevant web sites of the thesis have no uniform standard for article classification, and basically fight each other, and at present, no data dealer in China develops research on a classification system of education department. .

Disclosure of Invention

In order to solve the above problems, embodiments of the present invention provide a classification system and a classification method for education departments, which effectively avoid the defects that in the prior art, there is no discipline classification system with high accuracy on the basis of the disciplines of education departments in the market, the classification accuracy of the conventional classification method is low, the classification difficulty is high, and the classification of the conventional articles does not relate to the disciplines of education departments.

In order to overcome the defects in the prior art, the embodiment of the invention provides a solution for a classification system and a classification method for education departments, which comprises the following specific steps:

a classification method of a classification system for education department, comprising:

step 1: establishing a labeling data set;

the method for establishing the annotation data set comprises the following steps: and establishing a labeling data set according to the corresponding relation between the academic paper and the Chinese country library classification number and the corresponding relation between the Chinese country library classification number and the education department subject.

Step 2: transcoding the annotation data set;

the transcoding method of the annotation data set comprises the following steps: acquiring all the words appearing in the articles according to the acquired articles, and making an English dictionary with the length of 601408; all articles are converted to a1 x 200 matrix according to the english dictionary.

And step 3: establishing a training set and a testing set of a certain subject;

the method for establishing a training set and a test set of a certain subject comprises the following steps: marking all the articles of the subject as positive results, and extracting 80% of the articles as a positive result training set, and taking the rest 20% of the articles as a positive result testing set; all articles that are not the subject are marked as negative results and 80% are extracted as a negative result training set and the remaining 20% are extracted as a negative result testing set;

in the training process, 64 training samples are respectively extracted from the positive result training set and the negative result training set at a time and used as model training samples, and 64 test samples are respectively extracted from the positive result testing set and the negative result testing set at a time and used as model training samples.

And 4, step 4: building a model based on a convolutional neural network;

and 5: training a model;

the model training method comprises the following steps: the total number of 13 gate models and 110 education department models are trained, and the evaluation index of each model is more than 90%.

Step 6: classifying subjects;

the subject classification method comprises the following steps: if an article is to be classified into a primary subject of an education department, firstly, a gate model of the subject is required to be satisfied, and then, the primary subject model of the education department is required to be satisfied;

if a journal is to be classified in a department of education primary subject, at least 60% of the articles are under that department of education primary subject.

A classification system for educational departments, comprising:

the establishing module is used for establishing a marking data set;

the transcoding module is used for transcoding the marked data set;

the training module is used for establishing a training set and a testing set of a certain subject;

the building module is used for building a model based on a convolutional neural network;

the model module is used for model training;

and the classification module is used for subject classification.

The establishing module is also used for establishing a labeling data set according to the corresponding relation between the academic paper and the Chinese country library classification number and the corresponding relation between the Chinese country library classification number and the education department subject.

The transcoding module is also used for acquiring all words appearing in the articles according to the acquired articles and making an English dictionary with the length of 601408; all articles are converted to a1 x 200 matrix according to the english dictionary.

The training module is also used for marking all the articles of the subject as positive results, extracting 80% of the articles as a positive result training set, and taking the rest 20% of the articles as a positive result testing set; all articles that are not the subject are marked as negative results and 80% are extracted as a negative result training set and the remaining 20% are extracted as a negative result testing set;

The classification module is also used for meeting a gate model of an article to which the article belongs first and then meeting a primary subject model of an education department if the article is to be classified into the primary subject of the education department;

The embodiment of the invention has the beneficial effects that:

the method of the invention realizes a subject classification system which has high accuracy and takes the education department as a reference, and the classification is easy and can relate to the education department. The method effectively overcomes the defects that in the prior art, a subject classification system which has high accuracy and takes education department as a reference is not available in the market, the classification accuracy of the traditional classification method is low, the classification difficulty is high, and the classification of the traditional article does not relate to the education department.

Drawings

Fig. 1 is an overall flowchart of a classification method of a classification system for education department according to the present invention.

Detailed Description

The embodiments of the present invention will be further described with reference to the drawings and the embodiments.

As shown in fig. 1, the classification method of the classification system for education department includes the following steps:

step 1: establishing a labeling data set;

the method for establishing the annotation data set comprises the following steps: and establishing a labeling data set according to the corresponding relation between the academic paper and the Chinese country library classification number and the corresponding relation between the Chinese country library classification number and the education department subject (13 education department subjects, 110 education department primary subjects). On the final presented results at this step, a more accurate discriminant data set was obtained, 10 articles (repeatable) for each of 13 disciplines, and 2 articles (repeatable) for each of 110 disciplines.

For example: the primary subject: the classification number of the Chinese national library corresponding to the marxist philosophy is as follows: a1, A8, a84, B0; the door class: the Chinese national library classification number corresponding to philosophy is as follows: b01, B02, B03, B08, B0.

The advantages are that: in the step, the traditional manual marking method is replaced by the machine marking method, so that the investment of manpower and material resources is greatly reduced; the article obtained by the two corresponding relations has high accuracy.

Step 2: transcoding the annotation data set;

For example, the english dictionary shown below:

dict

minimizingweighted

municipai

as200mw

recovery51

about9years

andmangiferin

hypsochromically

pp2c21

wakening

couldlower

educationenvironment

enogenousely

betterfamily

incomplementary

acmotor

lc50were1

saionji

controllled

progresson

enhancedgreatly

ionx

bacillary

refracive

in1890

crystalsbased

energyand

forguangzhou

libertins

part of implementation codes of the transcoding method of the annotation data set are as follows:

for example: a method for creating a software program for creating, the software code of the computer program product may include a code of code.

Is coded into after conversion

[1129794 1238442 1142221 138159 1381583 571579 1335737 617718 1326063 618069 286557 1315384 776902 1259783 90889 1165424 512814 839423 547653 1391312 237506 963132 546716 1067425 113548 354942 132381 1335737 900013 214897 1143905 964454 1315933 624879 214897 1136531 1314985 51201 445480 304242 1312112 1216493 1058571 1167438 1049619 1067425 383474 1335737 900013 214897 90889 790745 1238442 1356034 1326063 237506 306144 279336 138159 428031 299002 814090 484760 776902 1259783 90889 811154 1067425 383474 1335737 900013 214897 138159 1054269 1356034 1239053 1216493 776902 1113755 654817 912278 286557 1315384 1314985 796005 1238442 618069 1381583 237506 138159 89707 1335737 682687 218181 878963 1330000 622842 153527 571579 906748 776902 700796 90889 412721 1054269 1129940 1237833 852873 1067425 878963 586549 90889 646562 214897 1352935 1314985 618069 814090 484760 183524 811154 1067425 383474 1335737 900013 214897 214897 383474 183524 394974 181300 951935 493621 1233765 1152098 214897 1010930 1314985 714988 445480 304242 618069 814090 484760 183524 571769 618069 376024 1335737 214897 383474 98387 181300 493621 1314985 214897 1010930 831280 260478 618069 376024 1335737 214897 383474 98387 445480 304242 489797 1138128 729142 877022 275706 1211368 878963 1330000 1260399 1166217 1174398 878963 385770 4958 618069 237506 236913 637641 215509 1134332 138159 1381583 1238442 380444 776902 1259783 618069 376024 1335737 214897 383474 776902 868183]

And step 3: establishing a training set and a testing set of a certain subject;

For example:

there are three abstracts A, B, C to the Marxism.

There are three abstracts D, E, F of philosophy.

There are three abstracts H, I, J to law.

For Marxist, its positive result is A, B, C and its negative result is D, E, F, H, I, J.

Part of the code for establishing the training set and the test set of a certain subject is realized as follows:

selected_index＝

random.sample(list(range(len(train_Y_true))),k＝64)

batch_X_1＝train_X_true[selected_index]

batch_Y_1＝train_Y_true[selected_index]

selected_index＝

random.sample(list(range(len(train_Y_false))),k＝64)

batch_X_2＝train_X_false[selected_index]

batch_Y_2＝train_Y_false[selected_index]

batch_X＝np.vstack((batch_X_2,batch_X_1))

batch_Y＝np.vstack((batch_Y_2,batch_Y_1))

64 samples from each of the positive result training set and the negative result training set

selected_index＝

random.sample(list(range(len(test_Y_true))),k＝64)

batch_X_1＝test_X_true[selected_index]

batch_Y_1＝test_Y_true[selected_index]

selected_index＝

random.sample(list(range(len(test_Y_false))),k＝64)

batch_X_2＝test_X_false[selected_index]

batch_Y_2＝test_Y_false[selected_index]

test_X＝np.vstack((batch_X_2,batch_X_1))

test_Y＝np.vstack((batch_Y_2,batch_Y_1))

64 samples from each of the positive result test set and the negative result test set

The advantages are that: the building method that one subject corresponds to one model is adopted, but not that multiple subjects correspond to one model, so that the classification accuracy of a certain subject is optimized; the proportion of positive results and negative results is equivalent, and the problem that the accuracy is not practical under the condition that the proportion of negative results is too small is prevented.

And 4, step 4: building a model based on a convolutional neural network;

part of codes for realizing the model building based on the convolutional neural network are as follows:

# import-related library

import tensorflow as tf

import tensorflow.compat.v1 as tf

tf.reset_default_graph()

tf.disable_v2_behavior()

from tensorflow import keras as kr

from sklearn import metrics

New variables x and y

X_holder＝tf.placeholder(tf.int32,[None,seq_length])

Y_holder＝tf.placeholder(tf.float32,[None, num_classes])

# converts to a sentence vector based on the corresponding word vector

embedding＝tf.get_variable('embedding',[601408, embedding_dim])

embedding_inputs＝tf.nn.embedding_lookup(embedding, X_holder)

Layers. conv1d one-dimensional convolution

conv＝tf.layers.conv1d(embedding_inputs,num_filters, kernel_size)

Pooled in # pool

max_pooling＝tf.reduce_max(conv, reduction_indices＝[1])

# full connection

full_connect＝tf.layers.dense(max_pooling,hidden_dim)

# dropout, randomly culling partial data

full_connect_dropout＝tf.nn.dropout(full_connect,

keep_prob＝dropout_keep_prob)

Function activation

full_connect_activate＝tf.nn.relu(full_connect_dropout)

# full connection

softmax_before＝tf.layers.dense(full_connect_activate, num_classes)

predict_Y＝tf.nn.softmax(softmax_before)

# optimizer

cross_entropy＝ tf.nn.softmax_cross_entropy_with_logits_v2(labels＝Y_h older,logits＝softmax_before)

loss＝tf.reduce_mean(cross_entropy)

optimizer＝tf.train.AdamOptimizer(learning_rate)

# training

train＝optimizer.minimize(loss)

# output results

true_result＝tf.argmax(Y_holder,1)

predict_result＝tf.argmax(predict_Y,1)

And 5: training a model;

Step 6: classifying subjects;

For example: the first-level discipline and the public security are under the department of law, and if an article is classified into the department of law and the first-level discipline and the public security, the article can be considered to belong to the first-level discipline.

A classification system for educational departments, comprising:

the establishing module is used for establishing a marking data set;

the transcoding module is used for transcoding the marked data set;

the model module is used for model training;

and the classification module is used for subject classification.

The establishing module is further used for establishing a labeling data set according to the corresponding relation between the academic paper and the Chinese country library classification number and the corresponding relation between the Chinese country library classification number and the education department subject (13 education department subjects, 110 education department primary subjects). On the final presented results at this step, a more accurate discriminant data set was obtained, 10 articles (repeatable) for each of 13 disciplines, and 2 articles (repeatable) for each of 110 disciplines.

The journal range recorded by Scival is far smaller than that recorded by insects, in the aspect of the academic department, Scival only distinguishes 97 primary subjects, while the invention distinguishes all the primary subjects and totals 110.

Taking the foreign language literature as an example, Scival totally includes 151 periodicals, which is obviously much lower than the number of actual periodicals. The embodiment of the invention records 2651 periodicals in total, and can be said to cover most foreign language and literature periodicals. Only 2 of the 151 journals it contains are not recognized by embodiments of the invention as foreign language literature, compared to the scope covered by Sciva. The 2 journals were indeed not to be classified in foreign language literature, as confirmed by the correlation. 100 periodicals are randomly extracted from 2502 periodicals which are recorded in the embodiment of the invention but not recorded by Scival, and are judged manually, and the 100 periodicals are confirmed to be classified into foreign language literature, which cannot be realized by Scival.

While the embodiments of the present invention have been described above in terms of procedures illustrated by the embodiments, it will be understood by those skilled in the art that the present disclosure is not limited to the embodiments described above, and that various changes, modifications, and substitutions can be made without departing from the scope of the embodiments of the present invention.

Claims

1. A classification method of a classification system for education department, comprising:

step 1: establishing a labeling data set;

step 2: transcoding the annotation data set;

and step 3: establishing a training set and a testing set of a certain subject;

and 4, step 4: building a model based on a convolutional neural network;

and 5: training a model;

step 6: and (5) subject classification.

2. The classification method for a classification system for education sections according to claim 1, wherein the method of creating the annotation data set includes: and establishing a labeling data set according to the corresponding relation between the academic paper and the Chinese country library classification number and the corresponding relation between the Chinese country library classification number and the education department subject.

3. The method of classification for a classification system for an education department according to claim 1, wherein the method of transcoding the annotation data set comprises: acquiring all the words appearing in the articles according to the acquired articles, and making an English dictionary with the length of 601408; all articles are converted to a1 x 200 matrix according to the english dictionary.

4. The method of classifying a classification system for an education department according to claim 1, wherein the method of creating a training set and a test set of a certain discipline comprises: marking all the articles of the subject as positive results, and extracting 80% of the articles as a positive result training set, and taking the rest 20% of the articles as a positive result testing set; all articles that are not the subject are marked as negative results and 80% are extracted as a negative result training set and the remaining 20% are extracted as a negative result testing set;

5. The method of classification for a classification system of an education department according to claim 1, wherein the method of model training includes: the total number of 13 gate models and 110 education department models are trained, and the evaluation index of each model is more than 90%.

6. The classification method of a classification system for education sections according to claim 1,

7. A classification system for education, comprising:

the establishing module is used for establishing a marking data set;

the transcoding module is used for transcoding the marked data set;

the model module is used for model training;

and the classification module is used for subject classification.

8. The system according to claim 7, wherein the establishing module is further configured to establish the labeled data set according to the correspondence between the academic paper and the Chinese national library classification number and the correspondence between the Chinese national library classification number and the education department subject.

9. The system for classifying education sections according to claim 7, wherein the transcoding module is further configured to obtain all words appearing in all the articles according to the obtained articles, and create an English dictionary with length of 601408; all articles are converted to a1 x 200 matrix according to the english dictionary.

10. The system of claim 7, wherein the training module is further configured to label all the articles of the subject as positive results and extract 80% as a positive results training set and the remaining 20% as a positive results testing set; all articles that are not the subject are marked as negative results and 80% are extracted as a negative result training set and the remaining 20% are extracted as a negative result testing set;

in the training process, respectively extracting 64 pieces of training used as models from a positive result training set and a negative result training set each time, and respectively extracting 64 pieces of testing used as models from a positive result testing set and a negative result testing set each time;