CN109284382A - A kind of file classification method and computing device - Google Patents
A kind of file classification method and computing device Download PDFInfo
- Publication number
- CN109284382A CN109284382A CN201811158905.6A CN201811158905A CN109284382A CN 109284382 A CN109284382 A CN 109284382A CN 201811158905 A CN201811158905 A CN 201811158905A CN 109284382 A CN109284382 A CN 109284382A
- Authority
- CN
- China
- Prior art keywords
- text information
- feature
- color value
- game
- areas
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000003066 decision tree Methods 0.000 claims abstract description 26
- 238000007637 random forest analysis Methods 0.000 claims abstract description 22
- 238000005192 partition Methods 0.000 claims description 106
- 238000005070 sampling Methods 0.000 claims description 47
- 238000004590 computer program Methods 0.000 claims description 17
- 230000000694 effects Effects 0.000 abstract description 7
- 238000012549 training Methods 0.000 abstract description 7
- 238000004364 calculation method Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000012216 screening Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 238000001914 filtration Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application discloses a kind of file classification method and computing device, the problem of for solving the problems, such as between inhomogeneity imbalanced training sets and Feature Selection, can be obviously improved the text classification effect of model.The embodiment of the present application method includes: the text information of the text information and M game subregion that obtain the area current scene ZhongNGe Yanzhi, and N and M are the integer greater than 0, and the absolute value of the difference of N and M is less than preset threshold value;From selecting A text information in the text information of the text information in the area NGe Yanzhi and the M game subregion;Select at least two features as candidate feature from the fisrt feature, the second feature and the third feature;According to the candidate feature and feature selecting formula, the maximum feature of information gain is selected to divide the node of decision tree, generates Random Forest model.
Description
Technical Field
The present application relates to the field of big data, and in particular, to a text classification method and a computing device.
Background
In machine learning, a random forest is a classifier that contains multiple decision trees, and the class of its output is determined by the mode of the class output by the individual trees. Random forests are actually a special bagging method that uses decision trees as models in bagging. Firstly, generating m training sets by using a bootstrap method, then constructing a decision tree for each training set, and when finding features for splitting, not finding all the features to maximize indexes (such as information gain), but randomly extracting a part of features from the features, finding an optimal solution among the extracted features, and applying the optimal solution to the nodes for splitting. The random forest method is equivalent to sampling samples and features due to the bagging, namely the integration idea.
However, when a task of text classification based on a random forest algorithm is performed, two common problems are caused: 1. the imbalance of samples among the classes can lead the classification result to be biased to the class with more samples; 2. the selection of the features determines the execution speed and the final effect of the algorithm.
Disclosure of Invention
The embodiment of the application provides a text classification method and a calculation device, which are used for solving the problems of sample imbalance among different classes and feature screening, and can remarkably improve the text classification effect of a model.
In view of the above, a first aspect of the present application provides a text classification method, which may include:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Optionally, in some embodiments of the present application, before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further includes:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Optionally, in some embodiments of the present application, the calculating new text information of X3 color value regions according to the text information of X2 color value regions and a sample sampling formula includes:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, in some embodiments of the present application, before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further includes:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, in some embodiments of the present application, the calculating new text information of Y3 game partitions according to the text information of Y2 game partitions and a sample sampling formula includes:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,and adjusting the coefficient to be between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
A second aspect of the present application provides a computing device, which may include:
the first acquisition module is used for acquiring text information of N color value areas and text information of M game subareas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
a first selection module, configured to select a pieces of text information from the text information of the N color value regions and the text information of the M game partitions, where each piece of text information in the a pieces of text information includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in a sentence, and the third feature includes a maximum word frequency value of a word in a sentence;
a second selection module for selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and the generation module is used for selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula so as to generate a random forest model.
Optionally, in some embodiments of the present application, the computing apparatus may further include:
the second acquisition module is used for acquiring the original text information of the X1 color value areas;
a third selecting module, configured to select text information of X2 color value regions from the original text information of the X1 color value regions when an absolute value of a difference between X1 and M is greater than the preset threshold;
the calculation module is used for calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
a determining module, configured to determine that a sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Alternatively, in some embodiments of the present application,
the calculation module is specifically configured to determine, according to the text information and euclidean distances of the X2 color value regions, neighboring text information of X3 color value regions; and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Alternatively, in some embodiments of the present application,
the second acquisition module is used for acquiring the original text information of Y1 game partitions;
a third selecting module, configured to select text information of Y2 game partitions from the original text information of the Y1 game partitions when an absolute value of a difference between Y1 and M is greater than the preset threshold;
the calculation module is used for calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
a determining module, configured to determine that a sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Alternatively, in some embodiments of the present application,
the calculation module is specifically configured to determine neighboring text information of Y3 game partitions according to the text information and euclidean distances of the Y2 game partitions; and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,adjustment factorAnd the value is between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
In a third aspect, an embodiment of the present invention provides a computing apparatus, including a memory, and a processor, where the processor is configured to implement the steps of the text classification method described in the foregoing first aspect when executing a computer program stored in the memory.
In a fourth aspect, an embodiment of the present invention provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the text classification method as described in the foregoing first aspect embodiment.
According to the technical scheme, the embodiment of the application has the following advantages:
in the embodiment of the application, the text information of N color value areas and the text information of M game subareas in the current scene are obtained, wherein N and M are integers larger than 0, and the absolute value of the difference value between N and M is smaller than a preset threshold value; selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence; selecting at least two features from the first feature, the second feature and the third feature as candidate features; and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model. The method is used for solving the problem of sample imbalance among different classes and the problem of feature screening, and can remarkably improve the text classification effect of the model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following briefly introduces the embodiments and the drawings used in the description of the prior art, and obviously, the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to the drawings.
FIG. 1 is a schematic diagram of an embodiment of a text classification method in an embodiment of the present application;
FIG. 2 is a schematic diagram of an embodiment of a computing device in an embodiment of the present application;
FIG. 3 is a schematic diagram of another embodiment of a computing device in an embodiment of the present application;
FIG. 4 is a schematic diagram of another embodiment of a computing device in an embodiment of the present application;
fig. 5 is a schematic diagram of another embodiment of a computer-readable storage medium in an embodiment of the present application.
Detailed Description
The embodiment of the application provides a text classification method and a calculation device, which are used for solving the problems of sample imbalance among different classes and feature screening, and can remarkably improve the text classification effect of a model.
For a person skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. The embodiments in the present application shall fall within the protection scope of the present application.
In the following, a brief description of the terms referred to in the present application is given, as follows:
random Forest algorithm (RF), in machine learning, a Random Forest is a classifier that contains multiple decision trees and whose output class is determined by the mode of the class output by the individual trees.
Each tree was built according to the following algorithm:
(1) the number of training cases (samples) is represented by N, and the number of features is represented by M.
(2) Inputting a characteristic number m for determining a decision result of a node on a decision tree; where M should be much smaller than M.
(3) Sampling N times from N training cases (samples) in a manner of sampling back to form a training set (i.e. bootstrap sampling), and using the cases (samples) which are not extracted as a prediction to evaluate the error.
(4) For each node, m features are randomly selected, and the decision for each node on the decision tree is determined based on these features. Based on the m features, the optimal splitting mode is calculated.
(5) Each tree grows completely without pruning, which may be employed after a normal tree classifier is built.
When a text classification task based on a random forest algorithm is performed, 2 common problems exist: 1. the imbalance of samples among the classes can lead the classification result to be biased to the class with more samples; 2. the selection of the features determines the execution speed and the final effect of the algorithm.
Therefore, the present invention improves these two problems, and the following further describes the technical solution of the present application in an embodiment, as shown in fig. 1, which is an exemplary illustration of a text classification method in the embodiment of the present application, and may include:
101. acquiring text information of N color value areas and text information of M game areas in the current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value.
In this embodiment of the application, before the obtaining of the text information of the N color value regions and the text information of the M game partitions in the current scene, the method may further include:
(1) acquiring original text information of X1 color value areas; when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas; calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula; determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Or,
(2) acquiring original text information of Y1 game partitions; when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions; calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula; determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, the calculating to obtain new text information of X3 color value regions according to the text information of the X2 color value regions and the sample sampling formula may include:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas; and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, the calculating new text information of Y3 game partitions according to the text information of Y2 game partitions and the sample sampling formula may include:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions; and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
For example, the computing device may extract the color value and the text information (which may also be referred to as corpus) of the game partition from the barrage library of the live page, for example: color value area: 10 ten thousand, game zone: 2 ten thousand.
Firstly, segmenting all linguistic data by utilizing a crust, filtering stop words, and mapping to a 4-dimensional word2vec space; then, supplementing the text information of the game partition, and randomly taking 1 ten thousand as an original sample; aiming at each original sample, solving 5 adjacent samples of the sample in a TFIDF vector space by using Euclidean distance; and then 5 new samples can be generated by transforming the 5 adjacent samples by using a sample sampling formula.
Wherein the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xil) (formula one),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1). It should be noted that formula one means a sampling formula with fewer samples in the category, and the purpose is to increase N samples, so that the samples in the category are balanced.
Assume s 1: i like to see Miss [0.212, 0.356, 0.254, 0.684 ]; thus, 5 neighbor samples of s1 can be found:
s11=[0.102,0.254,0.102,0.631],…,s15;
then, by using a formula I, converting the generated new sample through s 11;
s’=s1+0.6*|s11-s1|
=[0.212,0.356,0.254,0.684]+0.6*([0.11,0.102,0.152,0.053])
=[0.278,0.4172,0.356,0.3452,0.7158]
therefore, the new sample s' is mapped to new text by word2 vec: miss will look nice.
Similarly, the computing device may obtain new samples of other nearby samples, and the number of samples for the game partition is expanded to 7 ten thousand. I.e. the number of 5 ten thousand new samples plus the number of 2 ten thousand original samples.
102. Selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in the sentence, and the third feature comprises a maximum word frequency value of words in the sentence.
Illustratively, each decision tree in the random forest is a training set that selects 2 million corpora from the entire corpus as the decision tree with random playback.
Each of the samples had 3 features which were,
the characteristics are as follows: whether the sentence length is greater than 5;
and (B) is as follows: whether the maximum Inverse text Frequency Index (IDF) value of the word in the sentence is more than 200;
and (C) feature: whether the Term maximum Frequency (TF) value in the sentence is larger than 30.
103. Selecting at least two features from the first feature, the second feature, and the third feature as candidate features.
Illustratively, t (t <3) -dimensional features are selected as candidate features of the decision tree.
104. And selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Wherein the feature selection formula is:
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,and adjusting the coefficient to be between (0, 1). The meaning of the formula two is that the decision tree is a standard for selecting the features as the nodes, and the features with the largest information gain ratio can be the nodes of the current round.
Illustratively, the computing device selects the feature with the largest information gain from the candidate features by using a formula II to split the nodes of the decision tree, and the t value is kept unchanged in the growth process of the random forest.
Let t be 2, the first selected feature is A, B, the total number of samples N is 20000,
the number of the game subareas Ng is 8000, and the number of the color value subareas Nf is 12000;
9000, 6000 of them belong to the color value zone and 3000 belong to the game zone;
a- ═ 11000, of which 6000 belong to the color value zone and 5000 belong to the game zone;
b + 5000, 3000 of which belong to the color value zone and 2000 belong to the game zone;
b-15000, 9000 of which belong to the color zone and 6000 of which belong to the game zone.
The computing means can therefore find the information gain according to equation three.
Information gain g (a): g (a) ═ E (S) -E (S | a) (formula three),
wherein E (S) represents the entropy of the set S and refers to the entropy formula of the decision tree, and E (S | A) represents the entropy divided by the characteristic A and refers to the conditional entropy formula of the decision tree. The meaning of the formula three is that the formula is an information gain formula, and a random forest is referred to, so as to supplement the description of the formula two.
Exemplary, formula three: g (a) ═ E (n) -E (S | a);
therefore, the information gain: g (a) ═ 0.292-0.286 ═ 0.006.
The calculation means may find the information content separation according to equation four.
Information content division split (a):
wherein n is the total number divided by the characteristic A; a isjThe total number of categories j when divided by feature a. The meaning of the formula four is that the formula is divided for the information amount, and a random forest is referred to, so as to supplement the description of the formula two.
Exemplary, equation four, traffic separation split (a):
the calculation means can thus find the value of the attribute association degree according to the formula five.
Attribute association formula t (f):
where n is the total number of attributes that do not contain attribute A, H (F)i) Representing the entropy value of the ith attribute. The expression five is the expression of the degree of association between the attributes, that is, the smaller the degree of association between the attribute a and other attributes is, the larger the information gain ratio of the attribute a is.
Exemplarily, since h (a) ═ E (S | a), and h (B) ═ E (S | B) ═ 0.203,
then it is calculated according to equation two:
Gen(B)=0.107。
similarly, we can find gen (B) ═ 0.107, so gen (B) > gen (a), and then we should choose the B-feature split node in this decision tree.
It should be noted that, the step 102-104 is circularly executed, so as to ensure that each decision tree in the random deep forest is fissured to the maximum extent, pruning is not required, and finally the random deep forest model is generated.
In the embodiment of the application, the text information of N color value areas and the text information of M game subareas in the current scene are obtained, wherein N and M are integers larger than 0, and the absolute value of the difference value between N and M is smaller than a preset threshold value; selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence; selecting at least two features from the first feature, the second feature and the third feature as candidate features; and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model. The method is used for solving the problem of sample imbalance among different classes and the problem of feature screening, and can remarkably improve the text classification effect of the model.
As shown in fig. 2, fig. 2 is a schematic view of an embodiment of a computing apparatus in an embodiment of the present application, and may include:
a first obtaining module 201, configured to obtain text information of N color value regions and text information of M game partitions in a current scene, where N and M are integers greater than 0, and an absolute value of a difference between N and M is smaller than a preset threshold;
a first selection module 202, configured to select a text message from text messages of the N color value regions and text messages of the M game partitions, where each text message of the a text messages includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in the sentence, and the third feature includes a maximum word frequency value of the word in the sentence;
a second selection module 203, configured to select at least two features from the first feature, the second feature, and the third feature as candidate features;
and the generating module 204 is configured to select a feature with the largest information gain to split nodes of the decision tree according to the candidate feature and the feature selection formula, and generate a random forest model.
Optionally, in some embodiments of the present application, as shown in fig. 3, fig. 3 is a schematic diagram of an embodiment of a computing device in an embodiment of the present application, and the computing device may further include:
a second obtaining module 205, configured to obtain original text information of X1 color value regions;
a third selecting module 206, configured to select text information of X2 color value regions from the original text information of X1 color value regions when an absolute value of a difference between X1 and M is greater than a preset threshold;
the calculating module 207 is configured to calculate new text information of X3 color value regions according to the text information of the X2 color value regions and the sample sampling formula;
and the determining module 208 is used for determining that the sum of the new text information of the X3 color value areas and the original text information of the X1 color value areas is the text information of the N color value areas.
Alternatively, in some embodiments of the present application,
the calculating module 207 is specifically configured to determine neighboring text information of X3 color value regions according to the text information of the X2 color value regions and the euclidean distance; and calculating to obtain new text information of X3 color value areas according to the neighboring text information of the X3 color value areas and a sample sampling formula.
Alternatively, in some embodiments of the present application,
a second obtaining module 205, configured to obtain original text information of Y1 game partitions;
a third selecting module 206, configured to select text information of Y2 game partitions from the original text information of Y1 game partitions when the absolute value of the difference between Y1 and M is greater than a preset threshold;
the calculating module 207 is used for calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
and the determining module 208 is used for determining the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions as the text information of the M game partitions.
Alternatively, in some embodiments of the present application,
the calculation module 207 is specifically configured to determine neighboring text information of Y3 game partitions according to the text information and euclidean distances of Y2 game partitions; and calculating to obtain new text information of Y3 game partitions according to the adjacent text information of the Y3 game partitions and a sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,and adjusting the coefficient to be between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
As shown in fig. 4, an embodiment of the present invention provides a computing apparatus, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory 420 and running on the processor 420, and when the processor 420 executes the computer program 411, the following steps may be implemented:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,and adjusting the coefficient to be between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
Referring to fig. 5, fig. 5 is a schematic diagram illustrating an embodiment of a computer-readable storage medium according to the present invention.
As shown in fig. 5, the present embodiment provides a computer-readable storage medium, on which a computer program 511 is stored, and the computer program 511, when executed by a processor, can implement the following steps:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (10)
1. A method of text classification, comprising:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
2. The method of claim 1, wherein before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further comprises:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
3. The method according to claim 2, wherein calculating new text information of X3 color value regions according to the text information of the X2 color value regions and a sample sampling formula comprises:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
4. The method of claim 1, wherein before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further comprises:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
5. The method of claim 4, wherein calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula comprises:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
6. The method according to any of claims 1-5, wherein the feature selection formula is:
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,adjusting coefficient between (0,1)。
7. The method according to any of claims 2-5, wherein the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
8. A computing device, comprising:
the first acquisition module is used for acquiring text information of N color value areas and text information of M game subareas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
a first selection module, configured to select a pieces of text information from the text information of the N color value regions and the text information of the M game partitions, where each piece of text information in the a pieces of text information includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in a sentence, and the third feature includes a maximum word frequency value of a word in a sentence;
a second selection module for selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and the generation module is used for selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula so as to generate a random forest model.
9. A computing device comprising a processor for implementing the steps of the text classification method according to any one of claims 1 to 7 when executing a computer program stored in a memory.
10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text classification method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811158905.6A CN109284382B (en) | 2018-09-30 | 2018-09-30 | Text classification method and computing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811158905.6A CN109284382B (en) | 2018-09-30 | 2018-09-30 | Text classification method and computing device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109284382A true CN109284382A (en) | 2019-01-29 |
CN109284382B CN109284382B (en) | 2021-05-28 |
Family
ID=65182189
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811158905.6A Active CN109284382B (en) | 2018-09-30 | 2018-09-30 | Text classification method and computing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109284382B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110390400A (en) * | 2019-07-02 | 2019-10-29 | 北京三快在线科技有限公司 | Feature generation method, device, electronic equipment and the storage medium of computation model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120288207A1 (en) * | 2010-02-02 | 2012-11-15 | Alibaba Group Holding Limited | Method and System for Text Classification |
CN103473231A (en) * | 2012-06-06 | 2013-12-25 | 深圳先进技术研究院 | Classifier building method and system |
CN107292186A (en) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | A kind of model training method and device based on random forest |
CN107357895A (en) * | 2017-01-05 | 2017-11-17 | 大连理工大学 | A kind of processing method of the text representation based on bag of words |
-
2018
- 2018-09-30 CN CN201811158905.6A patent/CN109284382B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120288207A1 (en) * | 2010-02-02 | 2012-11-15 | Alibaba Group Holding Limited | Method and System for Text Classification |
CN103473231A (en) * | 2012-06-06 | 2013-12-25 | 深圳先进技术研究院 | Classifier building method and system |
CN107292186A (en) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | A kind of model training method and device based on random forest |
CN107357895A (en) * | 2017-01-05 | 2017-11-17 | 大连理工大学 | A kind of processing method of the text representation based on bag of words |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110390400A (en) * | 2019-07-02 | 2019-10-29 | 北京三快在线科技有限公司 | Feature generation method, device, electronic equipment and the storage medium of computation model |
CN110390400B (en) * | 2019-07-02 | 2023-07-14 | 北京三快在线科技有限公司 | Feature generation method and device of computing model, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109284382B (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110019876B (en) | Data query method, electronic device and storage medium | |
Gualberto et al. | The answer is in the text: Multi-stage methods for phishing detection based on feature engineering | |
Liang et al. | GraghVQA: Language-guided graph neural networks for graph-based visual question answering | |
CN110472043B (en) | Clustering method and device for comment text | |
CN111767403A (en) | Text classification method and device | |
US10643031B2 (en) | System and method of content based recommendation using hypernym expansion | |
CN109918498B (en) | Problem warehousing method and device | |
KR101757900B1 (en) | Method and device for knowledge base construction | |
CN112231468B (en) | Information generation method, device, electronic equipment and storage medium | |
CN109284382B (en) | Text classification method and computing device | |
CN109299463B (en) | Emotion score calculation method and related equipment | |
US20210312333A1 (en) | Semantic relationship learning device, semantic relationship learning method, and storage medium storing semantic relationship learning program | |
CN111339778B (en) | Text processing method, device, storage medium and processor | |
CN111556375B (en) | Video barrage generation method and device, computer equipment and storage medium | |
JP2016081265A (en) | Picture selection device, picture selection method, picture selection program, characteristic-amount generation device, characteristic-amount generation method and characteristic-amount generation program | |
CN113449522A (en) | Text fuzzy matching method and device | |
CN114547476B (en) | Community searching method and device based on bipartite graph and processing equipment | |
JP2021005179A (en) | Search device, search system, and search program | |
Palsetia et al. | Excavating social circles via user interests | |
CN112765329B (en) | Method and system for discovering key nodes of social network | |
CN113656451B (en) | Data mining method, electronic device, and computer-readable storage medium | |
GB2612423A (en) | Automated system and method for hyper parameter tuning and retrofitting formulation | |
US20150356143A1 (en) | Generating a hint for a query | |
CN114676677A (en) | Information processing method, information processing apparatus, server, and storage medium | |
CN112434174A (en) | Method, device, equipment and medium for identifying issuing account of multimedia information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |