CN114564575A - Method for identifying bad texts in large amount of texts based on inclined random forest processing - Google Patents

Method for identifying bad texts in large amount of texts based on inclined random forest processing Download PDF

Info

Publication number
CN114564575A
CN114564575A CN202210058001.6A CN202210058001A CN114564575A CN 114564575 A CN114564575 A CN 114564575A CN 202210058001 A CN202210058001 A CN 202210058001A CN 114564575 A CN114564575 A CN 114564575A
Authority
CN
China
Prior art keywords
text
random forest
vector
objectionable
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210058001.6A
Other languages
Chinese (zh)
Inventor
张攀峰
阚学达
汪玉坤
杜慧
敬超
陶小梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Technology
Original Assignee
Guilin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Technology filed Critical Guilin University of Technology
Priority to CN202210058001.6A priority Critical patent/CN114564575A/en
Publication of CN114564575A publication Critical patent/CN114564575A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a method for identifying bad texts in a large number of texts based on inclined random forest processing, which comprises the steps of reading text data; preprocessing the text data to obtain a text vector; establishing a bad text dictionary to judge the bad rate of the text vector, and if the judgment is unqualified, defining the text vector as a bad text; if the judgment is qualified, obtaining a text set and executing the next step; constructing a tilted random forest classification model by using a random forest based on the text set; the text vectors are classified by utilizing the inclined random forest classification model to obtain a text classification result, the accuracy of the classification result obtained by the inclined random forest classification model is high, and the problem that the accuracy of classification by a traditional random forest algorithm is low is solved.

Description

Method for identifying bad texts in large amount of texts based on inclined random forest processing
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method for identifying bad texts in a large number of texts based on inclined random forest processing.
Background
With the rapid development of the internet, event detection from mass data has become a research hotspot. However, existing social network event detection methods rarely consider filtered data in short text data. Therefore, sensitive information items of short texts are effectively detected, the random spread of the sensitive information is limited, and the disqualification of bad information on network resources is prevented, so that the development of the Internet is facilitated.
Random Forest (RF) is one of the important techniques of classification algorithms, and related text filtering is also used. However, when the random forest algorithm classifies the unbalanced data set, the accuracy of classifying a small number of classes is too low, which causes the overall classification accuracy to be reduced. In the network filtering process, sensitive words are often found out from massive data information, so that the efficiency and the accuracy of event detection are influenced by the traditional random forest algorithm.
Disclosure of Invention
The invention aims to provide a method for identifying bad texts in a large number of texts based on inclined random forest processing, and aims to solve the problem of low accuracy of classification of a traditional random forest algorithm.
In order to achieve the purpose, the invention provides a method for identifying objectionable texts in a large number of texts based on a tilted random forest, which comprises the following steps:
s1 reading the text data;
s2, preprocessing the text data to obtain a text vector;
s3, establishing an objectionable text dictionary to judge the defective rate of the text vector, and if the defective text dictionary is judged to be unqualified, defining the text vector as an objectionable text; if the text is qualified, obtaining a text set and executing the step S4;
s4, constructing a tilted random forest classification model by using random forests based on the text set;
s5, classifying the text vectors by using the inclined random forest classification model to obtain a text classification result.
The specific way of acquiring the text content is as follows:
and reading the text data by adopting a web crawler and calling an API.
The specific way of preprocessing the text data to obtain the text vector is as follows:
s21, performing word segmentation selection on the text data to obtain a selected text;
s22, performing feature selection on the selected text to obtain a feature text;
s23, distinguishing the long text and the short text of the characteristic text by using a convolutional neural network to obtain a text vector.
The establishing of the objectionable text dictionary is used for judging the reject ratio of the text vector, judging the reject ratio unqualified, defining the text vector as the objectionable text, judging the objectionable text qualified, obtaining a text set, and the specific way of executing the step S4 is as follows:
s31, establishing an address dictionary and a keyword dictionary;
s32, the address dictionary judges the content proportion of the bad text in the address in the text vector, if the content proportion of the bad text in the address is more than or equal to 50%, the text vector is defined as the bad text, if the content proportion of the bad text in the address is less than 50%, the step S33 is executed;
s33, the keyword dictionary judges the text vector to contain the inscription words as bad text, and filters the bad text to obtain a text set, and the step S4 is executed.
The specific method for constructing the inclined random forest classification model by using the random forest based on the text set is as follows:
s41, dividing the text set by using a cross authentication mode through a random forest to generate a sample subset;
s42 constructing a plurality of decision tree classification models based on the sample subsets;
s43, statistics is carried out on the prediction result of each decision tree classification model by adopting the random forest bagging idea, and the result with the highest ticket number is used as the prediction result of the inclined random forest classification model.
The invention relates to a method for identifying bad texts in a large number of texts based on inclined random forest processing, which comprises the steps of reading text data; preprocessing the text data to obtain a text vector; establishing a bad text dictionary to judge the bad rate of the text vector, judging the text vector to be unqualified, defining the text vector as a bad text, judging the text vector to be qualified, and executing the next step; constructing a tilted random forest classification model by utilizing a random forest based on the text set; the text vectors are classified by utilizing the oblique random forest classification model to obtain a text classification result, the accuracy of the classification result obtained by the oblique random forest classification model is high, and the problem that the accuracy of classification by a traditional random forest algorithm is low is solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for identifying objectionable texts in a large amount of texts based on a tilted random forest.
Fig. 2 is a flowchart of preprocessing the text data to obtain a text vector.
Fig. 3 is a flowchart of establishing a bad text dictionary to determine the bad rate of the text vector, determining that the text vector is not qualified, defining the text vector as a bad text, determining that the text vector is qualified, obtaining a text set, and executing step S4.
FIG. 4 is a diagram of a process for constructing a tilted random forest classification model using random forests based on the text collection.
Fig. 5 is a flowchart of reading text data.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Referring to fig. 1 to 5, the present invention provides a method for identifying objectionable texts in a large amount of texts based on a tilted random forest, comprising the following steps:
s1 reading the text data;
specifically, the text data is read by adopting a web crawler and calling an API.
The API is called Application Programming Interface, i.e. Application Programming Interface. An API is a predefined function that is intended to provide applications and developers the ability to access a set of routines based on certain software or certain hardware, and without accessing source code or understanding the details of internal workings. An API is a call interface from an operating system to an application program, and the application program makes the operating system execute a command (action) of the application program by calling the API of the operating system. In Windows, the system API is provided in the form of a function call.
Among them, web crawlers, also known as web spiders, are widely used in internet search engines and other similar websites to download web pages, data or text information on the internet to the local for further processing. The traditional web crawler puts URLs to be grabbed into a grabbing queue for waiting to be grabbed, wherein the queue can be called as a URL seed library, and the URL seed library contains URL grabbing information which respectively represents that new delivery is grabbing, grabbing is successful, grabbing is failed and the like. And selecting a newly delivered URL in the queue during each grabbing, analyzing a DNS to obtain a host IP, grabbing a new page according to an IP address, then updating the URL state of the seed bank, referring to the new URL generated after the page is grabbed as a new chain extension, re-delivering the extended URL into the seed bank until a certain condition is met, and stopping grabbing, such as page expansion depth and the like.
In the acquisition of the bad text filtering basic data, the required crawler is different from a traditional crawler program. As shown in fig. 5, what is needed herein is to put a batch of web page URLs as seed URLs into the queue to be crawled, i.e. the seed repository, where the batch of URLs are the same in domain name and classification, and there is a difference mainly in ID below the classification, corresponding to the URL of a certain classification under the web site. Secondly, the web crawler does not need to parse new URLs contained in the crawled pages, namely, a breadth-first strategy is adopted, and the crawling depth is 1, because only text contents in the web pages need to be crawled, generally, external links of the web pages do not have influence on the research of the text. And finally, placing the captured text content into a designated position of a server or a database to obtain the text data, and waiting for further analysis and filtration.
S2, preprocessing the text data to obtain a text vector;
the concrete mode is as follows:
s21, performing word segmentation selection on the text data to obtain a selected text;
specifically, because the captured texts are all presented in the form of articles or paragraphs, the text data is first subjected to chinese word segmentation and denoising. The Chinese sentences are separated by punctuations, and the words are not usually separated, and are tightly connected in the sentences, so that the advantage that the English words are separated by spaces is not possessed, and most algorithms in the field of natural language processing use the words as basic units for processing, so that word segmentation processing is an important stage in text preprocessing and directly influences subsequent experimental results.
The development language adopted by the invention is that Python and NLPIR need to be realized by calling a C + + library through Python, and the balance participle has good support to the Python language. The settlement participle is considered to be superior in use in consideration of the system instability which may be generated when the C + + library is called and the time consumption generated by the calling. Combining the above factors, the present invention selects the ending participle as the participle algorithm. Because of the particularity of bad texts, Chinese word segmentation by a word segmentation algorithm is not complete, and an N-Gram method is added to supplement word segmentation.
Wherein, N-Gram is a common language model in maximum vocabulary continuous speech recognition. The model is based on the assumption that the probability of the current word occurring depends on the probability of the first N-1 words. If the probability of the current word only depends on the first 1 words, namely the commonly used binary Bi-Gram; if the probability of occurrence of the current word depends on the first 2 words, the commonly used ternary Tri-Gram. The invention selects Bi-Gram finally.
Wherein, Gram is a word with a single word as a basic unit, namely, adjacent N words are used as a word. The division is performed according to the characters, and the phenomenon that the original words are divided into two word characteristics exists. The invention combines the advantages of word segmentation and N-Gram methods, firstly, the text is subjected to word segmentation, and then the word segmentation result is used as the input of the N-Gram, namely, the adjacent N words are used as a characteristic word. Therefore, adjacent words are spliced, and the method can be regarded as simple supplement to the context and make up for the defects in semantic understanding.
S22, performing feature selection on the selected text to obtain a feature text;
specifically, since the computer can process data in a structured form, text input cannot be recognized effectively by the computer, and therefore the form of the selected text needs to be converted first. After the documents are segmented, each document can be regarded as a set of words, namely, the set of words can be converted into a multidimensional vector with the words as features, wherein the words can represent the content of the text to a certain extent. The characteristic words are used as an intermediate representation form of the document, so that the target text can be effectively represented, and meanwhile, the situation that the characteristic words appear in other texts less is ensured as much as possible, namely, the representation capability of the other texts is poor. In order to simplify the calculation of a computer and improve the text processing efficiency, the feature words should be simplified and have strong generalization, that is, the text information is represented as much as possible by using as few feature words as possible.
The TF-IDF is a commonly used method for calculating the weight of an entry based on statistics, and the higher the weight of an entry is, the higher the importance of the entry is. The main idea is as follows: for the distinguished and representative entries, except for requiring that the occurrence frequency of the entries in the target text is high, the occurrence frequency of the entries in other texts is low, and the entries are suitable for classification. TF-IDF is actually TF x IDF, as shown in equation 1.1.
tf-idfi,j=tfi,j×idfi(formula 1.1)
Tf (term frequency) is a word frequency, which indicates the frequency of occurrence of a keyword in a document, and simply means counting the occurrence of a keyword in a document. Considering that an article has a long or short length, if it is calculated by frequency of occurrence, the TF of a keyword is relatively high in a long text, thereby causing an influence on a classification result because the text length is different. TF is usually normalized, expressed in terms of frequency of occurrence of terms, and the weight of high frequency terms in long text is reduced:
Figure BDA0003477176020000061
in equation 1.2, the numerator is the sum of the number of occurrences of word i in document j, and the denominator is the sum of the number of occurrences of all words in document j.
Wherein, idf (inverse Document frequency) is the inverse file frequency, and represents the general importance of the words. If the documents containing a certain entry are fewer, the IDF is larger, and the category distinguishing capability of the entry is stronger. The IDF of a term is calculated as shown in formula 1.3, wherein the numerator is the number of documents, and can be obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of the obtained quotient:
Figure BDA0003477176020000062
where the numerator is the total document spread and the denominator is the number of documents containing the term t. To avoid documents that do not contain the word t, the denominator is 0 and a 1 addition is usually made.
Considering the complexity of calculating the feature weight and the effectiveness of representing the information capability, the invention adopts a TF-IDF method to calculate the weight of the feature entries.
S23, distinguishing the long text and the short text of the characteristic text by using a convolutional neural network to obtain a text vector.
Specifically, the text data in the internet are various and different in length, so that the crawled data conforms to the distribution proportion in the internet as much as possible.
In the training data of the present invention, there is a certain difference in expression between the long article and the short cross. The text is transmitted to the model in the form of a feature vector, so that features are enriched as much as possible, and correct training of the model is facilitated.
The invention adopts a method of combining the Chinese character segmentation and Bi-Gram to extract the characteristic words of the text, and aims to supplement the incompleteness of a segmentation tool in the aspect of processing unknown words of Chinese texts.
In a short text, the result obtained by using a word segmentation tool is used as a basic characteristic word, and the result of the N-Gram is used as a supplement, so that the combined method really and effectively improves the accuracy of classification. Assuming that the length of a section of text is N, the number of the feature words extracted through Bi-Gram is N-1, assuming that M results are output by the Chinese grading word, M-1 features can be extracted through a method of combining Bi-Gram and Chinese grading word results, and at the moment, the section of text extracts N +2M-2 features in total.
In long texts, the feature extraction method can greatly improve the number of feature words, so that the feature calculation is more complicated. As the long text is not limited to the length of the text, the expression of the text is sufficient, and the result of the segmentation of the crust is accurate, the Bi-Gram extraction features are not used any more.
S3, establishing an objectionable text dictionary to judge the defective rate of the text vector, and if the defective text dictionary is judged to be unqualified, defining the text vector as an objectionable text; if the judgment is qualified, obtaining a text set and executing the step S4;
the concrete method is as follows:
s31, establishing an address dictionary and a keyword dictionary;
specifically, the address dictionary and the keyword dictionary are added with data through a manual method at the initial stage of establishment, feedback is carried out in the unhealthy text counting process, and dictionary contents are updated.
S32, the address dictionary judges the content proportion of the bad text in the address in the text vector, if the content proportion of the bad text in the address is more than or equal to 50%, the text vector is defined as the bad text, if the content proportion of the bad text in the address is less than 50%, the step S33 is executed;
specifically, when saving the crawled text content, the URL of the crawled webpage is saved, so the URL address is processed first. All the text vectors under the same site form a set, which can be regarded as different documents under one site, and the bad text coverage ratio of all the documents on the site is calculated. The invention defines that when the content proportion of the bad texts of the site reaches 50%, the site is added into an address blacklist, and in the subsequent filtering process of the bad texts, the site is firstly judged whether to be in the address blacklist or not. And directly defining the text of the site in the address blacklist as the bad text without subsequent judgment so as to improve the filtering efficiency.
S33, the keyword dictionary judges the text vector to contain the inscription words as bad text, and filters the bad text to obtain a text set, and the step S4 is executed.
Specifically, the keyword dictionary judges the text vector as an inscription word, and judges the document containing the sensitive word as an objectionable text. And the words which have obvious objectionable text characteristics and other texts with high distinctiveness are added into the keyword dictionary, and the keywords of the texts are fed back to the keyword dictionary for updating the keyword dictionary for the contents judged as objectionable texts. The URL-based filtering and the keyword-based filtering can filter out text contents in bad websites, and some obvious text contents containing sensitive words can also be filtered out.
S4, constructing a tilted random forest classification model by using random forests based on the text set;
the concrete mode is as follows:
s41, dividing the text set by using a cross authentication mode through a random forest to generate a sample subset;
specifically, a random forest is an integrated classifier, and for each base classifier, a certain sample subset needs to be generated as an input variable of the base classifier. In order to consider an evaluation model, a plurality of modes are available for dividing a sample set, in the invention, a mode of cross certification is used for dividing a data set, the cross certification is to divide a text needing training into k (k is any natural number larger than zero) sub data sets according to different word numbers, one sub data set is used as a test set and the other sub data sets are used as training sets during each training, and k times of rotation steps are carried out.
Specifically, determining an out-of-bag estimation matrix, which is a classification data set D of given n tuples with m attributes; for the given classification dataset of n tuples and m attributes, the decision tree extracts n tuples from the training dataset D, and labels the extracted tuples, per single classifier.
The classification dataset out-of-bag estimation matrix D includes: and the given n tuples with m attributes jointly form a classification dataset matrix.
For a given classification dataset D of n tuples with m attributes, assuming that a random forest has k single classifiers, the basic idea of bagging is that, for a cycle i (i ═ 1, 2.. multidata., k), the i-th decision tree extracts n tuples from the training dataset D that are put back, and labels the extracted tuples.
Determining an out-of-bag estimation matrix, which is a classification dataset D of m attributes of given n tuples; wherein the ith decision tree extracts n tuples from the training data set D and labels the extracted tuples to generate the sample subset.
S42 constructing a plurality of decision tree classification models based on the sample subsets;
specifically, in a random forest, each base classifier is an independent decision tree. In the construction process of the decision tree, an optimal characteristic is sought to be searched by using a splitting rule to divide the sample, so that the accuracy of final classification is improved. The decision tree of the random forest is basically consistent with the construction mode of a common decision tree, and the difference is that the features selected when the decision tree of the random forest is split do not search the whole feature complete set, but k (k is any natural number larger than zero) features are randomly selected for division. In the invention, each text vector is used as the root of a decision tree, the characteristics of the label of the text vector obtained by utilizing the convolutional neural network are used as child nodes of the decision tree, and the lower nodes are respectively re-extracted characteristics, so that each decision tree is trained.
Wherein, the splitting rule refers to a specific rule involved in splitting the decision tree. E.g. which feature is selected and what the conditions for splitting are, while it is also known when to terminate the splitting. Since the generation of the decision tree is relatively arbitrary, it needs to be adjusted by the splitting rule to make it look better.
The data classification set out-of-bag estimation matrix D comprises: and the given n tuples with m attributes jointly form a classification dataset matrix.
For a given classification dataset D of n tuples with m attributes, assuming a random forest has k single classifiers, the basic idea of bagging is that for the cycle i (i ═ 1, 2...., k), the ith decision tree has n tuples extracted from the training dataset D put back and the extracted tuples are labeled.
Determining an out-of-bag estimation matrix, which is a classification dataset D of m attributes of given n tuples; wherein the ith decision tree extracts n tuples from the training data set D and labels the extracted tuples.
Specifically, the skewed decision trees of estimators (parameters of random forest) obtained from the optimal splitting criterion of each node include: solving the minimum value of the kini index of each node to obtain the optimal splitting principle of each node, and finally obtaining the classification model sequence of the decision trees of the estimators;
in order to ensure that the operational performance is improved on the basis of constructing the oblique splitting hyperplane, the invention uses the following formula as a cost function and takes the parameter for solving the hyperplane as the oblique splitting hyperplane:
Figure BDA0003477176020000091
in the scheme, a tilted random forest classification model is constructed based on feature information of a large number of extracted texts, and in order to obtain an optimal splitting criterion in a traditional random forest algorithm, a Gini index of the node is usually used as a target. It is assumed that the solution splitting criterion is that the splitting point is constant b on the attribute ai (i ═ 1, 2.. times, m), i.e., the splitting criterion is a hyperplane perpendicular to a dimension of the data space. But such a splitting approach does not grip the geometry in the data space well. In order to ensure that the operation performance is improved on the basis of constructing the oblique splitting hyperplane, the method uses a formula 7 as a cost function and takes the parameter for solving the hyperplane as the oblique splitting hyperplane.
Namely: using the kini index as a splitting criterion for measuring the impurity degree of the data partition D;
under the conditions of determining the splitting attribute A and determining the splitting point, the data set D is divided into D1 and D2 by rules, and the splitting criterion Gini index is calculated;
the Gini index is often used as a splitting criterion for measuring the impurity degree of the data partition D in the random forest construction decision tree. The formula for defining and calculating the Gini index is as follows:
Figure BDA0003477176020000092
pi represents the probability that the tuple in D belongs to class Ci. Considering the determination of the split property A and the determination of the split point, the data set D is divided by the rule into D1 and D2.
The equation for calculating the kini index for the splitting criterion is as follows:
Figure BDA0003477176020000093
and obtaining the optimal splitting criterion of each node by solving the minimum value of each node GiniA (D). Finally, a classification model sequence { h1(X), h2(X) } and hk (X) } of k decision trees is obtained.
The oblique splitting hyperplane comprises: the splitting rule of the inclined splitting hyperplane is different from that of the traditional random forest algorithm, and the traditional splitting rule is a special case of the inclined splitting hyperplane.
And expressing the classification error degree of the determined model to the sample attribute X by using a cost function, solving the minimum value of the cost function as an independent variable, and obtaining the value when the classification error degree is minimum. The specific calculation method is as follows:
logistic regression is a common algorithm to solve the classification problem. The value of X for a given sample property is as follows:
hθ(X)=g(θTx) (formula 2.1)
Theta is a weight parameter;
the function g is calculated as follows:
Figure BDA0003477176020000101
from the equations 2.1 and 2.2, when z >0 in the function g (z), g (z) >0.5, the sample class y is predicted to take a value of 1. Otherwise, when z is smaller than 0 in the function g (z), g (z) is smaller than 0.5, and the sample type y is predicted to take the value of 0.
For a trained logistic regression model hθ(X)=g(θTX), the samples Xi are brought into the decision boundary θTX calculates whether the expression is greater than 0 to judge that the sample belongs to a certain class. ThetaTX-0 is referred to as the decision boundary of the model.
And after obtaining the value of the minimum value of the cost function, establishing the inclined decision tree by using the inclined splitting hyperplane as a classification algorithm of the inclined decision tree node splitting criterion. The specific calculation process is as follows:
a skewed decision tree (obliquedecision tree) is a decision tree in which decision boundaries are the splitting criteria for each node in the tree. Assuming class obliqueDescriptree, for each instance object Node of the class, the variable classLabel represents the class label of the current Node; the variable dataset represents the data partition of the current node; the variables leftchildTree and rightchildTree point to the left sub-tree and the right sub-tree respectively; the variable obliqueSplitHyperplane represents the value θ of the current node splitting hyperplane. The tilt Split Hyperplane formula of each tilt decision tree Node in the tilt Split Hyperplane (obe) OFDB algorithm is as follows: thetaT·X+θ0=0;
The splitting criterion of the traditional random forest algorithm can be calculated by the following formula, wherein x is equal to b, x represents a certain attribute, and b represents a certain constant. The splitting criterion formula of the traditional random forest algorithm can be represented by a dip splitting hyperplane formula, namely the traditional splitting criterion is a special case of dip splitting hyperplane, and the dip splitting hyperplane is considered in a wider range.
Consider that only the two-class problem can be solved by one hyperplane on a single node. In order to solve the multi-classification problem, the algorithm uses a one-to-many strategy, and uses the most classes and other classes to carry out two-classification on a single node, so that the algorithm is suitable for the multi-classification problem.
Considering the influence on the classification algorithm effect caused by different positive and negative class sample ratios, the invention sets a class label calculation method as shown in the following formula on leaf nodes:
Figure BDA0003477176020000111
where argmax represents the maximum value in a set of numbers, Wc(array) represents the weight of each class, I (y)iC) represents the traversal calculation of the number of each class tuple on the current leaf node. To leaf nodes according to proportion of classes in sampleImprovements are made to the marking method.
And then, adjusting the combination weight coefficient of each base classifier in the random forest classification model by using a loss function to generate a new inclined random forest classification model, and identifying the text to be identified by using the new random forest classification model. And the final loss of the inclined random forest classification model is minimized, and the accuracy of recognizing and classifying the bad texts by the random forest classification model is greatly improved.
In the invention, the cost function is considered to be a convex function, namely the local minimum value of the function is the global minimum value, and the minimum value of the cost function is solved by using a gradient descent method. The parameter θ update rule is:
Figure BDA0003477176020000112
wherein the content of the first and second substances,
Figure BDA0003477176020000113
is the sign of the partial derivative.
To ensure rapid convergence of the gradient descent method, the present invention uses the formula prior to training:
Figure BDA0003477176020000114
and carrying out normalization processing on the data set, and mapping the values of the data set to the interval [0,1 ].
S43, statistics is carried out on the prediction result of each decision tree classification model by adopting the random forest bagging idea, and the result with the highest ticket number is used as the prediction result of the inclined random forest classification model.
Specifically, the classification result of the random forest is obtained by voting by each base classifier, i.e., decision tree. And the random forest looks at the base classifier identically, each decision tree obtains a classification result, voting results of all the decision trees are collected and accumulated, and the result with the highest vote number is the final result. Accordingly, according to the score of each child node (label) of each decision tree (text vector needing label classification), if the score of the label exceeds the threshold t set by the application, the label is considered to be capable of interpreting the text vector, so that all labels of the text vector are obtained. The threshold t is determined by accumulating the voting results of all classifiers of the decision tree by 0.3.
The random forest prediction adopts a majority voting method to calculate the formula as follows:
Figure BDA0003477176020000121
y indicates which class the final possible output result is. Y for class two may take the value 0, 1.
Where H (X) represents a combined classification model and hi is a single decision tree classification model. And obtaining the overall classification result of the random forest by a majority voting method in the prediction. Out-of-bag Estimation (OOBE) is used to estimate the classification capability of a random forest, which is tested using the non-sampled tuples in each training subset of decision trees.
S5, classifying the text vectors by using the inclined random forest classification model to obtain a text classification result.
Selecting N extracted texts (the text vectors) and obtaining values of the N extracted text characteristic variables; wherein the N extracted text types include confirmed good text and confirmed bad text, of the N extracted texts
Each extracted text corresponds to a sentence characteristic variable;
taking the N extracted texts as an original sample set, and constructing a tilted random forest classification model based on the original sample set; wherein the input of the tilted random forest classification model is the value of the extracted text characteristic variable, and the output is the tilted random forest
Judging the probability that the text is the objectionable text by all the base classifiers in the classification model;
constructing a loss function of the inclined random forest classification model, wherein the independent variable of the loss function is the weight coefficient of all the base classifiers;
solving the optimal solution of the independent variable when the dependent variable of the loss function is minimum, updating the weight coefficients of all the base classifiers according to the optimal solution, and generating a new inclined random forest classification model;
inputting the value of the extracted text characteristic variable to be identified into the new random forest classification model to obtain an output result;
and determining whether the extracted text to be identified is a good text or a non-bad text according to the output result.
Unbalanced data sets generally refer to data volumes that vary greatly from class to class, with a small number of small samples or relatively small samples of large classes. With the explosive increase of data volume in recent years, unbalanced data phenomenon appears in various industries, such as commodity recommendation in the field of electronic commerce, and the purchased recommended commodities often occupy a small proportion; credit card fraud detection in the banking field, problem credit cards account for a small percentage of all credit cards; attack identification in the network security field, and the network attack frequency is far lower than that of normal network connection.
A general random forest trains a data set through a method to generate a plurality of decision trees for ensemble learning. Under the condition of equal probability, the number of the subclass samples in the extracted new sample set is only less, and the trained model is further unbalanced, so that the accuracy of the prediction result is seriously biased.
The method selects the inclined random forest to classify the samples, randomly divides various positive and negative samples into a training set and a verification set according to actual proportion as input data, and uses an output prediction real value to replace a classification binary value to represent the possibility that the samples are judged to be of a certain class. Since the class with fewer samples contains less information, the distribution of data is difficult to determine, and the class with fewer samples may be classified into the class with more samples, thereby causing a low accuracy of class determination.
Random forests have two randomness properties in the generation process, namely randomness in sample selection and randomness in feature selection, also referred to as row sampling and column sampling, which has the advantage that overfitting is not easily generated. The randomness of the sample selection means that in the process of generating a single decision tree, in the total training sample set, the training set with the put-back samples is used to form a single tree, so that repeated samples may exist in the sample set. The randomness of feature selection refers to that a feature subset is randomly selected from all extracted feature sets to be used for training a base learner. Assuming that there are M feature attributes, M feature attributes are selected, and a single decision tree is distributed over the M feature attributes, in general, the decision tree is established until a certain leaf node cannot be split continuously or samples falling on the certain leaf node belong to one class. The randomness is ensured by two random sampling processes of the random forest, so that pruning is not needed. The random forest is simple to realize, the calculation cost is low, and the method is suitable for the condition of more feature dimensions.
The decision tree algorithm has the characteristics of low training time complexity, high prediction speed and the like, but the single decision tree has the defect of easy overfitting, and although the situation can be reduced by pruning, the overfitting cannot be avoided. The method of generating a plurality of trees for model combination can greatly reduce the defects of a single decision tree. Generally, if each decision tree judges different characteristics and the judgment result is determined to be accurate, if there are multiple decision trees, different division and cooperation are performed, and the obtained result is better than the judgment result of any one of the trees, which is also the idea of integrated learning. The base learners are mutually independent, the training process of each base learner is irrelevant to other learners, and the parallelization method can be generated at the same time only by meeting the self optimization condition. The training of the random forest model aims to reduce the variance of the model, namely the difference between model output values trained by different training sets. The model ensures that the difference between model output values of multiple times of training is minimum.
The decision tree is a model for classifying samples according to characteristic values of the samples, and branches judged by attributes are in a tree-shaped structure. The training process of the decision tree is supervised, and for given training data, the feature with the best distinguishability, also called the optimal feature, is selected each time to serve as the judgment condition of the current node. And segmenting the training data according to different outputs under the current judging condition, and continuing to judge the next node. When all of the retrieved subsets can be correctly classified, or when a stopping condition has been satisfied, the subsets are classified into leaf nodes. And when new data is input, judging the data according to the established tree structure, and sequentially judging the data from the root node to the leaf node through the attribute of each node until the leaf node is reached. The class defined by the leaf node is then used as the data class. The idea of error rate reduction pruning is to delete a subtree taking a node as a root from a leaf node of a decision tree upwards to the root node, compare the loss function values of the trees before and after deletion, if the loss function is not enlarged, the node is really deleted, and if the loss function is enlarged, the node is reserved.
While the foregoing describes only a preferred embodiment of the method for identifying objectionable text in a large amount of text based on a skewed random forest, it is understood that the scope of the present invention is not limited thereto, and those skilled in the art will appreciate that all or a portion of the above described embodiment can be implemented and equivalents thereof can be made to the scope of the present invention as defined by the appended claims.

Claims (5)

1. A method for identifying objectionable texts in a large number of texts based on inclined random forest processing is characterized by comprising the following steps:
s1 reading the text data;
s2, preprocessing the text data to obtain a text vector;
s3, establishing an objectionable text dictionary to judge the reject rate of the text vector, and if the text vector is judged to be unqualified, defining the text vector as an objectionable text; if the text is qualified, obtaining a text set and executing the step S4;
s4, constructing a tilted random forest classification model by using random forests based on the text set;
s5, classifying the text vectors by using the inclined random forest classification model to obtain a text classification result.
2. The method for tilted random forest based processing of objectionable text in a large amount of text as recited in claim 1,
the specific way of acquiring the text content is as follows:
and reading the text data by adopting a web crawler and calling an API.
3. The method for tilted random forest based processing of objectionable text in a large amount of text as recited in claim 1,
the specific way of preprocessing the text data to obtain the text vector is as follows:
s21, performing word segmentation selection on the text data to obtain a selected text;
s22, performing feature selection on the selected text to obtain a feature text;
s23, distinguishing the long text and the short text of the characteristic text by using a convolutional neural network to obtain a text vector.
4. The method for tilted random forest based processing of objectionable text in a large amount of text as recited in claim 1,
the establishing of the objectionable text dictionary is used for judging the reject ratio of the text vector, judging the text vector to be unqualified, defining the text vector as the objectionable text, judging the text vector to be qualified, and obtaining a text set, wherein the specific way of executing the step S4 is as follows:
s31, establishing an address dictionary and a keyword dictionary;
s32, the address dictionary determines the content proportion of the objectionable text in the address in the text vector, if the content proportion of the objectionable text in the address is greater than or equal to 50%, the text vector is defined as the objectionable text, if the content proportion of the objectionable text in the address is less than 50%, then step S33 is executed;
s33, the keyword dictionary judges the text vector to contain the inscription words as bad text, and filters the bad text to obtain a text set, and the step S4 is executed.
5. The method for tilted random forest based processing of objectionable text in a large amount of text as recited in claim 1,
the specific method for constructing the inclined random forest classification model by using the random forest based on the text set is as follows:
s41, dividing the text set by using a cross authentication mode through a random forest to generate a sample subset;
s42 constructing a plurality of decision tree classification models based on the sample subsets;
s43, statistics is carried out on the prediction result of each decision tree classification model by adopting the random forest bagging idea, and the result with the highest ticket number is used as the prediction result of the inclined random forest classification model.
CN202210058001.6A 2022-01-19 2022-01-19 Method for identifying bad texts in large amount of texts based on inclined random forest processing Pending CN114564575A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210058001.6A CN114564575A (en) 2022-01-19 2022-01-19 Method for identifying bad texts in large amount of texts based on inclined random forest processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210058001.6A CN114564575A (en) 2022-01-19 2022-01-19 Method for identifying bad texts in large amount of texts based on inclined random forest processing

Publications (1)

Publication Number Publication Date
CN114564575A true CN114564575A (en) 2022-05-31

Family

ID=81711102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210058001.6A Pending CN114564575A (en) 2022-01-19 2022-01-19 Method for identifying bad texts in large amount of texts based on inclined random forest processing

Country Status (1)

Country Link
CN (1) CN114564575A (en)

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
CN108197117B (en) Chinese text keyword extraction method based on document theme structure and semantics
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
Gürcan Multi-class classification of turkish texts with machine learning algorithms
CN109165383B (en) Data aggregation, analysis, mining and sharing method based on cloud platform
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN107506472B (en) Method for classifying browsed webpages of students
Paul et al. LeSICiN: a heterogeneous graph-based approach for automatic legal statute identification from Indian legal documents
Freitag Trained named entity recognition using distributional clusters
CN111897963A (en) Commodity classification method based on text information and machine learning
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN113806493A (en) Entity relationship joint extraction method and device for Internet text data
Tkaczyk New methods for metadata extraction from scientific literature
CN115510500A (en) Sensitive analysis method and system for text content
dos Reis et al. One-class quantification
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
CN113032573A (en) Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
de Oliveira et al. A syntactic-relationship approach to construct well-informative knowledge graphs representation
CN109344397A (en) The extracting method and device of text feature word, storage medium and program product
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN114564575A (en) Method for identifying bad texts in large amount of texts based on inclined random forest processing
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination