WO2017190527A1 - 一种文本数据分类方法及服务器 - Google Patents

一种文本数据分类方法及服务器 Download PDF

Info

Publication number
WO2017190527A1
WO2017190527A1 PCT/CN2017/070464 CN2017070464W WO2017190527A1 WO 2017190527 A1 WO2017190527 A1 WO 2017190527A1 CN 2017070464 W CN2017070464 W CN 2017070464W WO 2017190527 A1 WO2017190527 A1 WO 2017190527A1
Authority
WO
WIPO (PCT)
Prior art keywords
support vector
training set
target
classification
feature word
Prior art date
Application number
PCT/CN2017/070464
Other languages
English (en)
French (fr)
Inventor
马洪芹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017190527A1 publication Critical patent/WO2017190527A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a text data classification method and a server.
  • Support Vector Machine (English: Support Vector Machine, SVM for short) is a supervised learning model commonly used for pattern recognition, classification, and regression analysis.
  • FIG. 1 is a schematic flowchart of data classification based on an SVM algorithm in the prior art, specifically including:
  • the classification server acquires the classified text data and extracts the feature words in the classified text data by a preset word segmentation algorithm.
  • the weights of the individual feature words are calculated and the weights of each feature word are respectively represented by vectors.
  • a part of the vector in the obtained vector is used as a training set, and another part of the vector in the obtained vector is used as a test set.
  • the vector in the training set is analyzed by the SVM training system to obtain a model file, and the vector of the test set is classified by the model file.
  • the training set is re-acquired and the model file is calculated based on the acquired new training set, if the error
  • the model file is used as a model for classifying the text data.
  • the unclassified data is obtained and the feature words in the unclassified data are extracted by a preset word segmentation algorithm; the weights of the feature words are calculated and the weights of the feature words are represented by the vector; and the error rate of the classification falls into the preset range model
  • the file classifies the vector and outputs the classification result.
  • a disadvantage of the prior art is that when the error rate of the classification model classification result exceeds the preset range, the re-acquired training set has contingency, which may not necessarily reduce the error rate of the model file classification.
  • the embodiment of the invention discloses a text data classification method and a server, which can reduce the classification mode The error rate of the type classification.
  • an embodiment of the present invention provides a text data classification method, where the method includes:
  • the server analyzes the first training set by using a support vector machine SVM algorithm, and performs a classification test on the first test set according to the first classification model obtained by the analysis, where the first training set and the first test set each include multiple a support vector, each support vector includes K weight factors corresponding to K feature words, each of the weight factors corresponding to one feature word, and the feature value of the weight factor corresponding to the weight factor is in the The number of occurrences in the text data described by the support vector is positively correlated, and K is a positive integer greater than one;
  • the server calculates a relative weight of each of the K feature words in the target support vector according to a weighting factor of the target support vector and a parameter in the first classification model, where the target support vector is
  • the first test set uses the support vector obtained by using the first classification model to obtain a classification test result that does not match the preset classification;
  • the server analyzes the second training set by using the SVM algorithm, and performs a classification test on the second test set according to the second classification model obtained by the analysis, and the second training set and the support vector in the second test set
  • Each of the K feature words includes a weight factor corresponding to the feature word other than the target feature word, and the target feature word is a feature word whose relative weight in the target support vector is less than the first preset threshold;
  • the weight of the feature word in the target support vector is positively correlated with the number of occurrences of the feature word in the text data corresponding to the target support vector, and the weight may be represented by a weight factor; the feature word is in the first
  • the weight of a training set specifically refers to a weight obtained by weighting the weights of the feature words in the respective support vectors in the first training set.
  • the first training set includes support vectors X1, X2, X3, and X4, and the feature word 1 is The weight in X1, the weight of feature word 1 in X2, the weight of feature word 1 in X3, and the weight of feature word 1 in X3, and the value obtained by dividing the added value by 4 is the value The relative weight of feature word 1 in the first training set.
  • the server calculates the relative weights of the feature words in the target support vector based on the parameters in the first classification model and the weight factors in the target support vector, because the target feature words with relatively small relative weights are not very good. Describe the feature of the text data represented by the target support vector, so the weighting factor of the target feature word is deleted from the first training set and the first test set, and the second training set and the second test set are respectively obtained for re-re- Calculate the classification model to avoid the purpose
  • the negative influence of the weighting factor of the characteristic feature word in calculating the classification model can reduce the error rate when the classification model is classified.
  • the server analyzes the second training set by using the SVM algorithm, and performs the second test set according to the second classification model obtained by the analysis. Before the classification test, the method further includes:
  • the server acquires a target feature word, where the target feature word refers to a feature word whose relative weight in the target support vector is less than the first preset threshold;
  • Deleting by the server, a weighting factor of the target feature word in each support vector in the first training set to obtain a second training set, and setting the target in each support vector in the first test set The weighting factor of the feature word is deleted to obtain the second test set.
  • the server analyzes the second training set by using the SVM algorithm, and performs the second test set according to the second classification model obtained by the analysis.
  • the method further includes:
  • the server calculates, according to parameters in the first classification model, a relative weight of each of the K feature words in the first training set;
  • the server acquires a target feature word, where the relative feature weight in the first training set is less than a second preset threshold, and the relative weight in the target support vector is smaller than the first preset Characteristic word of the threshold;
  • the server deletes a factor corresponding to the target feature word in each support vector in the first training set to obtain a second training set, and sets the target in each support vector in the first test set The factor corresponding to the feature word is deleted to obtain the second test set.
  • the parameter in the first classification model includes a Lagrangian of each support vector in the first training set a daily coefficient; the first training set includes N support vectors; the server calculates a relative weight of each of the K feature words in the first training set according to parameters in the first classification model include:
  • the first training set includes N support vectors
  • parameters in the first classification model include Lagrangian coefficients of respective support vectors in the first training set
  • the weighting factors of the target support vector and the parameters in the first classification model calculate the relative weights of each of the K feature words in the target support vector, including:
  • the method before the server analyzes the second training set by using the SVM algorithm, and before performing the classification test on the second test set according to the second classification model, the method further includes:
  • the step of performing the classification test on the second test set by analyzing the second training set by the SVM algorithm and performing the second classification model according to the analysis is performed.
  • an embodiment of the present invention provides a server, where the server includes a processor and a memory, where:
  • the memory is for storing instructions and data
  • the processor is configured to read instructions and data stored in the memory, and perform the following operations:
  • the first training set is analyzed by the support vector machine SVM algorithm, and the A classification model performs a classification test on the first test set
  • the first training set and the first test set each include a plurality of support vectors, and each support vector includes K weight factors corresponding to K feature words, each The weighting factor corresponds to a feature word, and the numerical value of the weighting factor is positively correlated with the number of occurrences of the feature word corresponding to the weighting factor in the text data described by the support vector, and K is a positive integer greater than one;
  • the test set uses the first classification model to obtain a support vector whose classification test result does not match the preset classification;
  • the second training set is analyzed by the SVM algorithm, and the second test set is classified and tested according to the second classification model obtained by the analysis, and the classification error rate obtained by the second classification model classification test is lower than the target prediction
  • the second classification model is used to classify the text data;
  • the support vectors in the second training set and the second test set each include features other than the target feature words among the K feature words. a weight factor corresponding to the word, the target feature word being a feature word whose relative weight in the target support vector is less than a first preset threshold.
  • the server calculates the relative weight of each feature word in the target support vector based on the parameter in the first classification model and the weighting factor in the target support vector, because the target feature word with relatively small weight is not very good. Describe the feature of the text data represented by the target support vector, so the weighting factor of the target feature word is deleted from the first training set and the first test set, and the second training set and the second test set are respectively obtained for re-re- Calculating the classification model avoids the negative impact of the weighting factor of the target feature words in calculating the classification model, and can reduce the error rate when classifying the classification model.
  • the processor analyzes the second training set by using the SVM algorithm, and uses the second classification model obtained by the analysis to the second test set. Before performing the classification test, it is also used to:
  • target feature word refers to a feature word whose relative weight in the target support vector is less than the first preset threshold
  • the processor analyzes the second training set by using the SVM algorithm, and Before classifying the second test set according to the second classification model obtained by the analysis, the processor is further configured to:
  • target feature word is a feature that the relative weight in the first training set is less than a second preset threshold, and the relative weight in the target support vector is less than the first preset threshold word;
  • the first training set includes N support vectors
  • the parameters in the first classification model include Calculating a Lagrangian coefficient of each support vector in the first training set
  • the processor calculating, according to parameters in the first classification model, a relative of each of the K feature words in the first training set Weight, specifically:
  • T (i) ⁇ * ( a1 * x1 i + a2 * x2 i + ... + aN * xN i) in the i-th word of said first feature relative weight training set T (i), by i takes a positive integer between 1 and K to calculate the relative weight of each feature word in the first training set, where aN is the Lagrangian coefficient of the Nth support vector of the N support vectors , xN i is a weighting factor of the i-th feature word in the Nth support vector.
  • the first training set includes N support vectors
  • parameters in the first classification model include Lagrangian coefficients of respective support vectors in the first training set; Calculating a relative weight of each of the K feature words in the target support vector by using a weighting factor of the target support vector and a parameter in the first classification model, specifically:
  • the processor analyzes the second training set by using the SVM algorithm, and performs the second test according to the second classification model obtained by the analysis. Before the set is tested for classification, the processor is also used to:
  • the second training set is analyzed by the SVM algorithm, and the second test set is subjected to a classification test according to the second classification model obtained by the analysis.
  • an embodiment of the present invention provides a server, where the server includes a functional unit for performing some or all of the steps of any implementation manner of the first aspect of the embodiments of the present invention.
  • the present invention provides a computer readable storage medium storing one or more computer programs, the server performing the first step by running the one or more computer programs Aspects of data classification methods.
  • the server calculates the relative weight of each feature word in the target support vector based on the parameter in the first classification model and the weighting factor in the target support vector, because the target feature word with relatively small weight cannot be very
  • a good description describes the feature of the text data represented by the target support vector, so the weighting factor of the target feature word is deleted from the first training set and the first test set, and the second training set and the second test set are respectively used.
  • Recalculating the classification model avoids the negative impact of the weighting factor of the target feature words in calculating the classification model, and can reduce the error rate when classifying the classification model.
  • FIG. 1 is a schematic flow chart of data classification based on an SVM algorithm in the prior art
  • FIG. 2 is a schematic diagram of a scenario of web page classification according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of obtaining a feature vector according to an embodiment of the present invention.
  • 4A is a schematic flowchart of a text data classification method according to an embodiment of the present invention.
  • FIG. 4B is a schematic diagram of a scenario of web page data classification according to an embodiment of the present invention.
  • FIG. 4C is a schematic diagram of another scenario of webpage data classification according to an embodiment of the present invention.
  • 4D is a schematic diagram of another scenario of webpage data classification according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a server according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of still another server according to an embodiment of the present invention.
  • the embodiment of the present invention can be applied to various text data classification scenarios. Regardless of which scenario, the classified text data needs to be first quantized into feature vectors according to its features, and then the feature vector is used as a sample set of the SVM or the feature. The vector is cleaned up as a sample set of the SVM, and a classification model is established based on the sample set.
  • FIG. 2 is a schematic diagram of a scenario of a webpage classification according to an embodiment of the present invention.
  • the scenario of webpage classification is an optional application scenario of the embodiment of the present invention, and the scenario includes the following steps:
  • Step 1 The classification server obtains a large number of HyperText Markup Language (English: HyperText Markup Language, HTML for short) pages through the crawler server.
  • HyperText Markup Language English: HyperText Markup Language, HTML for short
  • Step 2 The classification server performs parsing, word segmentation, feature extraction, feature weight calculation and the like on the text content of a large number of HTML pages, for example, parsing a Title field, a Keyword field, and a description in the HTML page (Description)
  • the text content in the field and the anchor text field is split into a plurality of words by a word segmentation algorithm to form a word set; some words in the word set are characteristic words describing the characteristics of the web page, and some words are concatenated different words Connected words need to extract the feature words to form the feature word set, and then select the feature words from the feature word set to form the feature set; the importance of each feature may be different, so the weight of each feature needs to be calculated, for example, according to the word frequency-inverse document Frequency (in English: term frequency-inverse document frequency, The TF-IDF algorithm calculates the weight of each feature.
  • the TF-IDF algorithm measures the weight of each feature by the number of occurrences of each feature. After calculating the weight of the feature, the weight of the feature word is weighted by the vector to form these features. Feature vector. After processing a large number of HTML pages, the feature vectors of the respective HTML pages can be obtained, and the flow of obtaining the feature vectors is as shown in FIG.
  • Step 3 The classification server purifies a large number of feature vectors obtained, and eliminates some feature vectors that have little effect on subsequent classification.
  • the K-means algorithm is used to purify the large number of feature vectors.
  • the cleaned feature vector can be used as a sample set input into the SVM.
  • Step 4 The classification server trains and tests the sample set through the SVM to obtain a classification model.
  • Step 5 The classification server classifies the subsequently acquired HTML pages by using the classification model, and sends the classification result of the HTML page and the uniform resource locator (English: Uniform Resource Locator, short name: URL) of the HTML to the URL. Library.
  • Uniform Resource Locator short name: URL
  • Step 6 The gateway device receives the packet sent to the external network when the terminal user accesses the Internet through the browser or the Internet (web) proxy server, first identifies the HTTP Get packet, and parses the HTTP Get packet to obtain the host (HOST). And a URL field; query the classification associated with the URL from the URL library, and then perform an operation policy corresponding to the classification, for example, blocking, redirecting, pushing an alarm page, and the like.
  • FIG. 4A is a schematic flowchart of a text data classification method according to an embodiment of the present invention; a sample set obtained in a webpage classification scenario or a sample set obtained in other scenarios may be classified by the process.
  • the process consists of the following steps:
  • Step S401 The server analyzes the first training set by using a support vector machine SVM algorithm, and performs a classification test on the first test set according to the first classification model obtained by the analysis.
  • the server refers to a classification server, which first selects a part of the support vector as the training set from the input sample set, and selects another part of the support vector as the test set.
  • the training set selected by the server in this step may be referred to as a first training set
  • the selected test set may be referred to as a first test set.
  • N support vectors in the first training set and there are M support vectors in the first test set, and each support vector in the N support vectors and each support vector in the M support vectors are composed of K features.
  • the weight factors corresponding to the words are composed, and M, N, and K are positive integers greater than one; the N support vectors in the first training set are as shown in Table 1:
  • the support vectors X1 (x1 1 , x1 2 , x1 3 , ..., x1 K ), the support vectors X2 (x2 1 , x2 2 ,, x2 3 , ..., x2 K ), Support vectors XN (xN 1 , xN 2 , xN 3 , ..., xN K ), each of which contains a weighting factor of feature word 1 to feature word K.
  • the X1, X2, and XN are respectively the feature vectors quantized by the first webpage, the second webpage, and the Nth webpage, and the weighting factor specifically refers to the number of occurrences of the feature words in the text data, and then the weighting factor x1 1 is The number of occurrences of "character word 1" in the first web page, the weighting factor x1 2 is the number of occurrences of "character word 2" in the first web page, and the weighting factor x1 3 is the "character word 3" appearing in the first web page.
  • the weighting factor x1 K is the number of times the feature word K appears in the first web page; further, the weighting factor x2 1 is the number of times the feature word 1 appears in the second web page, and the weighting factor x2 2 is "feature"
  • the remaining parameters in Table 1 can be analogized by analogy.
  • the weighting factors of the K feature words in the support vector of the first training set are exemplified.
  • the support vector of the first test set includes the weighting factors of the K feature words and the first training set. The case where each support vector contains a weighting factor is the same and will not be described here.
  • the first training set and the support vector in the first test set are all pre-classified, and the classification may be classified by manual marking, or may be classified by some devices through a clustering algorithm.
  • the classification may be classified by manual marking, or may be classified by some devices through a clustering algorithm.
  • a large number of "alcoholic" and “diet” web pages that can be obtained first are manually classified, and the "vector" of the "alcohol” web page is marked as 1, and the "diet”
  • the support vector for the "Web” page is labeled -1 to categorize the "Alcohol" and "Diet” pages.
  • the server performs iterative calculation on the first training set by the SVM algorithm as in the prior art, and the process of the iterative calculation is a process of summarizing the commonality of the same kind of support vectors and the difference of the heterogeneous support vectors.
  • the iterative calculation will get the first classification model file, and the first classification model can reflect the same kind The commonality of support vectors and the difference between heterogeneous support vectors.
  • the first classification model includes vector coefficients for characterizing the weights of the respective support vectors in the first training set. In an optional solution, the vector coefficients of the respective support vectors in the first training set may be specific to the respective support vectors.
  • the Lagrangian coefficient assuming that the Lagrangian coefficients of the support vectors X1, X2, X3, . . . , XN are a1, a2, a3, ..., aN, then a1 is used to represent the support vector X1.
  • the weights in all the support vectors of the first training set, a2 are used to represent the weights of the support vector X2 in all the support vectors of the first training set, and the rest of the similar parameters are analogized.
  • the server tests the support vector in the first test set based on the obtained first classification model file, and the specific process includes classifying the support vector in the first test set by using the first classification model, and obtaining each support vector in the first test set. And the classification result is compared with the classification result pre-classified by each support vector in the first test set, and the support vector that is classified by the first classification model and the classification result that is different from the pre-classified classification result is found.
  • the inconsistent support vector may be referred to as a target support vector.
  • Step S402 The server calculates a relative weight of each of the K feature words in the target support vector according to a weighting factor of the target support vector and a parameter in the first classification model.
  • the embodiment of the present invention not only considers the weight of the feature word 1, the feature word 2, the feature word 3, ..., the feature word K in the target support vector, but also considers the feature word 1, the feature word 2, and the feature.
  • the weight of the word 3, ..., the feature word K in the first training set optionally, the weight of the feature word in the target support vector and the appearance of the feature word in the text data described by the target support vector.
  • the number of times is positively correlated, and the weighting factor indicates the weight;
  • the relative weight of the feature word in the first training set specifically refers to the weight obtained by weighting and averaging the weights of the feature words in the respective support vectors in the first training set.
  • the first training set includes support vectors X1, X2, X3, and X4, the weight of feature word 1 in X1, the weight of feature word 1 in X2, the weight of feature word 1 in X3, and feature word 1.
  • the weights in X3 are added, and the added value is divided by 4 to obtain the relative weight of the feature word 1 in the first training set.
  • the weight calculated by combining the weight of the feature word in the target support vector and the relative weight of the feature word in the first training set is the relative weight of the feature word in the target support vector.
  • the target support vector is Y1(y1 1 , y1 2 , y1 3 ,..., y1 K ), and the weighting factors y1 1 , y1 2 , y1 3 , ..., y1 K characterize the feature word 1 and the feature word in turn. 2.
  • the weight of the feature words 3, ..., the feature word K in the target support vector Y1.
  • Equation 1-1 the relative weight can be calculated by Equation 1-1, and Equation 1-1 is as follows:
  • f(i) is the relative weight of the i-th feature word in the target support vector in the target support vector, and each feature word is calculated in the target support vector by taking a positive integer between 1 and K for i The relative weight; y1 i is the weighting factor of the feature word i in the target support vector.
  • (a1*x1 i + a2*x2 i +...+aN*xN i ) in the formula is equivalent to weighting the feature word i in each support vector in the first training set, and can reflect the feature word i in the Relative weights in the first training set; therefore ⁇ *(a1*x1 i +a2*x2 i +...+aN*xN i )*y1 i can characterize the i-th feature word described in the embodiment of the invention in the target support The relative weight in the vector.
  • Equation 1-2 and Equation 1-3 are as follows:
  • i takes a positive integer from 1 to K to calculate sum(1), sum(2), ..., sum(K), sum(1), sum(2), ...,
  • the maximum value in sum(K) is MAX-sum in Equation 1-2, and the minimum value in sum(1), sum(2), ..., sum(K) is MIN- in Equation 1-2.
  • Equations 1-4 can be calculated from Equations 1-4 and 1-3, and Equations 1-4 are as follows:
  • the server calculates the relative weight of each feature word in the target support vector according to the weighting factor in the target support vector and the parameter in the first classification model, the relative weight in the calculated relative weight is smaller than the first pre-weight
  • the weighting factor of the feature word corresponding to the relative weight of the threshold is deleted from the first training set and the first test set, and the first preset threshold may be a preset fixed value or a function, for example, the preset threshold is calculated.
  • the relative weights obtained are ranked in the fifth place from the largest to the smallest.
  • the support vector X1 1 X1, X2 support vector X2 in the 1, ..., SVM
  • the xN 1 in XN is deleted, and the new support vector obtained is X1 (x1 2 , x1 3 , ..., x1 K ), X2 (x1 2 , x2 3 , ..., x2 K ), .. . XN(xN 2 , xN 3 , . . . , xN K ), the set of the new support vectors X1, X2, . . .
  • XN may be referred to as a second training set for convenience of subsequent description.
  • the weighting factor used to describe the feature word 1 in the first test set is also deleted, and the set of support vectors after the weighting factor is deleted is the second test set.
  • each feature word may be calculated by using Equation 1-6.
  • the relative weight of a training set, Equation 1-6 is as follows:
  • T(i) ⁇ *(a1*x1 i +a2*x2 i +...+aN*xN i ) 1-6
  • i in Equation 1-6 can take any positive integer between 1 and K to calculate the relative weight of any feature word in the first training set. For example, i takes 1 to calculate the feature word 1 in the first training set. Relative weight, i takes 2 to calculate the relative weight of feature word 2 in the first training set, and the rest can be deduced by analogy.
  • the second preset threshold may be a preset fixed value or function. When a relative weight of a feature word in the target support vector is less than a first preset threshold, calculating a relative weight of the certain feature word in the first training set by using Equation 1-6, and then selecting the certain feature word The relative weights in the first training set are compared to a second predetermined threshold.
  • the weighting factor of the certain feature word is deleted from the first training set to obtain the second training set. And deleting the factor of the certain feature word from the first test set to obtain the second test set.
  • i sequentially takes a positive integer between 1 and K into Equation 1-6.
  • the relative weights of the feature words in the first training set are calculated, and then the relative weights of the feature words in the first training set are sorted. If the relative weight of a certain feature word in the target support vector is less than a first preset threshold, and the relative weight of the certain feature word in the first training set in the sorting sequence falls within a preset sequence number interval. For example, if the number is less than the fifth digit, the weighting factor of the certain feature word is deleted from the first training set to obtain the second training set, and the weighting factor of the certain feature word is from the first test set. Delete to get the second test set.
  • Step S402 when the server tests the support vector in the first test set by using the calculated first classification model, if the result of the classification by the first classification model is compared with the pre-classification result Step S402 is performed only when the ratio error rate is higher than the target preset threshold, for example, the target preset threshold is set to 5%.
  • Step S403 The server analyzes the second training set by using the SVM algorithm, and performs a classification test on the second test set according to the second classification model obtained by the analysis.
  • the support of the second training set is analyzed by the SVM algorithm again to obtain a new classification model.
  • the new classification model may be referred to as the first
  • the binary classification model is then tested based on the second classification model for the support vector in the second test set.
  • the weighting factor is removed from the second training set and the second test set again according to the principle of step S402 until the error rate is not higher.
  • the target preset threshold is a threshold for the weighting factor removed from the second training set and the second test set again according to the principle of step S402 until the error rate is not higher.
  • a vector 360 describing the webpage data of the diet and a vector 1903 describing the webpage data of the alcoholic webpage are obtained, and the obtained vectors are preprocessed, and the set of the preprocessed vectors is a sample set, and the sample set is obtained.
  • Each vector corresponds to a category identifier, the category identifier 411 is equal to 1 for identifying dietary webpage data, and the category identifier 412 is equal to -1 for identifying alcohol webpage data; each vector after preprocessing also has a plurality of feature numbers. 413, each feature number 413 corresponds to a weight Factor 414, in Figure 4B, each feature number 413 is separated from the corresponding weighting factor by a colon. Different features are separated by spaces or aligners. A part of the vector in the sample set is taken as a training set, and another part of the sample set is taken as a test set.
  • the training set is substituted into the SVM to generate a classification model file, and the classification model file includes a Lagrangian coefficient 421 of each vector.
  • the relative weights of each feature word in the training set are respectively calculated, and the weights of the feature words are sorted.
  • FIG. 4C shows the ranking of some dietary feature words and their relative weights in the training set. And some alcoholic feature words and their relative weight ordering in the training set.
  • the vector in the test set is substituted into the classification model file for testing.
  • the vector with the error in the test set is obtained, and the vector with the error may be referred to as the target support vector. Calculating the relative weights of the respective feature words in the target support vector.
  • the feature word is deleted from the training set and the test set.
  • a certain feature has a relative weight in the target support vector that is less than a first preset threshold, and a relative weight of the certain feature in the training set is less than a second preset threshold
  • the feature word is deleted from the training set and the test set.
  • a new classification model is calculated based on the new training set formed after deleting the certain feature words, and the vector of the new test set formed after deleting the certain feature words is substituted into the new classification model for testing.
  • the resulting classification model file classification error rate is lower than the target preset threshold.
  • the feature word "sweet alcohol” in the training set of diet and alcohol The relative weights in the middle are relatively large, and the word “fragrance” cannot be used to reflect the difference between diet and alcohol. Therefore, the characteristic factors corresponding to the characteristic word "fragrance” can be deleted from the test set and training set.
  • the server calculates the relative weight of each feature word in the target support vector based on the parameters in the first classification model and the weighting factors in the target support vector. Since the target feature words with relatively small weights cannot describe the characteristics of the text data represented by the target support vector well, the weight factors of the target feature words are deleted from the first training set and the first test set, respectively. The second training set and the second test set are used to recalculate the classification model, which avoids the negative influence of the weighting factor of the target feature word in calculating the classification model, and can reduce the error rate when the classification model is classified.
  • FIG. 5 is a server 50 according to an embodiment of the present invention.
  • the server 50 includes a processor 501 and a memory 502.
  • the processor 501 and the memory 502 are connected to each other through a bus.
  • Memory 502 includes, but is not limited to, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), or portable read only memory (CD-ROM), Memory 502 is used for related instructions and data.
  • the memory 502 is further configured to store the first training set, the first test set, and the second training set and the second test set obtained by the storage processor 501.
  • the processor 501 may be one or more central processing units (English: Central Processing Unit, CPU for short). In the case that the processor 501 is a CPU, the CPU may be a single core CPU or a multi-core CPU.
  • CPU Central Processing Unit
  • the first training set is analyzed by the support vector machine SVM algorithm, and the first test set is classified and tested according to the first classification model obtained by the analysis, where the first training set and the first test set both include multiple supports.
  • each support vector includes K weight factors corresponding to K feature words, each of the weight factors corresponding to one feature word, and the value of the weight factor has a feature word corresponding to the weight factor in the support
  • the number of occurrences in the text data described by the vector is positively correlated, and K is a positive integer greater than one;
  • the test set uses the first classification model to obtain a support vector whose classification test result does not match the preset classification;
  • the second training set is analyzed by the SVM algorithm, and the second test set is subjected to a classification test according to the second classification model obtained by the analysis, where the support vectors of the second training set and the second test set include a weighting factor corresponding to a feature word other than the target feature word among the K feature words, wherein the target feature word is a feature word whose relative weight in the target support vector is less than a first preset threshold;
  • classification error rate obtained by the second classification model classification test is not higher than the target preset threshold, it is confirmed that the text data to be classified is classified using the second classification model.
  • the server 50 calculates the relative weight of each feature word in the target support vector based on the parameters in the first classification model and the weighting factor in the target support vector, because the target feature words with relatively small relative weights are not very good. Describe the feature of the text data represented by the target support vector, so the weighting factor of the target feature word is deleted from the first training set and the first test set, and the second training set and the second test set are respectively obtained for Recalculating the classification model avoids the negative impact of the weighting factor of the target feature words in calculating the classification model, and can reduce the error rate when classifying the classification model.
  • the processor 501 analyzes the second training set by using the SVM algorithm, and performs a classification test on the second test set according to the second classification model obtained by the analysis, and is further used for:
  • target feature word refers to a feature word whose relative weight in the target support vector is less than the first preset threshold
  • the processor 501 analyzes the second training set by using the SVM algorithm, and performs a classification test on the second test set according to the second classification model obtained by the analysis, where the processing is performed.
  • the device 501 is also used to:
  • target feature word is a feature that the relative weight in the first training set is less than a second preset threshold, and the relative weight in the target support vector is less than the first preset threshold word;
  • the first training set includes N support vectors
  • parameters in the first classification model include Lagrangian coefficients of respective support vectors in the first training set
  • the processor 501 calculates, according to parameters in the first classification model, a relative weight of each of the K feature words in the first training set, specifically:
  • T (i) ⁇ * ( a1 * x1 i + a2 * x2 i + ... + aN * xN i) in the i-th word of said first feature relative weight training set T (i), by i takes a positive integer between 1 and K to calculate the relative weight of each feature word in the first training set, where aN is the Lagrangian coefficient of the Nth support vector of the N support vectors , xN i is a weighting factor of the i-th feature word in the Nth support vector.
  • the first training set includes N support vectors, and parameters in the first classification model include Lagrangian coefficients of respective support vectors in the first training set;
  • the processor 501 calculates a relative weight of each of the K feature words in the target support vector according to a weighting factor of the target support vector and a parameter in the first classification model, specifically:
  • the processor 501 analyzes the second training set by using the SVM algorithm, and performs a classification test on the second test set according to the second classification model obtained by the analysis, where the processing is performed.
  • the device 501 is also used to:
  • the second training set is analyzed by the SVM algorithm, and the second test set is subjected to a classification test according to the second classification model obtained by the analysis.
  • server 50 in the embodiment of the present invention may also correspond to the corresponding description of the method embodiment shown in FIG. 4A.
  • the server 50 is based on parameters and parameters in the first classification model.
  • the weighting factor in the target support vector calculates the relative weight of each feature word in the target support vector. Since the target feature word with relatively small weight does not describe the characteristics of the text data represented by the target support vector well, The weighting factor of the target feature word is deleted from the first training set and the first test set, and the second training set and the second test set are respectively obtained for recalculating the classification model, and the weighting factor of the target feature word is avoided in calculating the classification.
  • the negative impact of the model can reduce the error rate when classifying the classification model.
  • FIG. 6 is a schematic structural diagram of still another server 60 according to an embodiment of the present invention.
  • the server 60 may include an analyzing unit 601 and a computing unit 602.
  • the detailed description of the analyzing unit 601 and the calculating unit 602 is as follows.
  • the analyzing unit 601 is configured to analyze the first training set by using a support vector machine SVM algorithm, and perform a classification test on the first test set according to the first classification model obtained by the analysis, the first training set and the first test set
  • Each of the support vectors includes a plurality of weight factors corresponding to the K feature words, each of the weight factors corresponding to one feature word, and the value of the weight factor corresponds to the weight factor.
  • the number of occurrences of the word in the text data described by the support vector is positively correlated, and K is a positive integer greater than one;
  • the calculating unit 602 is configured to calculate a relative weight of each of the K feature words in the target support vector according to a weighting factor of the target support vector and a parameter in the first classification model, the target support vector Supporting the classification test result obtained by using the first classification model in the first test set and the preset classification does not match;
  • the analyzing unit 601 is further configured to analyze the second training set by using the SVM algorithm, and perform a classification test on the second test set according to the second classification model obtained by the analysis, where the second training set and the second test set
  • Each of the K feature words includes a weighting factor corresponding to the feature word other than the target feature word, and the target feature word is a feature word whose relative weight is less than the first preset threshold in the target support vector;
  • classification error rate obtained by the second classification model classification test is not higher than the target preset threshold, it is confirmed that the text data to be classified is classified using the second classification model.
  • the server 60 calculates the relative weight of each feature word in the target support vector based on the parameters in the first classification model and the weighting factor in the target support vector, since the target feature words with relatively small relative weights are not very good.
  • the number of texts represented by the target support vector According to the feature, the weighting factor of the target feature word is deleted from the first training set and the first test set, and the second training set and the second test set are respectively obtained for recalculating the classification model, thereby avoiding the target feature words.
  • the negative impact of the weighting factor on the classification model can reduce the error rate when classifying the classification model.
  • the server 601 further includes an obtaining unit and a deleting unit;
  • the obtaining unit is configured to analyze the second training set by using the SVM algorithm in the analyzing unit 601, and obtain a target feature word before performing the classification test on the second test set according to the second classification model obtained by the analysis.
  • the target feature word refers to a feature word whose relative weight in the target support vector is smaller than the first preset threshold;
  • the deleting unit is configured to delete a weighting factor of the target feature word in each support vector in the first training set to obtain a second training set, and set the first test set in each support vector The weighting factor of the target feature word is deleted to obtain a second test set.
  • the server 601 further includes an obtaining unit and a deleting unit;
  • the calculating unit 602 is further configured to analyze, by the analyzing unit 601, the second training set by using the SVM algorithm, and perform a classification test on the second test set according to the second classification model obtained by the analysis, according to the first classification model.
  • the parameter calculates a relative weight of each of the K feature words in the first training set;
  • the acquiring unit is configured to acquire a target feature word, where the target feature word refers to a relative weight in the first training set is less than a second preset threshold, and a relative weight in the target support vector is smaller than the first a characteristic word of a predetermined threshold;
  • the deleting unit is configured to delete a factor corresponding to the target feature word in each support vector in the first training set to obtain a second training set, and set the first test set in each support vector The factor corresponding to the target feature word is deleted to obtain a second test set.
  • the first training set includes N support vectors
  • parameters in the first classification model include Lagrangian coefficients of respective support vectors in the first training set
  • 602 Calculate, according to the parameter in the first classification model, a relative weight of each of the K feature words in the first training set, specifically:
  • T (i) ⁇ * ( a1 * x1 i + a2 * x2 i + ... + aN * xN i) in the i-th word of said first feature relative weight training set T (i), by i takes a positive integer between 1 and K to calculate the relative weight of each feature word in the first training set, where aN is the Lagrangian coefficient of the Nth support vector of the N support vectors , xN i is a weighting factor of the i-th feature word in the Nth support vector.
  • the first training set includes N support vectors
  • parameters in the first classification model include Lagrangian coefficients of respective support vectors in the first training set; 602. Calculate a relative weight of each of the K feature words in the target support vector according to a weighting factor of the target support vector and a parameter in the first classification model, specifically:
  • the server 60 further includes a determining unit, where the determining unit is configured to analyze the second training set by the analyzing unit 601 by using the SVM algorithm, and obtain a second classification model pair according to the analysis. Before the second test set performs the classification test, determining whether the classification error rate obtained by the first classification model classification test is higher than the target preset threshold;
  • the trigger analysis unit 601 performs the operation of analyzing the second training set by the SVM algorithm, and performing a classification test on the second test set according to the second classification model obtained by the analysis.
  • the specific implementation of the server 60 in the embodiment of the present invention may also correspond to the corresponding description of the method embodiment shown in FIG. 4A.
  • the server 60 calculates the relative weight of each feature word in the target support vector based on the parameters in the first classification model and the weighting factor in the target support vector, because the target is relatively small in weight.
  • the feature word does not describe the feature of the text data represented by the target support vector well, so the weight factor of the target feature word is deleted from the first training set and the first test set, and the second training set and the second are respectively obtained.
  • the test set is used to recalculate the classification model, which avoids the negative influence of the weighting factor of the target feature word in calculating the classification model, and can reduce the error rate when the classification model is classified.
  • the server calculates the relative weight of each feature word in the target support vector based on the parameter in the first classification model and the weighting factor in the target support vector, because the relative weight is small.
  • the target feature word cannot describe the feature of the text data represented by the target support vector well, so the weight factor of the target feature word is deleted from the first training set and the first test set, respectively, and the second training set and the second training set are respectively obtained.
  • the second test set is used to recalculate the classification model, which avoids the negative influence of the weighting factor of the target feature words in calculating the classification model, and can reduce the error rate when the classification model is classified.
  • the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本数据分类方法及服务器,该方法包括:服务器通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第一分类模型对第一测试集进行分类测试,第一训练集和第一测试集均包含多个支持向量,每个支持向量包含K个与K个特征词对应的权重因子,每个特征因子对应一个特征词(S401);根据目标支持向量的权重因子和第一分类模型中的参数计算K个特征词中每个特征词在目标支持向量中的相对权重(S402);通过SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试,第二训练集和第二测试集中的支持向量均包含K个特征词中的除目标特征词以外的特征词对应的权重因子(S403)。采用该方法及服务器,能够降低分类模型分类的错误率。

Description

一种文本数据分类方法及服务器
本申请要求于2016年5月6日提交中国专利局、申请号为201610296812.4、发明名称为“一种文本数据分类方法及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,尤其涉及一种文本数据分类方法及服务器。
背景技术
支持向量机(英文:Support Vector Machine,简称:SVM)是一个有监督的学习模型,通常用来进行模式识别、分类、以及回归分析等。图1是现有技术中基于SVM算法进行数据分类的流程示意图,具体包括:
分类服务器获取已分类的文本数据并通过预设的分词算法提取该已分类的文本数据中的特征词。计算各个特征词的权重并通过向量分别表示每个特征词的权重。将得到的向量中的一部分向量作为训练集,以及将得到的向量中的另一部分向量作为测试集。通过SVM训练系统对该训练集中的向量进行分析以得到模型文件,通过该模型文件对该测试集中的向量分类。参照预先分类的结果判断通过该分类模型分类的结果的错误率是否在预设范围内,若错误率不在预设范围内,则重新获取训练集并基于获取的新训练集计算模型文件,若错误率在预设范围内,则将该模型文件作为对文本数据进行分类的模型。然后,获取未分类数据并通过预设的分词算法提取该未分类数据中的特征词;计算各个特征词的权重并通过向量表示特征词的权重;通过分类的错误率落入预设范围的模型文件对该向量分类并输出分类结果。
现有技术的缺陷在于,当分类模型分类结果的错误率超出预设范围时,重新获取的训练集具有偶然性,不一定能够降低该模型文件分类的错误率。
发明内容
本发明实施例公开了一种文本数据分类方法及服务器,能够降低分类模 型分类的错误率。
第一方面,本发明实施例提供一种文本数据分类方法,该方法包括:
服务器通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第一分类模型对第一测试集进行分类测试,所述第一训练集和所述第一测试集均包含多个支持向量,每个支持向量包含K个与K个特征词对应的权重因子,每个所述权重因子对应一个特征词,所述权重因子的数值大小与所述权重因子对应的特征词在所述支持向量描述的文本数据中出现的次数成正相关,K为大于1的正整数;
所述服务器根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,所述目标支持向量为所述第一测试集中利用所述第一分类模型得到的分类测试结果与预设分类不相符的支持向量;
所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试,所述第二训练集和所述第二测试集中的支持向量均包含所述K个特征词中的除目标特征词以外的特征词对应的权重因子,所述目标特征词为所述目标支持向量中的相对权重小于第一预设阈值的特征词;如果通过所述第二分类模型分类测试得到的分类错误率不高于目标预设阈值时,确认使用所述第二分类模型对待分类的文本数据进行分类。可选的,特征词在该目标支持向量中的权重与该特征词在该目标支持向量对应的文本数据中的出现的次数成正相关,该权重可以通过权重因子来表示;该特征词在该第一训练集中的权重具体指该特征词在该第一训练集中各个支持向量中的权重进行加权得到的权重,例如,该第一训练集中包含支持向量X1、X2、X3和X4,将特征词1在X1中的权重、特征词1在X2中的权重、特征词1在X3中的权重和特征词1在X3中的权重相加,并将相加得到的值除以4得到的值为该特征词1在该第一训练集中的相对权重。
通过执行上述步骤,服务器基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目 标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
结合第一方面,在第一方面的第一种可能的实现方式中,所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述方法还包括:
所述服务器获取目标特征词,所述目标特征词是指在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
所述服务器将所述第一训练集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二测试集。
结合第一方面,在第一方面的第二种可能的实现方式中,所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述方法还包括:
所述服务器根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重;
所述服务器获取目标特征词,所述目标特征词是指在所述第一训练集中的相对权重小于第二预设阈值,且在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
所述服务器将所述第一训练集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二测试集。
结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述第一训练集中包含N个支持向量;所述服务器根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重包括:
所述服务器通过公式T(i)=θ*(a1*x1i+a2*x2i+…+aN*xNi)计算第i个特征词在所述第一训练集中的相对权重T(i),通过对i取1到K之间的正整数来计算每个特征词在所述第一训练集中的相对权重,其中,aN为所述N个支持向量中的第N个支持向量的拉格朗日系数,xNi为所述第i个特征 词在所述第N个支持向量中的权重因子。
结合第一方面,或者第一方面的第一种可能的实现方式,或者第一方面的第二种可能的实现方式,或者第一方面的第三种可能的实现方式,在第一方面的第四种可能的实现方式中,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述服务器根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重包括:
所述服务器通过公式f(i)=β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i计算第i个特征词在所述目标支持向量中的相对权重f(i),通过对i取1到K之间的正整数来计算每个特征词在所述目标支持向量中的相对权重;其中,aN为所述N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子,y1i为所述第i个特征词在所述目标支持向量中的权重因子。
结合第一方面,或者第一方面的第一种可能的实现方式,或者第一方面的第二种可能的实现方式,或者第一方面的第三种可能的实现方式,在第一方面的第五种可能的实现方式中,所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述方法还包括:
所述服务器判断通过所述第一分类模型分类测试得到的分类错误率是否高于所述目标预设阈值;
若高于,则执行所述通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试的步骤。
具体地,在计算出第一分类模型分类的错误率高于目标预设阈值时才从该第一训练集和第一测试集中删除一些特征,而不是在每次计算出第一分类模型就删除特征,降低了服务器的开销。
第二方面,本发明实施例提供一种服务器,所述服务器包括处理器和存储器,其中:
所述存储器用于存储指令和数据;
所述处理器,用于读取所述存储器中存储的指令和数据,执行如下操作:
通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第 一分类模型对第一测试集进行分类测试,所述第一训练集和所述第一测试集均包含多个支持向量,每个支持向量包含K个与K个特征词对应的权重因子,每个所述权重因子对应一个特征词,所述权重因子的数值大小与所述权重因子对应的特征词在所述支持向量描述的文本数据中出现的次数成正相关,K为大于1的正整数;
根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,所述目标支持向量为所述第一测试集中利用所述第一分类模型得到的分类测试结果与预设分类不相符的支持向量;
通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试,当通过所述第二分类模型分类测试得到的分类错误率低于目标预设阈值时,所述第二分类模型用于对文本数据分类;所述第二训练集和所述第二测试集中的支持向量均包含所述K个特征词中的除目标特征词以外的特征词对应的权重因子,所述目标特征词为所述目标支持向量中相对权重小于第一预设阈值的特征词。
通过执行上述操作,服务器基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
结合第二方面,在第二方面的第一种可能的实现方式中,所述处理器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,还用于:
获取目标特征词,所述目标特征词是指在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
将所述第一训练集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二测试集。
结合第二方面,或者第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述处理器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述处理器还用于:
根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重;
获取目标特征词,所述目标特征词是指在所述第一训练集中的相对权重小于第二预设阈值,且在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
将所述第一训练集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二测试集。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述处理器根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重,具体为:
通过公式T(i)=θ*(a1*x1i+a2*x2i+…+aN*xNi)计算第i个特征词在所述第一训练集中的相对权重T(i),通过对i取1到K之间的正整数来计算每个特征词在所述第一训练集中的相对权重,其中,aN为所述N个支持向量中的第N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子。
结合第二方面,或者第二方面的第一种可能的实现方式,或者第二方面的第二种可能的实现方式,或者第二方面的第三种可能的实现方式,在第二方面的第四种可能的实现中,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述处理器根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,具体为:
通过公式f(i)=β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i计算第i个特征词在所述目标支持向量中的相对权重f(i),通过对i取1到K之间的正 整数来计算每个特征词在所述目标支持向量中的相对权重;其中,aN为所述N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子,y1i为所述第i个特征词在所述目标支持向量中的权重因子。
结合第二方面,或者第二方面的第一种可能的实现方式,或者第二方面的第二种可能的实现方式,或者第二方面的第三种可能的实现方式,或者第二方面的第四种可能的实现方式,在第二方面的第五种可能的实现中,所述处理器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述处理器还用于:
判断通过所述第一分类模型分类测试得到的分类错误率是否高于所述目标预设阈值;
若高于,则执行所述通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试的操作。
具体地,在计算出第一分类模型分类的错误率高于目标预设阈值时才从该第一训练集和第一测试集中删除一些特征,而不是在每次计算出第一分类模型就删除特征,降低了服务器的开销。
第三方面,本发明实施例提供一种服务器,所述服务器包括用于执行本发明实施例第一方面任一实现方式的部分或全部步骤的功能单元。
第四方面,本发明提供了一种计算机可读存储介质,所述计算机可读存储介质存储有一个或多个计算机程序,所述服务器通过运行所述一个或多个计算机程序来执行上述第一方面的数据分类方法。
通过实施本发明实施例,服务器基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实 施例或现有技术描述中所需要使用的附图作简单地介绍。
图1是现有技术中基于SVM算法进行数据分类的流程示意图;
图2是本发明实施例提供的一种网页分类的场景示意图;
图3是本发明实施例提供的一种获得特征向量的流程示意图;
图4A是本发明实施例提供的一种文本数据分类方法的流程示意图;
图4B是本发明实施例提供的一种网页数据分类的场景示意图;
图4C是本发明实施例提供的又一种网页数据分类的场景示意图;
图4D是本发明实施例提供的又一种网页数据分类的场景示意图;
图5是本发明实施例提供的一种服务器的结构示意图;
图6是本发明实施例提供的又一种服务器的结构示意图。
具体实施方式
下面将结合附图对本发明实施例中的技术方案进行清楚、完整地描述。本发明实施例可以应用于各种文本数据分类场景,不管哪种场景都需要先将该被分类的文本数据根据其特征量化为特征向量,然后将该特征向量作为SVM的样本集或对该特征向量净化后作为SVM的样本集,并基于该样本集建立分类模型。
请参照图2,图2是本发明实施例提供的一种网页分类的场景示意图,网页分类的场景是本发明实施例的一种可选的应用场景,该场景下包含如下步骤:
步骤一:分类服务器通过爬虫服务器获取大量的超级文本标记语言(英文:HyperText Markup Language,简称:HTML)页面。
步骤二:该分类服务器对大量HTML页面的文本内容进行解析、分词、特征提取、特征权重计算等处理,例如,解析出HTML页面中标题(Title)字段、关键词(Keyword)字段、描述(Description)字段、锚文本字段中的文本内容,通过分词算法将该文本内容拆分成多个单词,形成单词集合;该单词集合中有些单词是描述网页特征的特征词,有些单词是串联不同词语的连接词,需要先提取特征词形成特征词集,再从该特征词集中选择特征词形成特征集;各个特征的重要性可能存在区别,因此需要计算各个特征的权重,例如,根据词频-逆文档频率(英文:term frequency-inverse document frequency, 简称TF-IDF)算法计算各个特征的权重,该TF-IDF算法具体通过各个特征出现的次数来衡量各个特征的权重;计算出特征的权重后通过向量将该特征词的权重量化,形成这些特征的特征向量。对大量HTML页面进行处理后即可得到各个HTML页面的特征向量,得到特征向量的流程如图3所示。
步骤三:该分类服务器对得到的大量特征向量进行净化,剔除一些对后续分类作用不大的特征向量,例如,通过K均值(K-means)算法来对该大量特征向量进行净化。净化后的特征向量可以作为输入到SVM中的样本集。
步骤四:该分类服务器通过SVM对样本集进行训练和测试,得到分类模型。
步骤五:该分类服务器通过该分类模型对后续获取的HTML页面进行分类,并将该HTML页面的分类结果以及该HTML的统一资源定位符(英文:Uniform Resource Locator,简称:URL)关联发送给URL库。
第六步:网关设备接收到终端用户通过浏览器或者互联网(web)代理服务器上网时发到外网的报文,先识别出HTTP Get报文并对HTTP Get报文进行解析以获取主机(HOST)和URL字段;从该URL库查询该URL关联的分类,然后执行分类对应的操作策略,例如,阻断、重定向、推送告警页面等操作。
请参见图4A,图4A是本发明实施例提供的一种文本数据分类方法的流程示意图;无论是网页分类场景下得到的样本集还是其他场景下得到的样本集均可以通过该流程来分类,该流程包含如下步骤:
步骤S401:服务器通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第一分类模型对第一测试集进行分类测试。
具体的,该服务器即是指分类服务器,该服务器先从输入的样本集中选择一部分支持向量作为训练集,并选择另一部分支持向量作为测试集。为了与后续将描述到的训练集和测试集进行区分,本步骤中该服务器选择的训练集可称为第一训练集,选择的测试集可称为第一测试集。该第一训练集中存在N个支持向量,该第一测试集中存在M个支持向量,该N个支持向量中的每个支持向量以及该M个支持向量中的每个支持向量均由K个特征词对应的权重因子组成,M、N、K均为大于1的正整数;该第一训练集中的N个支持向量如表1所示:
Figure PCTCN2017070464-appb-000001
表1
在表1中示出了支持向量X1(x11,x12,x13,...,x1K)、支持向量X2(x21,x22,,x23,...,x2K)、支持向量XN(xN1,xN2,xN3,...,xNK),每个支持向量均包含了特征词1到特征词K的权重因子。举例来说,该X1、X2、XN分别为第一网页、第二网页、第N网页量化后的特征向量,该权重因子具体指特征词在文本数据中出现的次数,那么权重因子x11为“特征词1”在第一网页中出现的次数,权重因子x12为“特征词2”在第一网页中出现的次数,权重因子x13为“特征词3”在第一网页中出现的次数,权重因子x1K为“特征词K”在第一网页中出现的次数;进一步地,权重因子x21为“特征词1”在第二网页中出现的次数,权重因子x22为“特征词2”在第二网页中出现的次数,权重因子x23为“特征词3”在第二网页中出现的次数,权重因子xNK为“特征词K”在第N网页中出现的次数;表1中的其余参数可以依次类推。
以上对该第一训练集中的支持向量包含K个特征词的权重因子进行了举例说明,该第一测试集中的各个支持向量包该K个特征词的权重因子的情况与该第一训练集中的各个支持向量包含权重因子的情况相同,此处不再赘述。
在本发明实施例中,该第一训练集和第一测试集中的支持向量都预先分类好了,该分类可以由人工标记的方式来分类,也可以由一些设备通过聚类算法来进行分类。例如,在网页归类的场景下,可以先获取的大量的“酒类”和“饮食类”的网页的进行人工归类,通过将“酒类”网页的支持向量标记为1,将“饮食类”网页的支持向量标记为-1来对“酒类”网页和“饮食类”网页分类。
该服务器像现有技术一样通过SVM算法对该第一训练集进行迭代计算,该迭代计算的过程即是归纳同类支持向量的共性以及异类支持向量的区别的过程。该迭代计算会得到第一分类模型文件,该第一分类模型能够体现同类 支持向量的共性以及异类支持向量的区别。该第一分类模型包含用于表征各个支持向量在该第一训练集中权重的向量系数,在一种可选的方案中,该第一训练集中各个支持向量的向量系数可以具体为该各个支持向量的拉格朗日系数,假设支持向量X1、X2、X3、....、XN的拉格朗日系数依次为a1、a2、a3、...、aN,那么a1用于表征支持向量X1在该第一训练集的所有支持向量中的权重,a2用于表征支持向量X2在该第一训练集的所有支持向量中的权重,其余同类参数依次类推。
该服务器基于得到的第一分类模型文件对第一测试集中的支持向量进行测试,具体过程包括通过该第一分类模型对该第一测试集中的支持向量分类,得到该第一测试集中各个支持向量的分类结果,然后将该分类结果与该第一测试集中各个支持向量预先分类好的分类结果进行对比,找出通过该第一分类模型分类的分类结果与预先分类的分类结果不一致的支持向量,为了方便后续描述可以称该不一致的支持向量为目标支持向量。
步骤S402:所述服务器根据目标支持向量的权重因子和该第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重。
具体地,本发明实施例不仅要考虑特征词1、特征词2、特征词3、...、特征词K在该目标支持向量中的权重,还要考虑特征词1、特征词2、特征词3、...、特征词K在该第一训练集中的权重;可选的,特征词在该目标支持向量中的权重与该特征词在该目标支持向量描述的文本数据中的出现的次数成正相关,上述权重因子表示的即是该权重;该特征词在该第一训练集中的相对权重具体指该特征词在该第一训练集中各个支持向量中的权重进行加权平均后得到的权重,例如,该第一训练集中包含支持向量X1、X2、X3和X4,将特征词1在X1中的权重、特征词1在X2中的权重、特征词1在X3中的权重和特征词1在X3中的权重相加,并将相加得到的值除以4得到的值为该特征词1在该第一训练集中的相对权重。结合特征词在目标支持向量中的权重和该特征词在第一训练集中的相对权重计算得到的权重为该特征词在该目标支持向量中的相对权重。假设该目标支持向量为Y1(y11,y12,y13,...,y1K),权重因子y11、y12、y13、...、y1K依次表征特征词1、特征词2、特征词3、...、特征词K在该目标支持向量Y1中的权重。
在一种可选的方案中,可以通过公式1-1计算该相对权重,公式1-1如下:
f(i)=β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i       1-1
f(i)为目标支持向量中第i个特征词在所述目标支持向量中的相对权重,通过对i取1到K之间的正整数来计算每个特征词在所述目标支持向量中的相对权重;y1i为特征词i在所述目标支持向量中的权重因子。在该公式中的(a1*x1i+a2*x2i+…+aN*xNi)相当于对该第一训练集中每个支持向量中的特征词i进行加权,能够反映特征词i在该第一训练集中的相对权重;因此β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i能够表征本发明实施例所描述的第i个特征词在该目标支持向量中的相对权重。进一步的,该公式中的β可以为预先设置的固定值或者函数,若未对β进行配置则默认为β=1。
在一种可选的方案中,β可以由公式1-2和公式1-3计算得到,公式1-2和公式1-3如下:
Figure PCTCN2017070464-appb-000002
sum(i)=(a1*x11+a2*x21+…aN*xN1)*y11
+(a1*x12+a2*x22+…aN*xN2)*y12+…+
(a1*x1i+a2*x2i+…aN*xNi)*y1i        1-3
公式1-3中的i依次取1到K的正整数计算出sum(1)、sum(2)、...、sum(K),sum(1)、sum(2)、...、sum(K)中的最大值为公式1-2中的MAX-sum,sum(1)、sum(2)、...、sum(K)中的最小值为公式1-2中的MIN-sum。
在又一种可选的方案中,β可以由公式1-4和1-3计算得到,公式1-4如下:
Figure PCTCN2017070464-appb-000003
在又一种可选的方案中,β可以由公式1-5和1-3计算得到,公式1-5如下:
Figure PCTCN2017070464-appb-000004
在又一种可选的方案中,当i在公式1-1中取某个值导致f(i)为负数时,对应的f(i)取0。
在又一种可选的方案中,当i在公式1-1中取某个值导致f(i)为正数时,对应的f(i)取0。
所述服务器根据目标支持向量中的权重因子和该第一分类模型中的参数计算出每个特征词在该目标支持向量中的相对权重后,将计算得到的相对权重中相对权重小于第一预设阈值的相对权重对应的特征词的权重因子从该第一训练集和第一测试集中删掉,该第一预设阈值可以为预先设置的固定值或者函数,例如,该预设阈值为计算得到的相对权重从大到小排在倒数第5位的相对权重。
举例来说,当计算出特征词1在该目标支持向量中的相对权重小于第一预设阈值时,将支持向量X1中的x11、支持向量X2中的x21、...、支持向量XN中的xN1删掉,得到的新的支持向量依次为X1(x12,x13,...,x1K)、X2(x12,x23,...,x2K)、...、XN(xN2,xN3,...,xNK),为了方便后续描述可以称该新的支持向量X1、X2、...、XN组成的集合为第二训练集。同样的,第一测试集中用来描述该特征词1的权重因子也删除掉,删除该权重因子后的支持向量组成的集合为第二测试集。
在一种可选的方案中,在从该第一训练集中删除某个特征词的权重因子得到第二训练集以及从该第一测试集中删除该某个特征词的权重因子得到第二测试集之前,该服务器还要判断该特征词在该第一训练集中的相对权重是否小于第二预设阈值,在一种可选的方案中,可以通过公式1-6来计算各个特征词在该第一训练集中的相对权重,公式1-6如下:
T(i)=θ*(a1*x1i+a2*x2i+…+aN*xNi)     1-6
公式1-6中的i可以取1到K之间的任意正整数来计算任意特征词在该第一训练集中的相对权重,例如,i取1可以计算特征词1在该第一训练集中的相对权重,i取2可以计算特征词2在该第一训练集中的相对权重,其余可以依此类推。该第二预设阈值可以为预先设置的固定值或者函数。当某个特征词在该目标支持向量中的相对权重小于第一预设阈值时,通过公式1-6计算该某个特征词在该第一训练集中的相对权重,然后将该某个特征词在该第一训练集中的相对权重与第二预设阈值进行比较。当该某个特征词在该第一训练集中的相对权重也小于该第二预设阈值时,才将该某个特征词的权重因子从该第一训练集中删掉以得到第二训练集,以及将该某个特征词的因子从该第一测试集中删掉以得到第二测试集。
在一种又可选的方案中,i依次取1到K之间的正整数代入到公式1-6中 计算得到各个特征词在该第一训练集中的相对权重,然后对各个特征词在该第一训练集中的相对权重排序。如果某个特征词在该目标支持向量中的相对权重小于第一预设阈值,且该某个特征词在该第一训练集中的相对权重在该排序中的排列序号落入预设的序号区间,例如,倒数第5位以内,则将该某个特征词的权重因子从该第一训练集中删掉以得到第二训练集,以及将该某个特征词的权重因子从该第一测试集中删掉以得到第二测试集。
在一种可选的方案中,本发明实施例中的所描述的目标支持向量可能存在多个。当存在多个时,需要根据该多个目标支持向量分别计算出目标特征词,然后从该第一训练集中删除计算出的目标特征词的权重因子以得到第二训练集,以及从该第一测试集中删除计算出的目标特征词的权重因子以得到第二测试集。
在又一种可选的方案中,当该服务器通过计算得到的第一分类模型对该第一测试集中的支持向量进行测试时,若通过该第一分类模型分类的结果与预先的分类结果相比错误率高于目标预设阈值,才执行步骤S402,例如,该目标预设阈值设置为5%。
步骤S403:所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试。
具体地,得到第二训练集和第二测试集后,再次通过SVM算法对该第二训练集中的支持进行分析,得到新的分类模型,为了方便后续描述,可称该新的分类模型为第二分类模型,然后基于该第二分类模型来对该第二测试集中的支持向量进行测试。在一种可选的方案中,若测试的错误率还是高于目标预设阈值则按照步骤S402的原理,再次从该第二训练集和第二测试集中删除权重因子,直至错误率不高于该目标预设阈值。
现结合图4B~4D所描述的实例讲述如何基于图4A所示的数据分类方法对网页数据分类。
请参见图4B,先获取描述饮食类网页数据的向量360条和描述酒类网页数据的向量1903条,并对获取的这些向量进行预处理,预处理后的向量的集合为样本集,样本集中的每个向量对应有种类标识,种类标识411等于1用于标识饮食类网页数据,种类标识412等于-1用于标识酒类网页数据;预处理后的每个向量还对应有多个特征编号413,每个特征编号413对应一个权重 因子414,在图4B中,每个特征编号413与对应的权重因子之间通过冒号隔开。不同特征之间用空格或对齐符隔开。取该样本集中一部分向量作为训练集,以及取该样本集中的又一部分作为测试集。
请参见图4C,将训练集代入到SVM中训练后生成分类模型文件,该分类模型文件包含各个向量的拉格朗日系数421。可选的,分别计算出每个特征词在该训练集中的相对权重,并对特征词的权重进行排序,图4C中示出了部分饮食类特征词及其在该训练集中的相对权重的排序,以及部分酒类特征词及其在该训练集中的相对权重排序。
请参见图4D,将测试集中的向量代入到该分类模型文件中测试。当测试结果表明该分类模型文件的分类错误率高于目标预设阈值时,获取该测试集中分类出现错误的向量,可称该出现错误的向量为目标支持向量。计算该各个特征词在该目标支持向量中的相对权重。在一种可选的方案中,当某个特征在该目标支持向量中相对权重小于第一预设阈值时,将该特征词从该训练集和该测试集中删除。在又一种可选的方案中,当某个特征在该目标支持向量中相对权重小于第一预设阈值时,且该某个特征在该训练集中的相对权重小于第二预设阈值时,将该特征词从该训练集和该测试集中删除。然后基于删除了该某个特征词后形成的新的训练集计算新的分类模型,并将删除了该某个特征词后形成的新的测试集中的向量代入到该新的分类模型中进行测试,直至最终得到的分类模型文件分类的错误率低于目标预设阈值。可选的,删除那些在目标支持向量中的相对权重的同时,还可以删除掉在每个分类中的权重都较大的特征,例如,特征词“香醇”在训练集中的饮食类和酒类中的相对权重都较大,无法通过“香醇”这个词来体现饮食类和酒类的区别,因此可以将特征词“香醇”对应的特征因子从该测试集和训练集中删掉。
在图4A所描述的方法中,服务器基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小。由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
上述详细阐述了本发明实施例的方法,为了便于更好地实施本发明实施例的上述方案,相应地,下面提供了本发明实施例的装置。
请参见图5,图5是本发明实施例提供的一种服务器50,该服务器50包括处理器501和存储器502,所述处理器501和存储器502通过总线相互连接。
存储器502包括但不限于是随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或者快闪存储器)、或便携式只读存储器(CD-ROM),该存储器502用于相关指令及数据。存储器502还用于存储第一训练集、第一测试集,以及存储处理器501得到的第二训练集和第二测试集。
处理器501可以是一个或多个中央处理器(英文:Central Processing Unit,简称:CPU),在处理器501是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
所述服务器50中的处理器501用于读取所述存储器502中存储的程序代码后,执行以下操作:
通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第一分类模型对第一测试集进行分类测试,所述第一训练集和所述第一测试集均包含多个支持向量,每个支持向量包含K个与K个特征词对应的权重因子,每个所述权重因子对应一个特征词,所述权重因子的数值大小与所述权重因子对应的特征词在所述支持向量描述的文本数据中出现的次数成正相关,K为大于1的正整数;
根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,所述目标支持向量为所述第一测试集中利用所述第一分类模型得到的分类测试结果与预设分类不相符的支持向量;
通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试,所述第二训练集和所述第二测试集中的支持向量均包含所述K个特征词中的除目标特征词以外的特征词对应的权重因子,所述目标特征词为所述目标支持向量中相对权重小于第一预设阈值的特征词;
如果通过所述第二分类模型分类测试得到的分类错误率不高于目标预设阈值时,确认使用所述第二分类模型对待分类的文本数据进行分类。
通过执行上述操作,服务器50基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
在一种可选的方案中,所述处理器501通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,还用于:
获取目标特征词,所述目标特征词是指在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
将所述第一训练集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二测试集。
在又一种可选的方案中,所述处理器501通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述处理器501还用于:
根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重;
获取目标特征词,所述目标特征词是指在所述第一训练集中的相对权重小于第二预设阈值,且在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
将所述第一训练集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二测试集。
在又一种可选的方案中,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数; 所述处理器501根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重,具体为:
通过公式T(i)=θ*(a1*x1i+a2*x2i+…+aN*xNi)计算第i个特征词在所述第一训练集中的相对权重T(i),通过对i取1到K之间的正整数来计算每个特征词在所述第一训练集中的相对权重,其中,aN为所述N个支持向量中的第N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子。
在又一种可选的方案中,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述处理器501根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,具体为:
通过公式f(i)=β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i计算第i个特征词在所述目标支持向量中的相对权重f(i),通过对i取1到K之间的正整数来计算每个特征词在所述目标支持向量中的相对权重;其中,aN为所述N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子,y1i为所述第i个特征词在所述目标支持向量中的权重因子。
在又一种可选的方案中,所述处理器501通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述处理器501还用于:
判断通过所述第一分类模型分类测试得到的分类错误率是否高于所述目标预设阈值;
若高于,则执行所述通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试的操作。
具体地,在计算出第一分类模型分类的错误率高于目标预设阈值时才从该第一训练集和第一测试集中删除一些特征,而不是在每次计算出第一分类模型就删除特征,降低了服务器50的开销。
本发明实施例中的服务器50的具体实现还可以对应参照图4A所示的方法实施例的相应描述。
在图5所描述的服务器50中,服务器50基于第一分类模型中的参数和 目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
请参见图6,图6是本发明实施例提供的又一种服务器60的结构示意图,该服务器60可以包括分析单元601和计算单元602,分析单元601和计算单元602的详细描述如下。
分析单元601用于通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第一分类模型对第一测试集进行分类测试,所述第一训练集和所述第一测试集均包含多个支持向量,每个支持向量包含K个与K个特征词对应的权重因子,每个所述权重因子对应一个特征词,所述权重因子的数值大小与所述权重因子对应的特征词在所述支持向量描述的文本数据中出现的次数成正相关,K为大于1的正整数;
计算单元602用于根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,所述目标支持向量为所述第一测试集中利用所述第一分类模型得到的分类测试结果与预设分类不相符的支持向量;
分析单元601还用于通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试,所述第二训练集和所述第二测试集中的支持向量均包含所述K个特征词中的除目标特征词以外的特征词对应的权重因子,所述目标特征词为所述目标支持向量中相对权重小于第一预设阈值的特征词;
如果通过所述第二分类模型分类测试得到的分类错误率不高于目标预设阈值时,确认使用所述第二分类模型对待分类的文本数据进行分类。
通过运行上述单元,服务器60基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数 据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
在一种可选的方案中,所述服务器601还包括获取单元和删除单元;
所述获取单元用于在所述分析单元601通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,获取目标特征词,所述目标特征词是指在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
所述删除单元用于将所述第一训练集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二测试集。
在又一种可选的方案中,所述服务器601还包括获取单元和删除单元;
计算单元602还用于在分析单元601通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重;
所述获取单元用于获取目标特征词,所述目标特征词是指在所述第一训练集中的相对权重小于第二预设阈值,且在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
所述删除单元用于将所述第一训练集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二测试集。
在又一种可选的方案中,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;计算单元602根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重,具体为:
通过公式T(i)=θ*(a1*x1i+a2*x2i+…+aN*xNi)计算第i个特征词在所述第一训练集中的相对权重T(i),通过对i取1到K之间的正整数来计算每个特征词在所述第一训练集中的相对权重,其中,aN为所述N个支持 向量中的第N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子。
在又一种可选的方案中,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;计算单元602根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,具体为:
通过公式f(i)=β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i计算第i个特征词在所述目标支持向量中的相对权重f(i),通过对i取1到K之间的正整数来计算每个特征词在所述目标支持向量中的相对权重;其中,aN为所述N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子,y1i为所述第i个特征词在所述目标支持向量中的权重因子。
在又一种可选的方案中,所述服务器60还包括判断单元,判断单元用于在分析单元601通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,判断通过所述第一分类模型分类测试得到的分类错误率是否高于所述目标预设阈值;
若高于,则触发分析单元601执行所述通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试的操作。
具体地,在计算出第一分类模型分类的错误率高于目标预设阈值时才从该第一训练集和第一测试集中删除一些特征,而不是在每次计算出第一分类模型就删除特征,降低了服务器60的开销。
本发明实施例中的服务器60的具体实现还可以对应参照图4A所示的方法实施例的相应描述。
在图6所描述的服务器60中,服务器60基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
综上所述,通过实施本发明实施例,服务器基于第一分类模型中的参数和目标支持向量中的权重因子计算各个特征词在该目标支持向量中的相对权重大小,由于相对权重较小的目标特征词不能很好的描述该目标支持向量所表示的文本数据的特征,因此将该目标特征词的权重因子从第一训练集和第一测试集中删除掉,分别得到第二训练集和第二测试集以用于重新计算分类模型,避免了目标特征词的权重因子在计算分类模型时产生的负面影响,能够降低该分类模型分类时的错误率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上实施例仅揭露了本发明中较佳实施例,不能以此来限定本发明之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本发明权利要求所作的等同变化,仍属于发明所涵盖的范围。

Claims (12)

  1. 一种文本数据分类方法,其特征在于,包括:
    服务器通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第一分类模型对第一测试集进行分类测试,所述第一训练集和所述第一测试集均包含多个支持向量,每个支持向量包含K个与K个特征词对应的权重因子,每个所述权重因子对应一个特征词,所述权重因子的数值大小与所述权重因子对应的特征词在所述支持向量描述的文本数据中出现的次数成正相关,K为大于1的正整数;
    所述服务器根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,所述目标支持向量为所述第一测试集中利用所述第一分类模型得到的分类测试结果与预设分类不相符的支持向量;
    所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试,所述第二训练集和所述第二测试集中的支持向量均包含所述K个特征词中除目标特征词以外的特征词对应的权重因子,所述目标特征词为所述目标支持向量中相对权重小于第一预设阈值的特征词;
    如果通过所述第二分类模型分类测试得到的分类错误率不高于目标预设阈值时,确认使用所述第二分类模型对待分类的文本数据进行分类。
  2. 根据权利要求1所述的方法,其特征在于,所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述方法还包括:
    所述服务器获取目标特征词,所述目标特征词是指在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
    所述服务器将所述第一训练集中每个支持向量中的所述目标特征词的权 重因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二测试集。
  3. 根据权利要求1所述的方法,其特征在于,所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述方法还包括:
    所述服务器根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重;
    所述服务器获取目标特征词,所述目标特征词是指在所述第一训练集中的相对权重小于第二预设阈值,且在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
    所述服务器将所述第一训练集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二测试集。
  4. 根据权利要求3所述的方法,其特征在于,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述服务器根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重包括:
    所述服务器通过公式T(i)=θ*(a1*x1i+a2*x2i+…+aN*xNi)计算第i个特征词在所述第一训练集中的相对权重T(i),通过对i取1到K之间的正整数来计算每个特征词在所述第一训练集中的相对权重,其中,aN为所述N个支持向量中的第N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子。
  5. 根据权利要求1~4任一项所述的方法,其特征在于,所述第一训练集 中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述服务器根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重包括:
    所述服务器通过公式f(i)=β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i计算第i个特征词在所述目标支持向量中的相对权重f(i),通过对i取1到K之间的正整数来计算每个特征词在所述目标支持向量中的相对权重;其中,aN为所述N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子,y1i为所述第i个特征词在所述目标支持向量中的权重因子。
  6. 根据权利要求1~5任一项所述的方法,其特征在于,所述服务器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述方法还包括:
    所述服务器判断通过所述第一分类模型分类测试得到的分类错误率是否高于所述目标预设阈值;
    若高于,则执行所述通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试的步骤。
  7. 一种服务器,其特征在于,所述服务器包括处理器和存储器,其中:
    所述存储器用于存储指令和数据;
    所述处理器,用于读取所述存储器中存储的指令和数据,执行如下操作:
    通过支持向量机SVM算法对第一训练集进行分析,并根据分析得到的第一分类模型对第一测试集进行分类测试,所述第一训练集和所述第一测试集均包含多个支持向量,每个支持向量包含K个与K个特征词对应的权重因子,每个所述权重因子对应一个特征词,所述权重因子的数值大小与所述权重因 子对应的特征词在所述支持向量描述的文本数据中出现的次数成正相关,K为大于1的正整数;
    根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,所述目标支持向量为所述第一测试集中利用所述第一分类模型得到的分类测试结果与预设分类不相符的支持向量;
    通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试,所述第二训练集和所述第二测试集中的支持向量均包含所述K个特征词中的除目标特征词以外的特征词对应的权重因子,所述目标特征词为所述目标支持向量中相对权重小于第一预设阈值的特征词;
    如果通过所述第二分类模型分类测试得到的分类错误率不高于目标预设阈值时,确认使用所述第二分类模型对待分类的文本数据进行分类。
  8. 根据权利要求7所述的服务器,其特征在于,所述处理器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述处理器还用于:
    获取目标特征词,所述目标特征词是指在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
    将所述第一训练集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词的权重因子删掉以得到第二测试集。
  9. 根据权利要求7所述的服务器,其特征在于,所述处理器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述处理器还用于:
    根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所 述第一训练集中的相对权重;
    获取目标特征词,所述目标特征词是指在所述第一训练集中的相对权重小于第二预设阈值,且在所述目标支持向量中的相对权重小于所述第一预设阈值的特征词;
    将所述第一训练集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二训练集,并将所述第一测试集中每个支持向量中的所述目标特征词对应的因子删掉以得到第二测试集。
  10. 根据权利要求9所述的服务器,其特征在于,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述处理器根据所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述第一训练集中的相对权重,具体为:
    通过公式T(i)=θ*(a1*x1i+a2*x2i+…+aN*xNi)计算第i个特征词在所述第一训练集中的相对权重T(i),通过对i取1到K之间的正整数来计算每个特征词在所述第一训练集中的相对权重,其中,aN为所述N个支持向量中的第N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子。
  11. 根据权利要求7~10任一项所述的服务器,其特征在于,所述第一训练集中包含N个支持向量,所述第一分类模型中的参数包括所述第一训练集中各个支持向量的拉格朗日系数;所述处理器根据目标支持向量的权重因子和所述第一分类模型中的参数计算所述K个特征词中每个特征词在所述目标支持向量中的相对权重,具体为:
    通过公式f(i)=β*(a1*x1i+a2*x2i+…+aN*xNi)*y1i计算第i个特征词在所述目标支持向量中的相对权重f(i),通过对i取1到K之间的正整数来计算每个特征词在所述目标支持向量中的相对权重;其中,aN为所述 N个支持向量的拉格朗日系数,xNi为所述第i个特征词在所述第N个支持向量中的权重因子,y1i为所述第i个特征词在所述目标支持向量中的权重因子。
  12. 根据权利要求7~11任一项所述的服务器,其特征在于,所述处理器通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试之前,所述处理器还用于:
    判断通过所述第一分类模型分类测试得到的分类错误率是否高于所述目标预设阈值;
    若高于,则执行所述通过所述SVM算法对第二训练集进行分析,并根据分析得到的第二分类模型对第二测试集进行分类测试的操作。
PCT/CN2017/070464 2016-05-06 2017-01-06 一种文本数据分类方法及服务器 WO2017190527A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610296812.4 2016-05-06
CN201610296812.4A CN107346433B (zh) 2016-05-06 2016-05-06 一种文本数据分类方法及服务器

Publications (1)

Publication Number Publication Date
WO2017190527A1 true WO2017190527A1 (zh) 2017-11-09

Family

ID=60202712

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/070464 WO2017190527A1 (zh) 2016-05-06 2017-01-06 一种文本数据分类方法及服务器

Country Status (2)

Country Link
CN (1) CN107346433B (zh)
WO (1) WO2017190527A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908774A (zh) * 2017-11-30 2018-04-13 云易天成(北京)安全科技开发有限公司 一种文件分类方法、存储介质及设备
CN108053251A (zh) * 2017-12-18 2018-05-18 北京小度信息科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN109284285A (zh) * 2018-09-07 2019-01-29 平安科技(深圳)有限公司 数据处理方法、装置、计算机设备及计算机可读存储介质
CN110377727A (zh) * 2019-06-06 2019-10-25 深思考人工智能机器人科技(北京)有限公司 一种基于多任务学习的多标签文本分类方法和装置
WO2020057413A1 (zh) * 2018-09-17 2020-03-26 阿里巴巴集团控股有限公司 垃圾文本的识别方法、装置、计算设备及可读存储介质
CN111611353A (zh) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 筛选方法、装置、电子设备及计算机可读存储介质
CN112632971A (zh) * 2020-12-18 2021-04-09 上海明略人工智能(集团)有限公司 一种用于实体匹配的词向量训练方法与系统
CN112989761A (zh) * 2021-05-20 2021-06-18 腾讯科技(深圳)有限公司 文本分类方法及装置
CN113378950A (zh) * 2021-06-22 2021-09-10 深圳市查策网络信息技术有限公司 一种长文本的无监督分类方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800139A (zh) * 2018-12-18 2019-05-24 东软集团股份有限公司 服务器健康度分析方法,装置,存储介质及电子设备
CN110555431B (zh) * 2019-09-10 2022-12-13 杭州橙鹰数据技术有限公司 一种图像识别的方法和装置
CN111625645B (zh) * 2020-05-14 2023-05-23 北京字节跳动网络技术有限公司 文本生成模型的训练方法、装置和电子设备
CN113743425A (zh) * 2020-05-27 2021-12-03 北京沃东天骏信息技术有限公司 一种生成分类模型的方法和装置
CN111708888B (zh) * 2020-06-16 2023-10-24 腾讯科技(深圳)有限公司 基于人工智能的分类方法、装置、终端和存储介质
CN112037911B (zh) * 2020-08-28 2024-03-05 北京万灵盘古科技有限公司 基于机器学习的精神评估的筛查系统及其训练方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1746900A (zh) * 2005-09-23 2006-03-15 上海交通大学 基于迭代特征选择的快速人脸识别方法
US20070239638A1 (en) * 2006-03-20 2007-10-11 Microsoft Corporation Text classification by weighted proximal support vector machine
CN104834940A (zh) * 2015-05-12 2015-08-12 杭州电子科技大学 一种基于支持向量机的医疗影像检查疾病分类方法
CN104951809A (zh) * 2015-07-14 2015-09-30 西安电子科技大学 基于不平衡分类指标与集成学习的不平衡数据分类方法
CN105184316A (zh) * 2015-08-28 2015-12-23 国网智能电网研究院 一种基于特征权学习的支持向量机电网业务分类方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902523B (zh) * 2010-07-09 2014-07-16 中兴通讯股份有限公司 一种移动终端及其短信的过滤方法
CN103699523B (zh) * 2013-12-16 2016-06-29 深圳先进技术研究院 产品分类方法和装置
CN104239900B (zh) * 2014-09-11 2017-03-29 西安电子科技大学 基于k均值和深度svm的极化sar图像分类方法
CN104866869B (zh) * 2015-05-29 2018-12-14 武汉大学 基于分布差异与增量学习的时序sar图像分类方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1746900A (zh) * 2005-09-23 2006-03-15 上海交通大学 基于迭代特征选择的快速人脸识别方法
US20070239638A1 (en) * 2006-03-20 2007-10-11 Microsoft Corporation Text classification by weighted proximal support vector machine
CN104834940A (zh) * 2015-05-12 2015-08-12 杭州电子科技大学 一种基于支持向量机的医疗影像检查疾病分类方法
CN104951809A (zh) * 2015-07-14 2015-09-30 西安电子科技大学 基于不平衡分类指标与集成学习的不平衡数据分类方法
CN105184316A (zh) * 2015-08-28 2015-12-23 国网智能电网研究院 一种基于特征权学习的支持向量机电网业务分类方法

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908774A (zh) * 2017-11-30 2018-04-13 云易天成(北京)安全科技开发有限公司 一种文件分类方法、存储介质及设备
CN108053251A (zh) * 2017-12-18 2018-05-18 北京小度信息科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN108053251B (zh) * 2017-12-18 2021-03-02 北京小度信息科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN109284285A (zh) * 2018-09-07 2019-01-29 平安科技(深圳)有限公司 数据处理方法、装置、计算机设备及计算机可读存储介质
CN109284285B (zh) * 2018-09-07 2024-05-28 平安科技(深圳)有限公司 数据处理方法、装置、计算机设备及计算机可读存储介质
WO2020057413A1 (zh) * 2018-09-17 2020-03-26 阿里巴巴集团控股有限公司 垃圾文本的识别方法、装置、计算设备及可读存储介质
CN111611353B (zh) * 2019-02-25 2023-08-18 北京嘀嘀无限科技发展有限公司 筛选方法、装置、电子设备及计算机可读存储介质
CN111611353A (zh) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 筛选方法、装置、电子设备及计算机可读存储介质
CN110377727A (zh) * 2019-06-06 2019-10-25 深思考人工智能机器人科技(北京)有限公司 一种基于多任务学习的多标签文本分类方法和装置
CN110377727B (zh) * 2019-06-06 2022-06-17 深思考人工智能机器人科技(北京)有限公司 一种基于多任务学习的多标签文本分类方法和装置
CN112632971B (zh) * 2020-12-18 2023-08-25 上海明略人工智能(集团)有限公司 一种用于实体匹配的词向量训练方法与系统
CN112632971A (zh) * 2020-12-18 2021-04-09 上海明略人工智能(集团)有限公司 一种用于实体匹配的词向量训练方法与系统
CN112989761A (zh) * 2021-05-20 2021-06-18 腾讯科技(深圳)有限公司 文本分类方法及装置
CN113378950A (zh) * 2021-06-22 2021-09-10 深圳市查策网络信息技术有限公司 一种长文本的无监督分类方法

Also Published As

Publication number Publication date
CN107346433B (zh) 2020-09-18
CN107346433A (zh) 2017-11-14

Similar Documents

Publication Publication Date Title
WO2017190527A1 (zh) 一种文本数据分类方法及服务器
WO2018086470A1 (zh) 关键词提取方法、装置和服务器
WO2023060795A1 (zh) 关键词自动提取方法、装置、设备及存储介质
US11288573B2 (en) Method and system for training and neural network models for large number of discrete features for information rertieval
WO2019153551A1 (zh) 文章分类方法、装置、计算机设备及存储介质
US20200097709A1 (en) Classification model training method, server, and storage medium
CN109471944B (zh) 文本分类模型的训练方法、装置及可读存储介质
US10637826B1 (en) Policy compliance verification using semantic distance and nearest neighbor search of labeled content
KR101715432B1 (ko) 단어쌍취득장치, 단어쌍취득방법 및 기록 매체
US20170330054A1 (en) Method And Apparatus Of Establishing Image Search Relevance Prediction Model, And Image Search Method And Apparatus
JP2019185716A (ja) エンティティ推薦方法及び装置
CN107862022B (zh) 文化资源推荐系统
WO2017097231A1 (zh) 话题处理方法及装置
WO2016180270A1 (zh) 网页分类方法和装置、计算设备以及机器可读存储介质
WO2015135452A1 (en) Text information processing method and apparatus
CN105528422B (zh) 一种主题爬虫处理方法及装置
CN107180084B (zh) 词库更新方法及装置
CN109145299A (zh) 一种文本相似度确定方法、装置、设备及存储介质
WO2012075884A1 (zh) 书签智能分类的方法和服务器
CN110287409B (zh) 一种网页类型识别方法及装置
EP4258610A1 (en) Malicious traffic identification method and related apparatus
US20210004670A1 (en) Training digital content classification models utilizing batchwise weighted loss functions and scaled padding based on source density
WO2022121163A1 (zh) 用户行为倾向识别方法、装置、设备及存储介质
WO2021159655A1 (zh) 数据属性填充方法、装置、设备及计算机可读存储介质
WO2023159756A1 (zh) 价格数据的处理方法和装置、电子设备、存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17792357

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17792357

Country of ref document: EP

Kind code of ref document: A1