WO2022166830A1 - 一种文本分类的特征提取方法及装置 - Google Patents

一种文本分类的特征提取方法及装置 Download PDF

Info

Publication number
WO2022166830A1
WO2022166830A1 PCT/CN2022/074714 CN2022074714W WO2022166830A1 WO 2022166830 A1 WO2022166830 A1 WO 2022166830A1 CN 2022074714 W CN2022074714 W CN 2022074714W WO 2022166830 A1 WO2022166830 A1 WO 2022166830A1
Authority
WO
WIPO (PCT)
Prior art keywords
optimal
classification accuracy
feature extraction
feature
value
Prior art date
Application number
PCT/CN2022/074714
Other languages
English (en)
French (fr)
Inventor
霍小倩
Original Assignee
北京紫光展锐通信技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京紫光展锐通信技术有限公司 filed Critical 北京紫光展锐通信技术有限公司
Publication of WO2022166830A1 publication Critical patent/WO2022166830A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Definitions

  • the invention relates to the field of computers, and in particular, to a method and device for feature extraction of text classification.
  • Natural language processing (NLP) technology and information mining are the key technologies of data management at present, and text classification is the operation basis of these technologies.
  • NLP Natural language processing
  • data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit. Therefore, feature extraction for text is a crucial step that directly affects the classification effect.
  • the commonly used methods are chi-square test (chi2Square), mutual information method (Mutual Information, MI), information gain method (Information Gain, IG), gradient ascent decision tree (Gradient Boosting Decision tree) Tree, GBDT) method, supervised feature extraction method and artificial method, etc.
  • chi2Square mutual information method
  • MI Mutual Information
  • IG information gain method
  • GDT gradient ascent decision tree
  • supervised feature extraction method supervised feature extraction method and artificial method, etc.
  • Different feature extraction methods have corresponding classification effects under different feature item dimensions.
  • traditional feature extraction methods have large generalization errors and cannot meet the high accuracy requirements of text classification. Therefore, how to improve the accuracy of text classification is an urgent problem to be solved.
  • the present application provides a feature extraction method and device for text classification, which is beneficial to improve the accuracy of text classification.
  • the present application provides a feature extraction method for text classification.
  • the method includes: using multiple feature extraction methods to perform feature extraction on text data to obtain feature items; based on the feature items corresponding to the multiple feature extraction methods, in different Under the dimension value of the feature item, determine the classification accuracy rate of the multiple feature extraction methods, and the feature item dimension value is used to represent the number of feature items; based on the classification accuracy rate of the multiple feature extraction methods, determine the corresponding Based on the optimal classification accuracy rate and the optimal feature item dimension value corresponding to each feature extraction method, determine the target feature extraction method and target feature corresponding to the text data based on the optimal classification accuracy rate and the optimal feature item dimension value corresponding to each feature extraction method. item dimension value.
  • the feature extraction method with the best classification effect and the corresponding feature extraction method determine the feature extraction method with the best classification effect and the corresponding feature extraction method.
  • the dimension value of the feature item is beneficial to improve the accuracy of text classification.
  • the optimal feature item dimension value corresponding to each feature extraction method is the feature item dimension value corresponding to the optimal classification accuracy rate of each feature extraction method, and a maximum The optimal classification accuracy rate corresponds to an optimal feature item dimension value.
  • the target feature extraction method and target corresponding to the text data are determined based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the dimension value of the feature item including: by comparing the optimal classification accuracy rate corresponding to each feature extraction method, determining the optimal classification accuracy rate with the largest value, and the optimal classification accuracy rate with the largest value includes one or more optimal classification accuracy Accuracy rate; if the optimal classification accuracy rate with the largest value includes an optimal classification accuracy rate, the feature extraction method corresponding to the optimal classification accuracy rate with the largest value is used as the target feature extraction method, and the largest value of the optimal classification accuracy rate is used as the target feature extraction method.
  • the optimal feature item dimension value corresponding to the optimal classification accuracy rate is used as the target feature item dimension value. Based on this method, it is beneficial to improve the accuracy of text classification.
  • the method further includes: if the optimal classification accuracy rate with the largest numerical value includes multiple optimal classification accuracy
  • the optimal feature item dimension value with the smallest value is determined among the multiple optimal feature item dimension values;
  • the dimension value of the optimal feature item is taken as the dimension value of the target feature item. Based on this method, it is beneficial to improve the accuracy of text classification.
  • the target feature extraction method and target corresponding to the text data are determined based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the dimension value of the feature item including: by comparing the optimal classification accuracy rate corresponding to each feature extraction method, determining the optimal classification accuracy rate with the largest value; determining the first optimal classification accuracy rate, the first optimal classification accuracy rate The difference with the optimal classification accuracy rate with the largest value is within a preset range; by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy rate with the optimal classification accuracy rate with the largest value corresponding to The optimal feature item dimension value is determined, and the optimal feature item dimension value with the smallest value is determined; the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is taken as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the target feature extraction method.
  • the feature item dimension value is used as the target feature item dimension value; wherein, the first optimal classification accuracy rate includes one or
  • the target feature extraction method and target corresponding to the text data are determined based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the dimension value of the feature item including: by comparing the optimal classification accuracy rate corresponding to each feature extraction method, determining the optimal classification accuracy rate with a value greater than a preset threshold, and the optimal classification accuracy rate with a value greater than the preset threshold includes a Or multiple optimal classification accuracy rates; by comparing the optimal feature item dimension values corresponding to the optimal classification accuracy rates with the value greater than the preset threshold, determine the optimal feature item dimension value with the smallest value;
  • the feature extraction method corresponding to the dimension value of the optimal feature item is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the target feature item dimension value.
  • the present application provides a processing apparatus, which includes a processing unit and a determination unit, where the processing unit and the determination unit are configured to execute the method in the first aspect or any possible implementation manners thereof.
  • the present application provides a chip, the chip includes a processor and a communication interface, the processor is configured to cause the chip to perform the method in the first aspect or any possible implementation manner thereof.
  • the present application provides a module device, the module device includes a communication module, a power module, a storage module and a chip, wherein: the power module is used to provide electrical energy for the module device; the The storage module is used to store data and instructions; the communication module is used for internal communication of the module device, or for the module device to communicate with external devices; the chip is used to perform the first aspect or any one of the above methods in possible implementations.
  • an embodiment of the present invention discloses an electronic device, the electronic device includes a memory and a processor, the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call The program instructions execute the method in the first aspect or any possible implementation manner thereof.
  • the present application provides a computer-readable storage medium, where computer-readable instructions are stored in the computer storage medium, and when the computer-readable instructions are executed on a communication device, the communication device is made to perform the above-mentioned first aspect and any of its possible implementations.
  • the present application provides a computer program or computer program product, comprising codes or instructions, which, when the codes or instructions are run on a computer, cause the computer to execute the method in the first aspect or any of its possible implementations .
  • FIG. 1 is a flowchart of a feature extraction method for text classification provided by an embodiment of the present application
  • Fig. 2 is a broken line graph of the change rule of the classification effect of a kind of different feature extraction methods provided by the embodiment of the present application along with the dimension value of the feature item;
  • FIG. 3 is a bar graph comparing the optimal classification effects of different feature extraction methods provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of another feature extraction method for text classification provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of another feature extraction method for text classification provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a processing device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a modular device provided by an embodiment of the present application.
  • the execution subject of the method proposed in this application can be an electronic device, which can be a terminal device, also called a terminal; it can be a device with wireless transceiver function, which can be deployed on land, including indoor or outdoor, handheld Or vehicle; can also be deployed on the water (such as ships, etc.); can also be deployed in the air (such as aircraft, balloons and satellites, etc.).
  • the electronic device may be a user equipment (UE), where the UE includes a handheld device, a vehicle-mounted device, a wearable device, or a computing device with a wireless communication function.
  • the UE may be a mobile phone, a tablet computer, or a computer with a wireless transceiver function.
  • the electronic device may also be a virtual reality (VR) electronic device, an augmented reality (AR) electronic device, a wireless terminal in industrial control, a wireless terminal in unmanned driving, a wireless terminal in telemedicine, Wireless terminals in smart grids, wireless terminals in smart cities, wireless terminals in smart homes, and so on.
  • the device for implementing the function of the electronic device may be a terminal; it may also be a device capable of supporting the electronic device to realize the function, such as a chip system, and the device may be installed in the electronic device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • NLP natural language processing
  • information mining are the key technologies of data management at present, and text classification is the operation basis of these technologies.
  • data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit. Therefore, feature extraction for text is a crucial step that directly affects the classification effect.
  • the commonly used methods are chi-square test (chi2Square), mutual information method (Mutual Information, MI), information gain method (Information Gain, IG), gradient ascent decision tree (Gradient Boosting Decision tree) Tree, GBDT) method, supervised feature extraction method and artificial method, etc.
  • chi2Square mutual information method
  • MI Mutual Information
  • IG information gain method
  • GDT gradient ascent decision tree
  • supervised feature extraction method supervised feature extraction method and artificial method, etc.
  • Different feature extraction methods have corresponding classification effects under different feature item dimensions.
  • traditional feature extraction methods have large generalization errors and cannot meet the high accuracy requirements of text classification. Therefore, how to improve the accuracy of text classification is an urgent problem to be solved.
  • the embodiments of the present application provide a feature extraction method and apparatus for text classification.
  • the feature extraction method for text classification is described in detail below.
  • FIG. 1 is a flowchart of a feature extraction method for text classification provided by an embodiment of the present application.
  • the feature extraction method for text classification includes steps 101 to 104 .
  • the method execution main body shown in FIG. 1 may be an electronic device, or the main body may be a chip in the electronic device.
  • the method execution body shown in FIG. 1 takes an electronic device as an example. in:
  • the electronic device uses a variety of feature extraction methods to perform feature extraction on the text data to obtain feature items.
  • commonly used feature extraction methods include chi-square test, mutual information method, information gain method, gradient ascent decision tree method, and supervised feature extraction method. It should be noted that other feature extraction methods may also be used to perform feature extraction on the text data, which is not limited in this embodiment of the present application. Based on this method, it is convenient to subsequently process and analyze the feature items obtained by different feature extraction methods.
  • the following will mainly take electronic equipment to select chi-square test, mutual information method, information gain method, gradient ascent decision tree method to perform feature extraction on text data, and obtain feature items extracted by each feature extraction method as an example.
  • the text data may be one text data, or may be one partial text data in which the one text data is divided into multiple partial text data.
  • a piece of text data is divided into three parts of text data, and feature items are obtained by using multiple feature extraction methods for feature extraction on one part of the text data. Based on this method, it is convenient to subsequently determine the optimal feature extraction method for each part of the text data, so as to improve the classification accuracy of the entire text data.
  • the electronic device determines the classification accuracy of the multiple feature extraction methods under different dimension values of the feature items.
  • the feature item dimension value is used to represent the number of feature items, and different feature extraction methods have corresponding classification accuracy rates under different feature item dimension values.
  • the classifier can be used to evaluate and optimize the feature items. Based on this method, the classification accuracy rates of various feature extraction methods used under different feature item dimension values can be determined.
  • chi2 chi-square test
  • MI mutual information method
  • IG information gain method
  • GBDT gradient ascension decision tree method
  • RF Random Forest
  • the dimension values of different feature items can be set in any range
  • the selected multiple feature extraction methods can be any feature extraction method
  • the step size of the dimension value change of the feature items can be any value, which is not limited here.
  • the classifier for testing the classification effect may be any classifier, which is not limited here.
  • the electronic device determines, based on the classification accuracy rates of the multiple feature extraction methods, an optimal classification accuracy rate and an optimal feature item dimension value corresponding to each feature extraction method.
  • the optimal feature item dimension value corresponding to each feature extraction method is the feature item dimension value corresponding to the optimal classification accuracy rate of each feature extraction method, and an optimal classification accuracy rate corresponds to a The optimal feature item dimension value.
  • chi2 electronic equipment selects chi-square test (chi2), mutual information method (MI), information gain method (IG), gradient ascension decision tree method (GBDT) to perform feature extraction on text data, and uses random forest as the classification method to test the classification effect.
  • the optimal classification accuracy rate and the optimal feature item dimension value corresponding to each feature extraction method can be determined from Figure 2, as shown in Figure 3.
  • the optimal classification accuracy rate of the chi-square test is 92%, and the optimal feature item dimension value is 25;
  • the optimal classification accuracy rate of the mutual information method is 90%, and the optimal feature item dimension value is 55;
  • the optimal classification accuracy rate is 92%, and the optimal feature item dimension value is 55;
  • the optimal classification accuracy rate of the gradient ascent decision tree method is 92%, and the optimal feature item dimension value is 50.
  • Figure 3 also includes the manual extraction method (byman), the optimal classification accuracy rate is 84%, and the optimal feature item dimension value is 80. It can be seen that the optimal classification accuracy of the four methods of chi-square test, mutual information method, information gain method and gradient ascent decision tree method is higher than that of the manual extraction method.
  • the electronic device determines a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the method for determining the target feature extraction method and the dimension value of the target feature item corresponding to the text data may be: the electronic device compares the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension. value, take the feature extraction method corresponding to the optimal classification accuracy rate with the largest value as the target feature extraction method corresponding to the text data, and take the optimal feature item dimension value corresponding to the optimal classification accuracy rate with the largest value as the target feature item dimension value .
  • the method for determining the target feature extraction method and the target feature item dimension value corresponding to the text data may also be: the electronic device compares the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value. , determine the optimal classification accuracy rate greater than the preset threshold, and compare the optimal feature item dimension values corresponding to these optimal classification accuracy rates greater than the preset threshold value.
  • the extraction method is used as the target feature extraction method corresponding to the text data, and the optimal feature item dimension value with the smallest value is taken as the optimal feature item dimension value.
  • the method for determining the target feature extraction method corresponding to the text data and the dimension value of the target feature item may also be other methods, which are not limited herein.
  • the electronic device determines the feature extraction method with the best classification effect and the feature for the text data by comprehensively considering the optimal classification accuracy rate and the optimal feature item dimension value of multiple feature extraction methods.
  • the dimension value of the feature item corresponding to the extraction method Therefore, based on the method described in Figure 1, it is beneficial to improve the accuracy of text classification.
  • FIG. 4 is a schematic flowchart of another feature extraction method for text classification provided by an embodiment of the present application.
  • the feature extraction method for text classification includes steps 401 to 407 .
  • Steps 404 to 407 are a specific implementation manner of the foregoing step 104 .
  • the method execution main body shown in FIG. 4 may be an electronic device, or the main body may be a chip in the electronic device.
  • the method execution body shown in FIG. 4 takes an electronic device as an example. in:
  • the electronic device uses multiple feature extraction methods to perform feature extraction on the text data to obtain feature items.
  • the electronic device determines the classification accuracy of the multiple feature extraction methods under different dimension values of the feature items.
  • the electronic device determines, based on the classification accuracy rates of the multiple feature extraction methods, an optimal classification accuracy rate and an optimal feature item dimension value corresponding to each feature extraction method.
  • steps 401 to 403 are the same as the specific implementation manners of the above-mentioned steps 101 to 103, and are not repeated here.
  • the electronic device determines the optimal classification accuracy rate with the largest value by comparing the optimal classification accuracy rates corresponding to each feature extraction method.
  • the optimal classification accuracy rate with the largest value includes one or more optimal classification accuracy rates. If the optimal classification accuracy rate with the largest value includes an optimal classification accuracy rate, step 405 is executed. If the optimal classification accuracy rate with the largest value includes multiple optimal classification accuracy rates (for example, refer to the embodiment shown in FIG. 3 ), then step 406 and step 407 are performed.
  • electronic equipment selects chi-square test (chi2), mutual information method (MI), information gain method (IG), gradient ascent decision tree method (GBDT) to perform feature extraction on text data, as shown in Figure 4, the electronic equipment passes Comparing the optimal classification accuracy rates corresponding to each feature extraction method, three optimal classification accuracy rates with the largest values were determined, all of which were 92%.
  • chi2 chi-square test
  • MI mutual information method
  • IG information gain method
  • GBDT gradient ascent decision tree method
  • the electronic device uses the feature extraction method corresponding to the optimal classification accuracy rate with the largest value as the target feature extraction method, and uses the optimal feature item dimension value corresponding to the optimal classification accuracy rate with the largest value as the target feature item dimension value.
  • chi2 chi-square test
  • MI mutual information method
  • IG information gain method
  • GBDT gradient ascent decision tree method
  • the optimal classification accuracy rate of the mutual information method is 92%, and the optimal feature item dimension value is 25; the optimal classification accuracy rate of the mutual information method is 90%, and the optimal feature item dimension value is 55; the optimal classification accuracy rate of the information gain method is 91%, and the optimal classification accuracy rate is 91%.
  • the optimal feature item dimension value is 50.
  • the feature extraction method corresponding to the optimal classification accuracy rate with the largest value is the chi-square test, and the optimal feature item dimension value corresponding to the optimal classification accuracy rate with the largest value is 25, so the electronic device uses the chi-square test as the target feature extraction. method, take 25 as the dimension value of the target feature item.
  • the electronic device determines, among the plurality of optimal feature item dimension values corresponding to the plurality of optimal classification accuracy rates, the optimal feature item dimension value with the smallest numerical value.
  • the electronic device uses the feature extraction method corresponding to the optimal feature item dimension value with the smallest numerical value as the target feature extraction method, and uses the optimal feature item dimension value with the smallest numerical value as the target feature item dimension value.
  • the optimal feature item dimension value with the smallest value is determined because reducing the dimension is more convenient for calculation and visualization, which is beneficial to the extraction and synthesis of effective information and the rejection of useless information. Based on this method, it is beneficial to improve the accuracy of text classification.
  • chi2 chi-square test
  • MI mutual information method
  • IG information gain method
  • GBDT gradient ascent decision tree method
  • the optimal classification accuracy rate of the mutual information method is 92%, and the optimal feature item dimension value is 25; the optimal classification accuracy rate of the mutual information method is 90%, and the optimal feature item dimension value is 55; the optimal classification accuracy rate of the information gain method is 92%, and the optimal classification accuracy rate is 92%.
  • the optimal feature item dimension value is 50.
  • the two optimal classification accuracy rates of 92% correspond to the corresponding optimal feature item dimension values of 25 and 50 respectively, so the optimal feature item dimension value with the smallest value is 25.
  • the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is the chi-square test, so the electronic device uses the chi-square test as the target feature extraction method and 25 as the target feature item dimension value.
  • the electronic device determines the optimal classification accuracy with the largest value for the text data by considering the optimal classification accuracy and the optimal feature item dimension values of various feature extraction methods, thereby determining the optimal classification accuracy.
  • FIG. 5 is a schematic flowchart of another feature extraction method for text classification provided by an embodiment of the present application.
  • the feature extraction method for text classification includes steps 501 to 507 .
  • Steps 504 to 507 are a specific implementation manner of the foregoing step 104 .
  • the method execution main body shown in FIG. 5 may be an electronic device, or the main body may be a chip in the electronic device.
  • the method execution body shown in FIG. 5 takes an electronic device as an example. in:
  • the electronic device uses multiple feature extraction methods to perform feature extraction on the text data to obtain feature items.
  • the electronic device determines the classification accuracy of the multiple feature extraction methods under different dimension values of the feature items.
  • the electronic device determines, based on the classification accuracy rates of the multiple feature extraction methods, an optimal classification accuracy rate and an optimal feature item dimension value corresponding to each feature extraction method.
  • steps 501 to 503 are the same as the specific implementation manners of the above-mentioned steps 401 to 403, and are not repeated here.
  • the electronic device determines the optimal classification accuracy rate with the largest value by comparing the optimal classification accuracy rates corresponding to each feature extraction method.
  • the electronic device determines the first optimal classification accuracy rate.
  • the difference between the first optimal classification accuracy rate and the optimal classification accuracy rate with the largest numerical value is within a preset range.
  • the first optimal classification accuracy rate includes one or more optimal classification accuracy rates.
  • the preset range may be any range, which is not limited here.
  • the electronic device determines the optimal feature item dimension with the smallest value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy rate with the optimal feature item dimension value corresponding to the optimal classification accuracy rate with the largest numerical value. value.
  • the electronic device uses the feature extraction method corresponding to the optimal feature item dimension value with the smallest numerical value as the target feature extraction method, and uses the optimal feature item dimension value with the smallest numerical value as the target feature item dimension value.
  • the electronic device selects the chi-square test (chi2), the mutual information method (MI), information gain method (IG), gradient ascension decision tree method (GBDT) for feature extraction of text data, among which the optimal classification accuracy rate of chi-square test is 92%, and the optimal feature item dimension value is 25;
  • the optimal classification accuracy rate of the information method is 86%, and the optimal feature item dimension value is 55;
  • the optimal classification accuracy rate of the information gain method is 90%, and the optimal feature item dimension value is 15.
  • the electronic device determines that the optimal classification accuracy with the largest value is 92%, and the difference between the optimal classification accuracy and the optimal classification accuracy with the largest value is within a preset range. There is one optimal classification accuracy rate of 90%, so the electronic device determines that there is one optimal classification accuracy rate, which is 90%.
  • the dimension value of the optimal feature item corresponding to the first optimal classification accuracy rate is 15, and the optimal feature item dimension value corresponding to the optimal classification accuracy rate with the largest value is 25.
  • the optimal feature item with the smallest value is determined.
  • the dimension value is 15.
  • the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is the information gain method, so the electronic device uses the information gain method as the target feature extraction method, and takes 15 as the target feature item dimension value.
  • the electronic device selects chi-square test (chi2), mutual information Method (MI), information gain method (IG), gradient ascent decision tree method (GBDT) for feature extraction of text data, the optimal classification accuracy rate of chi-square test is 92%, and the optimal feature item dimension value is 25;
  • the optimal classification accuracy rate of the mutual information method is 90%, and the optimal feature item dimension value is 55;
  • the optimal classification accuracy rate of the information gain method is 90%, and the optimal feature item dimension value is 15.
  • the electronic device determines that the optimal classification accuracy with the largest value is 92%, and the difference between the optimal classification accuracy and the optimal classification accuracy with the largest value is within a preset range.
  • the dimension value of the optimal feature item corresponding to the first optimal classification accuracy rate is 55 and 15 respectively, and the optimal feature item dimension value corresponding to the optimal classification accuracy rate with the largest value is 25.
  • the optimal feature item dimension value is 15.
  • the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is the information gain method, so the electronic device uses the information gain method as the target feature extraction method, and takes 15 as the target feature item dimension value.
  • the electronic device determines the feature extraction method with the best classification effect and the feature for the text data by comprehensively considering the optimal classification accuracy rate and the optimal feature item dimension value of multiple feature extraction methods.
  • the dimension value of the feature item corresponding to the extraction method Therefore, based on the method described in Figure 5, it is beneficial to improve the accuracy of text classification.
  • FIG. 6 is a schematic structural diagram of a processing apparatus provided by an embodiment of the present invention
  • the processing apparatus for the candidate synchronization signal block may be an electronic device or a device (such as a chip) having the function of an electronic device.
  • the processing device 60 for the candidate synchronization signal block may include:
  • a processing unit 601, configured to perform feature extraction on the text data by using multiple feature extraction methods to obtain feature items
  • the determining unit 602 is configured to determine the classification accuracy of the various feature extraction methods based on the feature items corresponding to the multiple feature extraction methods under different feature item dimension values, and the feature item dimension value is used to represent the feature item. quantity;
  • the determining unit 602 is further configured to determine the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the multiple feature extraction methods;
  • the determining unit 602 is further configured to determine a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the optimal feature item dimension value corresponding to each feature extraction method is the feature item dimension value corresponding to the optimal classification accuracy rate of each feature extraction method, and an optimal classification accuracy rate corresponds to an optimal feature item dimension value.
  • the determining unit 602 determines the specific implementation of the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the method is: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with the largest value, and the optimal classification accuracy rate with the largest value includes one or more optimal classification accuracy rates; if The optimal classification accuracy rate with the largest value includes an optimal classification accuracy rate, then the feature extraction method corresponding to the optimal classification accuracy rate with the largest value is used as the target feature extraction method, and the optimal classification accuracy rate with the largest value is used.
  • the optimal feature item dimension value corresponding to the rate is taken as the target feature item dimension value.
  • the determining unit 602 is further configured to: if the optimal classification accuracy rate with the largest numerical value includes multiple optimal classification accuracy rates, then the multiple optimal feature item dimensions corresponding to the multiple optimal classification accuracy rates Determine the optimal feature item dimension value with the smallest value among the values; the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the The dimension value of the target feature item.
  • the determining unit 602 determines the specific implementation of the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the method is: by comparing the optimal classification accuracy rate corresponding to each feature extraction method, determine the optimal classification accuracy rate with the largest value; determine the first optimal classification accuracy rate, the first optimal classification accuracy rate and the numerical value are the largest The difference between the optimal classification accuracy rates is within a preset range; by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy rate and the optimal feature corresponding to the optimal classification accuracy rate with the largest value Dimension value of the item, determine the optimal feature item dimension value with the smallest value; the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used.
  • the first optimal classification accuracy rate includes one or more optimal classification accuracy rates.
  • the determining unit 602 determines the specific implementation of the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the method is: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with a value greater than a preset threshold, and the optimal classification accuracy rate with a value greater than the preset threshold includes one or more optimal classification accuracy rates.
  • the optimal classification accuracy rate by comparing the optimal feature item dimension values corresponding to the optimal classification accuracy rate with the value greater than the preset threshold, determine the optimal feature item dimension value with the smallest value; the optimal feature item dimension value with the smallest value.
  • the feature extraction method corresponding to the value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the target feature item dimension value.
  • FIG. 1 , FIG. 4 and FIG. 5 are based on the same concept, and the technical effects brought by them are also the same.
  • FIG. 1 , FIG. 4 and FIG. 5 please refer to the description of the embodiments shown in FIG. 1 , FIG. 4 and FIG. 5 , I won't go into details here.
  • FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • the electronic device 70 may include a memory 701, a processor 702 and a communication interface 703, the memory 701, the processor 702 and the communication interface 703 being connected by one or more communication buses.
  • the communication interface 703 is controlled by the processor 702 for sending and receiving information.
  • Memory 701 may include read only memory and random access memory, and provides instructions and data to processor 702 .
  • a portion of memory 701 may also include non-volatile random access memory.
  • the processor 702 may be a central processing unit (Central Processing Unit, CPU), and the processor 702 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC) ), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor can be a microprocessor, alternatively, the processor 702 can also be any conventional processor or the like. in:
  • the memory 701 is used to store program instructions.
  • the processor 702 is used for calling the program instructions stored in the memory 701 .
  • the processor 702 invokes the program instructions stored in the memory 701 to make the electronic device 70 perform the following operations:
  • the classification accuracy rates of the multiple feature extraction methods are determined under different feature item dimension values, and the feature item dimension value is used to represent the number of feature items;
  • the target feature extraction method and target feature item dimension value corresponding to the text data are determined.
  • the optimal feature item dimension value corresponding to each feature extraction method is the feature item dimension value corresponding to the optimal classification accuracy rate of each feature extraction method, and an optimal classification accuracy rate corresponds to an optimal classification accuracy rate.
  • the dimension value of the optimal feature item is the feature item dimension value corresponding to the optimal classification accuracy rate of each feature extraction method, and an optimal classification accuracy rate corresponds to an optimal classification accuracy rate.
  • the processor 702 determines the difference between the target feature extraction method corresponding to the text data and the dimension value of the target feature item based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the operation is specifically as follows: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, the optimal classification accuracy rate with the largest value is determined, and the optimal classification accuracy rate with the largest value includes one or more optimal classification accuracy rates; If the optimal classification accuracy rate with the largest value includes an optimal classification accuracy rate, the feature extraction method corresponding to the optimal classification accuracy rate with the largest value is used as the target feature extraction method, and the optimal classification accuracy rate with the largest value is used as the target feature extraction method.
  • the optimal feature item dimension value corresponding to the correct rate is used as the target feature item dimension value.
  • the processor 702 is further configured to: if the optimal classification accuracy rate with the largest numerical value includes multiple optimal classification accuracy rates, then multiple optimal features corresponding to the multiple optimal classification accuracy rates Determine the optimal feature item dimension value with the smallest value among the item dimension values; the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used. as the dimension value of the target feature item. Based on this method, it is beneficial to improve the accuracy of text classification.
  • the processor 702 determines the difference between the target feature extraction method corresponding to the text data and the dimension value of the target feature item based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the operation is specifically: by comparing the optimal classification accuracy rate corresponding to each feature extraction method, determine the optimal classification accuracy rate with the largest value; determine the first optimal classification accuracy rate, the first optimal classification accuracy rate and the numerical value
  • the difference between the maximum optimal classification accuracy rates is within a preset range; by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy rate and the optimal classification accuracy rate corresponding to the largest numerical value
  • the feature item dimension value determine the optimal feature item dimension value with the smallest value; the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used. value as the dimension value of the target feature item; wherein, the first optimal classification accuracy rate includes one or more optimal classification accuracy rates.
  • the processor 702 determines the difference between the target feature extraction method corresponding to the text data and the dimension value of the target feature item based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value.
  • the specific implementation method is: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with a value greater than a preset threshold, and the optimal classification accuracy rate with a value greater than the preset threshold includes one or more
  • the optimal feature item dimension values corresponding to the optimal classification accuracy rates with the value greater than the preset threshold determine the optimal feature item dimension value with the smallest value; the optimal feature item with the smallest value is determined.
  • the feature extraction method corresponding to the item dimension value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the target feature item dimension value.
  • the embodiments of the present application further provide a chip, and the chip can execute the relevant steps of the electronic device in the foregoing method embodiments.
  • the chip includes a processor and a communication interface, the processor is configured to make the chip perform the following operations: perform feature extraction on text data by using multiple feature extraction methods to obtain feature items; based on the features corresponding to the multiple feature extraction methods Item, under different feature item dimension values, determine the classification accuracy rate of the multiple feature extraction methods, and the feature item dimension value is used to represent the number of the feature items; based on the classification accuracy rate of the multiple feature extraction methods to determine each The optimal classification accuracy rate and the optimal feature item dimension value corresponding to each feature extraction method; based on the optimal classification accuracy rate and the optimal feature item dimension value corresponding to each feature extraction method, determine the target feature corresponding to the text data Extraction method and dimension value of target feature item.
  • the optimal feature item dimension value corresponding to each feature extraction method is the feature item dimension value corresponding to the optimal classification accuracy rate of each feature extraction method, and an optimal classification accuracy rate corresponds to an optimal feature item dimension value.
  • the processor when determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value, the processor is executed.
  • the configuration is used to make the chip specifically perform the following operations: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with the largest numerical value, and the optimal classification accuracy rate with the largest numerical value includes one or Multiple optimal classification accuracy rates; if the optimal classification accuracy rate with the largest value includes an optimal classification accuracy rate, the feature extraction method corresponding to the optimal classification accuracy rate with the largest value is used as the target feature extraction method, and The optimal feature item dimension value corresponding to the optimal classification accuracy rate with the largest numerical value is used as the target feature item dimension value.
  • the processor is further configured to cause the chip to perform the following operations: if the optimal classification accuracy rate with the largest numerical value includes multiple optimal classification accuracy rates, then the multiple optimal classification accuracy rates correspond to Determine the optimal feature item dimension value with the smallest value among the multiple optimal feature item dimension values of The optimal feature item dimension value is taken as the target feature item dimension value.
  • the processor when determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value, the processor is executed.
  • the chip is configured to specifically perform the following operations: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with the largest value; determine the first optimal classification accuracy rate, the first The difference between the optimal classification accuracy rate and the optimal classification accuracy rate with the largest value is within a preset range; by comparing the dimension value of the optimal feature item corresponding to the first optimal classification accuracy rate and the largest value of the optimal classification accuracy rate.
  • the optimal feature item dimension value corresponding to the optimal classification accuracy rate is determined, and the optimal feature item dimension value with the smallest value is determined; the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is used as the target feature extraction method, and the The optimal feature item dimension value with the smallest numerical value is used as the target feature item dimension value; wherein, the first optimal classification accuracy rate includes one or more optimal classification accuracy rates.
  • the processor when determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value, the processor is executed.
  • the chip is configured to specifically perform the following operations: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with a value greater than a preset threshold, and determine the optimal classification accuracy with the value greater than the preset threshold
  • the accuracy rate includes one or more optimal classification accuracy rates; by comparing the optimal feature item dimension values corresponding to the optimal classification accuracy rates with the value greater than the preset threshold, determine the optimal feature item dimension value with the smallest value;
  • the feature extraction method corresponding to the dimension value of the optimal feature item with the smallest value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the dimension value of the target feature item.
  • the chip includes at least one processor, at least one first memory, and at least one second memory; wherein, the at least one first memory and the at least one processor are interconnected through a line, and the first memory Instructions are stored in the memory; the at least one second memory and the at least one processor are interconnected through a line, and the data to be stored in the foregoing method embodiments are stored in the second memory.
  • each module contained therein may be implemented by hardware such as circuits, or at least some of the modules may be implemented by a software program that runs on the integrated circuit inside the chip.
  • the processor and the remaining (if any) modules can be implemented in hardware such as circuits.
  • FIG. 8 is a schematic structural diagram of a module device provided by an embodiment of the present application.
  • the module device 80 can perform the relevant steps of the electronic device in the foregoing method embodiments, and the module device 80 includes: a communication module 801 , a power module 802 , a storage module 803 and a chip 804 .
  • the power module 802 is used to provide power for the module device; the storage module 803 is used to store data and instructions; the communication module 801 is used to perform internal communication of the module device, or to The module device communicates with an external device; the chip 804 is configured to perform the following operations: using multiple feature extraction methods to perform feature extraction on text data to obtain feature items; based on the feature items corresponding to the multiple feature extraction methods, in Under different dimension values of feature items, determine the classification accuracy of the multiple feature extraction methods, and the feature item dimension value is used to represent the number of feature items; determine each feature extraction method based on the classification accuracy of the multiple feature extraction methods The corresponding optimal classification accuracy rate and the optimal feature item dimension value; based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value, determine the target feature extraction method and target corresponding to the text data Feature item dimension value.
  • the optimal feature item dimension value corresponding to each feature extraction method is the feature item dimension value corresponding to the optimal classification accuracy rate of each feature extraction method, and an optimal classification accuracy rate corresponds to an optimal feature item dimension value.
  • the chip 804 determines the specific implementation mode of the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value. is: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with the largest value, and the optimal classification accuracy rate with the largest value includes one or more optimal classification accuracy rates; if the The optimal classification accuracy rate with the largest value includes an optimal classification accuracy rate, then the feature extraction method corresponding to the optimal classification accuracy rate with the largest value is used as the target feature extraction method, and the optimal classification accuracy rate with the largest value is used as the target feature extraction method.
  • the corresponding optimal feature item dimension value is taken as the target feature item dimension value.
  • the chip 804 is further configured to, if the optimal classification accuracy rate with the largest numerical value includes multiple optimal classification accuracy rates, set the dimension values of the multiple optimal feature items corresponding to the multiple optimal classification accuracy rates. Determine the optimal feature item dimension value with the smallest value; the feature extraction method corresponding to the optimal feature item dimension value with the smallest value is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the target. Feature item dimension value. Based on this method, it is beneficial to improve the accuracy of text classification.
  • the chip 804 determines the specific implementation mode of the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value. are: by comparing the optimal classification accuracy rate corresponding to each feature extraction method, determine the optimal classification accuracy rate with the largest value; determine the first optimal classification accuracy rate, the first optimal classification accuracy rate and the numerical value of the largest accuracy rate The difference between the optimal classification accuracy rates is within a preset range; by comparing the dimension value of the optimal feature item corresponding to the first optimal classification accuracy rate and the optimal feature item corresponding to the optimal classification accuracy rate with the largest value dimension value, determine the dimension value of the optimal feature item with the smallest value; take the feature extraction method corresponding to the dimension value of the optimal feature item with the smallest value as the target feature extraction method, and take the optimal feature item dimension value with the smallest value as the The dimension value of the target feature item; wherein, the first optimal classification accuracy includes one or more optimal classification accuracy.
  • the chip 804 determines the specific implementation mode of the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy rate corresponding to each feature extraction method and the optimal feature item dimension value. is: by comparing the optimal classification accuracy rates corresponding to each feature extraction method, determine the optimal classification accuracy rate with a value greater than a preset threshold, and the optimal classification accuracy rate with a value greater than the preset threshold includes one or more optimal classification accuracy Classification accuracy rate; by comparing the optimal feature item dimension values corresponding to the optimal classification accuracy rate with the value greater than the preset threshold, determine the optimal feature item dimension value with the smallest value; the optimal feature item dimension value with the smallest value.
  • the corresponding feature extraction method is used as the target feature extraction method, and the optimal feature item dimension value with the smallest value is used as the target feature item dimension value.
  • each module contained therein can be implemented in hardware such as circuits, and different modules can be located in the same component of the module equipment (such as chips, circuit modules, etc.) or In different components, or at least some of the modules can be implemented by means of a software program, the software program runs on the processor integrated inside the module device, and the remaining (if any) part of the modules can be implemented by means of hardware such as circuits.
  • Embodiments of the present application further provide a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is executed on a processor, the method flow of the foregoing method embodiment is implemented.
  • the embodiments of the present application further provide a computer program product, when the computer program product runs on a processor, the method flow of the above method embodiments can be realized.
  • each module/unit included in each device and product described in the above-mentioned embodiments it may be a software module/unit, a hardware module/unit, or a part of a software module/unit and a part of a hardware module/unit .
  • each module/unit included in the product may be implemented by hardware such as a circuit, or at least some modules/units may be implemented by a software program, and the software program runs Since the processor is integrated inside the chip, the remaining (if any) modules/units can be implemented in hardware such as circuits; for each device and product applied to or integrated in the chip module, each module/unit contained therein can be implemented using It is realized by hardware such as circuits, and different modules/units can be located in the same piece of the chip module (such as chips, circuit modules, etc.) or in different components, or, at least some modules/units can be realized by software programs.
  • the software program Running on the processor integrated inside the chip module, the remaining (if any) part of the modules/units can be implemented by hardware such as circuits; for each device and product applied to or integrated in the terminal, the modules/units contained therein can be all It is implemented by hardware such as circuits, and different modules/units may be located in the same component (eg, chip, circuit module, etc.) or in different components in the terminal, or at least some modules/units may be implemented by software programs.
  • the program runs on the processor integrated inside the terminal, and the remaining (if any) part of the modules/units can be implemented in hardware such as circuits.

Abstract

提供了一种文本分类的特征提取方法及装置,方法包括:对文本数据采用多种特征提取方法进行特征提取得到特征项(101);基于多种特征提取方法对应的特征项,在不同的特征项维度值下,确定多种特征提取方法的分类正确率,特征项维度值用于表示特征项的数量(102);基于多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值(103);基于每种特征提取方法对应的最优分类正确率和最优特征项维度值,确定文本数据对应的目标特征提取方法和目标特征项维度值(104)。采用提供的方法,有利于提高文本分类的准确性。

Description

一种文本分类的特征提取方法及装置 技术领域
本发明涉及计算机领域,尤其涉及一种文本分类的特征提取方法及装置。
背景技术
自然语言处理(Natural Language Processing,NLP)技术、信息挖掘是目前数据管理的关键技术,而文本分类则是这些技术的操作基础。在文本分类领域,数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限,因此,对文本进行特征提取是直接影响分类效果至关重要的一步。
目前,对于传统的文本分类的特征提取方法,常用的有卡方检验(chi2Square)、互信息方法(Mutual Information,MI)、信息增益方法(Information Gain,IG)、梯度上升决策树(Gradient Boosting Decision Tree,GBDT)方法、监督式的认为特征提取的方法以及人工方法等。不同的特征提取方法在不同的特征项维度下具有对应的分类效果。但是,传统的特征提取方法存在泛化误差大、无法满足文本分类的高准确率要求的情况。因此,如何提高文本分类的准确性是亟待解决的问题。
发明内容
本申请提供一种文本分类的特征提取方法及装置,有利于提高文本分类的准确性。
第一方面,本申请提供一种文本分类的特征提取方法,该方法包括:对文本数据采用多种特征提取方法进行特征提取得到特征项;基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率,该特征项维度值用于表示特征项的数量;基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值。
基于第一方面描述的方法,通过综合考虑多种特征提取方法的最优分类正确率和最优特征项维度值,针对文本数据,确定出分类效果最优的特征提取方法和该特征提取方法相应的特征项维度值,有利于提高文本分类的准确性。
结合第一方面,在一种可能的实现方式中,该每种特征提取方法对应的最优特征项维度值为该每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
结合第一方面,在一种可能的实现方式中,该基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值,包括:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,该数值最大的最优分类正确率包括一个或者多个最优分类正确率;若该数值最大的最优分类正确率包括一个最优分类正确率,则将该数值最大的最优分类正确率对应的特征提取方法作为该目标特征提取方法,并将该数值最大的最优分类正确率对应的最优特征项维度值作为该目标特征项维度值。基于该方式,有利于提高文本分类的准确性。
结合第一方面,在一种可能的实现方式中,该方法还包括:若该数值最大的最优分类正确率包括多个最优分类正确率,则在该多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。基于该方式,有利于提高文本分类的准确性。
结合第一方面,在一种可能的实现方式中,该基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值,包括:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;确定第一最优分类正确率,该第一最优分类正确率与该数值最大的最优分类正确率之间的差值在预设范围内;通过比较该第一最优分类正确率对应的最优特征项维度值与该数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值;其中,该第一最优分类正确率包括一个或者多个最优分类正确率。基于该方式,有利于提高文本分类的准确性。
结合第一方面,在一种可能的实现方式中,该基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值,包括:通过比较该每种特征提取方法对应的最优分类正确率,确定数值大于预设 阈值的最优分类正确率,该数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;通过比较该数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
第二方面,本申请提供了一种处理装置,该装置包括处理单元和确定单元,该处理单元和确定单元用于执行上述第一方面或其任一种可能的实现方式中的方法。
第三方面,本申请提供了一种芯片,该芯片包括处理器和通信接口,处理器被配置用于使芯片执行上述第一方面或其任一种可能的实现方式中的方法。
第四方面,本申请提供了一种模组设备,该模组设备包括通信模组、电源模组、存储模组以及芯片,其中:该电源模组用于为该模组设备提供电能;该存储模组用于存储数据和指令;该通信模组用于进行模组设备内部通信,或者用于该模组设备与外部设备进行通信;该芯片用于执行上述第一方面或其任一种可能的实现方式中的方法。
第五方面,本发明实施例公开了一种电子设备,该电子设备包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面或其任一种可能的实现方式中的方法。
第六方面,本申请提供了一种计算机可读存储介质,该计算机存储介质中存储有计算机可读指令,当该计算机可读指令在通信装置上运行时,使得该通信装置执行上述第一方面及其任一种可能的实现方式中的方法。
第七方面,本申请提供一种计算机程序或计算机程序产品,包括代码或指令,当代码或指令在计算机上运行时,使得计算机执行如第一方面或其任一种可能的实现方式中的方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种文本分类的特征提取方法的流程图;
图2是本申请实施例提供的一种不同特征提取方法的分类效果随特征项维度值的变化规律的折线图;
图3是本申请实施例提供的一种不同特征提取方法的最优分类效果比较的条形图;
图4是本申请实施例提供的又一种文本分类的特征提取方法的流程图;
图5是本申请实施例提供的又一种文本分类的特征提取方法的流程图;
图6是本申请实施例提供的一种处理装置的结构示意图;
图7是本申请实施例提供的一种电子设备的结构示意图;
图8是本申请实施例提供的一种模组设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本申请以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括复数表达形式,除非其上下文中明确地有相反指示。还应当理解,本申请中使用的术语“和/或”是指并包含一个或多个所列出项目的任何或所有可能组合。
需要说明的是,本申请的说明书和权利要求书中及上述附图中的属于“第一”、“第二”、“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述以外的顺序实施。此外,术语“包括”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或服务器不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请提出的方法的执行主体可以是电子设备,该电子设备可以是终端设备,也称为终端;可以是一种具有无线收发功能的设备,其可以部署在陆地上,包括室内或室外、手持或车载;也可以部署在水面上(如轮船等);还可以部署在空中(例如飞机、气球和卫星上等)。电子设备可以是用户设备(user equipment,UE),其中,UE包括具有无线通信功能的手持式设备、车载设备、可穿戴设备或计算设备。示例性地,UE可以是手机(mobile phone)、平板电脑或带无线收发功能的电脑。电子设备还可以是虚拟现实(virtual reality,VR)电子设备、增强现实(augmented re ality,AR)电子设备、工业控制中的无线终端、无人驾驶中的无线终端、远程医疗中的无线终端、智能电网中的无线终端、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的无线终端等等。本申请实施例中,用于实现电子设备的功能的装置可以是终端;也可以是能够支持电子设备实现该功能的装置,例如芯片系统,该装置可以被安装在电子设备中。本申请实施例中,芯片系统可以由芯片构成,也可以包括芯片和其他分立器件。
需要说明的是,自然语言处理(Natural Language Processing,NLP)技术、信息挖掘是目前数据管理的关键技术,而文本分类则是这些技术的操作基础。在文本分类领域,数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限,因此,对文本进行特征提取是直接影响分类效果至关重要的一步。
目前,对于传统的文本分类的特征提取方法,常用的有卡方检验(chi2Square)、互信息方法(Mutual Information,MI)、信息增益方法(Information Gain,IG)、梯度上升决策树(Gradient Boosting Decision Tree,GBDT)方法、监督式的认为特征提取的方法以及人工方法等。不同的特征提取方法在不同的特征项维度下具有对应的分类效果。但是,传统的特征提取方法存在泛化误差大、无法满足文本分类的高准确率要求的情况。因此,如何提高文本分类的准确性是亟待解决的问题。
为了能够提高文本分类的准确性,本申请实施例提供了一种文本分类的特征提取方法及装置。为了更好地理解本申请实施例提供的文本分类的特征提取方法,下面对该文本分类的特征提取方法进行详细描述。
请参阅图1,图1是本申请实施例提供的一种文本分类的特征提取方法的流程图,该文本分类的特征提取方法包括步骤101~步骤104。图1所示的方法执行主体可以为电子设备, 或主体可以为电子设备中的芯片。图1所示的方法执行主体以电子设备为例。其中:
101、电子设备对文本数据采用多种特征提取方法进行特征提取得到特征项。
本申请实施例中,常用的特征提取方法有卡方检验、互信息方法、信息增益方法、梯度上升决策树方法、监督式的认为特征提取的方法等。需要说明的是,还可以使用其他特征提取方法对文本数据进行特征提取,本申请实施例不作限制。基于该方式,便于后续对不同特征提取方法得到的特征项进行处理和分析。
示例性的,以下将主要以电子设备选取卡方检验、互信息方法、信息增益方法、梯度上升决策树方法对文本数据进行特征提取,得到每种特征提取方法提取的特征项为例进行介绍。
可选的,该文本数据可以是一个文本数据,也可以是一个文本数据分为多个部分文本数据中的一个部分文本数据。例如,将一个文本数据分为3个部分文本数据,对其中一个部分文本数据采用多种特征提取方法进行特征提取得到特征项。基于该方式,便于后续确定出每个部分文本数据的最优特征提取方法,以提高整个文本数据分类的准确性。
102、电子设备基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率。
本申请实施例中,该特征项维度值用于表示特征项的数量,不同的特征提取方法在不同的特征项维度值下具有对应的分类正确率。其中,可以采用分类器对特征项进行评估和优化。基于该方式,能够确定采用的多种特征提取方法在不同的特征项维度值下的分类正确率。
例如,电子设备选取卡方检验(chi2)、互信息方法(MI)、信息增益方法(IG)、梯度上升决策树方法(GBDT)对文本数据进行特征提取,将随机森林(Random Forest,RF)作为检验分类效果的分类器,特征项维度值在之间变化,变化的步长为5,不同的特征提取方法在不同的特征项维度值下具有对应的分类正确率,如图2所示。例如,当值为25时,卡方检验的分类正确率为92%,互信息方法的分类正确率为76%,信息增益方法的分类正确率为86%,梯度上升决策树方法的分类正确率为84%。
可选的,不同的特征项维度值可以设定在任意范围,选取的多种特征提取方法可以为任意特征提取方法,特征项维度值变化的步长可以为任意数值,在此不作限定。
可选的,检验分类效果的分类器可以为任意分类器,在此不作限定。
103、电子设备基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值。
在一种可能的实现方式中,每种特征提取方法对应的最优特征项维度值为该每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
例如,电子设备选取卡方检验(chi2)、互信息方法(MI)、信息增益方法(IG)、梯度上升决策树方法(GBDT)对文本数据进行特征提取,将随机森林作为检验分类效果的分类器,从图2可以确定出每种特征提取方法对应的最优分类正确率和最优特征项维度值,如图3所示。其中,卡方检验的最优分类正确率为92%,最优特征项维度值为25;互信息方法的最优分类正确率为90%,最优特征项维度值为55;信息增益方法的最优分类正确率为92%,最优特征项维度值为55;梯度上升决策树方法的最优分类正确率为92%,最优特征项维度值为50。另外,图3中也包含了人工提取方法(byman)的最优分类正确率为84%,最优特征项维度值为80。由此可以得知,卡方检验、互信息方法、信息增益方法、梯度上升决策树方法这四种方法的最优分类正确率都比人工提取方法的最优分类正确率高。
104、电子设备基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值。
本申请实施例中,确定该文本数据对应的目标特征提取方法和目标特征项维度值的方式可以是:电子设备通过比较每种特征提取方法对应的最优分类正确率和该最优特征项维度值,将数值最大的最优分类正确率对应的特征提取方法作为该文本数据对应的目标特征提取方法,将数值最大的最优分类正确率对应的最优特征项维度值作为目标特征项维度值。
可选的,确定该文本数据对应的目标特征提取方法和目标特征项维度值的方式也可以是:电子设备通过比较每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定出大于预设阈值的最优分类正确率,通过比较这些大于预设阈值的最优分类正确率分别对应的最优特征项维度值,将数值最小的最优特征项维度值对应的特征提取方法作为该文本数据对应的目标特征提取方法,将数值最小的最优特征项维度值作为最优特征项维度值。
可选的,确定该文本数据对应的目标特征提取方法和目标特征项维度值的方式还可以为其他方式,在此不作限定。
在图1所描述的方法中,电子设备通过综合考虑多种特征提取方法的最优分类正确率和最优特征项维度值,针对文本数据,确定出分类效果最优的特征提取方法和该特征提取方法相应的特征项维度值。因此,基于图1所描述的方法,有利于提高文本分类的准确性。
请参见图4,图4是本申请实施例提供的又一种文本分类的特征提取方法的流程示意图。该文本分类的特征提取方法包括步骤401~步骤407。步骤404~步骤407为上述步骤104的一种具体的实现方式。图4所示的方法执行主体可以为电子设备,或主体可以为电子设备中的芯片。图4所示的方法执行主体以电子设备为例。其中:
401、电子设备对文本数据采用多种特征提取方法进行特征提取得到特征项。
402、电子设备基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率。
403、电子设备基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值。
其中,步骤401~步骤403的具体实现方式与上述步骤101~步骤103的具体实现方式相同,在此不赘述。
404、电子设备通过比较每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率。
其中,该数值最大的最优分类正确率包括一个或者多个最优分类正确率。若该数值最大的最优分类正确率包括一个最优分类正确率,则执行步骤405。若该数值最大的最优分类正确率包括多个最优分类正确率(示例性的,可以参照图3所示的实施例),则执行步骤406和步骤407。
例如,电子设备选取卡方检验(chi2)、互信息方法(MI)、信息增益方法(IG)、梯度上升决策树方法(GBDT)对文本数据进行特征提取,如图4所示,电子设备通过比较每种特征提取方法对应的最优分类正确率,确定出的数值最大的最优分类正确率有3个,均为92%。
405、电子设备将该数值最大的最优分类正确率对应的特征提取方法作为该目标特征提取方法,并将该数值最大的最优分类正确率对应的最优特征项维度值作为该目标特征项维度值。
例如,电子设备选取卡方检验(chi2)、互信息方法(MI)、信息增益方法(IG)、梯度上升决策树方法(GBDT)对文本数据进行特征提取,其中卡方检验的最优分类正确率为92%,最优特征项维度值为25;互信息方法的最优分类正确率为90%,最优特征项维度值为55;信息增益方法的最优分类正确率为91%,最优特征项维度值为50。通过比较每种特征提取方法对应的最优分类正确率,电子设备确定数值最大的最优分类正确率有1个,为92%。该数值最大的最优分类正确率对应的特征提取方法为卡方检验,该数值最大的最优分类正确率对应的最优特征项维度值为25,因此电子设备将卡方检验作为目标特征提取方法,将25作为目标特征项维度值。
406、电子设备在该多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值。
407、电子设备将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
本申请实施例中,确定数值最小的最优特征项维度值是因为降低维度更便于计算和可视化,有利于有效信息的提取综合及无用信息的摒弃。基于该方式,有利于提高文本分类的准确性。
例如,电子设备选取卡方检验(chi2)、互信息方法(MI)、信息增益方法(IG)、梯度上升决策树方法(GBDT)对文本数据进行特征提取,其中卡方检验的最优分类正确率为92%,最优特征项维度值为25;互信息方法的最优分类正确率为90%,最优特征项维度值为55;信息增益方法的最优分类正确率为92%,最优特征项维度值为50。通过比较每种特征提取方法对应的最优分类正确率,电子设备确定数值最大的最优分类正确率有2个,均为92%。2个最优分类正确率92%对应分别对应的最优特征项维度值为25和50,因此数值最小的最优特征项维度值为25。该数值最小的最优特征项维度值对应的特征提取方法为卡方检验,因此电子设备将卡方检验作为目标特征提取方法,将25作为目标特征项维度值。
在图4所描述的方法中,电子设备通过考虑多种特征提取方法的最优分类正确率和最 优特征项维度值,针对文本数据,确定出数值最大的最优分类正确率,从而确定出分类效果最优的特征提取方法和该特征提取方法相应的特征项维度值。因此,基于图4所描述的方法,有利于提高文本分类的准确性。
请参见图5,图5是本申请实施例提供的又一种文本分类的特征提取方法的流程示意图。该文本分类的特征提取方法包括步骤501~步骤507。步骤504~步骤507为上述步骤104的一种具体的实现方式。图5所示的方法执行主体可以为电子设备,或主体可以为电子设备中的芯片。图5所示的方法执行主体以电子设备为例。其中:
501、电子设备对文本数据采用多种特征提取方法进行特征提取得到特征项。
502、电子设备基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率。
503、电子设备基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值。
其中,步骤501~步骤503的具体实现方式与上述步骤401~步骤403的具体实现方式相同,在此不赘述。
504、电子设备通过比较每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率。
505、电子设备确定第一最优分类正确率。
本申请实施例中,该第一最优分类正确率与该数值最大的最优分类正确率之间的差值在预设范围内。该第一最优分类正确率包括一个或者多个最优分类正确率。
可选的,该预设范围可以为任意范围,在此不作限定。
506、电子设备通过比较该第一最优分类正确率对应的最优特征项维度值与该数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值。
507、电子设备将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
例如,假设该第一最优分类正确率与该数值最大的最优分类正确率之间的差值的预设范围为0%~5%,电子设备选取卡方检验(chi2)、互信息方法(MI)、信息增益方法(IG)、 梯度上升决策树方法(GBDT)对文本数据进行特征提取,其中卡方检验的最优分类正确率为92%,最优特征项维度值为25;互信息方法的最优分类正确率为86%,最优特征项维度值为55;信息增益方法的最优分类正确率为90%,最优特征项维度值为15。通过比较每种特征提取方法对应的最优分类正确率,电子设备确定数值最大的最优分类正确率为92%,与该数值最大的最优分类正确率之间的差值在预设范围内的最优分类正确率有1个,为90%,因此电子设备确定第一最优分类正确率有1个,为90%。该第一最优分类正确率对应的最优特征项维度值为15,该数值最大的最优分类正确率对应的最优特征项维度值为25,通过比较,确定数值最小的最优特征项维度值为15。该数值最小的最优特征项维度值对应的特征提取方法为信息增益方法,因此电子设备将信息增益方法作为目标特征提取方法,将15作为目标特征项维度值。
又例如,假设该第一最优分类正确率与该数值最大的最优分类正确率之间的差值的预设范围为0%~5%,电子设备选取卡方检验(chi2)、互信息方法(MI)、信息增益方法(IG)、梯度上升决策树方法(GBDT)对文本数据进行特征提取,其中卡方检验的最优分类正确率为92%,最优特征项维度值为25;互信息方法的最优分类正确率为90%,最优特征项维度值为55;信息增益方法的最优分类正确率为90%,最优特征项维度值为15。通过比较每种特征提取方法对应的最优分类正确率,电子设备确定数值最大的最优分类正确率为92%,与该数值最大的最优分类正确率之间的差值在预设范围内的最优分类正确率有2个,均为90%,因此电子设备确定第一最优分类正确率有2个,均为90%。该第一最优分类正确率对应的最优特征项维度值分别为55和15,该数值最大的最优分类正确率对应的最优特征项维度值为25,通过比较,确定数值最小的最优特征项维度值为15。该数值最小的最优特征项维度值对应的特征提取方法为信息增益方法,因此电子设备将信息增益方法作为目标特征提取方法,将15作为目标特征项维度值。
在图5所描述的方法中,电子设备通过综合考虑多种特征提取方法的最优分类正确率和最优特征项维度值,针对文本数据,确定出分类效果最优的特征提取方法和该特征提取方法相应的特征项维度值。因此,基于图5所描述的方法,有利于提高文本分类的准确性。
请参见图6,图6是本发明实施例提供的一种处理装置的结构示意图,该候选同步信号 块的处理装置可以为电子设备或具有电子设备功能的装置(例如芯片)。具体的,如图6所示,该候选同步信号块的处理装置60,可以包括:
处理单元601,用于对文本数据采用多种特征提取方法进行特征提取得到特征项;
确定单元602,用于基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率,该特征项维度值用于表示该特征项的数量;
该确定单元602,还用于基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;
该确定单元602,还用于基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值。
可选的,该每种特征提取方法对应的最优特征项维度值为该每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
可选的,该确定单元602基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,该数值最大的最优分类正确率包括一个或者多个最优分类正确率;若该数值最大的最优分类正确率包括一个最优分类正确率,则将该数值最大的最优分类正确率对应的特征提取方法作为该目标特征提取方法,并将该数值最大的最优分类正确率对应的最优特征项维度值作为该目标特征项维度值。
可选的,该确定单元602还用于:若该数值最大的最优分类正确率包括多个最优分类正确率,则在该多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
可选的,该确定单元602基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;确定第一最优分类正确率,该第一最优分类正确率与该数值最大的最优分类正确率之间的差值在预设范围内;通过比较该第一最优分类正确率对应的最优特征项维度值与该 数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值;其中,该第一最优分类正确率包括一个或者多个最优分类正确率。
可选的,该确定单元602基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值大于预设阈值的最优分类正确率,该数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;通过比较该数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
本发明实施例和图1、图4和图5所示方法实施例基于同一构思,其带来的技术效果也相同,具体原理请参照图1、图4和图5所示实施例的描述,在此不赘述。
请参阅图7,图7是本发明实施例提供的一种电子设备的结构示意图。该电子设备70可以包括存储器701、处理器702和通信接口703,存储器701、处理器702和通信接口703通过一条或多条通信总线连接。其中,通信接口703受处理器702的控制用于收发信息。
存储器701可以包括只读存储器和随机存取存储器,并向处理器702提供指令和数据。存储器701的一部分还可以包括非易失性随机存取存储器。
处理器702可以是中央处理单元(Central Processing Unit,CPU),该处理器702还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器,可选的,该处理器702也可以是任何常规的处理器等。其中:
存储器701,用于存储程序指令。
处理器702,用于调用存储器701中存储的程序指令。
处理器702调用存储器701中存储的程序指令,使该电子设备70执行以下操作:
对文本数据采用多种特征提取方法进行特征提取得到特征项;
基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率,该特征项维度值用于表示特征项的数量;
基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;
基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值。
在一种实现方式中,该每种特征提取方法对应的最优特征项维度值为该每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
在一种实现方式中,处理器702基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的操作具体为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,该数值最大的最优分类正确率包括一个或者多个最优分类正确率;若该数值最大的最优分类正确率包括一个最优分类正确率,则将该数值最大的最优分类正确率对应的特征提取方法作为该目标特征提取方法,并将该数值最大的最优分类正确率对应的最优特征项维度值作为该目标特征项维度值。
在一种实现方式中,处理器702还用于:若该数值最大的最优分类正确率包括多个最优分类正确率,则在该多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。基于该方式,有利于提高文本分类的准确性。
在一种实现方式中,处理器702基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的操作具体为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;确定第一最优分类正确率,该第一最优分类正确率与该数值最大的最优分类正确 率之间的差值在预设范围内;通过比较该第一最优分类正确率对应的最优特征项维度值与该数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值;其中,该第一最优分类正确率包括一个或者多个最优分类正确率。
在一种实现方式中,处理器702基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值大于预设阈值的最优分类正确率,该数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;通过比较该数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
需要说明的是,图7对应的实施例中未提及的内容以及各个步骤的具体实现方式可参见图1、图4和图5所示实施例以及前述内容,这里不再赘述。
本申请实施例还提供了一种芯片,该芯片可以执行前述方法实施例中电子设备的相关步骤。该芯片,包括处理器和通信接口,该处理器被配置用于使该芯片执行如下操作:对文本数据采用多种特征提取方法进行特征提取得到特征项;基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率,该特征项维度值用于表示该特征项的数量;基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值。
可选的,该每种特征提取方法对应的最优特征项维度值为该每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
可选的,该基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值, 确定该文本数据对应的目标特征提取方法和目标特征项维度值时,该处理器被配置用于使所述芯片具体执行如下操作:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,该数值最大的最优分类正确率包括一个或者多个最优分类正确率;若该数值最大的最优分类正确率包括一个最优分类正确率,则将该数值最大的最优分类正确率对应的特征提取方法作为该目标特征提取方法,并将该数值最大的最优分类正确率对应的最优特征项维度值作为该目标特征项维度值。
可选的,该处理器被配置还用于使所述芯片执行如下操作:若该数值最大的最优分类正确率包括多个最优分类正确率,则在该多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
可选的,该基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值时,该处理器被配置用于使所述芯片具体执行如下操作:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;确定第一最优分类正确率,该第一最优分类正确率与该数值最大的最优分类正确率之间的差值在预设范围内;通过比较该第一最优分类正确率对应的最优特征项维度值与该数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值;其中,该第一最优分类正确率包括一个或者多个最优分类正确率。
可选的,该基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值时,该处理器被配置用于使该芯片具体执行如下操作:通过比较该每种特征提取方法对应的最优分类正确率,确定数值大于预设阈值的最优分类正确率,该数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;通过比较该数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目 标特征项维度值。
在一种可能的实现方式中,上述芯片包括至少一个处理器、至少一个第一存储器和至少一个第二存储器;其中,前述至少一个第一存储器和前述至少一个处理器通过线路互联,前述第一存储器中存储有指令;前述至少一个第二存储器和前述至少一个处理器通过线路互联,前述第二存储器中存储前述方法实施例中需要存储的数据。
对于应用于或集成于芯片的各个装置、产品,其包含的各个模块可以都采用电路等硬件的方式实现,或者,至少部分模块可以采用软件程序的方式实现,该软件程序运行于芯片内部集成的处理器,剩余的(如果有)部分模块可以采用电路等硬件方式实现。
如图8所示,图8是本申请实施例提供的一种模组设备的结构示意图。该模组设备80可以执行前述方法实施例中电子设备的相关步骤,该模组设备80包括:通信模组801、电源模组802、存储模组803以及芯片804。
其中,所述电源模组802用于为所述模组设备提供电能;所述存储模组803用于存储数据和指令;所述通信模组801用于进行模组设备内部通信,或者用于所述模组设备与外部设备进行通信;所述芯片804用于执行如下操作:对文本数据采用多种特征提取方法进行特征提取得到特征项;基于该多种特征提取方法对应的特征项,在不同的特征项维度值下,确定该多种特征提取方法的分类正确率,该特征项维度值用于表示特征项的数量;基于该多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值。
可选的,该每种特征提取方法对应的最优特征项维度值为该每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
可选的,该芯片804基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,该数值最大的最优分类正确率包括一个或者多个最优分类正确率;若该数值最大的最优分类正确率包括一个最优分类正确率,则将该数值最大的最优分类正确率对应的特征提取方 法作为该目标特征提取方法,并将该数值最大的最优分类正确率对应的最优特征项维度值作为该目标特征项维度值。
可选的,该芯片804,还用于若该数值最大的最优分类正确率包括多个最优分类正确率,则在该多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。基于该方式,有利于提高文本分类的准确性。
可选的,该芯片804基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;确定第一最优分类正确率,该第一最优分类正确率与该数值最大的最优分类正确率之间的差值在预设范围内;通过比较该第一最优分类正确率对应的最优特征项维度值与该数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值;其中,该第一最优分类正确率包括一个或者多个最优分类正确率。
可选的,该芯片804基于该每种特征提取方法对应的最优分类正确率和该最优特征项维度值,确定该文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:通过比较该每种特征提取方法对应的最优分类正确率,确定数值大于预设阈值的最优分类正确率,该数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;通过比较该数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;将该数值最小的最优特征项维度值对应的特征提取方法作为该目标特征提取方法,并将该数值最小的最优特征项维度值作为该目标特征项维度值。
对于应用于或集成于模组设备的各个装置、产品,其包含的各个模块可以都采用电路等硬件的方式实现,不同的模块可以位于模组设备的同一组件(例如芯片、电路模块等)或者不同组件中,或者,至少部分模块可以采用软件程序的方式实现,该软件程序运行于模组设备内部集成的处理器,剩余的(如果有)部分模块可以采用电路等硬件方式实现。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在处理器上运行时,上述方法实施例的方法流程得以实现。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在处理器上运行时,上述方法实施例的方法流程得以实现。
关于上述实施例中描述的各个装置、产品包含的各个模块/单元,其可以是软件模块/单元,也可以是硬件模块/单元,或者也可以部分是软件模块/单元,部分是硬件模块/单元。例如,对于应用于或集成于芯片的各个装置、产品其包含的各个模块/单元可以都采用电路等硬件的方式实现,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于芯片内部集成处理器,剩余的(如果有)部分模块/单元可以采用电路等硬件方式实现;对于应用于或集成于芯片模组的各个装置、产品,其包含的各个模块/单元可以都采用电路等硬件的方式实现,不同模块/单元可以位于芯片模组的同一件(例如芯片、电路模块等)或者不同组件中,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于芯片模组内部集成的处理器,剩余的(如果有)部分模块/单元可以采用电路等硬件方式实现;对于应用于或集成于终端的各个装置、产品,其包含的模块/单元可以都采用电路等硬件的方式实现,不同的模块/单元可以位于终端内同一组件(例如,芯片、电路模块等)或者不同组件中,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于终端内部集成的处理器,剩余的(如果有)部分模块/单元可以采用电路等硬件方式实现。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些操作可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
本申请提供的各实施例的描述可以相互参照,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。为描述的方便和简洁,例如关于本申请实施例提供的各装置、设备的功能以及执行的操作可以参照本申请方法实施例 的相关描述,各方法实施例之间、各装置实施例之间也可以互相参考、结合或引用。
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (21)

  1. 一种文本分类的特征提取方法,其特征在于,所述方法包括:
    对文本数据采用多种特征提取方法进行特征提取得到特征项;
    基于所述多种特征提取方法对应的特征项,在不同的特征项维度值下,确定所述多种特征提取方法的分类正确率,所述特征项维度值用于表示所述特征项的数量;
    基于所述多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;
    基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值。
  2. 根据权利要求1所述的方法,其特征在于,所述每种特征提取方法对应的最优特征项维度值为所述每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值,包括:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,所述数值最大的最优分类正确率包括一个或者多个最优分类正确率;
    若所述数值最大的最优分类正确率包括一个最优分类正确率,则将所述数值最大的最优分类正确率对应的特征提取方法作为所述目标特征提取方法,并将所述数值最大的最优分类正确率对应的最优特征项维度值作为所述目标特征项维度值。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    若所述数值最大的最优分类正确率包括多个最优分类正确率,则在所述多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值。
  5. 根据权利要求2所述的方法,其特征在于,所述基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目 标特征项维度值,包括:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;
    确定第一最优分类正确率,所述第一最优分类正确率与所述数值最大的最优分类正确率之间的差值在预设范围内;
    通过比较所述第一最优分类正确率对应的最优特征项维度值与所述数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值;
    其中,所述第一最优分类正确率包括一个或者多个最优分类正确率。
  6. 根据权利要求2所述的方法,其特征在于,所述基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值,包括:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值大于预设阈值的最优分类正确率,所述数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;
    通过比较所述数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值。
  7. 一种处理装置,其特征在于,所述装置包括:
    处理单元,用于对文本数据采用多种特征提取方法进行特征提取得到特征项;
    确定单元,用于基于所述多种特征提取方法对应的特征项,在不同的特征项维度值下,确定所述多种特征提取方法的分类正确率,所述特征项维度值用于表示所述特征项的数量;
    所述确定单元,还用于基于所述多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;
    所述确定单元,还用于基于所述每种特征提取方法对应的最优分类正确率和所述最优 特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值。
  8. 根据权利要求7所述的装置,其特征在于,所述每种特征提取方法对应的最优特征项维度值为所述每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
  9. 根据权利要求8所述的装置,其特征在于,所述确定单元基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,所述数值最大的最优分类正确率包括一个或者多个最优分类正确率;
    若所述数值最大的最优分类正确率包括一个最优分类正确率,则将所述数值最大的最优分类正确率对应的特征提取方法作为所述目标特征提取方法,并将所述数值最大的最优分类正确率对应的最优特征项维度值作为所述目标特征项维度值。
  10. 根据权利要求9所述的装置,其特征在于,所述确定单元还用于:
    若所述数值最大的最优分类正确率包括多个最优分类正确率,则在所述多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值。
  11. 根据权利要求8所述的装置,其特征在于,所述确定单元基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;
    确定第一最优分类正确率,所述第一最优分类正确率与所述数值最大的最优分类正确率之间的差值在预设范围内;
    通过比较所述第一最优分类正确率对应的最优特征项维度值与所述数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方 法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值;
    其中,所述第一最优分类正确率包括一个或者多个最优分类正确率。
  12. 根据权利要求8所述的装置,其特征在于,所述确定单元基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值的具体实现方式为:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值大于预设阈值的最优分类正确率,所述数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;
    通过比较所述数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值。
  13. 一种芯片,其特征在于,包括处理器和通信接口,所述处理器被配置用于使所述芯片执行如下操作:
    对文本数据采用多种特征提取方法进行特征提取得到特征项;
    基于所述多种特征提取方法对应的特征项,在不同的特征项维度值下,确定所述多种特征提取方法的分类正确率,所述特征项维度值用于表示所述特征项的数量;
    基于所述多种特征提取方法的分类正确率确定每种特征提取方法对应的最优分类正确率和最优特征项维度值;
    基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值。
  14. 根据权利要求13所述的芯片,其特征在于,所述每种特征提取方法对应的最优特征项维度值为所述每种特征提取方法的最优分类正确率对应的特征项维度值,一个最优分类正确率对应一个最优特征项维度值。
  15. 根据权利要求14所述的芯片,其特征在于,所述基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值时,所述处理器被配置用于使所述芯片具体执行如下操作:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率,所述数值最大的最优分类正确率包括一个或者多个最优分类正确率;
    若所述数值最大的最优分类正确率包括一个最优分类正确率,则将所述数值最大的最优分类正确率对应的特征提取方法作为所述目标特征提取方法,并将所述数值最大的最优分类正确率对应的最优特征项维度值作为所述目标特征项维度值。
  16. 根据权利要求15所述的芯片,其特征在于,所述处理器被配置还用于使所述芯片执行如下操作:
    若所述数值最大的最优分类正确率包括多个最优分类正确率,则在所述多个最优分类正确率对应的多个最优特征项维度值中确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值。
  17. 根据权利要求14所述的芯片,其特征在于,所述基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值时,所述处理器被配置用于使所述芯片具体执行如下操作:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值最大的最优分类正确率;
    确定第一最优分类正确率,所述第一最优分类正确率与所述数值最大的最优分类正确率之间的差值在预设范围内;
    通过比较所述第一最优分类正确率对应的最优特征项维度值与所述数值最大的最优分类正确率对应的最优特征项维度值,确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值;
    其中,所述第一最优分类正确率包括一个或者多个最优分类正确率。
  18. 根据权利要求14所述的芯片,其特征在于,所述基于所述每种特征提取方法对应的最优分类正确率和所述最优特征项维度值,确定所述文本数据对应的目标特征提取方法和目标特征项维度值时,所述处理器被配置用于使所述芯片具体执行如下操作:
    通过比较所述每种特征提取方法对应的最优分类正确率,确定数值大于预设阈值的最 优分类正确率,所述数值大于预设阈值的最优分类正确率包括一个或者多个最优分类正确率;
    通过比较所述数值大于预设阈值的最优分类正确率分别对应的最优特征项维度值,确定数值最小的最优特征项维度值;
    将所述数值最小的最优特征项维度值对应的特征提取方法作为所述目标特征提取方法,并将所述数值最小的最优特征项维度值作为所述目标特征项维度值。
  19. 一种模组设备,其特征在于,所述模组设备包括通信模组、电源模组、存储模组以及芯片,其中:
    所述电源模组用于为所述模组设备提供电能;
    所述存储模组用于存储数据和指令;
    所述通信模组用于进行模组设备内部通信,或者用于所述模组设备与外部设备进行通信;
    所述芯片用于执行如权利要求1~6中任一项所述的方法。
  20. 一种电子设备,其特征在于,包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求1~6任一项所述的方法。
  21. 一种计算机可读存储介质,其特征在于,所述计算机存储介质中存储有计算机可读指令,当所述计算机可读指令在通信装置上运行时,使得所述通信装置执行权利要求1~6中任一项所述的方法。
PCT/CN2022/074714 2021-02-05 2022-01-28 一种文本分类的特征提取方法及装置 WO2022166830A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110163603.3 2021-02-05
CN202110163603.3A CN112989036A (zh) 2021-02-05 2021-02-05 一种文本分类的特征提取方法及装置

Publications (1)

Publication Number Publication Date
WO2022166830A1 true WO2022166830A1 (zh) 2022-08-11

Family

ID=76348342

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074714 WO2022166830A1 (zh) 2021-02-05 2022-01-28 一种文本分类的特征提取方法及装置

Country Status (2)

Country Link
CN (1) CN112989036A (zh)
WO (1) WO2022166830A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989036A (zh) * 2021-02-05 2021-06-18 北京紫光展锐通信技术有限公司 一种文本分类的特征提取方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766792A (zh) * 2017-06-23 2018-03-06 北京理工大学 一种遥感图像舰船目标识别方法
US20190377823A1 (en) * 2018-06-07 2019-12-12 Element Ai Inc. Unsupervised classification of documents using a labeled data set of other documents
CN110597878A (zh) * 2019-09-16 2019-12-20 广东工业大学 一种多模态数据的跨模态检索方法、装置、设备及介质
CN111666748A (zh) * 2020-05-12 2020-09-15 武汉大学 一种自动化分类器的构造方法以及从软件开发文本类制品中识别决策的方法
US20200394557A1 (en) * 2019-06-15 2020-12-17 Terrance Boult Systems and methods for machine classification and learning that is robust to unknown inputs
CN112989036A (zh) * 2021-02-05 2021-06-18 北京紫光展锐通信技术有限公司 一种文本分类的特征提取方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045503B (zh) * 2016-02-05 2019-03-05 华为技术有限公司 一种特征集确定的方法及装置
CN107908715A (zh) * 2017-11-10 2018-04-13 中国民航大学 基于Adaboost和分类器加权融合的微博情感极性判别方法
CN109684476B (zh) * 2018-12-07 2023-10-17 中科恒运股份有限公司 一种文本分类方法、文本分类装置及终端设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766792A (zh) * 2017-06-23 2018-03-06 北京理工大学 一种遥感图像舰船目标识别方法
US20190377823A1 (en) * 2018-06-07 2019-12-12 Element Ai Inc. Unsupervised classification of documents using a labeled data set of other documents
US20200394557A1 (en) * 2019-06-15 2020-12-17 Terrance Boult Systems and methods for machine classification and learning that is robust to unknown inputs
CN110597878A (zh) * 2019-09-16 2019-12-20 广东工业大学 一种多模态数据的跨模态检索方法、装置、设备及介质
CN111666748A (zh) * 2020-05-12 2020-09-15 武汉大学 一种自动化分类器的构造方法以及从软件开发文本类制品中识别决策的方法
CN112989036A (zh) * 2021-02-05 2021-06-18 北京紫光展锐通信技术有限公司 一种文本分类的特征提取方法及装置

Also Published As

Publication number Publication date
CN112989036A (zh) 2021-06-18

Similar Documents

Publication Publication Date Title
CN109543832B (zh) 一种计算装置及板卡
US11513586B2 (en) Control device, method and equipment for processor
CN109522052B (zh) 一种计算装置及板卡
US11385878B2 (en) Model deployment method, model deployment device and terminal equipment
CN112801800A (zh) 行为资金分析系统、方法、计算机设备及存储介质
WO2022166830A1 (zh) 一种文本分类的特征提取方法及装置
US11775808B2 (en) Neural network computation device and method
CN109711538B (zh) 运算方法、装置及相关产品
CN109740729B (zh) 运算方法、装置及相关产品
CN109740730B (zh) 运算方法、装置及相关产品
CN111260043B (zh) 数据选择器、数据处理方法、芯片及电子设备
CA2956155A1 (en) Methods and apparatus for comparing different types of data
CN111160468B (zh) 数据处理方法及装置、处理器、电子设备、存储介质
CN114821272A (zh) 图像识别方法、系统、介质、电子设备及目标检测模型
CN111381875B (zh) 数据比较器、数据处理方法、芯片及电子设备
CN109558565B (zh) 运算方法、装置及相关产品
CN113657408A (zh) 确定图像特征的方法、装置、电子设备和存储介质
US10761847B2 (en) Linear feedback shift register for a reconfigurable logic unit
CN111260070A (zh) 运算方法、装置及相关产品
CN111258641A (zh) 运算方法、装置及相关产品
CN111353124A (zh) 运算方法、装置、计算机设备和存储介质
CN111260044B (zh) 数据比较器、数据处理方法、芯片及电子设备
CN111353125B (zh) 运算方法、装置、计算机设备和存储介质
CN111258534B (zh) 数据比较器、数据处理方法、芯片及电子设备
CN113434508B (zh) 用于存储信息的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22749109

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 161123)