CN112989036A - Feature extraction method and device for text classification - Google Patents

Feature extraction method and device for text classification Download PDF

Info

Publication number
CN112989036A
CN112989036A CN202110163603.3A CN202110163603A CN112989036A CN 112989036 A CN112989036 A CN 112989036A CN 202110163603 A CN202110163603 A CN 202110163603A CN 112989036 A CN112989036 A CN 112989036A
Authority
CN
China
Prior art keywords
optimal
feature
feature extraction
classification accuracy
item dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110163603.3A
Other languages
Chinese (zh)
Inventor
霍小倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ziguang Zhanrui Communication Technology Co Ltd
Original Assignee
Beijing Ziguang Zhanrui Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ziguang Zhanrui Communication Technology Co Ltd filed Critical Beijing Ziguang Zhanrui Communication Technology Co Ltd
Priority to CN202110163603.3A priority Critical patent/CN112989036A/en
Publication of CN112989036A publication Critical patent/CN112989036A/en
Priority to PCT/CN2022/074714 priority patent/WO2022166830A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text classification feature extraction method and a text classification feature extraction device, wherein the method comprises the following steps: carrying out feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items; determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items; determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods; and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method. By adopting the method provided by the application, the accuracy of text classification can be improved.

Description

Feature extraction method and device for text classification
Technical Field
The invention relates to the field of computers, in particular to a text classification feature extraction method and device.
Background
Natural Language Processing (NLP) technology and information mining are currently key technologies for data management, and text classification is the operation basis of these technologies. In the text classification field, data and features determine the upper limit of machine learning, and a model and an algorithm only approximate the upper limit, so that feature extraction on a text is a crucial step for directly influencing the classification effect.
At present, for the conventional feature extraction method for text classification, a chi-Square test (chi2Square), a Mutual Information Method (MI), an Information Gain method (IG), a Gradient Boosting Decision Tree (GBDT) method, a supervised feature extraction method, an artificial method, and the like are commonly used. Different feature extraction methods have corresponding classification effects under different feature item dimensions. However, the conventional feature extraction method has the condition that the generalization error is large and the requirement of high accuracy of text classification cannot be met. Therefore, how to improve the accuracy of text classification is an urgent problem to be solved.
Disclosure of Invention
The application provides a method and a device for extracting the characteristics of text classification, which are beneficial to improving the accuracy of text classification.
In a first aspect, the present application provides a method for feature extraction for text classification, including: carrying out feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items; determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items; determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods; and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
Based on the method described in the first aspect, the optimal classification accuracy and the optimal feature item dimension value of the multiple feature extraction methods are comprehensively considered, and the feature extraction method with the optimal classification effect and the feature item dimension value corresponding to the feature extraction method are determined for text data, so that the accuracy of text classification is improved.
With reference to the first aspect, in a possible implementation manner, the optimal feature item dimension value corresponding to each feature extraction method is a feature item dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature item dimension value.
With reference to the first aspect, in a possible implementation manner, the determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method includes: determining the optimal classification correct rate with the maximum value by comparing the optimal classification correct rates corresponding to each feature extraction method, wherein the optimal classification correct rate with the maximum value comprises one or more optimal classification correct rates; if the maximum optimal classification accuracy of the numerical value comprises an optimal classification accuracy, taking the feature extraction method corresponding to the maximum optimal classification accuracy of the numerical value as the target feature extraction method, and taking the optimal feature item dimension value corresponding to the maximum optimal classification accuracy of the numerical value as the target feature item dimension value. Based on the method, the accuracy of text classification is improved.
With reference to the first aspect, in a possible implementation manner, the method further includes: if the optimal classification accuracy with the maximum numerical value comprises a plurality of optimal classification accuracy, determining the optimal feature item dimension value with the minimum numerical value from a plurality of optimal feature item dimension values corresponding to the optimal classification accuracy; and taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value. Based on the method, the accuracy of text classification is improved.
With reference to the first aspect, in a possible implementation manner, the determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method includes: determining the optimal classification accuracy with the maximum value by comparing the optimal classification accuracy corresponding to each feature extraction method; determining a first optimal classification accuracy, wherein the difference value between the first optimal classification accuracy and the optimal classification accuracy with the maximum value is within a preset range; determining the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value; taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value; wherein the first optimal classification accuracy comprises one or more optimal classification accuracy. Based on the method, the accuracy of text classification is improved.
In a second aspect, the present application provides a feature extraction apparatus for text classification, which includes a processing unit and a determining unit, where the processing unit and the determining unit are configured to execute the method of the first aspect.
In a third aspect, the present application provides a chip, where the chip is configured to perform feature extraction on text data by using multiple feature extraction methods to obtain feature items; determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items; determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods; and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
In a fourth aspect, the present application provides a module device, which includes a communication module, a power module, a storage module, and a chip module, wherein: the power module is used for providing electric energy for the module equipment; the storage module is used for storing data and instructions; the communication module is used for carrying out internal communication of the module equipment or is used for carrying out communication between the module equipment and external equipment; this chip module is used for: carrying out feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items; determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items; determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods; and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
In a fifth aspect, an embodiment of the present invention discloses an electronic device, which includes a memory and a processor, where the memory is used for storing a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method according to the first aspect.
In a sixth aspect, the present application provides a computer-readable storage medium having stored thereon computer-readable instructions that, when run on a communication device, cause the communication device to perform the method of the first aspect and any of its possible implementations.
In a seventh aspect, the present application provides a computer program or computer program product comprising code or instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a feature extraction method for text classification according to an embodiment of the present application;
fig. 2 is a line graph of a classification effect of different feature extraction methods according to a change rule of feature item dimension values provided in an embodiment of the present application;
fig. 3 is a bar graph comparing optimal classification effects of different feature extraction methods provided in the embodiment of the present application;
fig. 4 is a flowchart of a feature extraction method for text classification according to an embodiment of the present application;
fig. 5 is a flowchart of a feature extraction method for text classification according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a module apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the following embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the listed items.
It should be noted that the terms "first," "second," "third," and the like in the description and claims of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than described or illustrated herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The execution main body of the method provided by the application can be electronic equipment, and the electronic equipment can be terminal equipment, also called a terminal; can be a device with wireless transceiving function, which can be deployed on land, including indoors or outdoors, handheld or vehicle-mounted; can also be deployed on the water surface (such as a ship and the like); and may also be deployed in the air (e.g., airplanes, balloons, satellites, etc.). The electronic device may be a User Equipment (UE), wherein the UE includes a handheld device, an in-vehicle device, a wearable device, or a computing device having wireless communication capabilities. Illustratively, the UE may be a mobile phone (mobile phone), a tablet computer, or a computer with wireless transceiving function. The electronic device may also be a Virtual Reality (VR) electronic device, an Augmented Reality (AR) electronic device, a wireless terminal in industrial control, a wireless terminal in unmanned driving, a wireless terminal in telemedicine, a wireless terminal in smart grid, a wireless terminal in smart city, a wireless terminal in smart home, etc. In the embodiment of the present application, the apparatus for implementing the function of the electronic device may be a terminal; or may be a device, such as a system-on-chip, capable of supporting an electronic apparatus to implement the function, and the device may be installed in the electronic apparatus. In the embodiment of the present application, the chip system may be composed of a chip, and may also include a chip and other discrete devices.
It should be noted that Natural Language Processing (NLP) technology and information mining are currently key technologies for data management, and text classification is the operation basis of these technologies. In the text classification field, data and features determine the upper limit of machine learning, and a model and an algorithm only approximate the upper limit, so that feature extraction on a text is a crucial step for directly influencing the classification effect.
At present, for the conventional feature extraction method for text classification, a chi-2 Square (chi-2 Square), a Mutual Information Method (MI), an Information Gain method (IG), a Gradient ascending Decision Tree (GBDT) method, a supervised feature extraction method, an artificial method, and the like are commonly used. Different feature extraction methods have corresponding classification effects under different feature item dimensions. However, the conventional feature extraction method has the condition that the generalization error is large and the requirement of high accuracy of text classification cannot be met. Therefore, how to improve the accuracy of text classification is an urgent problem to be solved.
In order to improve the accuracy of text classification, the embodiment of the application provides a method and a device for extracting features of text classification. In order to better understand the feature extraction method for text classification provided in the embodiment of the present application, the following describes the feature extraction method for text classification in detail.
Referring to fig. 1, fig. 1 is a flowchart of a feature extraction method for text classification according to an embodiment of the present application, where the feature extraction method for text classification includes steps 101 to 104. The method execution body shown in fig. 1 may be an electronic device, or the body may be a chip in the electronic device. The method shown in fig. 1 is executed by an electronic device as an example. Wherein:
101. the electronic equipment performs feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items.
In the embodiment of the application, common feature extraction methods include chi-square test, mutual information method, information gain method, gradient ascending decision tree method, supervised feature extraction method and the like. It should be noted that other feature extraction methods may also be used to perform feature extraction on the text data, and the embodiment of the present application is not limited. Based on the mode, the feature items obtained by different feature extraction methods can be conveniently processed and analyzed subsequently.
For example, the following description will mainly take the example that the electronic device selects chi-square test, mutual information method, information gain method, and gradient ascending decision tree method to extract features of text data, and obtains feature items extracted by each feature extraction method.
Optionally, the text data may be one text data, or may be one partial text data in which one text data is divided into a plurality of partial text data. For example, one text data is divided into 3 partial text data, and feature extraction is performed on one partial text data by using a plurality of feature extraction methods to obtain feature items. Based on the mode, the optimal feature extraction method of each part of text data is convenient to determine subsequently, so that the accuracy of the whole text data classification is improved.
102. The electronic equipment determines the classification accuracy of the multiple feature extraction methods based on the feature items corresponding to the multiple feature extraction methods under different feature item dimension values.
In the embodiment of the application, the feature item dimension value is used for representing the number of feature items, and different feature extraction methods have corresponding classification accuracy under different feature item dimension values. Wherein the classifier can be employed to evaluate and optimize the feature terms. Based on the mode, the classification accuracy of the adopted multiple feature extraction methods under different feature item dimension values can be determined.
For example, the electronic device selects chi2, a mutual information Method (MI), an information gain method (IG), and a gradient ascending decision tree method (GBDT) to perform feature extraction on text data, and takes a Random Forest (RF) as a classifier for checking a classification effect, where a feature item dimension k value varies between [10,200 ], a variation step size is 5, and different feature extraction methods have corresponding classification correctness rates under different feature item dimension values, as shown in fig. 2. For example, when the k value is 25, the classification accuracy of the chi-square test is 92%, the classification accuracy of the mutual information method is 76%, the classification accuracy of the information gain method is 86%, and the classification accuracy of the gradient ascending decision tree method is 84%.
Optionally, the dimension values of different feature items may be set in any range, the selected multiple feature extraction methods may be any feature extraction method, and the step length of the change in the dimension value of a feature item may be any numerical value, which is not limited herein.
Alternatively, the classifier for checking the classification effect may be any classifier, and is not limited herein.
103. The electronic equipment determines the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods.
In a possible implementation manner, the optimal feature item dimension value corresponding to each feature extraction method is the feature item dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature item dimension value.
For example, the electronic device selects chi2, Mutual Information (MI), Information Gain (IG), and gradient ascending decision tree (GBDT) to perform feature extraction on text data, and uses a random forest as a classifier for checking the classification effect, so as to determine the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method from fig. 2, as shown in fig. 3. The optimal classification accuracy of the chi-square test is 92%, and the optimal feature item dimension value is 25%; the optimal classification accuracy of the mutual information method is 90%, and the optimal feature item dimension value is 55; the optimal classification accuracy of the information gain method is 92%, and the optimal feature item dimension value is 55; the optimal classification accuracy of the gradient ascending decision tree method is 92%, and the optimal feature item dimension value is 50. In addition, fig. 3 also includes that the optimal classification accuracy of the manual extraction method (byman) is 84%, and the optimal feature term dimension value is 80. Therefore, the best classification accuracy of the chi-square test, the mutual information method, the information gain method and the gradient ascending decision tree method is higher than that of the manual extraction method.
104. And the electronic equipment determines a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
In the embodiment of the present application, the manner of determining the target feature extraction method and the target feature item dimension value corresponding to the text data may be: the electronic equipment compares the optimal classification accuracy corresponding to each feature extraction method with the optimal feature item dimension value, takes the feature extraction method corresponding to the optimal classification accuracy with the largest numerical value as the target feature extraction method corresponding to the text data, and takes the optimal feature item dimension value corresponding to the optimal classification accuracy with the largest numerical value as the target feature item dimension value.
Optionally, the manner of determining the target feature extraction method and the target feature item dimension value corresponding to the text data may also be: the electronic equipment determines the optimal classification accuracy rate larger than a preset value by comparing the optimal classification accuracy rate corresponding to each feature extraction method with the optimal feature item dimension value, and takes the feature extraction method corresponding to the optimal feature item dimension value with the minimum value as the target feature extraction method corresponding to the text data and the optimal feature item dimension value with the minimum value as the optimal feature item dimension value by comparing the optimal classification accuracy rates larger than the preset value with the optimal feature item dimension values corresponding to the optimal classification accuracy rates larger than the preset value.
Optionally, the method for determining the target feature extraction method and the target feature item dimension value corresponding to the text data may be other methods, which are not limited herein.
In the method described in fig. 1, the electronic device determines, for text data, a feature extraction method with an optimal classification effect and a feature item dimension value corresponding to the feature extraction method by comprehensively considering the optimal classification accuracy and the optimal feature item dimension value of a plurality of feature extraction methods. Therefore, based on the method described in fig. 1, it is beneficial to improve the accuracy of text classification.
Referring to fig. 4, fig. 4 is a schematic flowchart of another feature extraction method for text classification according to an embodiment of the present application. The feature extraction method for text classification comprises steps 401 to 407. Step 404 to step 407 are a specific implementation manner of step 104. The method execution body shown in fig. 4 may be an electronic device, or the body may be a chip in the electronic device. The method shown in fig. 4 is executed by an electronic device as an example. Wherein:
401. the electronic equipment performs feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items.
402. The electronic equipment determines the classification accuracy of the multiple feature extraction methods based on the feature items corresponding to the multiple feature extraction methods under different feature item dimension values.
403. The electronic equipment determines the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods.
The specific implementation manners of steps 401 to 403 are the same as those of steps 101 to 103, and are not described herein again.
404. And the electronic equipment determines the optimal classification accuracy with the maximum numerical value by comparing the optimal classification accuracy corresponding to each feature extraction method.
Wherein the maximum optimal classification accuracy includes one or more optimal classification accuracy. If the maximum optimal classification accuracy includes an optimal classification accuracy, step 405 is executed. If the maximum optimal classification accuracy includes multiple optimal classification accuracy (for example, refer to the embodiment shown in fig. 3), step 406 and step 407 are performed.
For example, the electronic device selects chi2, a mutual information Method (MI), an information gain method (IG), and a gradient ascending decision tree method (GBDT) to perform feature extraction on text data, as shown in fig. 4, by comparing the optimal classification accuracy corresponding to each feature extraction method, the electronic device determines that there are 3 optimal classification accuracy with the largest numerical value, and all the optimal classification accuracy are 92%.
405. The electronic equipment takes the feature extraction method corresponding to the optimal classification accuracy with the maximum numerical value as the target feature extraction method, and takes the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value as the target feature item dimension value.
For example, the electronic device selects chi-square test (chi2), mutual information Method (MI), information gain method (IG) and gradient ascending decision tree method (GBDT) to perform feature extraction on the text data, wherein the optimal classification accuracy of chi-square test is 92%, and the optimal feature item dimension value is 25; the optimal classification accuracy of the mutual information method is 90%, and the optimal feature item dimension value is 55; the optimal classification accuracy of the information gain method is 91%, and the optimal feature item dimension value is 50. By comparing the optimal classification accuracy corresponding to each feature extraction method, the electronic device determines that the optimal classification accuracy with the maximum numerical value is 1, and is 92%. The feature extraction method corresponding to the optimal classification accuracy with the maximum numerical value is chi-square test, and the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value is 25, so that the electronic equipment takes the chi-square test as a target feature extraction method and takes 25 as a target feature item dimension value.
406. And the electronic equipment determines the optimal feature item dimension value with the minimum numerical value from the multiple optimal feature item dimension values corresponding to the multiple optimal classification accuracy rates.
407. The electronic equipment takes the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and takes the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value.
In the embodiment of the application, the reason for determining the optimal feature item dimension value with the minimum numerical value is that the dimension is reduced, so that calculation and visualization are facilitated, and effective information extraction and synthesis and useless information abandonment are facilitated. Based on the method, the accuracy of text classification is improved.
For example, the electronic device selects chi-square test (chi2), mutual information Method (MI), information gain method (IG) and gradient ascending decision tree method (GBDT) to perform feature extraction on the text data, wherein the optimal classification accuracy of chi-square test is 92%, and the optimal feature item dimension value is 25; the optimal classification accuracy of the mutual information method is 90%, and the optimal feature item dimension value is 55; the optimal classification accuracy of the information gain method is 92%, and the optimal feature item dimension value is 50. By comparing the optimal classification accuracy corresponding to each feature extraction method, the electronic device determines that 2 optimal classification accuracy with the maximum numerical value are provided, and the two optimal classification accuracy are 92%. The 2 optimal classification accuracy rates of 92% correspond to the optimal feature item dimension values of 25 and 50 respectively, so that the optimal feature item dimension value with the minimum value is 25. The feature extraction method corresponding to the optimal feature item dimension value with the minimum value is chi-square test, so that the electronic equipment takes the chi-square test as a target feature extraction method and takes 25 as a target feature item dimension value.
In the method described in fig. 4, the electronic device determines the optimal classification accuracy with the largest value for the text data by considering the optimal classification accuracy and the optimal feature item dimension values of the plurality of feature extraction methods, so as to determine the feature extraction method with the optimal classification effect and the feature item dimension value corresponding to the feature extraction method. Therefore, based on the method described in fig. 4, it is beneficial to improve the accuracy of text classification.
Referring to fig. 5, fig. 5 is a schematic flowchart of another feature extraction method for text classification according to an embodiment of the present application. The feature extraction method for text classification comprises steps 501 to 507. Step 504 to step 507 are a specific implementation manner of step 104. The method execution body shown in fig. 5 may be an electronic device, or the body may be a chip in the electronic device. The method shown in fig. 5 is executed by an electronic device as an example. Wherein:
501. the electronic equipment performs feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items.
502. The electronic equipment determines the classification accuracy of the multiple feature extraction methods based on the feature items corresponding to the multiple feature extraction methods under different feature item dimension values.
503. The electronic equipment determines the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods.
The specific implementation manners of steps 501 to 503 are the same as those of steps 401 to 403, and are not described herein again.
504. And the electronic equipment determines the optimal classification accuracy with the maximum numerical value by comparing the optimal classification accuracy corresponding to each feature extraction method.
505. The electronic device determines a first optimal classification accuracy.
In the embodiment of the present application, a difference between the first optimal classification accuracy and the optimal classification accuracy with the largest value is within a preset range. The first optimal classification correct rate includes one or more optimal classification correct rates.
Alternatively, the preset range may be any range, and is not limited herein.
506. And the electronic equipment determines the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value.
507. The electronic equipment takes the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and takes the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value.
For example, assuming that the preset range of the difference between the first optimal classification accuracy and the maximum optimal classification accuracy of the numerical value is 0% to 5%, the electronic device performs feature extraction on the text data by selecting chi-square test (chi2), mutual information Method (MI), information gain method (IG), and gradient ascending decision tree method (GBDT), where the optimal classification accuracy of the chi-square test is 92% and the dimension value of the optimal feature item is 25; the optimal classification accuracy of the mutual information method is 86%, and the optimal feature item dimension value is 55%; the optimal classification accuracy of the information gain method is 90%, and the optimal feature item dimension value is 15. By comparing the optimal classification accuracy corresponding to each feature extraction method, the electronic device determines that the optimal classification accuracy with the maximum numerical value is 92%, and the optimal classification accuracy with the difference value between the optimal classification accuracy with the maximum numerical value and the optimal classification accuracy within the preset range is 1 and 90%, so that the electronic device determines that the first optimal classification accuracy is 1 and 90%. The optimal feature item dimension value corresponding to the first optimal classification accuracy is 15, the optimal feature item dimension value corresponding to the optimal classification accuracy with the largest numerical value is 25, and the optimal feature item dimension value with the smallest numerical value is determined to be 15 through comparison. The feature extraction method corresponding to the optimal feature item dimension value with the minimum value is an information gain method, so that the electronic equipment takes the information gain method as a target feature extraction method and takes 15 as a target feature item dimension value.
For another example, assuming that the preset range of the difference between the first optimal classification accuracy and the optimal classification accuracy with the maximum value is 0% to 5%, the electronic device selects chi-square test (chi2), mutual information Method (MI), information gain method (IG), and gradient ascending decision tree method (GBDT) to perform feature extraction on the text data, where the optimal classification accuracy of chi-square test is 92% and the optimal feature item dimension value is 25; the optimal classification accuracy of the mutual information method is 90%, and the optimal feature item dimension value is 55; the optimal classification accuracy of the information gain method is 90%, and the optimal feature item dimension value is 15. By comparing the optimal classification accuracy corresponding to each feature extraction method, the electronic device determines that the optimal classification accuracy with the largest numerical value is 92%, and the optimal classification accuracy with the difference value between the optimal classification accuracy with the largest numerical value and the optimal classification accuracy within the preset range is 2 and 90%, so that the electronic device determines that the first optimal classification accuracy is 2 and 90%. The optimal feature item dimension values corresponding to the first optimal classification accuracy are 55 and 15 respectively, the optimal feature item dimension value corresponding to the optimal classification accuracy with the largest numerical value is 25, and the optimal feature item dimension value with the smallest numerical value is determined to be 15 through comparison. The feature extraction method corresponding to the optimal feature item dimension value with the minimum value is an information gain method, so that the electronic equipment takes the information gain method as a target feature extraction method and takes 15 as a target feature item dimension value.
In the method described in fig. 5, the electronic device determines, for text data, a feature extraction method with an optimal classification effect and a feature item dimension value corresponding to the feature extraction method by comprehensively considering the optimal classification accuracy and the optimal feature item dimension value of a plurality of feature extraction methods. Therefore, based on the method described in fig. 5, it is beneficial to improve the accuracy of text classification.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention, where the processing apparatus of the candidate synchronization signal block may be an electronic device or an apparatus (e.g., a chip) having a function of the electronic device. Specifically, as shown in fig. 6, the processing device 60 for candidate synchronization signal blocks may include:
the processing unit 601 is configured to perform feature extraction on the text data by using multiple feature extraction methods to obtain feature items;
a determining unit 602, configured to determine, based on feature items corresponding to the multiple feature extraction methods, classification correctness rates of the multiple feature extraction methods under different feature item dimension values, where the feature item dimension value is used to represent the number of the feature items;
the determining unit 602 is further configured to determine an optimal classification accuracy and an optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods;
the determining unit 602 is further configured to determine a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
Optionally, the optimal feature item dimension value corresponding to each feature extraction method is a feature item dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature item dimension value.
Optionally, the determining unit 602 determines, based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method, that the target feature extraction method and the target feature item dimension value corresponding to the text data are specifically implemented in the following manner: determining the optimal classification correct rate with the maximum value by comparing the optimal classification correct rates corresponding to each feature extraction method, wherein the optimal classification correct rate with the maximum value comprises one or more optimal classification correct rates; if the maximum optimal classification accuracy of the numerical value comprises an optimal classification accuracy, taking the feature extraction method corresponding to the maximum optimal classification accuracy of the numerical value as the target feature extraction method, and taking the optimal feature item dimension value corresponding to the maximum optimal classification accuracy of the numerical value as the target feature item dimension value.
Optionally, the determining unit 602 is further configured to: if the optimal classification accuracy with the maximum numerical value comprises a plurality of optimal classification accuracy, determining the optimal feature item dimension value with the minimum numerical value from a plurality of optimal feature item dimension values corresponding to the optimal classification accuracy; and taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value.
Optionally, the determining unit 602 determines, based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method, that the target feature extraction method and the target feature item dimension value corresponding to the text data are specifically implemented in the following manner: determining the optimal classification accuracy with the maximum value by comparing the optimal classification accuracy corresponding to each feature extraction method; determining a first optimal classification accuracy, wherein the difference value between the first optimal classification accuracy and the optimal classification accuracy with the maximum value is within a preset range; determining the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value; taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value; wherein the first optimal classification accuracy comprises one or more optimal classification accuracy.
The embodiment of the present invention and the embodiments of the methods shown in fig. 1, fig. 4, and fig. 5 are based on the same concept, and the technical effects thereof are also the same, and for the specific principle, reference is made to the description of the embodiments shown in fig. 1, fig. 4, and fig. 5, which is not repeated herein.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 70 may comprise a memory 701, a processor 702, and a communication interface 703, the memory 701, the processor 702, and the communication interface 703 being connected by one or more communication buses. Wherein a communication interface 703 is controlled by the processor 702 for transmitting and receiving information.
Memory 701 may include both read-only memory and random access memory and provides instructions and data to processor 702. A portion of memory 701 may also include non-volatile random access memory.
The Processor 702 may be a Central Processing Unit (CPU), and the Processor 702 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general-purpose processor may be a microprocessor, but in the alternative, the processor 702 may be any conventional processor or the like. Wherein:
a memory 701 for storing program instructions.
A processor 702 for calling program instructions stored in the memory 701.
The processor 702 calls the program instructions stored in the memory 701 to cause the electronic device 70 to perform the following operations:
carrying out feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items;
determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items;
determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods;
and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
In one implementation, the optimal feature item dimension value corresponding to each feature extraction method is a feature item dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature item dimension value.
In an implementation manner, the operation of determining, by the processor 702, the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method specifically includes: determining the optimal classification correct rate with the maximum value by comparing the optimal classification correct rates corresponding to each feature extraction method, wherein the optimal classification correct rate with the maximum value comprises one or more optimal classification correct rates; if the maximum optimal classification accuracy of the numerical value comprises an optimal classification accuracy, taking the feature extraction method corresponding to the maximum optimal classification accuracy of the numerical value as the target feature extraction method, and taking the optimal feature item dimension value corresponding to the maximum optimal classification accuracy of the numerical value as the target feature item dimension value.
In one implementation, the processor 702 is further configured to: if the optimal classification accuracy with the maximum numerical value comprises a plurality of optimal classification accuracy, determining the optimal feature item dimension value with the minimum numerical value from a plurality of optimal feature item dimension values corresponding to the optimal classification accuracy; and taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value. Based on the method, the accuracy of text classification is improved.
In an implementation manner, the operation of determining, by the processor 702, the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method specifically includes: determining the optimal classification accuracy with the maximum value by comparing the optimal classification accuracy corresponding to each feature extraction method; determining a first optimal classification accuracy, wherein the difference value between the first optimal classification accuracy and the optimal classification accuracy with the maximum value is within a preset range; determining the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value; taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value; wherein the first optimal classification accuracy comprises one or more optimal classification accuracy.
It should be noted that details that are not mentioned in the embodiment corresponding to fig. 7 and specific implementation manners of each step may refer to the embodiments shown in fig. 1, fig. 4, and fig. 5 and the foregoing contents, and are not described again here.
The embodiment of the application also provides a chip, and the chip can execute the relevant steps of the electronic equipment in the embodiment of the method. The chip is used for extracting features of the text data by adopting a plurality of feature extraction methods to obtain feature items; determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items; determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods; and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
Optionally, the optimal feature item dimension value corresponding to each feature extraction method is a feature item dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature item dimension value.
Optionally, the specific implementation manner of the chip determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method is as follows: determining the optimal classification correct rate with the maximum value by comparing the optimal classification correct rates corresponding to each feature extraction method, wherein the optimal classification correct rate with the maximum value comprises one or more optimal classification correct rates; if the maximum optimal classification accuracy of the numerical value comprises an optimal classification accuracy, taking the feature extraction method corresponding to the maximum optimal classification accuracy of the numerical value as the target feature extraction method, and taking the optimal feature item dimension value corresponding to the maximum optimal classification accuracy of the numerical value as the target feature item dimension value.
Optionally, the chip is further configured to determine, if the optimal classification accuracy with the largest numerical value includes multiple optimal classification accuracy, an optimal feature item dimension value with a smallest numerical value among multiple optimal feature item dimension values corresponding to the multiple optimal classification accuracy; and taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value.
Optionally, the specific implementation manner of the chip determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method is as follows: determining the optimal classification accuracy with the maximum value by comparing the optimal classification accuracy corresponding to each feature extraction method; determining a first optimal classification accuracy, wherein the difference value between the first optimal classification accuracy and the optimal classification accuracy with the maximum value is within a preset range; determining the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value; taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value; wherein the first optimal classification accuracy comprises one or more optimal classification accuracy.
In a possible implementation, the chip includes at least one processor, at least one first memory, and at least one second memory; the at least one first memory and the at least one processor are interconnected through a line, and instructions are stored in the first memory; the at least one second memory and the at least one processor are interconnected through a line, and the second memory stores the data required to be stored in the method embodiment.
For each device or product applied to or integrated in the chip, each module included in the device or product may be implemented by hardware such as a circuit, or at least a part of the modules may be implemented by a software program running on a processor integrated in the chip, and the rest (if any) part of the modules may be implemented by hardware such as a circuit.
As shown in fig. 8, fig. 8 is a schematic structural diagram of a module device according to an embodiment of the present disclosure. The modular apparatus 80 can perform the steps related to the electronic apparatus in the foregoing method embodiments, and the modular apparatus 80 includes: a communication module 801, a power module 802, a storage module 803, and a chip module 804.
The power module 802 is configured to provide power for the module device; the storage module 803 is used for storing data and instructions; the communication module 801 is used for performing module device internal communication or for performing communication between the module device and an external device; the chip module 804 is configured to: carrying out feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items; determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items; determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods; and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
Optionally, the optimal feature item dimension value corresponding to each feature extraction method is a feature item dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature item dimension value.
Optionally, the specific implementation manner of the chip module 804 determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method is as follows: determining the optimal classification correct rate with the maximum value by comparing the optimal classification correct rates corresponding to each feature extraction method, wherein the optimal classification correct rate with the maximum value comprises one or more optimal classification correct rates; if the maximum optimal classification accuracy of the numerical value comprises an optimal classification accuracy, taking the feature extraction method corresponding to the maximum optimal classification accuracy of the numerical value as the target feature extraction method, and taking the optimal feature item dimension value corresponding to the maximum optimal classification accuracy of the numerical value as the target feature item dimension value.
Optionally, the chip module 804 is further configured to determine, if the optimal classification accuracy with the largest numerical value includes multiple optimal classification accuracy, an optimal feature item dimension value with a smallest numerical value among multiple optimal feature item dimension values corresponding to the multiple optimal classification accuracy; and taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value. Based on the method, the accuracy of text classification is improved.
Optionally, the specific implementation manner of the chip module 804 determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method is as follows: determining the optimal classification accuracy with the maximum value by comparing the optimal classification accuracy corresponding to each feature extraction method; determining a first optimal classification accuracy, wherein the difference value between the first optimal classification accuracy and the optimal classification accuracy with the maximum value is within a preset range; determining the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value; taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value; wherein the first optimal classification accuracy comprises one or more optimal classification accuracy.
For each device and product applied to or integrated in the chip module, each module included in the device and product may be implemented by using hardware such as a circuit, and different modules may be located in the same component (e.g., a chip, a circuit module, etc.) or different components of the chip module, or at least some of the modules may be implemented by using a software program running on a processor integrated in the chip module, and the rest (if any) of the modules may be implemented by using hardware such as a circuit. Embodiments of the present application further provide a computer-readable storage medium, in which instructions are stored, and when the computer-readable storage medium is executed on a processor, the method flow of the above method embodiments is implemented.
Embodiments of the present application further provide a computer program product, where when the computer program product runs on a processor, the method flow of the above method embodiments is implemented.
It is noted that, for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some acts may, in accordance with the present application, occur in other orders and/or concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
The descriptions of the embodiments provided in the present application may be referred to each other, and the descriptions of the embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. For convenience and brevity of description, for example, the functions and operations performed by the devices and apparatuses provided in the embodiments of the present application may refer to the related descriptions of the method embodiments of the present application, and may also be referred to, combined with or cited among the method embodiments and the device embodiments.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (14)

1. A method for extracting features of text classification, the method comprising:
carrying out feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items;
determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items;
determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods;
and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
2. The method according to claim 1, wherein the optimal feature term dimension value corresponding to each feature extraction method is a feature term dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature term dimension value.
3. The method according to claim 2, wherein the determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method comprises:
determining the optimal classification correct rate with the maximum numerical value by comparing the optimal classification correct rates corresponding to each feature extraction method, wherein the optimal classification correct rate with the maximum numerical value comprises one or more optimal classification correct rates;
and if the optimal classification accuracy with the maximum numerical value comprises an optimal classification accuracy, taking the feature extraction method corresponding to the optimal classification accuracy with the maximum numerical value as the target feature extraction method, and taking the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value as the target feature item dimension value.
4. The method of claim 3, further comprising:
if the optimal classification accuracy with the maximum numerical value comprises a plurality of optimal classification accuracy, determining the optimal feature item dimension value with the minimum numerical value from a plurality of optimal feature item dimension values corresponding to the optimal classification accuracy;
and taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value.
5. The method according to claim 2, wherein the determining the target feature extraction method and the target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method comprises:
determining the optimal classification accuracy with the maximum value by comparing the optimal classification accuracy corresponding to each feature extraction method;
determining a first optimal classification correct rate, wherein the difference value between the first optimal classification correct rate and the optimal classification correct rate with the maximum numerical value is within a preset range;
determining the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value;
taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value;
wherein the first optimal classification correct rate comprises one or more optimal classification correct rates.
6. A processing apparatus, characterized in that the apparatus comprises:
the processing unit is used for extracting the features of the text data by adopting a plurality of feature extraction methods to obtain feature items;
the determining unit is used for determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for expressing the number of the feature items;
the determining unit is further configured to determine an optimal classification accuracy and an optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods;
the determining unit is further configured to determine a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
7. The apparatus according to claim 6, wherein the optimal feature term dimension value corresponding to each feature extraction method is a feature term dimension value corresponding to the optimal classification accuracy of each feature extraction method, and one optimal classification accuracy corresponds to one optimal feature term dimension value.
8. The apparatus according to claim 7, wherein the determining unit determines, based on the optimal classification correctness and the optimal feature item dimension value corresponding to each feature extraction method, a specific implementation manner of the target feature extraction method and the target feature item dimension value corresponding to the text data as follows:
determining the optimal classification correct rate with the maximum numerical value by comparing the optimal classification correct rates corresponding to each feature extraction method, wherein the optimal classification correct rate with the maximum numerical value comprises one or more optimal classification correct rates;
and if the optimal classification accuracy with the maximum numerical value comprises an optimal classification accuracy, taking the feature extraction method corresponding to the optimal classification accuracy with the maximum numerical value as the target feature extraction method, and taking the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value as the target feature item dimension value.
9. The apparatus of claim 8, wherein the determining unit is further configured to:
if the optimal classification accuracy with the maximum numerical value comprises a plurality of optimal classification accuracy, determining the optimal feature item dimension value with the minimum numerical value from a plurality of optimal feature item dimension values corresponding to the optimal classification accuracy;
and taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value.
10. The apparatus according to claim 7, wherein the determining unit determines, based on the optimal classification correctness and the optimal feature item dimension value corresponding to each feature extraction method, a specific implementation manner of the target feature extraction method and the target feature item dimension value corresponding to the text data as follows:
determining the optimal classification accuracy with the maximum value by comparing the optimal classification accuracy corresponding to each feature extraction method;
determining a first optimal classification correct rate, wherein the difference value between the first optimal classification correct rate and the optimal classification correct rate with the maximum numerical value is within a preset range;
determining the optimal feature item dimension value with the minimum numerical value by comparing the optimal feature item dimension value corresponding to the first optimal classification accuracy with the optimal feature item dimension value corresponding to the optimal classification accuracy with the maximum numerical value;
taking the feature extraction method corresponding to the optimal feature item dimension value with the minimum numerical value as the target feature extraction method, and taking the optimal feature item dimension value with the minimum numerical value as the target feature item dimension value;
wherein the first optimal classification correct rate comprises one or more optimal classification correct rates.
11. A chip, characterized in that,
the chip is used for extracting features of the text data by adopting a plurality of feature extraction methods to obtain feature items;
the chip is further used for determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items;
the chip is also used for determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods;
the chip is further used for determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
12. The utility model provides a module equipment, its characterized in that, module equipment includes communication module, power module, storage module and chip module, wherein:
the power supply module is used for providing electric energy for the module equipment;
the storage module is used for storing data and instructions;
the communication module is used for carrying out internal communication of module equipment or is used for carrying out communication between the module equipment and external equipment;
the chip module is used for:
carrying out feature extraction on the text data by adopting a plurality of feature extraction methods to obtain feature items;
determining the classification accuracy of the multiple feature extraction methods under different feature item dimension values based on the feature items corresponding to the multiple feature extraction methods, wherein the feature item dimension values are used for representing the number of the feature items;
determining the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method based on the classification accuracy of the plurality of feature extraction methods;
and determining a target feature extraction method and a target feature item dimension value corresponding to the text data based on the optimal classification accuracy and the optimal feature item dimension value corresponding to each feature extraction method.
13. An electronic device comprising a memory for storing a computer program comprising program instructions and a processor configured to invoke the program instructions to perform the method of any of claims 1 to 5.
14. A computer readable storage medium having computer readable instructions stored thereon which, when run on a communication device, cause the communication device to perform the method of any of claims 1-5.
CN202110163603.3A 2021-02-05 2021-02-05 Feature extraction method and device for text classification Pending CN112989036A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110163603.3A CN112989036A (en) 2021-02-05 2021-02-05 Feature extraction method and device for text classification
PCT/CN2022/074714 WO2022166830A1 (en) 2021-02-05 2022-01-28 Feature extraction method and apparatus for text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110163603.3A CN112989036A (en) 2021-02-05 2021-02-05 Feature extraction method and device for text classification

Publications (1)

Publication Number Publication Date
CN112989036A true CN112989036A (en) 2021-06-18

Family

ID=76348342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110163603.3A Pending CN112989036A (en) 2021-02-05 2021-02-05 Feature extraction method and device for text classification

Country Status (2)

Country Link
CN (1) CN112989036A (en)
WO (1) WO2022166830A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022166830A1 (en) * 2021-02-05 2022-08-11 北京紫光展锐通信技术有限公司 Feature extraction method and apparatus for text classification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045503A (en) * 2016-02-05 2017-08-15 华为技术有限公司 The method and device that a kind of feature set is determined
CN107908715A (en) * 2017-11-10 2018-04-13 中国民航大学 Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN109684476A (en) * 2018-12-07 2019-04-26 中科恒运股份有限公司 A kind of file classification method, document sorting apparatus and terminal device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766792B (en) * 2017-06-23 2021-02-26 北京理工大学 Remote sensing image ship target identification method
US20190377823A1 (en) * 2018-06-07 2019-12-12 Element Ai Inc. Unsupervised classification of documents using a labeled data set of other documents
US11295240B2 (en) * 2019-06-15 2022-04-05 Boult Terrance E Systems and methods for machine classification and learning that is robust to unknown inputs
CN110597878B (en) * 2019-09-16 2023-09-15 广东工业大学 Cross-modal retrieval method, device, equipment and medium for multi-modal data
CN111666748B (en) * 2020-05-12 2022-09-13 武汉大学 Construction method of automatic classifier and decision recognition method
CN112989036A (en) * 2021-02-05 2021-06-18 北京紫光展锐通信技术有限公司 Feature extraction method and device for text classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045503A (en) * 2016-02-05 2017-08-15 华为技术有限公司 The method and device that a kind of feature set is determined
CN107908715A (en) * 2017-11-10 2018-04-13 中国民航大学 Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN109684476A (en) * 2018-12-07 2019-04-26 中科恒运股份有限公司 A kind of file classification method, document sorting apparatus and terminal device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐芳: "对虚拟人的文本情感语义分析", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022166830A1 (en) * 2021-02-05 2022-08-11 北京紫光展锐通信技术有限公司 Feature extraction method and apparatus for text classification

Also Published As

Publication number Publication date
WO2022166830A1 (en) 2022-08-11

Similar Documents

Publication Publication Date Title
CN107895191B (en) Information processing method and related product
CN111950638B (en) Image classification method and device based on model distillation and electronic equipment
CN110569984B (en) Configuration information generation method, device, equipment and storage medium
US11385878B2 (en) Model deployment method, model deployment device and terminal equipment
US20230297609A1 (en) Systems and methods for naming objects based on object content
CN112989036A (en) Feature extraction method and device for text classification
CN108053034B (en) Model parameter processing method and device, electronic equipment and storage medium
CN107704884A (en) Image tag processing method, image tag processing unit and electric terminal
CN111967478A (en) Feature map reconstruction method and system based on weight inversion, storage medium and terminal
CN111222558A (en) Image processing method and storage medium
CN113449062B (en) Track processing method, track processing device, electronic equipment and storage medium
CN113657408A (en) Method and device for determining image characteristics, electronic equipment and storage medium
CN114117062A (en) Text vector representation method and device and electronic equipment
US10761847B2 (en) Linear feedback shift register for a reconfigurable logic unit
CN109542837B (en) Operation method, device and related product
CN113010571A (en) Data detection method, data detection device, electronic equipment, storage medium and program product
CN113490954A (en) Neural network, operation method, and program
CN109308327A (en) Figure calculation method device medium apparatus based on the compatible dot center's model of subgraph model
CN113657353B (en) Formula identification method and device, electronic equipment and storage medium
CN116994002B (en) Image feature extraction method, device, equipment and storage medium
CN113722292B (en) Disaster response processing method, device, equipment and storage medium of distributed data system
CN113569727B (en) Method, system, terminal and medium for identifying construction site in remote sensing image
CN113157538B (en) Spark operation parameter determination method, device, equipment and storage medium
CN109543834B (en) Operation method, device and related product
CN110309127B (en) Data processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210618