CN110781292A - Text data multi-level classification method and device, electronic equipment and storage medium - Google Patents

Text data multi-level classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110781292A
CN110781292A CN201810828188.7A CN201810828188A CN110781292A CN 110781292 A CN110781292 A CN 110781292A CN 201810828188 A CN201810828188 A CN 201810828188A CN 110781292 A CN110781292 A CN 110781292A
Authority
CN
China
Prior art keywords
classification
level
text data
sub
component
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810828188.7A
Other languages
Chinese (zh)
Inventor
叶君健
田绍伟
薛璐影
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810828188.7A priority Critical patent/CN110781292A/en
Publication of CN110781292A publication Critical patent/CN110781292A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application provides a text data multi-level classification method, a text data multi-level classification device, electronic equipment and a storage medium, wherein the text data multi-level classification device comprises: the data layer is used for carrying out vector coding processing on the text data to generate word vectors corresponding to the text data; each level of sub-classification component is used for carrying out feature extraction and classification processing on the word vectors generated by the data layer and the classification results generated by the previous level of sub-classification component so as to determine the category of the text data at the level. The classification device classifies the text data step by utilizing the hierarchical parent-child relationship by taking the classification result of the upper-level sub-classification component as the classification basis of the lower-level sub-classification component, so that the accuracy of the hierarchical classification result is improved.

Description

Text data multi-level classification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for multi-level classification of text data, an electronic device, and a storage medium.
Background
With the development of internet technology, more and more documents such as articles, books and the like are provided on a network, and the documents are generally classified hierarchically in order to facilitate a user to find the documents. For example, a document of "primary language" is a category of three levels of education- > primary education- > language.
In the related art, a plurality of independent Support Vector Machine (SVM) classifiers are mainly used to classify documents. Because the SVM classifiers are independent of each other, the classification result accuracy of the SVM classifier is poor for the hierarchical classification task.
Disclosure of Invention
The application provides a text data multi-level classification method, a text data multi-level classification device, electronic equipment and a storage medium, and aims to solve the problem that an SVM classifier in the related art is poor in accuracy of classification results for hierarchical classification tasks.
An embodiment of an aspect of the present application provides a text data multi-level classification device, including: a data layer and a multi-level sub-classification component;
the data layer is used for carrying out vector coding processing on the text data to generate word vectors corresponding to the text data;
and each level of sub-classification component is used for carrying out feature extraction and classification processing on the word vectors generated by the data layer and the classification results generated by the previous level of sub-classification component so as to determine the category of the text data at the level.
The text data multi-level classification device comprises a data layer and a multi-level sub-classification component, wherein the data layer is used for carrying out vector coding processing on text data to generate word vectors corresponding to the text data; each level of sub-classification component is used for carrying out feature extraction and classification processing on the word vectors generated by the data layer and the classification results generated by the previous level of sub-classification component so as to determine the category of the text data at the level. Therefore, the classification component uses the classification result of the sub-classification component at the upper level as the classification basis of the sub-classification component at the lower level, so that the text data is classified step by utilizing the hierarchical parent-child relationship, and the accuracy of the hierarchical classification result is improved.
In another aspect, an embodiment of the present application provides a method for multi-level classification of text data, including:
carrying out vector coding processing on text data to be processed to generate word vectors corresponding to the text data;
performing feature extraction and classification processing on the word vectors by using a first-stage sub-classification component to determine a first-stage classification result corresponding to the text data;
determining a second-level target sub-classification component corresponding to the text data according to the first-level classification result;
performing feature extraction and classification processing on the word vectors and the first-stage classification result by using the second-stage target sub-classification component to determine a second-stage classification result corresponding to the text data;
and if the second-level target sub-classification component does not comprise a third-level sub-classification component, determining the category of the text data in each level of classification according to the second-level classification result.
The method for multi-level classification of text data comprises the steps of firstly carrying out vector coding processing on the text data to be processed to generate word vectors corresponding to the text data, then carrying out feature extraction and classification processing on the word vectors by utilizing a first-level sub-classification component to determine first-level classification results corresponding to the text data, determining a second-level target sub-classification component corresponding to the text data according to the first-level classification results, then carrying out feature extraction and classification processing on the word vectors and the first-level classification results by utilizing the second-level target sub-classification component to determine second-level classification results corresponding to the text data, and determining the category of the text data in each level of classification according to the second-level classification results if the second-level target sub-classification component does not contain a third-level sub-classification component. Therefore, the classification result of the first-stage classification component is input into the second-stage target sub-classification component determined according to the classification result of the first-stage classification component and serves as a classification basis of the second-stage sub-target classification component, so that the text data is classified step by utilizing the hierarchical parent-child relationship, and the accuracy of the hierarchical classification result is improved.
Another embodiment of the present application provides a text data multi-level classification device, including:
the encoding module is used for carrying out vector encoding processing on the text data to be processed so as to generate word vectors corresponding to the text data;
the first determining module is used for utilizing a first-level sub-classification component to perform feature extraction and classification processing on the word vectors so as to determine a first-level classification result corresponding to the text data;
the second determining module is used for determining a second-level target sub-classification component corresponding to the text data according to the first-level classification result;
the third determining module is used for utilizing the second-level target sub-classification component to perform feature extraction and classification processing on the word vectors and the first-level classification results so as to determine second-level classification results corresponding to the text data;
and the fourth determining module is used for determining the category of the text data in each level of classification according to the second level classification result when the second level target sub-classification component does not contain the third level sub-classification component.
The multi-level text data classification device provided by the embodiment of the application generates word vectors corresponding to text data by performing vector coding processing on the text data to be processed, performs feature extraction and classification processing on the word vectors by using a first-level sub-classification component to determine a first-level classification result corresponding to the text data, determines a second-level target sub-classification component corresponding to the text data according to the first-level classification result, performs feature extraction and classification processing on the word vectors and the first-level classification result by using the second-level target sub-classification component to determine a second-level classification result corresponding to the text data, and determines the category of the text data in each level of classification according to the second-level classification result when the second-level target sub-classification component does not contain a third-level sub-classification component. Therefore, the classification result of the first-stage classification component is input into the second-stage target sub-classification component determined according to the classification result of the first-stage classification component and serves as a classification basis of the second-stage sub-target classification component, so that the text data is classified step by utilizing the hierarchical parent-child relationship, and the accuracy of the hierarchical classification result is improved.
Another embodiment of the present application provides an electronic device, including a processor and a memory;
the processor reads the executable program code stored in the memory to run a program corresponding to the executable program code, so as to implement the method for multi-level classification of text data according to the embodiment of the other aspect.
Another embodiment of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a method for multi-level classification of text data as described in the another embodiment.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic structural diagram of a text data multi-level classification apparatus according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an input of a classification layer of a next-level sub-classification component according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a text data multi-level classification method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of another text data multi-level classification method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a text data multi-level classification apparatus according to an embodiment of the present application;
FIG. 6 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
A text data multilevel classification method, apparatus, electronic device, and storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.
The embodiment of the application provides a text data multi-level classification device aiming at the problem that an SVM classifier in the related art is poor in accuracy of classification results of hierarchical classification tasks.
The multi-level text data classification device comprises a data layer and a multi-level sub-classification component, and the multi-level text data classification device classifies text data step by utilizing a hierarchical parent-child relationship through a classification result of a previous-level sub-classification component as a classification basis of a next-level sub-classification component, so that the accuracy of a hierarchical classification result is improved.
Taking the example that the text data multi-level classification device includes three levels of sub-classification components, fig. 1 is a schematic structural diagram of a text data multi-level classification device provided in the embodiment of the present application.
As shown in fig. 1, the text data multi-level classification apparatus includes: data layer 100, a first level sub-category component 200, n second level sub-category components 210, 220, 230, …, 2n 0.
Wherein the second-level sub-classification component 210 comprises n1 third-level sub-classification components 211, …, 21n1, the second-level sub-classification component 220 comprises n2 third-level sub-classification components 221, …, 22n2, the second-level sub-classification component 230 comprises n3 third-level sub-classification components 231, …, 23n3, …, and the second-level sub-classification component 2n0 comprises nn third-level sub-classification components 2n1, …, 2 nnn.
In this embodiment, the data layer 100 is configured to perform vector encoding processing on the text data to generate a word vector corresponding to the text data. The text data may be header data of the text to be classified or a keyword set of the text to be classified.
For example, when articles entitled "research on time synchronization algorithm of wireless sensor network" are classified, the article entitled "research on time synchronization algorithm of wireless sensor network" is text data.
In practical application, when the text data is a sentence, such as an article title, a book name, and the like, Word segmentation processing may be performed on the text data to obtain a Word sequence of the text data, and then vector encoding processing may be performed on the Word sequence by using a Word2vec model to generate a corresponding Word vector.
For example, if there are N words in a sentence and each word is represented by a K-column vector, the word vector corresponding to the text data is an N × K matrix.
When the text data is a keyword set, such as a plurality of keywords of an article, vector coding can be performed on a word sequence formed by all the keywords to generate a word vector of the keyword set.
Because the Word vector generated by the Word2vec model is unsupervised training, a large amount of text data which does not need manual labeling can be used for training the Word vector with semantic information, and the generalization capability is stronger.
In this embodiment, each sub-classification component is configured to perform feature extraction and classification processing on the word vectors generated by the data layer and the classification results generated by the preceding sub-classification component, so as to determine the category to which the text data belongs at the level.
In the first-level sub-classification component 200 in fig. 1, since the classification component at the first level has no previous-level sub-classification component, feature extraction and classification processing is performed according to the word vectors to determine the category to which the text data belongs in the first-level classification.
After the first-level sub-classification component 200 finishes classification, a second-level target sub-classification component corresponding to the text data can be determined from the n second- level sub-classification components 210, 220, 230, …, and 2n0 according to the classification result, and the second-level target sub-classification component can perform feature extraction and classification processing according to the word vectors and the classification result of the first-level sub-classification component 200 to determine a second-level classification result of the text data in the second-level classification.
If the second-level target sub-classification component is the second-level sub-classification component 220, according to the second-level classification result of the second-level sub-classification component 220, the third-level target sub-classification component corresponding to the text data is determined from the n2 third-level sub-classification components 221, …, and 22n2 included in the second-level sub-classification component 220.
And the third-stage target sub-classification component performs feature extraction and classification processing according to the word vectors and the second-stage classification result to obtain a third-stage classification result, and further determines the category of the text data in each stage of classification according to the third-stage classification result.
It should be noted that, only three levels of sub-class components are shown in fig. 1, which should not be taken as a limitation of the present application, and the number of levels of the sub-class components may be adjusted and set according to actual needs. When the next-level sub-classification component does not exist, the category to which the text data belongs in each level of classification can be determined, and the classification of the text data is finished. And the number of the next-level sub-classification components corresponding to each level of sub-classification components can be the same or different.
According to the multi-level text data classification device, the classification result of the sub-classification component at the previous level is used as the basis of the classification at the next level, the class of the text data at each level is determined step by step according to the parent-child relationship among the levels, and the accuracy of the level classification result is greatly improved.
Further, each level of sub-classification components in the above embodiments includes a plurality of convolution layers, a max-pooling layer, and a classification layer. Fig. 2 is an input schematic diagram of a classification layer of a next-level sub-classification component according to an embodiment of the present application.
As shown in fig. 2, the input of the classification layer in the ith-level jth sub-classification component includes the output result of the largest pooling layer of the ith-1-level kth sub-classification component and the output result of the largest pooling layer of the ith-level jth sub-classification component. And the input of the classification layer in the ith-1 st sub-classification component may include the output result of the largest pooling layer of the ith-2 nd sub-classification component and the output result of the largest pooling layer of the ith-1 st sub-classification component. The i-2 th sub-classification component is a previous-level component comprising an i-1 th sub-classification component.
Wherein i, j and k are natural numbers respectively, and i is a natural number greater than or equal to 2.
It is to be understood that when i is 2, the first-level sub-classification component has no previous-level sub-classification component, and then the input of the classification layer of the first-level sub-classification component includes only the output result of the largest pooling layer of the level.
And the kth sub-classification component is a parent component of the jth sub-classification component. For example, a first-level sub-classification component is a parent component to a second-level sub-classification component, each of which is in turn a parent component to the respective third-level sub-classification components it includes.
Therefore, the output result of the maximum pooling layer of the parent-level component is used as the input of the classification layer of the next-level sub-classification component, and the accuracy of the classification result of the next-level sub-classification component can be improved. In this embodiment, the output result of the largest pooling layer in each level of sub-classification component is obtained after the word vector is processed by the plurality of convolution layers and the largest pooling layer. Specifically, the maximum pooling layer performs convolution feature extraction on word vectors, then the extracted features are input to the maximum pooling layer, and the maximum pooling layer combines the features output by the convolution layer into a vector feature after pooling operation and inputs the vector feature to the classification layer.
It should be noted that, for the first-level sub-classification component, the input of the classification layer is the output result of the largest pooling layer in the first-level sub-classification component.
For ease of processing, the dimension of the maximum pooling layer output result in each level of sub-classification components is the same. Therefore, the output result of the maximum pooling layer in the sub-classification assembly at the upper level has the same dimension as the output result of the maximum pooling layer in the sub-classification assembly at the current level, and the classification layer of the sub-classification assembly at the current level can be classified.
In the embodiment of the application, the sub-classification component of each level combines the feature vector extracted by the parent component thereof to determine the category of the text data at the level, so that the accuracy of the classification result is improved.
In order to implement the foregoing embodiment, the embodiment of the present application further provides a text data multi-level classification method. Fig. 3 is a flowchart illustrating a text data multi-level classification method according to an embodiment of the present application.
The text data multi-level classification method of the embodiment of the application can be executed by another text data multi-level classification device provided by the embodiment of the application, and the device can be configured in electronic equipment such as computers, mobile phones and other equipment with operating systems.
As shown in fig. 3, the method for multi-level classification of text data includes:
step 301, performing vector encoding processing on the text data to be processed to generate a word vector corresponding to the text data.
The text data to be processed may be title data of the text to be classified or a keyword set of the text to be classified.
For example, when an article entitled "research on time synchronization algorithm of wireless sensor network" is classified, the article entitled "research on time synchronization algorithm of wireless sensor network" is text data to be processed.
In practical application, when the text data to be processed is a sentence, such as an article title, a book name, and the like, Word segmentation processing may be performed on the text data to obtain a Word sequence of the text data, and then vector encoding processing may be performed on the Word sequence by using a Word2vec model to generate a corresponding Word vector.
For example, if there are N words in a sentence and each word is represented by a K-column vector, the word vector corresponding to the text data is an N × K matrix.
When the text data is a keyword set, such as a plurality of keywords of an article, all the keywords in the keyword set can be combined into a word sequence corresponding to the keyword set, and then the word sequence is subjected to vector coding to generate a word vector of the keyword set.
Because the Word vector generated by the Word2vec model is unsupervised training, a large amount of text data which does not need manual labeling can be used for training the Word vector with semantic information, and the generalization capability is stronger.
Step 302, using the first-level sub-classification component to perform feature extraction and classification processing on the word vectors so as to determine a first-level classification result corresponding to the text data.
After generating word vectors of text data to be processed, inputting the word vectors into a first-stage sub-classification component, and performing feature extraction and classification processing on the word vectors by the first-stage sub-classification component to determine a first-stage classification result of the text data at the first stage.
Specifically, the convolution layers in the first-stage sub-classification component extract convolution features of word vectors, the extracted features are input into the maximum pooling layer, the features output by the convolution layers are pooled by the maximum pooling layer and combined into a vector feature, then the vector feature is input into the classification layer, and the convolution pooling result is processed by the classification layer to obtain a first-stage classification result.
Step 303, determining a second-level target sub-classification component corresponding to the text data according to the first-level classification result.
As a possible implementation manner, assuming that there are N second-level sub-classification components, the first-level classification result may be a vector including N elements, where each element is used to represent a probability that the text data belongs to the corresponding second-level sub-classification component, and then the second-level sub-classification component with the highest probability may be used as the second-level target sub-classification component.
As another possible implementation manner, the first-stage classification result may also be a category label to which the text data to be classified belongs in the first-stage classification, and then a second-stage sub-classification component corresponding to the category label may be used as a second-stage target sub-classification component according to the category label.
And 304, utilizing a second-level target sub-classification component to perform feature extraction and classification processing on the word vectors and the first-level classification results so as to determine second-level classification results corresponding to the text data.
In this embodiment, the word vectors of the text data to be processed and the first-stage classification result are input to the second-stage target sub-classification component, and the second-stage target sub-classification component performs feature extraction and classification processing on the word vectors and the first-stage classification result.
As a possible implementation manner, the plurality of convolution layers and the largest pooling layer of the second-level target sub-classification component can perform feature extraction on the word vectors, and then the classification layer is used for performing classification processing on the feature result output by the largest pooling layer and the classification result of the first-level sub-classification component, so as to obtain a second-level classification result corresponding to the text data.
As another possible implementation manner, feature extraction and classification processing may be performed on the word vectors and the feature vectors extracted by the first-level sub-classification component. Specifically, the word vectors are subjected to feature extraction by a plurality of convolution layers and a maximum pooling layer in the second-level target sub-classification component, the result output by the maximum pooling layer and the result output by the maximum pooling layer of the first-level sub-classification component are input into the classification layer, and the classification layer performs classification processing to generate a second-level classification result.
For ease of processing, the dimension of the maximum pooling layer output result in each level of sub-classification components is the same. Therefore, the output result of the maximum pooling layer in the sub-classification assembly at the upper level has the same dimension as the output result of the maximum pooling layer in the sub-classification assembly at the current level, and the classification layer of the sub-classification assembly at the current level can be classified.
And 305, if the second-level target sub-classification component does not comprise a third-level sub-classification component, determining the category of the text data in each level of classification according to the second-level classification result.
In this embodiment, if the second-level target sub-classification component does not include the third-level sub-classification component, that is, the second-level sub-classification component corresponds to the last-level classification, the category to which the text data belongs in the first level and the second level is determined according to the second-level classification result.
When the first-level sub-classification component outputs the first-level classification result, the class of the text data to be processed in the first-level classification can be determined according to the first-level classification result. When the text data to be processed belongs to the category in the second-level classification, the category to which the text data belongs in the second-level classification can be determined according to the second-level classification result.
Further, in this embodiment, the second-level target sub-classification component may also include a plurality of third-level sub-classification components, and the third-level target sub-classification component may determine, according to the second-level classification result, a category to which the text data to be processed in the third-level classification belongs. Fig. 4 is a flowchart illustrating another text data multi-level classification method according to an embodiment of the present application.
After determining the second-level classification result corresponding to the text data, as shown in fig. 4, the method for multi-level classification of text data may further include:
step 401, if the second-level target sub-classification component includes a plurality of third-level sub-classification components, determining a third-level target sub-classification component according to the second-level classification result.
In this embodiment, when the second-level target sub-classification component includes a plurality of third-level sub-classification components, the third-level target sub-classification component corresponding to the text data to be processed may be determined according to the second-level classification result. For a specific determination process, see the above embodiment, a method for determining a second-level target sub-classification component according to a first-level classification result.
And step 402, utilizing a third-level target sub-classification component to perform feature extraction and classification processing on the word vectors and the second-level classification results so as to determine third-level classification results corresponding to the text data until the categories of the text data in all levels of classifications are determined.
After the third-level target sub-classification component is determined, the word vectors and the second-level classification results of the text data to be processed can be input into the third-level target sub-classification component, and feature extraction and classification processing are carried out on the word vectors and the second-level classification results by the third-level target sub-classification component to obtain third-level classification results.
Then, it is determined whether the third level target sub-classification component contains a fourth level sub-classification component. If the fourth-level sub-classification component is included, further determining a fourth-level classification result; otherwise, determining the category of the text data in each level of classification. And repeating the steps until the last level of sub-classification component to determine the category of the text data in each level of classification.
According to the text data multi-level classification method, when the second-level target sub-classification component comprises the third-level sub-classification component, the third-level classification result is further determined according to the second-level classification result, the classification result according to the upper-level parent component is achieved when the next-level classification result is determined, and therefore the classification accuracy is greatly improved.
In order to implement the above embodiments, the present application further provides a text data multi-level classification device. Fig. 5 is a schematic structural diagram of a text data multi-level classification device according to an embodiment of the present application.
As shown in fig. 5, the text data multi-level classification apparatus includes: an encoding module 510, a first determining module 520, a second determining module 530, a third determining module 540, and a fourth determining module 550.
The encoding module 510 is configured to perform vector encoding processing on text data to be processed to generate a word vector corresponding to the text data;
the first determining module 520 is configured to perform feature extraction and classification processing on the word vectors by using the first-level sub-classification component to determine a first-level classification result corresponding to the text data.
The second determining module 530 is configured to determine a second-level target sub-classification component corresponding to the text data according to the first-level classification result.
The third determining module 540 is configured to perform feature extraction and classification processing on the word vectors and the first-stage classification results by using the second-stage target sub-classification component, so as to determine second-stage classification results corresponding to the text data.
The fourth determining module 550 is configured to determine, according to the second-stage classification result, a category to which the text data belongs in each stage of classification when the second-stage target sub-classification component does not include the third-stage sub-classification component.
In a possible implementation manner of the embodiment of the present application, the third determining module 540 is further configured to:
and carrying out feature extraction and classification processing on the word vectors and the feature vectors extracted by the first-level sub-classification component.
In a possible implementation manner of the embodiment of the present application, the text data to be processed is header data of the text to be classified, or is a keyword set corresponding to the text to be classified.
In a possible implementation manner of the embodiment of the application, the second determining module 530 is further configured to, after determining the second-level classification result corresponding to the text data, determine, according to the second-level classification result, a third-level target sub-classification component when the second-level target sub-classification component includes a plurality of third-level sub-classification components;
the third determining module 540 is further configured to perform feature extraction and classification processing on the word vectors and the second-level classification results by using a third-level target sub-classification component to determine third-level classification results corresponding to the text data until determining the category of the text data in each level of classification.
In a possible implementation manner of the embodiment of the application, if the text data to be processed is a sentence; the encoding module 510 is further configured to:
performing word segmentation on the text data to be processed, and determining a word sequence corresponding to the text data to be processed;
and carrying out vector coding processing on the Word sequence by using a Word2vec model.
It should be noted that the explanation of the embodiment of the text data multi-level classification method is also applicable to the text data multi-level classification device, and therefore is not repeated herein.
The multi-level text data classification device provided by the embodiment of the application generates word vectors corresponding to text data by performing vector coding processing on the text data to be processed, performs feature extraction and classification processing on the word vectors by using a first-level sub-classification component to determine a first-level classification result corresponding to the text data, determines a second-level target sub-classification component corresponding to the text data according to the first-level classification result, performs feature extraction and classification processing on the word vectors and the first-level classification result by using the second-level target sub-classification component to determine a second-level classification result corresponding to the text data, and determines the category of the text data in each level of classification according to the second-level classification result when the second-level target sub-classification component does not contain a third-level sub-classification component. Therefore, the classification result of the first-stage classification component is input into the second-stage target sub-classification component determined according to the classification result of the first-stage classification component and serves as a classification basis of the second-stage sub-target classification component, so that the text data is classified step by utilizing the hierarchical parent-child relationship, and the accuracy of the hierarchical classification result is improved.
In order to implement the foregoing embodiments, an electronic device is further provided in an embodiment of the present application, including a processor and a memory; the processor reads the executable program code stored in the memory to run a program corresponding to the executable program code, so as to implement the multilevel text data classification method according to the embodiment.
FIG. 6 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. The electronic device 12 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in FIG. 6, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, and commonly referred to as a "hard drive"). Although not shown in FIG. 6, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only memory (CD-ROM), a Digital versatile disk Read Only memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via the Network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.
In order to implement the foregoing embodiments, the present application further proposes a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the non-transitory computer-readable storage medium implements the text data multi-level classification method according to the foregoing embodiments.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (12)

1. A multilevel text data classification device, comprising: a data layer and a multi-level sub-classification component;
the data layer is used for carrying out vector coding processing on the text data to generate word vectors corresponding to the text data;
and each level of sub-classification component is used for carrying out feature extraction and classification processing on the word vectors generated by the data layer and the classification results generated by the previous level of sub-classification component so as to determine the category of the text data at the level.
2. The classification apparatus of claim 1, wherein each level of sub-classification components includes a plurality of convolution layers, a max-pooling layer, and a classification layer;
the input of the classification layer in the ith-level jth sub-classification component comprises an output result of a maximum pooling layer of the ith-1-level kth sub-classification component and an output result of a maximum pooling layer in the ith-level jth sub-classification component, wherein the kth sub-classification component is a parent component of the jth sub-classification component, and i, j and k are natural numbers respectively.
3. The classification apparatus of claim 2, wherein the dimension of the maximum pooling layer output result in each level of sub-classification components is the same.
4. The classification apparatus according to claim 1, wherein the data layer is specifically configured to perform vector encoding processing on the text data using a Word2vec model.
5. A method for multi-level classification of text data, comprising:
carrying out vector coding processing on text data to be processed to generate word vectors corresponding to the text data;
performing feature extraction and classification processing on the word vectors by using a first-stage sub-classification component to determine a first-stage classification result corresponding to the text data;
determining a second-level target sub-classification component corresponding to the text data according to the first-level classification result;
performing feature extraction and classification processing on the word vectors and the first-stage classification result by using the second-stage target sub-classification component to determine a second-stage classification result corresponding to the text data;
and if the second-level target sub-classification component does not comprise a third-level sub-classification component, determining the category of the text data in each level of classification according to the second-level classification result.
6. The method of claim 5, wherein said performing feature extraction and classification on said word vectors and said first stage classification results comprises:
and carrying out feature extraction and classification processing on the word vectors and the feature vectors extracted by the first-level sub-classification component.
7. The method as claimed in claim 5, wherein the text data to be processed is header data of the text to be classified or a keyword set corresponding to the text to be classified.
8. The method of any of claims 5-7, wherein after determining the second level classification result corresponding to the text data, further comprising:
if the second-level target sub-classification component comprises a plurality of third-level sub-classification components, determining a third-level target sub-classification component according to the second-level classification result;
and performing feature extraction and classification processing on the word vectors and the second-stage classification results by using the third-stage target sub-classification component to determine third-stage classification results corresponding to the text data until the categories of the text data in all stages of classifications are determined.
9. The method according to any one of claims 5-7, wherein if the text data to be processed is a sentence;
the vector encoding processing is carried out on the text data to be processed, and the vector encoding processing comprises the following steps:
performing word segmentation on the text data to be processed, and determining a word sequence corresponding to the text data to be processed;
and carrying out vector coding processing on the Word sequence by using a Word2vec model.
10. A multilevel text data classification device, comprising:
the encoding module is used for carrying out vector encoding processing on the text data to be processed so as to generate word vectors corresponding to the text data;
the first determining module is used for utilizing a first-level sub-classification component to perform feature extraction and classification processing on the word vectors so as to determine a first-level classification result corresponding to the text data;
the second determining module is used for determining a second-level target sub-classification component corresponding to the text data according to the first-level classification result;
the third determining module is used for utilizing the second-level target sub-classification component to perform feature extraction and classification processing on the word vectors and the first-level classification results so as to determine second-level classification results corresponding to the text data;
and the fourth determining module is used for determining the category of the text data in each level of classification according to the second level classification result when the second level target sub-classification component does not contain the third level sub-classification component.
11. An electronic device comprising a processor and a memory;
wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the text data multi-level classification method according to any one of claims 5 to 9.
12. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for multi-level classification of text data according to any one of claims 5 to 9.
CN201810828188.7A 2018-07-25 2018-07-25 Text data multi-level classification method and device, electronic equipment and storage medium Pending CN110781292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810828188.7A CN110781292A (en) 2018-07-25 2018-07-25 Text data multi-level classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810828188.7A CN110781292A (en) 2018-07-25 2018-07-25 Text data multi-level classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110781292A true CN110781292A (en) 2020-02-11

Family

ID=69377258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810828188.7A Pending CN110781292A (en) 2018-07-25 2018-07-25 Text data multi-level classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110781292A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737465A (en) * 2020-06-15 2020-10-02 上海理想信息产业(集团)有限公司 Method and device for realizing multi-level and multi-class Chinese text classification
WO2022057786A1 (en) * 2020-09-15 2022-03-24 智慧芽(中国)科技有限公司 Multi-type text-based automatic classification method and apparatus, device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
CN106909654A (en) * 2017-02-24 2017-06-30 北京时间股份有限公司 A kind of multiclass classification system and method based on newsletter archive information
CN107169049A (en) * 2017-04-25 2017-09-15 腾讯科技(深圳)有限公司 The label information generation method and device of application

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
CN106909654A (en) * 2017-02-24 2017-06-30 北京时间股份有限公司 A kind of multiclass classification system and method based on newsletter archive information
CN107169049A (en) * 2017-04-25 2017-09-15 腾讯科技(深圳)有限公司 The label information generation method and device of application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵其鲁, 李宗民: "基于深度多任务学习的层次分类", 《计算机辅助设计与图形学学报》 *
郭利敏: "基于卷积神经网络的文献自动分类研究", 《图书与情报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737465A (en) * 2020-06-15 2020-10-02 上海理想信息产业(集团)有限公司 Method and device for realizing multi-level and multi-class Chinese text classification
WO2022057786A1 (en) * 2020-09-15 2022-03-24 智慧芽(中国)科技有限公司 Multi-type text-based automatic classification method and apparatus, device, and storage medium

Similar Documents

Publication Publication Date Title
CN107004159B (en) Active machine learning
CN108733778B (en) Industry type identification method and device of object
US7958068B2 (en) Method and apparatus for model-shared subspace boosting for multi-label classification
WO2010119615A1 (en) Learning-data generating device and named-entity-extraction system
GB2544857A (en) Multimedia document summarization
US8243988B1 (en) Clustering images using an image region graph
WO2021034376A1 (en) Example based entity extraction, slot filling and value recommendation
US20210117802A1 (en) Training a Neural Network Using Small Training Datasets
US10699112B1 (en) Identification of key segments in document images
CN112861522B (en) Aspect-level emotion analysis method, system and model based on dual-attention mechanism
CN112749547A (en) Generation of text classifier training data
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
CN110659367A (en) Text classification number determination method and device and electronic equipment
CN107894979B (en) Compound word processing method, device and equipment for semantic mining
WO2021223882A1 (en) Prediction explanation in machine learning classifiers
Patel et al. Dynamic lexicon generation for natural scene images
CN112100377A (en) Text classification method and device, computer equipment and storage medium
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN113986950A (en) SQL statement processing method, device, equipment and storage medium
JP2020060970A (en) Context information generation method, context information generation device and context information generation program
CN111738009B (en) Entity word label generation method, entity word label generation device, computer equipment and readable storage medium
CN110781292A (en) Text data multi-level classification method and device, electronic equipment and storage medium
CN111488400B (en) Data classification method, device and computer readable storage medium
US11886809B1 (en) Identifying templates based on fonts
CN114970467B (en) Method, device, equipment and medium for generating composition manuscript based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination