CN111159404B - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN111159404B
CN111159404B CN201911383131.1A CN201911383131A CN111159404B CN 111159404 B CN111159404 B CN 111159404B CN 201911383131 A CN201911383131 A CN 201911383131A CN 111159404 B CN111159404 B CN 111159404B
Authority
CN
China
Prior art keywords
feature
text
weight
specified
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911383131.1A
Other languages
Chinese (zh)
Other versions
CN111159404A (en
Inventor
韩俊明
马志芳
张文剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Haier Uplus Intelligent Technology Beijing Co Ltd
Priority to CN201911383131.1A priority Critical patent/CN111159404B/en
Publication of CN111159404A publication Critical patent/CN111159404A/en
Application granted granted Critical
Publication of CN111159404B publication Critical patent/CN111159404B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a text classification method and a text classification device, wherein the method comprises the steps of extracting characteristics of a text, wherein the characteristics comprise text characteristics and part-of-speech characteristics; determining feature weights of the features according to the features; analyzing the feature weight by using a first model to determine a text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a plurality of sample texts, and a text category of the plurality of sample texts. The application solves the problem of lower classification accuracy in the field of short text classification, and achieves the effect of improving the classification accuracy.

Description

Text classification method and device
Technical Field
The application relates to the field of communication, in particular to a text classification method and device.
Background
In the problem of text classification, it is generally necessary to convert text into data before substituting the model and algorithm. tf_idf (term frequency-inverse document frequency) is a common weighting technique for data mining. tf_idf is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus, where tf_idf is the product of tf and idf, where tf is the document frequency used to describe the ability of the word to reflect the content of the document, and may refer to the frequency at which a given word appears in the document; idf is the inverse document frequency and is used to calculate the ability of the word to distinguish between different documents and may refer to the total number of documents divided by the number of documents containing the word and then taking the logarithm. The tf_idf value obtained according to this calculation is biased toward words that occur frequently in text and that have a low frequency of the number of documents that contain the word in the total document.
However, in some scenes, for example, in the field of classification of short text, due to short text and limited features, the frequency of occurrence of words in short text is generally 1, and the above-mentioned manner of calculating weights cannot correctly reflect the contribution degree of words to classification, so that a larger deviation exists in classification results, and the classification accuracy is lower. For example, by taking intelligent voice text classification as an example, documents appearing in an air conditioner generally belong to household appliances, are terms, the number of the documents appearing is more, certain words such as general are likely to appear uniformly in each class or the number of the documents appearing is less, and are adjectives, so that the weight of the air conditioner is necessarily less than the weight of the general word in short texts with almost every word frequency of 1, the terms in the household appliances are also important, and the prominent contribution of the air conditioner to the household appliances is difficult to embody by using a classification method in the related art.
For the problem of low classification accuracy in the field of short text classification in the related art, no solution exists yet.
Disclosure of Invention
The embodiment of the application provides a text classification method and device, which at least solve the problem of low classification accuracy in the short text classification field in the related technology.
According to an embodiment of the present application, there is provided a text classification method including: extracting characteristics of the text, wherein the characteristics comprise text characteristics and part-of-speech characteristics; determining feature weights of the features according to the features; analyzing the feature weight by using a first model to determine a text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a plurality of sample texts, and a text category of the plurality of sample texts.
According to another embodiment of the present application, there is provided a text classification apparatus including:
the extraction module is used for extracting the characteristics of the text, wherein the characteristics comprise text characteristics and part-of-speech characteristics;
the determining module is used for determining the feature weight of the feature according to the feature;
the analysis module is used for analyzing the feature weight by using a first model and determining the text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a plurality of sample texts, and a text category of the plurality of sample texts.
According to a further embodiment of the application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the application, the characteristics of the text are extracted, wherein the characteristics comprise text characteristics and part-of-speech characteristics; determining feature weights of the features according to the features; analyzing the feature weight by using a first model to determine a text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the text classification method comprises the steps of obtaining a plurality of sample texts and text categories of the plurality of sample texts, so that the problem of low classification accuracy in the short text classification field can be solved, and the effect of improving the classification accuracy is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a block diagram of a hardware configuration of an arithmetic device of a text classification method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of classifying text according to an embodiment of the application;
FIG. 3 is the result of determining feature weights from tf_idf alone;
FIG. 4 is a graph showing the results of determining feature weights from tf_idf-var in accordance with an alternative embodiment of the present application;
FIG. 5 is a flow chart of an svm classifier training process in accordance with an alternative embodiment of the present application;
fig. 6 is a block diagram of a text classification apparatus according to an embodiment of the present application.
Detailed Description
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Example 1
The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Taking the operation on the computing device as an example, fig. 1 is a block diagram of a hardware structure of the computing device of a text classification method according to an embodiment of the present application. As shown in fig. 1, the computing device 10 may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative, and is not intended to limit the configuration of the computing device. For example, the computing device 10 may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a text classification method in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, to implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computing device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computing device 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.
In this embodiment, a method for classifying texts running on the above-mentioned computing device is provided, and fig. 2 is a flowchart of classifying texts according to an embodiment of the present application, as shown in fig. 2, the flowchart includes the following steps:
step S202, extracting characteristics of the text, wherein the characteristics comprise text characteristics and part-of-speech characteristics;
step S204, determining feature weights of the features according to the features;
step S206, analyzing the feature weights by using a first model, and determining a text category of the text, where the first model is obtained by training a deep neural network by using multiple sets of data, and each set of data in the multiple sets of data includes: a plurality of sample texts, and a text category of the plurality of sample texts.
Through the steps, as the extracted text features comprise text features and part-of-speech features, the feature weights of the features are determined, the feature weights are analyzed by using the model, and the text category of the text is determined, so that the problem of low classification accuracy in the short text classification field is solved, and the effect of improving the classification accuracy is achieved.
The following is further described, by way of example, in connection with specific scenarios:
classification problems of short text are typically involved in the field of intelligent question-answering to judge the intent of the user. The role of feature engineering is to extract useful data from short text and calculate corresponding values for use by the model. Optionally, after feature extraction can be performed on the cleaned and sampled short text data set through word segmentation and part-of-speech tagging, n-gram and other methods, feature selection can be performed on the short text, for example, chi-square inspection, information gain, mutual information and other methods can be used. The feature weight calculation method may be tf_idf. tf_idf is the product of tf and idf, where tf refers to the frequency with which a given word appears in the file, and idf refers to the number of total documents divided by the number of documents containing the word and then logarithm. The feature vector of the short text is obtained by calculating the td-idf value of each feature in the short text, so that the short text can be fed to a model in a vector form to participate in training to obtain a classifier. Alternatively, the models and algorithms for classification are mainly svm, naive bayes, decision trees, etc.
In the whole intelligent question-answering system, the problems of the user are received, the words are segmented and the parts of speech labels are obtained to obtain original features, the tf_idf value and var value of the problem text are calculated according to the tf value calculated by the existing training sample, then the final features are obtained, the finally obtained features converted into numerical values are input into svm to calculate the probability that the user input belongs to each category, and the category with the highest probability is selected to be output as a final result. the core idea of tf_idf is that words that occur more frequently in the same text and less frequently in different texts should be given a higher weight, wherein the frequency of occurrence of a word in a text (tf) is used to describe the ability of the word to reflect the content of the document; the inverse document frequency (idf) is used to calculate the ability of the word to distinguish between different documents, tf_idf being the product of the two. The tf_idf value obtained according to this calculation is biased toward words that occur frequently in text and that have a low frequency of the number of documents that contain the word in the total document. The embodiment can be applied to svm classifiers, because the W parameter in svm (support vector machine) is calculated by relying on support vectors (selected training data), the calculation process is to weight the complete support vectors instead of individual features, and if the feature value distribution of the support vectors in the training data is unreasonable, the performance of the svm classification model is affected. Words such as "air conditioner" should be given a greater weight when classifying to correctly reflect the contribution of different words to classification. Optionally, extracting part-of-speech features is added in the embodiment, so that the problem of low classification accuracy in the field of short text classification is solved, and the effect of improving the classification accuracy is achieved.
Optionally, performing category recognition on the text by using the trained model comprises the following steps:
s1: acquiring data to obtain a text to be classified;
and S2, extracting text features and part-of-speech features of the text to be classified, and optionally, respectively segmenting the content and the part-of-speech of the text to be classified by using more than two language models to obtain the text features and the part-of-speech features of the text. For example, the text is subjected to text word segmentation and part of speech is marked, and then features, namely one word or two continuous words, are extracted as one feature by using a single language model and a binary language model. Such as: i are Chinese, and the word segmentation result is as follows: i (r), are (v), china (ns), people (n). Extracting a feature set of the meta language model as follows: i, yes, china, people. The feature set of the extracted binary language model is as follows: i are Chinese. Extracting a part-of-speech unigram set as follows: r, v, ns, n. Extracting a feature set of the binary language model as follows: rv, vns, nsn. Then fusing the features of the two sets, wherein the final feature set is as follows: i, yes, china, people, I are, china, chinese, r, v, ns, n, rv, vns, nsn. As a general feature;
s3: optionally, before determining the feature weight of the feature according to the extracted feature, screening the feature according to the text category characterization capability of the feature to obtain a screened feature, wherein the text category characterization capability of the screened feature is higher than a preset first threshold. Example(s)For example, using chi-square feature test method, the formula is measured by the difference value of chi-square testWherein X is i Representing all actual values, E representing theoretical values, is the expectation calculated using all X, for each feature X 1 ,x 2 …x n And calculating the chi-square value of the feature for the category to which the feature belongs, wherein the larger the chi-square value is, the more the feature can be characterized for the category. The better characterizations can be selected according to a pre-set threshold. Removing the characteristics relatively independent of the category, and reducing noise characteristics during classification;
s4, since texts in different fields can be seriously unbalanced, the weights of the same words in different fields are completely different. For each feature obtained after screening in the steps, determining the feature weight of the feature according to the extracted feature;
s5, inputting the obtained characteristic weight into a model (corresponding to the first model in the embodiment) obtained through training, and analyzing;
and S6, calculating the probability that the user input belongs to each category, and selecting the category with the highest probability as a final result to be output.
Optionally, determining the feature weight of the feature according to the feature includes: determining a first feature weight and a second feature weight of the specified feature, wherein the first feature weight is used for representing the importance degree of the feature to the specified text in the plurality of sample texts, and the second feature weight is used for representing the importance degree of the specified feature to the specified text class; feature weights for the specified features are determined based on the first feature weights and the second feature weights.
The present embodiment is not limited to the order of determining the first feature weight of the specified feature and determining the second feature weight of the specified feature, and may be any order. The specified feature in the present embodiment may be each of all or part of the extracted features. The above specified feature may be each feature of the features obtained after the screening.
Optionally, the method for determining the feature weight of the feature is: the var value (corresponding to the second feature weight in the above embodiment) and the original tf_idf feature (corresponding to the first feature weight in the above embodiment) may be calculated based on the text imbalance to obtain feature weights, thereby correcting the feature affected by the text imbalance;
illustratively, in the present embodiment, the feature weight may be calculated by adding var (equivalent to the second feature weight in the above embodiment) on the basis of tf_idf (equivalent to the first feature weight in the above embodiment), so as to emphasize the difference of the extracted features in different categories, and increase the weight of the category keyword, thereby being able to improve the accuracy of text classification.
Optionally, a second feature weight (e.g., var, below) is determined for the specified feature i ) Comprising: and determining a second feature weight of the specified feature according to the total number of text categories of the plurality of sample texts, the number of sample texts belonging to the specified text category in the specified sample texts containing the specified feature, and the number of sample texts belonging to the specified text category in the plurality of sample texts. For example:
of which tf i,j The method comprises the following steps:
tf_idf i,j =tf i,j *idf i
wherein:
n i,j representing the number of times word i appears in document j;
k n k,j representing the sum of the occurrence times of all valid words in the document j;
|d| represents the total number of documents;
|{d∈D:t i e d } | represents the total number of documents containing word i.
Wherein var i The method comprises the following steps:
wherein:
|{d j ∈D:t i ∈d j the number of documents belonging to class j in which word i appears;
|{d j e, D } | represents the total number of j types of documents;
S(x j j=1, 2, …, n) represents x 1 ,x 2 ,…,x n Is a variance of (2);
n_class represents the total document classification number;
optionally, determining the feature weight of the specified feature according to the first feature weight and the second feature weight further comprises: normalizing the second feature weight to obtain a processed second feature weight; and determining the feature weight of the appointed feature according to the first feature weight and the processed second feature weight. It should be noted that, performing the normalization process may make the data distribution denser and reasonable. For example, the feature weights may also be calculated as follows:
the characteristic weights are as follows: tf_idf_var i,j =tf_idf i,j *f(var i ) Wherein tf_idf i,j Equivalent to the first weight, f (var i ) The second weight corresponds to the above embodiment; the second feature weight may also be f (var i ) Wherein f is a function for adjusting and calculating the distribution of each word var to be more reasonable and depends on different scenes.
Illustratively, in the problem of intelligent phonetic text classification, conclusions are drawn by observation and experimentation: the following f is adopted to make the data distribution denser and reasonable:
the characteristic weights are obtained by using tf_idf and var, so that the weights of words important for classification are increased. var may represent a measure of word imbalance across each category. The more unevenly the words are distributed in each category, the larger the var value, and the greater the contribution of the words to the category, thereby achieving the purpose of increasing the weight of the category keywords. Therefore, the defect that the classification is carried out by calculating the weight by only tf_idf is overcome, and words with higher document frequency in a certain part of specific categories are also obtained with higher weight (such as 'air conditioner'), so that the characteristic engineering part of the whole classifier is optimized, and the model accuracy is greatly improved.
It should be noted that, the feature weights calculated only according to tf_idf cannot correctly reflect the contribution degree of the words to classification, and after var is added, words which are unevenly distributed among the classes can obtain larger weights, and because of the added part-of-speech features, the weights of important words are enhanced, and words which are even more balanced in distribution or have fewer documents are relatively reduced in weight, for example, the word "skin-decocting" in the intelligent voice classification problem only appears twice in the class, but the calculation mode of tf_idf is used to obtain almost the largest weight, and after var is added to calculate the weights, the alarm clock obtains the largest weight. The word "alarm clock" appears thousands of times in the training set only in planning this category, such words clearly have a critical effect on classification, and should be given a higher weight. According to the embodiment of the application, part-of-speech features are added in the short text classification, and optionally, tf_idf and var are also used for calculating feature weights, so that the influence of the part-of-speech on classification and the unbalance of part-of-speech distribution among classes of each word are considered, and the model classification effect is obviously improved.
In the accompanying drawings, graphs of the weights of words in the front and rear calculation modes from large to small are shown (because of short text, the comparison can be performed on the assumption that the number of times of word occurrence in a document is 1) fig. 3 is a result obtained by determining the feature weight according to tf_idf only, and fig. 4 is a result obtained by determining the feature weight according to tf_idf-var according to the embodiment of the present application. In fig. 3, three black dots from left to right represent the feature words "early morning", "alarm clock", "air conditioner", respectively, which play an important role in intelligent voice classification, and the weights calculated by tf_idf are too low. After var items are added, the figure 4 is obtained, three black points in the figure 4 represent feature words of alarm clock, early morning and air conditioner from left to right respectively, the feature words are given high weight, the words have key effects on classification, and the contribution degree of the features to the classifier is correctly embodied. After the method of the embodiment of the application is used in the field of intelligent question-answering text classification, the classification effect is obviously improved, the problem of low initial weight of important feature words is solved, and under the condition of reasonable feature values, model training is faster to achieve convergence, and the classifier accuracy is higher.
Optionally, before analyzing the feature weights using the first model to determine the text category of the text, the method further comprises: training the first model using the plurality of sets of arrays; and stopping training the first model under the condition that the analysis accuracy of the first model is higher than a preset second threshold value.
Optionally, each of the plurality of sets of data used for training the first model includes: sample feature weights of a plurality of sample texts and text categories corresponding to the sample feature weights, wherein the sample feature weights are determined according to sample features of the sample texts, and the sample features of the sample texts comprise sample text features and sample part-of-speech features.
Illustratively, the process of training the first model described above is as follows:
in an alternative embodiment, building a complete classifier may include a feature engineering portion and an algorithm model portion. The embodiment of the application is further described by taking a svm text classification scheme applied to a voice intelligent question-answering system as an example, and fig. 5 is a flowchart of a svm classifier training process according to an alternative embodiment of the application, as shown in fig. 5:
in an alternative implementation manner, the whole classifier is generally as follows, where the feature extraction module and the feature calculation module are important parts in the embodiment of the present application:
s1: data acquisition, cleaning and sampling to obtain a training set, a verification set and a test set;
s2, segmenting the text and marking the parts of speech, and extracting features by using a single word or two continuous words as a feature by using a single language model and a binary language model. Such as: i are Chinese, and the word segmentation result is as follows: i (r), are (v), china (ns), people (n). Extracting a feature set of the meta language model as follows: i, yes, china, people. The feature set of the extracted binary language model is as follows: i are Chinese. Extracting a part-of-speech unigram set as follows: r, v, ns, n. Extracting a feature set of the binary language model as follows: rv, vns, nsn. Then fusing the features of the two sets, wherein the final feature set is as follows: i, yes, china, people, I are, china, chinese, r, v, ns, n, rv, vns, nsn. As a general feature;
s3: using chi-square feature test method, the formula is measured by the difference value of chi-square testWherein X is i Representing all actual values, E representing theoretical values, is the expectation calculated using all X, for each feature X 1 ,x 2 …x n And calculating the chi-square value of the feature for the category to which the feature belongs, wherein the larger the chi-square value is, the more the feature can be characterized for the category. The better characterizations can be selected according to a pre-set threshold. Removing the characteristics relatively independent of the category, and reducing noise characteristics during classification;
s4, in the intelligent question-answering scene, household appliance control is an independent vertical field, and various functions are as follows: other fields such as weather, news, menu and the like have relatively little data, and home appliances in the main fields have a large amount of corpus, so that serious text imbalance is caused, and the weights of the same words in different fields are completely different. Calculating var value by using text imbalance, and multiplying original tf_idf characteristic points, so as to correct characteristics affected by the text imbalance;
s5: training the text vectorized data as input features of an svm classifier, searching an optimal parameter value in a specified threshold range through 5-fold cross validation, and determining a kernel function, a penalty coefficient and other parameter values which enable the svm algorithm to have the highest accuracy;
and S6, obtaining a model and finishing classification. And obtaining a model file.
Optionally, analyzing the feature weight using a first model to determine a text category of the text, including: analyzing the feature weights by using a first model to respectively obtain the probability that the text corresponding to the feature weights belongs to a specified text category; and determining the text category of the text according to the probability. For example, the text category with the highest probability may be determined as the text category to which the text belongs.
The embodiment also provides a text classification device, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 6 is a block diagram of a text classification apparatus according to an embodiment of the present application, as shown in fig. 6, including:
an extracting module 31, configured to extract features of the text, where the features include text features and part-of-speech features;
a determining module 33, configured to determine a feature weight of the feature according to the feature;
an analysis module 35, configured to analyze the feature weights by using a first model, and determine a text category of the text, where the first model is obtained by training a deep neural network by using multiple sets of data, and each set of data in the multiple sets of data includes: a plurality of sample texts, and a text category of the plurality of sample texts.
Through the module, as the extracted text features comprise text features and part-of-speech features, the feature weights of the features are determined, the feature weights are analyzed by using the model, and the text category of the text is determined, so that the problem of lower classification accuracy in the short text classification field is solved, and the effect of improving the classification accuracy is achieved.
Optionally, each set of data in the plurality of sets of data includes: sample feature weights of a plurality of sample texts and text categories corresponding to the sample feature weights, wherein the sample feature weights are determined according to sample features of the sample texts, and the sample features of the sample texts comprise sample text features and sample part-of-speech features.
Optionally, the determining module includes: a first determining unit, configured to determine a first feature weight and a second feature weight of a specified feature, where the first feature weight is used to characterize a degree of importance of the feature to a specified text in the plurality of sample texts, and the second feature weight is used to characterize a degree of importance of the specified feature to a specified text category; and a second determining unit configured to determine the feature weight of the specified feature according to the first feature weight and the second feature weight. The present embodiment is not limited to the order of determining the first feature weight of the specified feature and determining the second feature weight of the specified feature, and may be any order. The specified feature in the present embodiment may be each of all or part of the extracted features.
Optionally, the first determining unit includes: a first determining subunit, configured to determine, according to a total number of text categories of the plurality of sample texts, and a number of sample texts belonging to a specified text category in a specified sample text including the specified feature in the plurality of sample texts, and a number of sample texts belonging to the specified text category in the plurality of sample texts, the second feature weight of the specified feature.
Optionally, the second determining unit further includes: the processing unit is used for carrying out normalization processing on the second characteristic weight to obtain the processed second characteristic weight; and a second determining subunit, configured to determine the feature weight of the specified feature according to the first feature weight and the processed second feature weight. It should be noted that, performing the normalization process may make the data distribution denser and reasonable.
Optionally, the embodiment of the present application further includes a screening unit, configured to screen the feature according to a text category characterizing capability of the feature before determining a feature weight of the feature according to the feature, so as to obtain a screened feature, where the text category characterizing capability of the feature after screening is higher than a preset first threshold. The above specified feature may be each feature of the features obtained after the screening.
Optionally, the extraction module includes: and the extraction unit is used for respectively using more than two language models for word segmentation on the content and the part of speech of the text to obtain the characteristics of the text.
Optionally, the embodiment of the present application further includes a training module, configured to train the first model using the multiple sets of arrays before analyzing the feature weights using the first model to determine the text class of the text; and stopping training the first model under the condition that the analysis accuracy of the first model is higher than a preset second threshold value.
Optionally, the analysis module of the embodiment of the present application includes: the analysis unit is used for analyzing the characteristic weights by using a first model to respectively obtain the probability that the text corresponding to the characteristic weights belongs to a specified text category; and the third determining unit is used for determining the text category of the text according to the probability. For example, the text category with the highest probability may be determined as the text category to which the text belongs.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
Example 2
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, extracting characteristics of the text, wherein the characteristics comprise text characteristics and part-of-speech characteristics;
s2, determining feature weights of the features according to the features;
s3, analyzing the feature weight by using a first model, and determining the text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a plurality of sample texts, and a text category of the plurality of sample texts.
Through the steps, as the extracted text features comprise text features and part-of-speech features, the feature weights of the features are determined, the feature weights are analyzed by using the model, and the text category of the text is determined, so that the problem of low classification accuracy in the short text classification field is solved, and the effect of improving the classification accuracy is achieved.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, extracting characteristics of the text, wherein the characteristics comprise text characteristics and part-of-speech characteristics;
s2, determining feature weights of the features according to the features;
s3, analyzing the feature weight by using a first model, and determining the text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a plurality of sample texts, and a text category of the plurality of sample texts.
Through the steps, as the extracted text features comprise text features and part-of-speech features, the feature weights of the features are determined, the feature weights are analyzed by using the model, and the text category of the text is determined, so that the problem of low classification accuracy in the short text classification field is solved, and the effect of improving the classification accuracy is achieved.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for classifying text, comprising:
extracting characteristics of the text, wherein the characteristics comprise text characteristics and part-of-speech characteristics;
determining feature weights of the features according to the features;
analyzing the feature weight by using a first model to determine a text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a plurality of sample texts, and text categories of the plurality of sample texts;
wherein determining feature weights for the features from the features comprises: determining a first feature weight and a second feature weight of a specified feature, wherein the first feature weight is used for representing the importance degree of the feature to a specified text in the plurality of sample texts, and the second feature weight is used for representing the importance degree of the specified feature to a specified text category; determining the feature weights of the specified features according to the first feature weights and the second feature weights;
calculating the first characteristic weight and the second characteristic weight through the following formula to obtain the characteristic weight tf_idf_var i,j
tf_idf_var i,j =tf_idf i,j *f(var i )
Wherein tf_idf i,j For the first feature weight, var i For the second feature weight, f () is the objective function f=The second characteristic weight is-> |{d j ∈D:t i ∈d j The number of documents belonging to class j in which word i appears; i { d } j E, D } | represents the total number of j types of documents; s (x) j J=1, 2, …, n) represents x 1 ,x 2 ,…,x n Is a variance of (2); n_class represents the total document classification number.
2. The method of claim 1, wherein determining the second feature weight for the specified feature comprises:
determining the second feature weight of the specified feature according to the total number of text categories of the plurality of sample texts, the number of sample texts belonging to a specified text category in the specified sample texts containing the specified feature, and the number of sample texts belonging to the specified text category in the plurality of sample texts.
3. The method of claim 1, wherein determining the feature weights for the specified features from the first feature weights and second feature weights further comprises:
normalizing the second characteristic weight to obtain the processed second characteristic weight;
and determining the feature weight of the designated feature according to the first feature weight and the processed second feature weight.
4. The method according to any one of claims 1 to 2, wherein prior to determining the feature weight of the feature from the feature, the method further comprises:
and screening the features according to the text category characterization capability of the features to obtain the screened features, wherein the text category characterization capability of the features after screening is higher than a preset first threshold.
5. The method of claim 1, wherein extracting features of the text comprises:
and respectively using more than two language models for word segmentation on the content and the part of speech of the text to obtain the characteristics of the text.
6. The method of claim 1, wherein prior to analyzing the feature weights using the first model to determine the text category of the text, the method further comprises:
training the first model using the plurality of sets of arrays;
and stopping training the first model under the condition that the analysis accuracy of the first model is higher than a preset second threshold value.
7. The method of claim 1, wherein analyzing the feature weights using a first model to determine a text category of the text comprises:
analyzing the feature weights by using a first model to respectively obtain the probability that the text corresponding to the feature weights belongs to a specified text category;
and determining the text category of the text according to the probability.
8. A text classification apparatus, comprising:
the extraction module is used for extracting the characteristics of the text, wherein the characteristics comprise text characteristics and part-of-speech characteristics;
the determining module is used for determining the feature weight of the feature according to the feature;
the analysis module is used for analyzing the feature weight by using a first model and determining the text category of the text, wherein the first model is obtained by training a deep neural network by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a plurality of sample texts, and text categories of the plurality of sample texts;
wherein, the determining module includes: a first determining unit, configured to determine a first feature weight and a second feature weight of a specified feature, where the first feature weight is used to characterize a degree of importance of the feature to a specified text in the plurality of sample texts, and the second feature weight is used to characterize a degree of importance of the specified feature to a specified text category; a second determining unit configured to determine the feature weight of the specified feature according to the first feature weight and a second feature weight, the second feature weight being calculated based on text imbalance;
the first special is calculated by the following formulaCalculating the feature weight and the second feature weight to obtain the feature weight tf_idf_var i,j
tf_idf_var i,j =tf_idf i,j *f(var i )
Wherein tf_idf i,j For the first feature weight, var i For the second feature weight, f () is the objective functionThe second characteristic weight is-> |{d j ∈D:t i ∈d j The number of documents belonging to class j in which word i appears; i { d } j E, D } | represents the total number of j types of documents; s (x) j J=1, 2, …, n) represents x 1 ,x 2 ,…,x n Is a variance of (2); n_class represents the total document classification number.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 7 when run.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 7.
CN201911383131.1A 2019-12-27 2019-12-27 Text classification method and device Active CN111159404B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911383131.1A CN111159404B (en) 2019-12-27 2019-12-27 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911383131.1A CN111159404B (en) 2019-12-27 2019-12-27 Text classification method and device

Publications (2)

Publication Number Publication Date
CN111159404A CN111159404A (en) 2020-05-15
CN111159404B true CN111159404B (en) 2023-09-19

Family

ID=70558818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911383131.1A Active CN111159404B (en) 2019-12-27 2019-12-27 Text classification method and device

Country Status (1)

Country Link
CN (1) CN111159404B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539653A (en) * 2020-05-27 2020-08-14 山西东易园智能家居科技有限公司 Intelligent filling construction progress management method
CN111797229A (en) * 2020-06-10 2020-10-20 南京擎盾信息科技有限公司 Text representation method and device and text classification method
CN111859862B (en) * 2020-07-22 2024-03-22 海尔优家智能科技(北京)有限公司 Text data labeling method and device, storage medium and electronic device
CN113704398A (en) * 2021-08-05 2021-11-26 上海万物新生环保科技集团有限公司 Keyword extraction method and device
CN115883912B (en) * 2023-03-08 2023-05-16 山东水浒文化传媒有限公司 Interaction method and system for internet communication demonstration

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991891A (en) * 2015-07-28 2015-10-21 北京大学 Short text feature extraction method
CN106709370A (en) * 2016-12-31 2017-05-24 北京明朝万达科技股份有限公司 Long word identification method and system based on text contents
CN107368611A (en) * 2017-08-11 2017-11-21 同济大学 A kind of short text classification method
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN110019792A (en) * 2017-10-30 2019-07-16 阿里巴巴集团控股有限公司 File classification method and device and sorter model training method
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
CN110413956A (en) * 2018-04-28 2019-11-05 南京云问网络技术有限公司 A kind of Text similarity computing method based on bootstrapping

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991891A (en) * 2015-07-28 2015-10-21 北京大学 Short text feature extraction method
CN106709370A (en) * 2016-12-31 2017-05-24 北京明朝万达科技股份有限公司 Long word identification method and system based on text contents
CN107368611A (en) * 2017-08-11 2017-11-21 同济大学 A kind of short text classification method
CN110019792A (en) * 2017-10-30 2019-07-16 阿里巴巴集团控股有限公司 File classification method and device and sorter model training method
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
CN110413956A (en) * 2018-04-28 2019-11-05 南京云问网络技术有限公司 A kind of Text similarity computing method based on bootstrapping
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster

Also Published As

Publication number Publication date
CN111159404A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111159404B (en) Text classification method and device
CN109165284B (en) Financial field man-machine conversation intention identification method based on big data
US20230237328A1 (en) Information processing method and terminal, and computer storage medium
CN110909165B (en) Data processing method, device, medium and electronic equipment
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN110717023B (en) Method and device for classifying interview answer text, electronic equipment and storage medium
CN109993057A (en) Method for recognizing semantics, device, equipment and computer readable storage medium
CN103761254A (en) Method for matching and recommending service themes in various fields
CN107291775B (en) Method and device for generating repairing linguistic data of error sample
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN107545038B (en) Text classification method and equipment
CN108287848B (en) Method and system for semantic parsing
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN111539612B (en) Training method and system of risk classification model
CN108665148B (en) Electronic resource quality evaluation method and device and storage medium
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN111061837A (en) Topic identification method, device, equipment and medium
EP4193261A1 (en) Test script generation from test specifications using natural language processing
CN112148852A (en) Intelligent customer service method and device, storage medium and computer equipment
CN108073567B (en) Feature word extraction processing method, system and server
CN112884569A (en) Credit assessment model training method, device and equipment
CN111400495A (en) Video bullet screen consumption intention identification method based on template characteristics
CN114048294B (en) Similar population extension model training method, similar population extension method and device
CN112711678A (en) Data analysis method, device, equipment and storage medium
CN110377741A (en) File classification method, intelligent terminal and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant