CN111177371A - Classification method and related device - Google Patents

Classification method and related device Download PDF

Info

Publication number
CN111177371A
CN111177371A CN201911235058.3A CN201911235058A CN111177371A CN 111177371 A CN111177371 A CN 111177371A CN 201911235058 A CN201911235058 A CN 201911235058A CN 111177371 A CN111177371 A CN 111177371A
Authority
CN
China
Prior art keywords
classification
sequence
category
words
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911235058.3A
Other languages
Chinese (zh)
Other versions
CN111177371B (en
Inventor
刘志煌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201911235058.3A priority Critical patent/CN111177371B/en
Publication of CN111177371A publication Critical patent/CN111177371A/en
Application granted granted Critical
Publication of CN111177371B publication Critical patent/CN111177371B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Abstract

The embodiment of the application discloses a classification method, wherein before a corpus to be classified is classified, if the corpus to be classified comprises a plurality of first class feature words for representing objects to be classified and second class feature words related to classification requirements, a class feature word sequence for embodying a classification association relation is determined according to a part-of-speech sequence corresponding to the corpus to be classified, and the classification association relation is used for embodying an association relation between the second class feature words and the first class feature words. Therefore, the classification characteristic vector constructed according to the class characteristic word sequence carries the classification incidence relation information among the class characteristic words. The classification model can directly determine which object to be classified is related to the class feature words according to the classification association relation information carried by the classification feature vectors, so that the objects to be classified are classified, and the requirements on the classification model and the training difficulty of the classification model are reduced. Meanwhile, the classification characteristic word sequence accords with the language expression rule, so that the obtained classification incidence relation can be ensured to be accurate.

Description

Classification method and related device
Technical Field
The present application relates to the field of data processing, and in particular, to a classification method and related apparatus.
Background
The classification processing based on the text corpus is an important technology in the information processing technology, with the increasing demand of users, the classification at chapter level or sentence level is difficult to meet the demand of users, and how to classify a plurality of objects included in the text corpus becomes an urgent demand of various application scenes such as emotion analysis, spam short message classification and the like in the fields of e-commerce platforms, news recommendation, social platforms and the like.
In some related technologies, after determining an attribute word representing an object to be classified based on a text corpus, the attribute word is input to a neural network model of an attention mechanism, so that classification is performed on different objects to be classified according to the attention mechanism of the neural network model.
However, in some cases, this approach is prone to classification errors, and in order to improve the classification accuracy, the requirement for the neural network model needs to be increased, which increases the difficulty in training the neural network model.
Disclosure of Invention
In order to solve the technical problem, the application provides a classification method, which reduces the requirements on a classification model and reduces the training difficulty of the classification model. Meanwhile, the category characteristic word sequence has a sequence, and the sequence accords with the language expression rule, so that the accuracy of the obtained classification associated information can be ensured, and the subsequent classification can be accurately carried out.
The embodiment of the application discloses the following technical scheme:
in a first aspect, an embodiment of the present application provides a classification method, where the method includes:
determining a category characteristic word sequence which embodies a classification association relation according to a part-of-speech sequence corresponding to a corpus to be classified, wherein the corpus to be classified comprises a plurality of first category characteristic words which represent objects to be classified and second category characteristic words which are related to a classification requirement, and the classification association relation is used for embodying the association relation between the second category characteristic words and the first category characteristic words;
constructing a classification feature vector according to the class feature word sequence, wherein the classification feature vector embodies the characteristics of the corpora of different classes;
and classifying the linguistic data to be classified through a classification model according to the classification characteristic vector, wherein the classification model is a non-deep learning model.
In a second aspect, an embodiment of the present application provides a classification apparatus, where the apparatus includes a determining unit, a constructing unit, and a classifying unit:
the determining unit is used for determining a category feature word sequence which embodies a classification association relationship according to a part-of-speech sequence corresponding to a corpus to be classified, wherein the corpus to be classified comprises a plurality of first category feature words which represent objects to be classified and second category feature words which are related to a classification requirement, and the classification association relationship is used for embodying the association relationship between the second category feature words and the first category feature words;
the construction unit is used for constructing a classification feature vector according to the category feature word sequence, and the classification feature vector embodies the corpus characteristics of different categories;
and the classification unit is used for classifying the linguistic data to be classified through a classification model according to the classification characteristic vector, wherein the classification model is a non-deep learning model.
In a third aspect, an embodiment of the present application provides an apparatus for classification, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of the first aspect according to instructions in the program code.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program code for executing the method of the first aspect.
According to the technical scheme, in order to reduce the requirement on the classification model, before the corpus to be classified is classified, if the corpus to be classified comprises a plurality of first class feature words for representing the objects to be classified and second class feature words related to the classification requirement, the classification method provided by the application can determine the class feature word sequence for representing the classification association relationship according to the part-of-speech sequence corresponding to the corpus to be classified, wherein the classification association relationship is used for representing the association relationship between the second class feature words and the first class feature words, namely, the classification feature words are respectively associated with which objects to be classified and have no association relationship when the classification is represented. Therefore, the classification characteristic vector constructed according to the class characteristic word sequence carries the classification incidence relation information among the class characteristic words. Because the classification characteristic vectors input into the classification model carry the classification incidence relation information among the classification characteristic words, and the classification characteristic vectors can embody the characteristics of the corpora of different classes, even if the corpora to be classified comprise a plurality of objects to be classified, the classification model can also directly determine which object to be classified the classification characteristic words are respectively related to according to the existing classification incidence relation information, thereby classifying the objects to be classified, reducing the requirements on the classification model and reducing the training difficulty of the classification model. Meanwhile, the category characteristic word sequence has a sequence, and the sequence accords with the language expression rule, so that the obtained classification incidence relation can be ensured to be accurate, and the subsequent classification can be accurately carried out.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic view of an application scenario of a classification method according to an embodiment of the present application;
fig. 2 is a flowchart of a classification method according to an embodiment of the present application;
fig. 3 is a flowchart for extracting category feature words according to a category sequence rule according to an embodiment of the present application;
fig. 4 is a process diagram for obtaining hybrid coding by concatenating word vectors according to the embodiment of the present application;
fig. 5 is an exemplary diagram of a concatenation result of word vectors in a category feature word sequence according to the embodiment of the present application;
fig. 6 is an exemplary diagram for constructing a classification feature vector based on context feature encoding according to an embodiment of the present application;
fig. 7 is an exemplary diagram for constructing a classification feature vector based on part-of-speech sequence features according to an embodiment of the present application;
FIG. 8 is a flow chart of a classification method provided by an embodiment of the present application;
fig. 9 is a structural diagram of a sorting apparatus according to an embodiment of the present application;
fig. 10 is a structural diagram of a terminal device according to an embodiment of the present application;
fig. 11 is a block diagram of a server according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
In some related technologies, when a text corpus is classified according to a plurality of objects to be classified, after an attribute word representing the object to be classified is determined based on the text corpus, the attribute word is input to a neural network model of an attention mechanism, so that the attribute word and which classification feature words have an association relationship according to the attention mechanism of the neural network model, and thus, an object to be classified can be known to be classified according to which classification feature words.
For example, in the e-commerce platform scene, a user issues a piece of comment information, namely that a room is comfortable, service is good, and price is not low, so that the emotion of the user on product attributes can be mined in order to better balance the preference condition of the user on products. The comment information includes a plurality of product attributes "room", "service" and "price", and a plurality of other category feature words related to emotion classification, such as an emotion word "comfort", "good", a degree adverb "very", and a negative word "not". However, in order to perform accurate emotion classification for each attribute, the neural network model needs to learn which category feature words are related to the emotion classification of "room", which category feature words are related to the emotion classification of "service", and which category feature words are related to the emotion classification of "price" by continuously learning the association relationship between the category feature words.
Under these circumstances, the method is prone to classification errors, and in order to improve the classification accuracy, the requirement on the neural network model needs to be improved, and the training difficulty of the neural network model is increased.
For this reason, the present application provides a classification method, which may be applied to a data processing device, where the data processing device may be a terminal device, and the terminal device may be, for example, an intelligent terminal, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like.
The data processing device may also be a server, which may be an independent server or a cluster server. The server can send the classification result to the terminal equipment for displaying.
The classification method provided by the embodiment of the application can be applied to the fields of emotion analysis, junk message identification, classification, hacker attack identification and the like. The method mainly uses emotion analysis as an example for detailed introduction, and the emotion analysis can be applied to various scenes, such as fields of electronic commerce, news information, microblog forums and the like, and is suitable for scenes of public opinion analysis, recommendation, user portrait mining and the like. For example, in various E-commerce platform scenes, the emotion of the user on the product attribute is mined, so that the preference condition of the user on the product can be better balanced, and a key decision is provided for applications such as merchant analysis, cross marketing and the like; in news information such as optional stocks and news and social platform scenes such as microblog forums, public sentiment analysis is carried out on some evaluation objects or attention objects, deeper information can be mined, and the method has very important significance in analyzing the rising and stopping reasons of individual stocks, understanding the dynamics of social attention hotspots, exploring the direction of future improvement and the like.
In order to facilitate understanding of the technical solution of the present application, the following describes a classification method provided in the embodiments of the present application by taking a server as an example in combination with an actual application scenario.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario of the classification method provided in the embodiment of the present application, where the application scenario may include a terminal device 101 and a server 102, and the server 102 may obtain corpora to be classified from the terminal device 101, where the corpora to be classified may be different types of corpora, such as comment information, short message information, and news information. Different classifications can be made for different corpora to be classified, for example, sentiment classification can be made for comment information, spam classification can be made for short messages (for example, classification of spam short messages such as gambling and pornography), and public opinion analysis can be made for news information.
The corpus to be classified may include a plurality of first class feature words representing objects to be classified and second class feature words related to classification requirements. Compared with the classification of the whole corpus to be classified, the embodiment of the application can realize the classification of different objects to be classified respectively under the condition that the corpus to be classified comprises a plurality of objects to be classified.
After obtaining the corpus to be classified, the server 102 may determine a category feature word sequence representing a classification association relationship according to the part-of-speech sequence corresponding to the corpus to be classified, where the classification association relationship is used to represent an association relationship between a second category feature word and a first category feature word, that is, when classifying, each category feature word is associated with which object to be classified, and which classification feature words are not associated.
Then, the server 102 constructs a classification feature vector according to the category feature word sequence, so that the classification feature vector carries classification association relation information between the category feature words. When the server 102 inputs the classification feature vector into the classification model for classification, since the classification feature vector carries the classification association relationship information between the classification feature words, the classification model can also directly determine which object to be classified the classification feature words are respectively related to according to the existing classification association relationship information, so as to classify the object to be classified, thereby reducing the requirement on the classification model and the training difficulty of the classification model. The classification feature vectors can embody the characteristics of different categories of corpora, so that the classification model used in the embodiment of the application is a non-deep learning model, namely the classification model does not need to extract features in a deep learning mode.
Meanwhile, the category characteristic word sequence has a sequence, and the sequence accords with the language expression rule, so that the accuracy of the obtained classification associated information can be ensured, and the subsequent classification can be accurately carried out.
After the server 102 finishes the classification, the server may also send the classification result to the terminal device 101, so that the terminal device 101 may perform analysis, processing, decision, and the like according to the classification result.
Next, a classification method provided by an embodiment of the present application will be described in detail with reference to the drawings, taking the data processing device as a server as an example.
Referring to fig. 2, fig. 2 shows a flow chart of a classification method comprising:
s201, determining a category characteristic word sequence reflecting a classification association relation according to the part-of-speech sequence corresponding to the corpus to be classified.
Each corpus to be classified has a corresponding part-of-speech sequence, and if the corpus to be classified comprises a plurality of first class feature words representing objects to be classified and second class feature words related to classification requirements, in order to know which classification of the first class feature words has an association relationship with the second class feature words, the server can determine a class feature word sequence representing the classification association relationship according to the part-of-speech sequence corresponding to the corpus to be classified. And the classification incidence relation is used for embodying the incidence relation between the second category characteristic words and the first category characteristic words.
It should be noted that the determining manner of the part-of-speech sequence may be that the server performs sentence segmentation on the corpus to be classified to obtain a plurality of sentences, and performs word segmentation and tagging on each sentence to obtain the part-of-speech sequence. The part-of-speech sequence may be formed by tagging labels corresponding to each participle in the corpus to be classified. The result after word segmentation is used as a word sequence, and the elements (label tags) of the part of speech sequence and the elements (word segmentation) of the word sequence keep the original corresponding relation according to the position index.
The labeling process may include part-of-speech labeling (part-of-speech labeling such as noun, adjective, adverb) and category feature word labeling. Under different classification scenes, the types of the category characteristic words are different, and taking the emotion classification scene as an example, the category characteristic words can comprise attribute words, emotion words, degree adverbs and negative words, the first category characteristic words comprise the attribute words, and the second category characteristic words comprise one or more combinations of the emotion words, the degree adverbs and the negative words.
For example, in an emotion classification scenario, a corpus to be classified is comment information, "a room is comfortable, service is good, and price is not cheap", first, a text is subjected to clause, participle, and part of speech, and an output result is: "room/n, very/d, comfortable/a, |, service/n, very/d, good/a, |, price/n, not/d, cheap/a", where "|" marks a clause, on the left of "/" is a participle, and on the right is a part-of-speech tag for the participle. Then, class feature word labeling is carried out, wherein 'room, service and price' is an attribute word (namely a first class feature word which represents an object to be classified, namely an evaluation object in comment information), and a label is '#'; "comfortable, good, cheap" is an emotional word, labeled with "+"; "very" is a degree adverb labeled "&"; "not" is a negative word labeled "! ". Therefore, the part-of-speech sequence obtained finally is: "#/n, &/d, #/a, |, #/n, &/d, |, #/n, |! And d,/a ". Wherein the first label "#/n" corresponds to the participle "room".
The server can determine a category feature word sequence according to the part of speech sequence, and the category feature word sequence can embody the association relationship between the second category feature word and the first category feature word, for example, according to the part of speech sequences "#/n, &/d,",/a, |, "#/n, &/d, |,", #/n, |! D,/a "can determine the category feature word sequence" room, very, comfortable "," service, very, good "," price, not, cheap ", that is," very "and" comfortable "are used to evaluate" room ", i.e. the second category feature word" very "and" comfortable "has a classification association with the first category feature word" room "; the 'very' and 'good' are used for evaluating the 'service', namely, the second class characteristic words 'very' and 'good' have classification association relation with the first class characteristic word 'service'; the "not" and "cheap" are used to evaluate the "price", i.e. the second category feature words "not" and "cheap" have a categorical association with the first category feature word "price". Then, when the emotion classification is performed for "room", it is considered "very" and "comfortable" rather than "not" and "cheap", and so on, when the emotion classification is performed for "service", it is considered "very" and "good", and when the emotion classification is performed for "price", it is considered "not" and "cheap".
And S202, constructing a classification feature vector according to the class feature word sequence.
After the server obtains the category feature word sequence, a classification feature vector can be constructed according to the category feature word sequence, and the obtained classification feature vector also carries classification association relationship information because the category feature word sequence can embody a classification association relationship.
S203, classifying the linguistic data to be classified through a classification model according to the classification characteristic vector.
The server can input the classification feature vectors into the classification model, and the classification feature vectors carry classification association relation information among the classification feature words and can embody the corpus characteristics of different classes, so that the classification model can directly determine which object to be classified the classification feature words are respectively related to according to the existing classification association relation information, thereby classifying the object to be classified and reducing the requirements on the classification model.
Because the classification feature vectors input into the classification model carry the classification association relationship information among the class feature words, the classification model used in the embodiment of the application is a non-deep learning model, and the classification model can realize classification according to the classification feature vectors only by having a classification function, and features (classification feature vectors) do not need to be extracted in a deep learning mode. Therefore, in the embodiment of the present application, the classification model may be a Support Vector Machine (SVM) classifier, a softmax layer, an enhanced tree model (xgboost), and a Logistic Regression (LR) classifier.
For example, in an emotion classification scenario, if the emotion is usually commendably, neutrally and derogatively, the commendably emotion label is 1, the neutral emotion label is 0 and the derogatory emotion label is-1. The classification model can know that the room has classification association relation with the good room and the comfortable room according to the classification feature word sequence of comfortable room, good service and low price, considers the good room and the comfortable room when classifying the room, and further determines that the emotion of the attribute word (first classification feature word) room is positive (the emotion label is 1). Similarly, it can be determined that the emotion of the attribute word (first category feature word) "service" is positive (emotion label is 1) and the emotion of the attribute word (first category feature word) "price" is negative (emotion label is-1).
Specifically, the classification model may calculate probability values corresponding to different classes according to the classification feature vectors, and take the class with the highest probability value as the class corresponding to the object to be classified represented by the first class feature word.
The following describes in detail a method for classifying classification models according to classification feature vectors by taking classification models as SVM classifiers and emotion classification scenes as examples. If the emotion categories comprise positive, neutral and negative meanings, after the SVM classifier obtains the classification feature vectors, the classification feature vectors are mapped into a high-dimensional space for classification, the probability values of the emotion words corresponding to the attribute words (the first class of feature words) belonging to the emotion categories are output, if the probability values are 0.8 (positive), 0.6 (neutral) and 0.1 (negative) in sequence, and the probability value of the "positive" belonging to the emotion category is the largest, the emotion expressed by the emotion words corresponding to the attribute words can be determined to be positive, and then the object to be classified represented by the attribute words is classified into the "positive" category.
According to the technical scheme, in order to reduce the requirement on the classification model, before the corpus to be classified is classified, if the corpus to be classified comprises a plurality of first class feature words for representing the objects to be classified and second class feature words related to the classification requirement, the classification method provided by the application can determine the class feature word sequence for representing the classification association relationship according to the part-of-speech sequence corresponding to the corpus to be classified, wherein the classification association relationship is used for representing the association relationship between the second class feature words and the first class feature words, namely, the classification feature words are respectively associated with which objects to be classified and have no association relationship when the classification is represented. Therefore, the classification characteristic vector constructed according to the class characteristic word sequence carries the classification incidence relation information among the class characteristic words. Because the classification characteristic vectors input into the classification model carry the classification incidence relation information among the classification characteristic words, and the classification characteristic vectors can embody the characteristics of the corpora of different classes, even if the corpora to be classified comprise a plurality of objects to be classified, the classification model can also directly determine which object to be classified the classification characteristic words are respectively related to according to the existing classification incidence relation information, thereby classifying the objects to be classified, reducing the requirements on the classification model and reducing the training difficulty of the classification model. Meanwhile, the category characteristic word sequence has a sequence, and the sequence accords with the language expression rule, so that the obtained classification incidence relation can be ensured to be accurate, and the subsequent classification can be accurately carried out.
In addition, the whole process can realize high efficiency and complete automation, and compared with the current deep learning model which needs to perform a more complicated and time-consuming training link, the method provided by the embodiment of the application has higher practical value and reference significance in industrial application.
It can be understood that, because the method provided by the embodiment of the present application can determine the classification association relationship between the second category feature words and the first category feature words for the corpus to be classified, so that the constructed classification feature vectors can carry information of the association relationship between the second category feature words and the first category feature words, the server can classify different objects to be classified (e.g., rooms, services, prices) at one time through the classification model. Of course, the classification can also be performed on one object to be classified each time, and the classification on different objects to be classified is realized for multiple times, for example, firstly, the "room" is subjected to emotion classification according to the classification association relationship between "very" and "comfortable" and "room"; then, carrying out emotion classification on the service according to the classification association relation between the 'very' and 'good' and the 'service'; and performing emotional classification on the price according to the classification association relationship between the price and the price.
It should be noted that the manner of determining the category feature word sequence representing the classification association relationship in S201 may include multiple manners, and since the category sequence rule may identify, to a certain extent, the classification association relationship between the second category feature word and the first category feature word corresponding to the part-of-speech sequence covering the category sequence rule, in some possible implementation manners, the server may determine the category feature word sequence according to the target category sequence rule.
The target Class sequence rule is a Class sequence rule which is accorded with a part-of-speech sequence corresponding to a corpus to be classified, the Class Sequence Rule (CSR) is a rule composed of a Class label and a sequence, a mapping relation between the sequence and the Class label is embodied, and is expressed as X → Y, and the mapping relation is described specifically as follows:
x is a sequence expressed as<S1x1S2x2…Sixi>Where S refers to a sequence database, as a series of tuples<sid,s>Set of components, as shown in FIG. 1, sid is the number of a sequence, and s refers to the sequence, xiIndicating category information indicating the possible categories for this sequence:
TABLE 1 sequence database example
Numbering of sequences Sequence of
1 <abdC1gh>
2 <abeghk>
3 <C2kea>
4 <dC2kb>
5 <abC1fgh>
Y is another sequence, expressed as<S1c1S2c2…Sicr>Wherein (c)rE is C,1 is more than or equal to i is less than or equal to r), S is defined as above, CrIndicating category information, as a certain category label, and c ═ c1,c2,…,crIs a set of category labels. Thus, CSR requires that sequences must carry specified category information.
After specifying the category information, the CSR mines sequences satisfying the requirements as a rule, taking table 1 as an example, the sequence database contains 5 sequences with category information, and according to the above definition, the class sequence rule that can be mined for the sequence database shown in table 1 is "ab" x "gh" → "ab" c1《gh》。
It should be noted that, in the embodiment of the present application, different obtaining manners may be provided for the target class sequence rule used when the corpus to be classified is classified, and in some cases, the corpus to be classified may be closer to the expression manner of the historical corpus, for example, the corpus to be classified and the historical corpus belong to the same field, for example, both are comment information. Therefore, when the linguistic data to be classified is classified, the class characteristic word sequence can be directly determined according to the existing target class sequence rule, the class sequence rule is prevented from being excavated again, the calculated amount is reduced, and the classification efficiency is improved.
It can be understood that, according to the definition of mining class sequence rules introduced above, the process of extracting class feature words according to the class sequence rules is shown in fig. 3, and the CSR determines the class (S301) first and then mines the rules according to the class. In the class sequence rule, the left side is the sequence, the right side is the corresponding class label, and the sequence and the class information identified by the class label are bound together through the corresponding mapping relation. The goal of CSR mining is to find sequences that have a high degree of correlation with category information, mining rules for correspondence between sequences and category labels. It follows that class sequence rules are characterized by supervised and pre-given class information.
In this embodiment of the present application, a target class sequence rule may be mined according to a support threshold and a confidence threshold, specifically, the support threshold is set (S302), a frequent sequence satisfying the support threshold is determined from a plurality of part-of-speech sequences corresponding to a history corpus (S303), and if the confidence of the frequent sequence satisfies the confidence threshold (S304), the frequent sequence is determined to meet the target class sequence rule (S305).
Taking Table 1 above as an example, the sequences numbered 1 and 5 contain the sequence rule "ab" x "gh" → "ab" c1"gh" is all c1While the sequences numbered 1, 2 and 5 cover this class of sequence rules, the sequence numbered 2 has no defined class label. Therefore, in the 5-sequence data tuples, the support degree of the sequence rule is 2/5, and the confidence degree is 2/3. The two indexes are used as the measuring standard for mining the target class sequence rule, and the sequences meeting the minimum support degree threshold value and the confidence coefficient threshold value are extracted to be used as the class sequence rule of the sequence database.
It should be noted that there are many algorithms for mining CSR, such as a Generalized Sequential Pattern (GSP) algorithm, a Prefixspan algorithm (a sequence Pattern mining algorithm), and the like. The frequent sequences meeting the minimum support degree are mined through a prefix span algorithm based on frequent pattern mining, meanwhile, considering that the difference of the sequence lengths in all the sequences is large, the similar sequence rule mining is not suitable by using a single fixed minimum support degree, otherwise, if a low-frequency sequence is to be mined, the support degree threshold value needs to be reduced, so that a large number of rules generated by high-frequency words are introduced, and noise is introduced. Therefore, in the embodiment of the application, a multiple minimum support degree strategy is used, and the minimum support degree min _ sup is obtained by multiplying the minimum support rate a by the sequence length n by the calculation method of the rule minimum support degree, as shown in the following formula:
min_sup=a×n
where a is the minimum support rate, and is preset, for example, a value between 0.01 and 0.1 may be taken, n is the sequence length, and the sequence length is the number of parts of speech sequences obtained from the history corpus. The higher the support threshold, the higher the accuracy of the mined target class sequence rule.
In addition, the confidence degree represents the credibility of the determined class sequence rule, and the class sequence rule includes a sequence and a class label, so that the more the class labels are, the more the credibility thereof is, and therefore, in this embodiment, the ratio of the number of the class labels to the number of the preset class labels can be used as the confidence degree, that is, when the target class sequence rule is determined, the ratio of the number of the class labels in the frequent sequence to the number of the preset class labels is used as the confidence degree of the frequent sequence, so as to determine whether the confidence degree meets the confidence degree threshold. The confidence threshold is preset.
It is assumed that the part-of-speech sequence corresponding to the history corpus includes "#/n, &/d,"/a, |, "#/n, |! And/d,/a ", then" #/n, &/d,/a "can be determined as the target class sequence rule in the manner described above. In the emotion classification scene, the category labels "#, &,", respectively represent the categories to which the classification feature words at the position belong, namely, the attribute words, the degree adverbs and the emotion words.
In some cases, because the labeling in the feature word dictionary may not be comprehensive enough, and in order to reduce the workload of labeling, there may be unlabeled category feature words in the history corpus, so that after the target category sequence rule is obtained by mining, the target category sequence rule may be matched with the unlabeled text mining category feature words (S306), the category to which the unlabeled category feature words belong is determined according to the category label in the target category sequence rule, and all the category feature words are obtained by mining. For example, the category feature words comprise attribute words, emotion words, degree adverbs and negative words, the mining result is used as a new category feature word to be added into a feature word dictionary, and the label of the next round of labeling is updated, so that multiple rounds of iterative mining are performed.
Each round of mining can be set with higher support degree, accuracy of mining rules is guaranteed, a target class sequence rule is iteratively mined by labeling new class labels in multiple rounds, and all class feature words are mined according to the target class sequence rule.
For example, for a historical corpus "the hotel is very close in location, particularly good in air, and comfortable in room", the word segmentation and part-of-speech tagging are also performed, and meanwhile, the class feature word tagging is performed according to the existing class tag, and the existing class tag is assumed to be: attribute words: a room; degree word: very much; emotional words: preferably, the obtained part-of-speech sequence is "/r,/n,/u,/n,/d,/a, |, #/n,/d,/a", and "#/n, &/d,/a" mined and obtained according to the method is a target class sequence rule, then "/n,/d,/a" satisfies the frequent sequence, and if the confidence threshold is set to 0.1, 4 class tags are shared by the emotion classification scenarios as an example, that is, the preset number of class tags is 4, as long as one or more class tags appear in the frequent sequence, the confidence reaches 0.25 or more, then the requirement of the mined target class sequence rule can be satisfied, and then "/n, &/d,/a" in the part-of-speech sequence is satisfied, and then "/n,/d,/a" in the part-of-speech sequence is found in the part, All of "/n,/d,/a", "#/n,/d,/a" meet the support threshold and confidence threshold requirements of the target class sequence rule "#/n, &/d,/a", the word corresponding to the position can be extracted as a new class feature word, that is, the newly added class label is: attribute words: position, air; degree word: particularly, straightening; emotional words: is close and comfortable. Then all the obtained category feature words are: attribute words: location, air, room; degree word: very, special, straight; emotional words: it is close, good and comfortable.
It can be understood that, in some cases, because the expression manner of the corpus to be classified may be greatly different from that of the historical corpus, for example, the corpus to be classified is comment information, and the historical corpus is news information, in such a case, the class sequence rule obtained by mining the historical corpus may not be applicable to the corpus to be classified, at this time, if the corpus to be classified includes a plurality of corpora, the target class sequence rule may be mined according to the plurality of corpuses to be classified, that is, the target class sequence rule is mined according to the plurality of corpuses to be classified. In addition, even if the linguistic data to be classified is relatively close to the expression mode of the historical linguistic data, in order to improve the accuracy of the target class sequence rule according to which the linguistic data to be classified is classified, the target class sequence rule can be mined again according to a plurality of linguistic data to be classified.
The mining method of the target class sequence rule may refer to the mining method described above, and similarly, if there is an unlabeled class feature word in the part-of-speech sequence, after the target class sequence rule is mined, the class to which the unlabeled class feature word in the part-of-speech sequence belongs may be determined according to the class label in the class sequence rule, and all class feature words corresponding to the part-of-speech sequence are mined to obtain a class feature word sequence.
It should be noted that, in this embodiment, the classification feature vector may be a vector of each category feature word in the category feature word sequence, and of course, since relevant information of the category feature word in the category feature word sequence may also affect a classification result, in order to improve classification accuracy, the classification feature vector may also be processed and constructed in combination with relevant information of the category feature word in the category feature word sequence, so as to further improve accuracy of the constructed classification feature vector, and the accurate classification feature vector may provide classification accuracy.
Wherein, the modes of processing and constructing the classification feature vector by combining the related information of the class feature words in the class feature word sequence can comprise a plurality of modes, for example, word vectors of category feature words are spliced to construct a classification feature vector (related information is a word vector), a classification feature vector is constructed according to position features of words of the category feature words in a corpus to be classified (related information is a position feature), a classification feature vector is constructed according to context features of the category feature words in the corpus to be classified (related information is a context feature), a classification feature vector is constructed according to part-of-speech sequence features of the category feature words in a category feature word sequence (related information is a part-of-speech sequence feature), and a classification feature vector is constructed according to dependency syntactic relation features of the category feature words in the category feature word sequence (related information is a dependency syntactic relation feature). The above method for constructing the classification feature vector is described in sequence below.
Splicing word vectors of the category feature words to construct classification feature vectors:
in some cases, the vectors corresponding to different category feature words are different, but different category feature words may have similar meanings due to the fact that the different category feature words include the same or similar words, such as the category feature words "above" and "above", if the category feature vector is constructed only according to the vector corresponding to the category feature word, the two are different, but if the category feature vector is constructed by considering the influence of the vector of each word in the category feature word, the obtained category feature vector can reflect the features of the category feature word more accurately.
Therefore, the server can splice word vectors of the category feature words to construct classification feature vectors, specifically, the category feature words are split by taking the words as units and the words as units, vectors of each word and each word are obtained by generating a vector correlation model such as a word2vec model, and then the words and the words are spliced. In order to obtain a word vector aligned with the word vector, each word needs to be repeatedly encoded, the number of times of repetition is the number of words of the words constituting the word, and a process diagram of obtaining mixed encoding by splicing the word vectors is shown in fig. 4. Fig. 4 takes the category feature word "position" in the category feature word sequence "position is close" as an example, and the word vector of "position" is spliced with the word vectors of "position" and "position", respectively.
According to the method, the splicing result of other word vectors in the category characteristic word sequence can be obtained, as shown in fig. 5, the splicing result of the word vector at the position is obtained according to the position, the splicing result of the word vector at the very position is obtained according to the very position, the splicing result of the word vector at the near position is obtained according to the near position, and the corresponding word vector mixed sequence is obtained according to the category characteristic word sequence at the position, very position and near position of the category sequence rule.
Constructing a classification feature vector according to the position features of the category feature words in the category feature word sequence:
because the position of the word in the category feature word sequence may be different, the position of the word in the category feature word sequence is different, and the expressed meanings of the word may be different, for example, "chinese" and "national", although both the word "chinese" and "national" are included, the position of the word in the category feature word sequence is different, so that the meaning of the category feature word is different, therefore, in order to improve the accuracy of the classification feature vector, the server may be constructed by combining the position characteristics of the word of the category feature word in the category feature word sequence when constructing the classification feature vector.
Wherein, each sentence after the sentence is divided gives the number of the position by taking the character as a unit, and the position with the number of w is mapped into a position vector with a fixed dimension, such as 200 dimensions, so that the position vector of each character (embodying the position characteristics) is obtained by calculation. The ith element value of the vector is PEi(w) the calculation formula is:
PE2i(w)=sin(w/100002i/200)
PE2i+1(w)=cos(w/100002i/200)
constructing a classification feature vector according to the context features of the category feature words in the corpus to be classified:
in some cases, the linguistic expression in the corpus to be classified may not be complete, such as "the hotel is spacious and bright," wherein "bright" is also the evaluation of "house", but when determining the category feature word sequence, the category feature word in the category feature word sequence is extracted from the corpus to be classified, and may ignore "bright" which has a certain effect on the subsequent classification. In order to avoid neglecting important information in the corpus to be classified and improve the accuracy of the constructed classification feature vector, in this embodiment, the server may construct the classification feature vector in combination with the context features of the category feature words.
The method for determining the context feature of the category feature word may be to select n words before and after the category feature word as window words for encoding, for example, the corpus to be classified is "the house of the hotel is spacious and bright", the determined category feature word sequence is "the house is spacious", 2 words before and after the feature word are selected as window words (i.e., n ═ 2), then the window words before and after are "the hotel" and "bright", respectively, and thus the context feature of "the house is spacious" is encoded to construct the classification feature vector. The encoding method may be position vector + word hybrid vector. As shown in fig. 6, the word mixture vector of "hotel" is spliced with the position vector of "hotel", the word mixture vector of "is spliced with the position vector of" very ", the word mixture vector of" very "is spliced with the position vector of" and ", the word mixture vector of" bright "is spliced with the position vector of" bright ", thereby realizing the construction of the classification feature vector based on the context feature.
In addition to splicing the word-word mixed vector and the corresponding position vector, vector addition can be performed according to each dimension, which is a method for constructing a classification feature vector based on context features that can be adopted in this embodiment.
Constructing a classification feature vector according to the part-of-speech sequence features of the category feature words in the category feature word sequence:
the part of speech of the category feature word can further reflect the role of the category feature word in classification, for example, in an emotion classification scene, if the part of speech of the category feature word is a noun n, the category feature word is generally an object to be classified (attribute word), and if the part of speech of the category feature word is an adjective a, the category feature word is generally an emotion word. Therefore, in order to further improve the accuracy of the classification feature vector, the server may construct the classification feature vector according to the part-of-speech sequence features of the class feature words in the class feature word sequence. The coding of the part-of-speech sequence features can be performed by constructing a part-of-speech dictionary, coding with dictionary dimensions, and adopting a 6-bit binary coding mode, wherein a part-of-speech coding dictionary dimension table can be shown in table 2:
TABLE 2 part-of-speech dictionary dimension Table
Figure BDA0002304660180000151
Figure BDA0002304660180000161
For the word sequence with the category characteristics of "large room", the corresponding part-of-speech sequence is "/n,/d,/a", table look-up 2 according to the constituent elements of the part-of-speech sequence and then concatenation is performed, and the concatenation result (the obtained part-of-speech sequence characteristics) is shown in fig. 7, that is, concatenation is performed on the part-of-speech code of "n", the part-of-speech code of "d", and the part-of-speech code of "a".
Constructing a classification feature vector according to the dependency syntactic relation features of the category feature words in the category feature word sequence:
the dependency syntax relationship of the category characteristic words in the category characteristic word sequence can also reflect the role of the category characteristic words in classification, so that in order to further improve the accuracy of the classification characteristic vector, the server can construct the classification characteristic vector according to the dependency syntax relationship of the category characteristic words in the category characteristic word sequence. The dependency syntax relationship may be encoded by constructing a dependency syntax relationship dictionary, encoding with dictionary dimensions, and encoding the dependency syntax relationship by using a 4-bit binary number encoding method, where the dependency syntax relationship dictionary dimension table may be shown in table 3:
TABLE 3 dependency syntactic relationship type dictionary dimension Table
Figure BDA0002304660180000162
Figure BDA0002304660180000171
And constructing a corresponding classification feature vector according to the dependency syntactic relation feature of the category feature words in the category feature word sequence.
It should be noted that the above-constructed classification feature vectors may be used independently, or feature fusion may be performed on a plurality of classification feature vectors, so that the fused classification feature vectors are input into the classification model for use.
In this embodiment, based on the category feature word sequence, an accurate classification feature vector is constructed by combining the above one or more ways, so that the accuracy of the classification feature vector characterization is improved, the accurate characterization capability of the classification feature vector greatly improves the classification effect, and the requirement on the classification model is reduced in the classification process.
Next, the classification method provided in the embodiment of the present application is described in detail with reference to an actual application scenario. The application scenario is an emotion classification scenario for comment information, in the scenario, the corpus to be classified is 'the hotel is very close in position, the air is particularly good, the room is very comfortable', the existing target class sequence rule is '#/n, &/d, · a', and then, referring to fig. 8, the classification method includes:
s801, preprocessing (sentence segmentation and word segmentation) and labeling processing are carried out on the corpus to be classified.
Assume that the existing class labels are: attribute words: a room; degree word: very much; emotional words: preferably, then, the part-of-speech sequence of the corpus to be classified is "/r,/n,/u,/n, &/d,/a, |,/n,/d,/a, |, #/n,/d,/a".
S802, determining a sequence which accords with the target class sequence rule in the part of speech sequence.
And S803, mining the category characteristic words according to the target category sequence rule to obtain a category characteristic word sequence.
The part-of-speech sequence has unlabeled class characteristic words, and the unlabeled class characteristic words in the corresponding sequence can be mined according to the target class sequence rule to obtain a class characteristic word sequence.
For example, "/n, &/d,/a", "/n,/d,/a", "#/n,/d,/a" all conform to the target class sequence rule "#/n, &/d,/a", the word corresponding to that position may be extracted as a new class feature word, i.e. the newly added class label is: attribute words: position, air; degree word: particularly, straightening; emotional words: is close and comfortable. Then all the obtained category feature words are: attribute words: location, air, room; degree word: very, special, straight; emotional words: it is close, good and comfortable. The obtained category feature word sequences are 'very close in position', 'particularly good in air' and 'very comfortable in room'.
And S804, constructing a classification feature vector according to the classification feature word sequence.
And S805, inputting the classification feature vector into the SVM for emotion classification.
Based on the classification method provided by the foregoing embodiment, an embodiment of the present application provides a classification apparatus, as shown in fig. 9, the apparatus includes a determining unit 901, a constructing unit 902, and a classifying unit 903:
the determining unit 901 is configured to determine a category feature word sequence representing a classification association relationship according to a part-of-speech sequence corresponding to a corpus to be classified, where the corpus to be classified includes a plurality of first category feature words representing objects to be classified and second category feature words related to a classification requirement, and the classification association relationship is used to represent an association relationship between the second category feature words and the first category feature words;
the constructing unit 902 is configured to construct a classification feature vector according to the category feature word sequence, where the classification feature vector embodies the corpus features of different categories;
the classifying unit 903 is configured to classify the corpus to be classified through a classification model according to the classification feature vector, where the classification model is a non-deep learning model.
In an implementation manner, the determining unit 901 is configured to:
determining the category characteristic word sequence according to a target category sequence rule; and the target class sequence rule identifies the classification association relation between the second class characteristic words and the first class characteristic words.
In one implementation, the target class sequence rule is obtained by mining according to historical corpus.
In one implementation, the mining method of the target class sequence rule is as follows:
determining a frequent sequence meeting a support degree threshold value from a plurality of part-of-speech sequences corresponding to the historical corpus;
and if the confidence of the frequent sequence meets a confidence threshold, determining that the frequent sequence meets a target class sequence rule.
In one implementation, the confidence of the frequent sequence is a ratio of the number of class tags in the frequent sequence to a preset number of class tags.
In an implementation manner, if the corpus to be classified includes a plurality of corpora, the target class sequence rule is obtained by mining the corpus to be classified.
In an implementation manner, if there is an unlabeled category feature word in the part of speech sequence, the determining unit 901 is further configured to:
and determining the category to which the unlabeled category characteristic words in the part of speech sequence belong according to the category label in the target category sequence rule, and mining to obtain all category characteristic words corresponding to the part of speech sequence.
In one implementation, the constructing unit 902 is configured to:
and constructing the classification feature vector according to the related information of the class feature words in the class feature word sequence.
In one implementation, the constructing unit 902 constructs the classification feature vector by one or more of the following methods:
splicing word vectors of the category feature words to construct the classification feature vectors; the related information is the word vector;
constructing the classification feature vector according to the position features of the characters of the class feature words in the corpus to be classified; the related information is the position feature;
constructing the classification feature vector according to the context features of the category feature words in the corpus to be classified; the related information is the context feature;
constructing the classification characteristic vector according to the part-of-speech sequence characteristics of the classification characteristic words in the classification characteristic word sequence; the related information is the part of speech sequence characteristics;
constructing the classification feature vector according to the dependency syntax relation feature of the category feature words in the category feature word sequence; the relevant information is the dependency syntactic relationship characteristic.
In one implementation, the classification is an emotion classification, the first category feature words include attribute words, and the second category feature words include a combination of one or more of emotion words, degree adverbs, and negative words.
The embodiment of the application also provides a device for classification, which is described below with reference to the accompanying drawings. Referring to fig. 10, an apparatus 1000 for classification is provided in the embodiment of the present application, where the apparatus 1000 may also be a terminal apparatus, and the terminal apparatus may be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal apparatus is a mobile phone:
fig. 10 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 10, the cellular phone includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. Those skilled in the art will appreciate that the handset configuration shown in fig. 10 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile phone in detail with reference to fig. 10:
RF circuit 1010 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 1080; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1010 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1010 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 1020 can be used for storing software programs and modules, and the processor 1080 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1020 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 1030 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1030 may include a touch panel 1031 and other input devices 1032. The touch panel 1031, also referred to as a touch screen, may collect touch operations by a user (e.g., operations by a user on or near the touch panel 1031 using any suitable object or accessory such as a finger, a stylus, etc.) and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1031 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1080, and can receive and execute commands sent by the processor 1080. In addition, the touch panel 1031 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1030 may include other input devices 1032 in addition to the touch panel 1031. In particular, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a track ball, a mouse, a joystick, or the like.
The display unit 1040 may be used to display information input by a user or information provided to the user and various menus of the cellular phone. The Display unit 1040 may include a Display panel 1041, and optionally, the Display panel 1041 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1031 can cover the display panel 1041, and when the touch panel 1031 detects a touch operation on or near the touch panel 1031, the touch operation is transmitted to the processor 1080 to determine the type of the touch event, and then the processor 1080 provides a corresponding visual output on the display panel 1041 according to the type of the touch event. Although in fig. 10, the touch panel 1031 and the display panel 1041 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1031 and the display panel 1041 may be integrated to implement the input and output functions of the mobile phone.
The handset may also include at least one sensor 1050, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1041 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1041 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
Audio circuitry 1060, speaker 1061, microphone 1062 may provide an audio interface between the user and the handset. The audio circuit 1060 can transmit the electrical signal converted from the received audio data to the speaker 1061, and the electrical signal is converted into a sound signal by the speaker 1061 and output; on the other hand, the microphone 1062 converts the collected sound signal into an electrical signal, which is received by the audio circuit 1060 and converted into audio data, which is then processed by the audio data output processor 1080 and then sent to, for example, another cellular phone via the RF circuit 1010, or output to the memory 1020 for further processing.
WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help the user to send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 1070, which provides wireless broadband internet access for the user. Although fig. 10 shows the WiFi module 1070, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 1080 is a control center of the mobile phone, connects various parts of the whole mobile phone by using various interfaces and lines, and executes various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1020 and calling data stored in the memory 1020, thereby integrally monitoring the mobile phone. Optionally, processor 1080 may include one or more processing units; preferably, the processor 1080 may integrate an application processor, which handles primarily the operating system, user interfaces, applications, etc., and a modem processor, which handles primarily the wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1080.
The handset also includes a power source 1090 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1080 via a power management system to manage charging, discharging, and power consumption via the power management system.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.
In this embodiment, the processor 1080 included in the terminal device further has the following functions:
determining a category characteristic word sequence which embodies a classification association relation according to a part-of-speech sequence corresponding to a corpus to be classified, wherein the corpus to be classified comprises a plurality of first category characteristic words which represent objects to be classified and second category characteristic words which are related to a classification requirement, and the classification association relation is used for embodying the association relation between the second category characteristic words and the first category characteristic words;
constructing a classification feature vector according to the class feature word sequence, wherein the classification feature vector embodies the characteristics of the corpora of different classes;
and classifying the linguistic data to be classified through a classification model according to the classification characteristic vector, wherein the classification model is a non-deep learning model.
Referring to fig. 11, fig. 11 is a structural diagram of a server 1100 provided in this embodiment, where the server 1100 may have a large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) for storing an application program 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100.
The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 11.
The embodiment of the present application further provides a computer-readable storage medium, which is used for storing a program code, where the program code is used for executing the classification method described in the foregoing embodiments.
The embodiments of the present application also provide a computer program product including instructions, which when run on a computer, cause the computer to execute the classification method described in the foregoing embodiments.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (15)

1. A method of classification, the method comprising:
determining a category characteristic word sequence which embodies a classification association relation according to a part-of-speech sequence corresponding to a corpus to be classified, wherein the corpus to be classified comprises a plurality of first category characteristic words which represent objects to be classified and second category characteristic words which are related to a classification requirement, and the classification association relation is used for embodying the association relation between the second category characteristic words and the first category characteristic words;
constructing a classification feature vector according to the class feature word sequence, wherein the classification feature vector embodies the characteristics of the corpora of different classes;
and classifying the linguistic data to be classified through a classification model according to the classification characteristic vector, wherein the classification model is a non-deep learning model.
2. The method according to claim 1, wherein the determining a word sequence of the category features that embodies the association relationship of the classification according to the word sequence corresponding to the corpus to be classified comprises:
determining the category characteristic word sequence according to a target category sequence rule; and the target class sequence rule identifies the classification association relation between the second class characteristic words and the first class characteristic words.
3. The method according to claim 2, wherein the target class sequence rule is derived from historical corpus mining.
4. The method of claim 3, wherein the mining manner of the target class sequence rule is:
determining a frequent sequence meeting a support degree threshold value from a plurality of part-of-speech sequences corresponding to the historical corpus;
and if the confidence of the frequent sequence meets a confidence threshold, determining that the frequent sequence meets a target class sequence rule.
5. The method of claim 4, wherein the confidence of the frequent sequence is a ratio of the number of class tags in the frequent sequence to a preset number of class tags.
6. The method according to claim 2, wherein if the corpus to be classified includes a plurality of corpora, the target-class sequence rule is mined according to the plurality of corpora to be classified.
7. The method according to claim 6, wherein if there is an unlabeled class feature word in the part-of-speech sequence, before determining the class feature word sequence according to the target class sequence rule, the method further comprises:
and determining the category to which the unlabeled category characteristic words in the part of speech sequence belong according to the category label in the target category sequence rule, and mining to obtain all category characteristic words corresponding to the part of speech sequence.
8. The method according to any one of claims 1 to 7, wherein said constructing a classification feature vector from said class feature word sequence comprises:
and constructing the classification feature vector according to the related information of the class feature words in the class feature word sequence.
9. The method according to claim 8, wherein the constructing the classification feature vector according to the related information of the class feature words in the class feature word sequence comprises one or more of the following ways:
splicing word vectors of the category feature words to construct the classification feature vectors; the related information is the word vector;
constructing the classification feature vector according to the position features of the characters of the class feature words in the corpus to be classified; the related information is the position feature;
constructing the classification feature vector according to the context features of the category feature words in the corpus to be classified; the related information is the context feature;
constructing the classification characteristic vector according to the part-of-speech sequence characteristics of the classification characteristic words in the classification characteristic word sequence; the related information is the part of speech sequence characteristics;
constructing the classification feature vector according to the dependency syntax relation feature of the category feature words in the category feature word sequence; the relevant information is the dependency syntactic relationship characteristic.
10. The method according to any one of claims 1 to 7, wherein the classification is an emotion classification, the first category feature words comprise attribute words, and the second category feature words comprise a combination of one or more of emotion words, adverbs, and negatives.
11. A classification apparatus, characterized in that the apparatus comprises a determination unit, a construction unit and a classification unit:
the determining unit is used for determining a category feature word sequence which embodies a classification association relationship according to a part-of-speech sequence corresponding to a corpus to be classified, wherein the corpus to be classified comprises a plurality of first category feature words which represent objects to be classified and second category feature words which are related to a classification requirement, and the classification association relationship is used for embodying the association relationship between the second category feature words and the first category feature words;
the construction unit is used for constructing a classification feature vector according to the category feature word sequence, and the classification feature vector embodies the corpus characteristics of different categories;
and the classification unit is used for classifying the linguistic data to be classified through a classification model according to the classification characteristic vector, wherein the classification model is a non-deep learning model.
12. The apparatus of claim 11, wherein the determining unit is configured to:
determining the category characteristic word sequence according to a target category sequence rule; and the target class sequence rule identifies the classification association relation between the second class characteristic words and the first class characteristic words.
13. The apparatus of claim 12, wherein the target class sequence rule is derived from historical corpus mining.
14. An apparatus, comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-10 according to instructions in the program code.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the method of any of claims 1-10.
CN201911235058.3A 2019-12-05 2019-12-05 Classification method and related device Active CN111177371B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911235058.3A CN111177371B (en) 2019-12-05 2019-12-05 Classification method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911235058.3A CN111177371B (en) 2019-12-05 2019-12-05 Classification method and related device

Publications (2)

Publication Number Publication Date
CN111177371A true CN111177371A (en) 2020-05-19
CN111177371B CN111177371B (en) 2023-03-21

Family

ID=70653826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911235058.3A Active CN111177371B (en) 2019-12-05 2019-12-05 Classification method and related device

Country Status (1)

Country Link
CN (1) CN111177371B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353303A (en) * 2020-05-25 2020-06-30 腾讯科技(深圳)有限公司 Word vector construction method and device, electronic equipment and storage medium
CN111400432A (en) * 2020-06-04 2020-07-10 腾讯科技(深圳)有限公司 Event type information processing method, event type identification method and device
CN111611801A (en) * 2020-06-02 2020-09-01 腾讯科技(深圳)有限公司 Method, device, server and storage medium for identifying text region attribute
CN111737476A (en) * 2020-08-05 2020-10-02 腾讯科技(深圳)有限公司 Text processing method and device, computer readable storage medium and electronic equipment
CN112148841A (en) * 2020-09-30 2020-12-29 北京金堤征信服务有限公司 Object classification and classification model construction method and device
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data
CN115171048A (en) * 2022-07-21 2022-10-11 北京天防安全科技有限公司 Asset classification method, system, terminal and storage medium based on image recognition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104471568A (en) * 2012-07-02 2015-03-25 微软公司 Learning-based processing of natural language questions
CN104516874A (en) * 2014-12-29 2015-04-15 北京牡丹电子集团有限责任公司数字电视技术中心 Method and system for parsing dependency of noun phrases
US20160357851A1 (en) * 2015-06-05 2016-12-08 Mr. Buzz, Inc. dba WeOtta Natural Language Search With Semantic Mapping And Classification
CN108763402A (en) * 2018-05-22 2018-11-06 广西师范大学 Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary
CN110019792A (en) * 2017-10-30 2019-07-16 阿里巴巴集团控股有限公司 File classification method and device and sorter model training method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104471568A (en) * 2012-07-02 2015-03-25 微软公司 Learning-based processing of natural language questions
CN104516874A (en) * 2014-12-29 2015-04-15 北京牡丹电子集团有限责任公司数字电视技术中心 Method and system for parsing dependency of noun phrases
US20160357851A1 (en) * 2015-06-05 2016-12-08 Mr. Buzz, Inc. dba WeOtta Natural Language Search With Semantic Mapping And Classification
CN110019792A (en) * 2017-10-30 2019-07-16 阿里巴巴集团控股有限公司 File classification method and device and sorter model training method
CN108763402A (en) * 2018-05-22 2018-11-06 广西师范大学 Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
常曹育: "基于机器学习的中文微博情感分类技术研究" *
李伟卿;王伟军;: "基于大规模评论数据的产品特征词典构建方法研究" *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353303A (en) * 2020-05-25 2020-06-30 腾讯科技(深圳)有限公司 Word vector construction method and device, electronic equipment and storage medium
CN111353303B (en) * 2020-05-25 2020-08-25 腾讯科技(深圳)有限公司 Word vector construction method and device, electronic equipment and storage medium
CN111611801A (en) * 2020-06-02 2020-09-01 腾讯科技(深圳)有限公司 Method, device, server and storage medium for identifying text region attribute
CN111400432A (en) * 2020-06-04 2020-07-10 腾讯科技(深圳)有限公司 Event type information processing method, event type identification method and device
CN111400432B (en) * 2020-06-04 2020-09-25 腾讯科技(深圳)有限公司 Event type information processing method, event type identification method and device
CN111737476A (en) * 2020-08-05 2020-10-02 腾讯科技(深圳)有限公司 Text processing method and device, computer readable storage medium and electronic equipment
CN112148841A (en) * 2020-09-30 2020-12-29 北京金堤征信服务有限公司 Object classification and classification model construction method and device
CN112148841B (en) * 2020-09-30 2024-04-19 北京金堤征信服务有限公司 Object classification and classification model construction method and device
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data
CN115171048A (en) * 2022-07-21 2022-10-11 北京天防安全科技有限公司 Asset classification method, system, terminal and storage medium based on image recognition

Also Published As

Publication number Publication date
CN111177371B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN111177371B (en) Classification method and related device
CN109145303B (en) Named entity recognition method, device, medium and equipment
CN109241431B (en) Resource recommendation method and device
CN111553162B (en) Intention recognition method and related device
CN109033156B (en) Information processing method and device and terminal
CN110704661B (en) Image classification method and device
CN110069769B (en) Application label generation method and device and storage device
CN109543014B (en) Man-machine conversation method, device, terminal and server
CN111597804B (en) Method and related device for training entity recognition model
CN112214605A (en) Text classification method and related device
CN113821589A (en) Text label determination method and device, computer equipment and storage medium
CN110276010A (en) A kind of weight model training method and relevant apparatus
CN114328906A (en) Multistage category determination method, model training method and related device
CN108549681B (en) Data processing method and device, electronic equipment and computer readable storage medium
CN114328908A (en) Question and answer sentence quality inspection method and device and related products
CN112907255A (en) User analysis method and related device
CN112749252A (en) Text matching method based on artificial intelligence and related device
CN111553163A (en) Text relevance determining method and device, storage medium and electronic equipment
CN110929882A (en) Feature vector calculation method based on artificial intelligence and related device
CN112036135B (en) Text processing method and related device
CN111611369B (en) Interaction method and related device based on artificial intelligence
CN113821609A (en) Answer text acquisition method and device, computer equipment and storage medium
CN113569043A (en) Text category determination method and related device
CN115080840A (en) Content pushing method and device and storage medium
CN112328783A (en) Abstract determining method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant