CN112069312A

CN112069312A - Text classification method based on entity recognition and electronic device

Info

Publication number: CN112069312A
Application number: CN202010806716.6A
Authority: CN
Inventors: 王树鹏; 孙立远; 赵忠华; 张磊; 王博; 王勇; 付培国; 王泽辰; 王禄恒; 万欣欣; 李欣
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-12-11
Anticipated expiration: 2040-08-12
Also published as: CN112069312B

Abstract

The invention provides a text classification method based on entity recognition, which comprises the following steps: segmenting a text to be detected to obtain emotion words and entity words, and judging the emotion types of the entity words through an entity and emotion type labeled data set; carrying out sentence segmentation on a text to be detected, and acquiring the emotion type of each sentence through the part of speech, negative words and punctuation content of the emotion words and entity words for marking the emotion type in each sentence; and obtaining the emotion type of the text to be detected according to the emotion type of each sentence. The method determines a directional entity set by utilizing a semi-supervised learning mode, a cooperative training and active learning mode and a learning and emotion rule mode; by identifying the entity with the appointed direction, the tendency judgment is carried out by combining the emotional words; and generating an entity set of the appointed category, and combining the emotion rules to realize deeper analysis of the text.

Description

Text classification method based on entity recognition and electronic device

Technical Field

The invention relates to the field of natural language processing, in particular to a text classification method based on entity recognition and an electronic device.

Background

With the rapid development of the internet, a variety of texts is rapidly increasing. People often classify the massive texts according to different requirements. For example, chinese patent application CN107491554B discloses a method, a device and a method for constructing a text classifier, chinese patent application CN105224695B discloses a method and a device for quantizing text features based on information entropy and a method and a device for classifying texts.

In a social platform, microblogs have strong influence and penetration. Microblog users are increasing continuously, relationships are established on platforms, information is acquired, and a large amount of content is produced. The microblog text information contains contents with obvious emotional colors. The viewpoint and the tendency of the users behind the texts are mined, the popularity tendency and the hot spots can be judged, enterprises can analyze the purchasing tendency of the consumers, accurate marketing is carried out, and the government can respond to the public opinion change of the netizens in time. Therefore, tendency analysis is carried out on the microblog texts, and judgment on the blephar position is further completed.

The standpoint judgment is an important branch of natural language processing for the purpose of processing a specific text and judging the emotional standpoint thereof, and has recently received much attention. Extracting emotion words is the most direct method for analyzing the emotion expressed by the microblog. The microblog content has the characteristics of short length, various content forms, strong viewpoint tendency, spoken language expression mode, general lack of context information and the like. Although the prior art analyzes the microblog by combining the emotion dictionary, the emotion words contained in the microblog can be simply summarized by classifying the emotion words. However, information expressed by a microblog is not only related to the emotional words contained in the microblog. Even if two microblogs contain identical emotional words, when the objects pointed by the emotional words are different, the viewpoints and tendencies of the microblogs are obviously different. With the development of network languages, new words for entity representation are also in a variety, users often refer to the new words by using brand abbreviations, harmonic sounds, popular languages and the like, and the entities depend on manual labeling and often need a large amount of work. Therefore, how to perform tendency analysis on massive microblog data so as to realize the determination from the standpoint is a problem to be solved urgently at present.

Disclosure of Invention

In order to solve the problems, the invention discloses a text classification method based on entity recognition and an electronic device, wherein a deep learning method is combined with emotion rules, a Bowen position judgment is carried out based on syntax rules, corresponding emotion rules are specified, and a collaborative learning method and an active learning method are combined, so that the emotion position of the Bowen expressed to an entity of a specified type can be accurately judged on the premise of depending on a small amount of labeled data, the microblog tendency analysis problem is solved, the emotion position of the Bowen to an entity of a specific type is judged and used as auxiliary information, and the application of the Bowen tendency analysis, such as fashion trend judgment, accurate marketing, public opinion monitoring and the like is facilitated.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a text classification method based on entity recognition comprises the following steps:

1) segmenting a text to be detected to obtain emotion words and entity words, and judging the emotion types of the entity words through an entity and emotion type labeled data set;

2) carrying out sentence segmentation on a text to be detected, and acquiring the emotion type of each sentence through the part of speech, negative words and punctuation content of the emotion words and entity words for marking the emotion type in each sentence;

3) and obtaining the emotion type of the text to be detected according to the emotion type of each sentence.

Further, preprocessing the text to be detected before extracting the emotion words and the entity words in the text to be detected; the pretreatment comprises the following steps: simplified traditional Chinese characters and removed stop words.

Further, the method for acquiring stop words comprises a Chinese word segmentation method.

Further, obtaining emotion words through an emotion vocabulary ontology library DUTIR emotion dictionary of university of great succession of studios.

Further, obtaining the entity and emotion category labeled data set through the following steps:

1) acquiring a plurality of emotion type sample texts which are marked with emotion, acquiring entity words of each emotion type sample text which is marked with emotion, and marking each entity word according to the emotion type of each emotion type sample text which is marked with emotion to obtain a first entity and emotion type marked data set;

2) segmenting each emotion class sample text marked with emotion, and inputting a first segmentation result into a laminated hidden Markov model and a conditional random field entity recognition learning model respectively for training to obtain a laminated hidden Markov entity classification model and a conditional random field entity classification model;

3) collecting a plurality of unlabeled emotion category sample texts, segmenting each unknown emotion category sample text, and respectively inputting a second segmentation result into a laminated hidden Markov entity classification model and a conditional random field entity classification model to respectively obtain an entity word set of a first labeled emotion category and an entity word set of a second labeled emotion category;

4) if the emotion category labeling results of the entity words in the entity word set labeled with the emotion category are the same as those of the entity words in the entity word set labeled with the emotion category, obtaining a second entity and emotion category labeled data set; if the emotion types of an entity word in the same unlabeled emotion type sample text are different, judging the emotion types by a domain expert to obtain a third entity and an emotion type labeled data set; carrying out manual entity word tagging and emotion category judgment by a domain expert on entity words different between the entity word set tagged with the first emotion category and the entity word set tagged with the second emotion category to obtain a fourth entity and emotion category tagged data set;

5) and combining the first entity and emotion type labeled data set, the second entity and emotion type labeled data set, the third entity and emotion type labeled data set and the fourth entity and emotion type labeled data set to obtain an entity and emotion type labeled data set.

Further, the parts of speech include subjects, negatives, predicates, objects, and determinants.

Further, the emotion categories of the sentence comprise a positive category, a negative category and an uncertain category; obtaining the emotion classification of each sentence through the following strategies:

1) when the predicate is an emotional word, a subject, or an entity:

a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;

b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;

c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;

d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;

e) if negative words or question marks exist, the sentence types are reversed;

2) when the fixed language is an emotional word, a subject or a predicate is an entity word:

e) if negative words or question marks exist, the sentence types are reversed;

3) when the object is an emotional word, a subject or a predicate is an entity:

e) if negative words or question marks exist, the sentence types are reversed;

4) when the emotional words are only verbs and the objects are entity words:

e) if negative words or question marks exist, the sentence types are reversed;

5) when the emotional words and the entity words are in other conditions, the tendency of the sentence cannot be judged.

Further, obtaining the emotion type of the text to be detected through the following strategies:

1) if a sentence in the text to be detected is judged to be in a negative category, the text to be detected is in the negative category;

2) if one sentence in the text to be detected is judged to be the positive type, and other sentences are judged to be positive tendency or can not identify the entity in the sentence, the detected text is the positive type;

3) if one sentence in the text to be detected is a sentence whose tendency cannot be judged, and other sentences are judged to be positive tendency, entities in the sentence cannot be identified, or the sentence tendency cannot be judged, the tendency of the text to be detected cannot be judged.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.

Compared with the prior art, the invention has the advantages that:

(1) determining a directional entity set by using a semi-supervised learning mode, a collaborative training and active learning mode and a learning and emotion rule mode;

(2) and (3) providing an emotion rule based on principal component analysis, and performing tendency judgment on the blog through identifying an entity with a specified direction and combining emotion words. The method comprises the steps of extracting main components of sentences, eliminating noise information, normalizing spoken bobbles into a specified format and the like; based on the processed text, performing tendency analysis on the blossoms containing the entities with the specified directions, and judging the blossoms' position by utilizing the positive and negative aspects of the entities, the commendation and derviation of the emotional words, sentence components serving as the emotional words and other information;

(3) and generating an entity set of the appointed category, and judging the position of the Bo text by combining the emotion rules to realize deeper analysis of the Bo text.

Drawings

FIG. 1 is a flow diagram of a text classification method of the present invention.

FIG. 2 is a flowchart of emotion object entity set extraction.

FIG. 3 is a schematic diagram of emotion object entity set extraction.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the following describes in detail a microblog tendency analysis method and steps based on emotional object recognition and emotional rules according to the present invention with reference to the accompanying drawings.

As shown in fig. 1, in the text classification method based on entity recognition, on one hand, a semi-supervised learning method is adopted, an entity recognition model is trained through cooperative training and active learning, and a specified type entity contained in a microblog is extracted in combination with a learning and emotion rule mode to determine a directional entity set. On the other hand, an emotion rule based on principal component analysis is constructed, principal components of sentences are extracted, noise information is eliminated, and the spoken texts are normalized to be in a specified format. And judging the tendency of the blooms by utilizing the positive and negative aspects of the directional entity, the recognition and derogation of the emotional words and sentence components serving as the emotional words, so as to realize the position judgment.

According to the first aspect of the invention, a semi-supervised learning method is adopted, a small amount of labeled microblog text data sets are used as initial input, and two different entity recognition models are cooperatively trained. In order to compare the model training effect in the collaborative training, two classifiers are constructed based on the two entity recognition models, and the two classifiers respectively comprise an entity set, an emotion dictionary and an emotion rule. In the process of collaborative training, a certain amount of data is firstly extracted from a data set which is not subjected to entity labeling and position judgment, and entity recognition is carried out on the unlabeled corpus by utilizing the two trained entity recognition models. Meanwhile, the two entity recognition models extract entities in the appointed direction by judging the entity types in the microblog texts, and an entity set is formed respectively. And then constructing two classifiers on the basis of the two entity sets, judging the microblog position by utilizing the classifiers respectively in combination with the emotion dictionary and the emotion rules, and performing tendency analysis on the same blog. Comparing the tendency analysis results obtained by the two classifiers, judging the confidence degrees of the samples, selecting the samples with high confidence degrees (namely the output results of the two classifiers are completely the same), adding entity labels and tendency labels of the appointed classes to the samples, and merging the entity labels and the tendency labels into the marked blog data; meanwhile, for samples with low confidence coefficient, an active learning mode is adopted, and samples with large divergence (two classifiers have different labeling results) selected by the classifier are added into the labeled data set. And inputting the updated and expanded labeled data set into the collaborative training model again, training the two entity recognition deep learning models again, and continuously iterating until the labeled Bowen data set reaches a sufficient scale, thereby obtaining a maximum specified direction entity set.

According to the second aspect of the invention, whether the microblog is related to the entity in the designated direction or not is judged according to the microblog data needing to be judged from the standpoint based on the entity set in the designated direction obtained by learning. If the microblog text contains any entity in the entity set with the appointed direction, tendency analysis is carried out on the microblog according to the previously written emotion rules and emotion dictionaries, and the microblog text position judgment is realized. Therefore, the method provides an emotional Object recognition and emotion rule-based microblog tendency Analysis (OASOSR) algorithm, which not only can determine the emotional words in the Bowen, but also can analyze whether the entity pointed by the emotional words is a target entity.

FIG. 2 shows a flowchart of emotion object entity set extraction. According to the method, semi-supervised learning is adopted, two entity recognition learning models are trained in a cooperative training and active learning mode, designated type entities contained in microblogs are extracted in a learning and emotion rule adding mode, and a directional entity set is determined. And constructing emotion rules based on principal component analysis, extracting principal components of sentences, eliminating noise information, and normalizing spoken texts into a specified format. And judging the tendency of the blooms by utilizing the positive and negative aspects of the directional entity, the recognition and derogation of the emotional words and sentence components serving as the emotional words, so as to realize the position judgment.

FIG. 3 shows a schematic diagram of emotion object entity set extraction. As shown in fig. 3, in the entity and emotion category labeled data sets, there are a small number of labeled microblog texts, which are labeled with entities (entity 1, entity 2, etc.) and emotion categories (positive/negative) of the blog text, respectively. The annotated microblog text is subjected to word segmentation, and the annotated microblog text is input into a layered Hidden Markov Model (CHMM) < Finn R.D. et al (2011) HMMER web server: interactive sequence search.nucleic Acids Res, 39, W29-W37 and a Conditional Random Field (Conditional Random Field, CRF) < Laffy, J., McCallum, A., Pereira, F. (2001) < comparative models for segmentation and labeling sequence data "< 12 > International. 18th. interface ray text company Lef.A. Morgan. entity Model 289. pp.282, two preliminary learning models. In the process of carrying out collaborative training on the CHMM and CRF entity recognition learning models, firstly, partial data (microblog texts a, b and c) are selected from unlabeled entity and emotion category data sets, the partial data are input into the CHMM and CRF entity recognition learning models which are subjected to preliminary training, entities in specified directions in Bowen are extracted, and an entity set I and an entity set II are obtained respectively. And performing dependency syntax analysis on the same unlabeled data based on the two entity sets respectively, performing position judgment based on a given emotion dictionary and emotion rules, and judging the emotion type of the microblog text. For each microblog, two marking results based on the entity set I and the entity set II exist, and each marking result comprises an entity (entity n) contained in the microblog and the trend (positive direction/negative direction/no judgment) of the microblog. And comparing the marking results obtained based on the entity set I and the entity set II, and if the marked entities and the emotion types in the results obtained by the two classifiers are completely the same aiming at the same microblog text, judging that the results are positive and negative type data with higher confidence coefficient. And directly adding the microblog text, the corresponding entity and emotion type labels to the entity and emotion type labeled data sets. If the entity or emotion type marked in the two results is not identical, the result is judged to be positive and negative type data with lower confidence coefficient, and an active learning method is adopted for processing. For such data, if the entities in the designated directions extracted by the two classifiers are completely the same for the same microblog text but the emotion types are opposite, the sample is considered to be a sample with larger divergence, and the result needs to be submitted to a domain expert for emotion type judgment. If the entities in the specified directions extracted by the two classifiers are different, or one classifier does not extract the entity in the specified direction, the sample is considered to be a sample with high uncertainty, and manual entity word labeling is needed. Similarly, microblog texts obtained by active learning and corresponding entity and emotion type labels are added to entity and emotion type labeled data sets. Thus, the first cycle of cooperative learning is completed.

And (4) iterating the process, and continuously expanding the data sets marked by the entity and the emotion types until the data volume of the marked data reaches the artificially set number. And acquiring an entity set consisting of all the labeled entities in the data set.

The oasorsr algorithm flow is given below. The algorithm comprises five steps of document preprocessing, condition judgment, entity information judgment, syntactic analysis and rule judgment, and the output result is the blog tendency category corresponding to the input blog.

Firstly, data cleaning is carried out on the original blog text, effective data is obtained after operations such as simplified traditional Chinese characters and the like, and then the ending participle is used, and stop words in the participle result are removed. And then judging the emotional words of the processed microblog text data based on the emotional dictionary. The method adopts an emotional vocabulary ontology library DUTIR emotional dictionary of university of great-succession studios to judge whether emotional words exist in the blog, screens out microblog texts containing the emotional words, and carries out the next processing. And judging entity information on the basis of the microblog text containing the emotional words. And screening out the microblogs containing the entity with the appointed direction. The entity information judgment is completed based on the extracted entity set. And extracting entity words in the microblog text, screening the microblog and carrying out dependency syntax analysis on sentences in the microblog text if the entity exists in the entity set in the specified direction. By identifying the subject, the negotiable term, the predicate, the object, the determinand, the punctuation and other components of the sentence, the sentence is divided into different types by using the emotional rules based on the syntactic analysis: 1) when the predicate is an emotional word, a subject or a predicate is an entity word; 2) when the fixed language is an emotional word, a subject or a predicate is an entity word; 3) when the object is an emotional word, a subject or a predicate is an entity word; 4) when the emotional words are only verbs and the objects are entity words.

When emotion words serve as different components in a sentence, the specific classification rule is shown in the OASOSR algorithm. And finally outputting the tendency of the microblog text after the judgment of the corresponding emotion rule.

An application scenario of an embodiment of the present invention: "apple Mobile phones were discarded in the last year! [ snow and snow treasures ] apple really has nothing to do with [ hum ] old and dead halt, and does not always have to be started in winter [ black line ]. . . And judging the tendency of the microblog text to the electronic product, wherein the tendency is required to be based on the constructed entity set of the electronic product. Firstly, word segmentation processing is carried out on a text; then judging that emotional words 'abandon' and 'no language' exist in the blog according to the emotional dictionary; further judging that an entity word 'apple mobile phone' exists in the text, wherein the entity word 'apple mobile phone' is an entity concentrated in the electronic product entity; and performing dependency syntactic analysis, wherein in a sentence of 'last year/abandon/apple mobile phone', the emotional words are only verbs and the objects are entity words, and the judgment conditions of the rule four in the emotion judgment rule are met, the emotional words are derviative words, and the entity words are positive entity words, so that the sentence is judged to be a negative tendency, and the microblog is also judged to be a negative tendency.

The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is specific, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. A text classification method based on entity recognition comprises the following steps:

2. The method of claim 1, wherein before extracting emotion words and entity words in the text to be detected, preprocessing the text to be detected; the pretreatment comprises the following steps: simplified traditional Chinese characters and removed stop words.

3. The method of claim 2, wherein the method of obtaining stop words comprises a method of ending segmentation.

4. The method of claim 1, wherein the emotion words are derived from the university of college of great studios emotion vocabulary ontology library DUTIR emotion dictionary.

5. The method of claim 1, wherein the entity and emotion class labeled data sets are obtained by:

6. The method of claim 1, wherein the parts of speech include subjects, negatives, predicates, objects, and determinants.

7. The method of claim 1, wherein the emotion categories of the sentence include a positive category, a negative category, and an untudable category; obtaining the emotion classification of each sentence through the following strategies:

1) when the predicate is an emotional word, a subject, or an entity:

e) if negative words or question marks exist, the sentence types are reversed;

3) when the object is an emotional word, a subject or a predicate is an entity:

e) if negative words or question marks exist, the sentence types are reversed;

4) when the emotional words are only verbs and the objects are entity words:

e) if negative words or question marks exist, the sentence types are reversed;

8. The method of claim 1, wherein the emotion classification of the text to be detected is obtained by the following strategies:

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.