CN112069312A - Text classification method based on entity recognition and electronic device - Google Patents

Text classification method based on entity recognition and electronic device Download PDF

Info

Publication number
CN112069312A
CN112069312A CN202010806716.6A CN202010806716A CN112069312A CN 112069312 A CN112069312 A CN 112069312A CN 202010806716 A CN202010806716 A CN 202010806716A CN 112069312 A CN112069312 A CN 112069312A
Authority
CN
China
Prior art keywords
words
entity
emotion
positive
negative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010806716.6A
Other languages
Chinese (zh)
Other versions
CN112069312B (en
Inventor
王树鹏
孙立远
赵忠华
张磊
王博
王勇
付培国
王泽辰
王禄恒
万欣欣
李欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Information Engineering of CAS
Priority to CN202010806716.6A priority Critical patent/CN112069312B/en
Publication of CN112069312A publication Critical patent/CN112069312A/en
Application granted granted Critical
Publication of CN112069312B publication Critical patent/CN112069312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text classification method based on entity recognition, which comprises the following steps: segmenting a text to be detected to obtain emotion words and entity words, and judging the emotion types of the entity words through an entity and emotion type labeled data set; carrying out sentence segmentation on a text to be detected, and acquiring the emotion type of each sentence through the part of speech, negative words and punctuation content of the emotion words and entity words for marking the emotion type in each sentence; and obtaining the emotion type of the text to be detected according to the emotion type of each sentence. The method determines a directional entity set by utilizing a semi-supervised learning mode, a cooperative training and active learning mode and a learning and emotion rule mode; by identifying the entity with the appointed direction, the tendency judgment is carried out by combining the emotional words; and generating an entity set of the appointed category, and combining the emotion rules to realize deeper analysis of the text.

Description

Text classification method based on entity recognition and electronic device
Technical Field
The invention relates to the field of natural language processing, in particular to a text classification method based on entity recognition and an electronic device.
Background
With the rapid development of the internet, a variety of texts is rapidly increasing. People often classify the massive texts according to different requirements. For example, chinese patent application CN107491554B discloses a method, a device and a method for constructing a text classifier, chinese patent application CN105224695B discloses a method and a device for quantizing text features based on information entropy and a method and a device for classifying texts.
In a social platform, microblogs have strong influence and penetration. Microblog users are increasing continuously, relationships are established on platforms, information is acquired, and a large amount of content is produced. The microblog text information contains contents with obvious emotional colors. The viewpoint and the tendency of the users behind the texts are mined, the popularity tendency and the hot spots can be judged, enterprises can analyze the purchasing tendency of the consumers, accurate marketing is carried out, and the government can respond to the public opinion change of the netizens in time. Therefore, tendency analysis is carried out on the microblog texts, and judgment on the blephar position is further completed.
The standpoint judgment is an important branch of natural language processing for the purpose of processing a specific text and judging the emotional standpoint thereof, and has recently received much attention. Extracting emotion words is the most direct method for analyzing the emotion expressed by the microblog. The microblog content has the characteristics of short length, various content forms, strong viewpoint tendency, spoken language expression mode, general lack of context information and the like. Although the prior art analyzes the microblog by combining the emotion dictionary, the emotion words contained in the microblog can be simply summarized by classifying the emotion words. However, information expressed by a microblog is not only related to the emotional words contained in the microblog. Even if two microblogs contain identical emotional words, when the objects pointed by the emotional words are different, the viewpoints and tendencies of the microblogs are obviously different. With the development of network languages, new words for entity representation are also in a variety, users often refer to the new words by using brand abbreviations, harmonic sounds, popular languages and the like, and the entities depend on manual labeling and often need a large amount of work. Therefore, how to perform tendency analysis on massive microblog data so as to realize the determination from the standpoint is a problem to be solved urgently at present.
Disclosure of Invention
In order to solve the problems, the invention discloses a text classification method based on entity recognition and an electronic device, wherein a deep learning method is combined with emotion rules, a Bowen position judgment is carried out based on syntax rules, corresponding emotion rules are specified, and a collaborative learning method and an active learning method are combined, so that the emotion position of the Bowen expressed to an entity of a specified type can be accurately judged on the premise of depending on a small amount of labeled data, the microblog tendency analysis problem is solved, the emotion position of the Bowen to an entity of a specific type is judged and used as auxiliary information, and the application of the Bowen tendency analysis, such as fashion trend judgment, accurate marketing, public opinion monitoring and the like is facilitated.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a text classification method based on entity recognition comprises the following steps:
1) segmenting a text to be detected to obtain emotion words and entity words, and judging the emotion types of the entity words through an entity and emotion type labeled data set;
2) carrying out sentence segmentation on a text to be detected, and acquiring the emotion type of each sentence through the part of speech, negative words and punctuation content of the emotion words and entity words for marking the emotion type in each sentence;
3) and obtaining the emotion type of the text to be detected according to the emotion type of each sentence.
Further, preprocessing the text to be detected before extracting the emotion words and the entity words in the text to be detected; the pretreatment comprises the following steps: simplified traditional Chinese characters and removed stop words.
Further, the method for acquiring stop words comprises a Chinese word segmentation method.
Further, obtaining emotion words through an emotion vocabulary ontology library DUTIR emotion dictionary of university of great succession of studios.
Further, obtaining the entity and emotion category labeled data set through the following steps:
1) acquiring a plurality of emotion type sample texts which are marked with emotion, acquiring entity words of each emotion type sample text which is marked with emotion, and marking each entity word according to the emotion type of each emotion type sample text which is marked with emotion to obtain a first entity and emotion type marked data set;
2) segmenting each emotion class sample text marked with emotion, and inputting a first segmentation result into a laminated hidden Markov model and a conditional random field entity recognition learning model respectively for training to obtain a laminated hidden Markov entity classification model and a conditional random field entity classification model;
3) collecting a plurality of unlabeled emotion category sample texts, segmenting each unknown emotion category sample text, and respectively inputting a second segmentation result into a laminated hidden Markov entity classification model and a conditional random field entity classification model to respectively obtain an entity word set of a first labeled emotion category and an entity word set of a second labeled emotion category;
4) if the emotion category labeling results of the entity words in the entity word set labeled with the emotion category are the same as those of the entity words in the entity word set labeled with the emotion category, obtaining a second entity and emotion category labeled data set; if the emotion types of an entity word in the same unlabeled emotion type sample text are different, judging the emotion types by a domain expert to obtain a third entity and an emotion type labeled data set; carrying out manual entity word tagging and emotion category judgment by a domain expert on entity words different between the entity word set tagged with the first emotion category and the entity word set tagged with the second emotion category to obtain a fourth entity and emotion category tagged data set;
5) and combining the first entity and emotion type labeled data set, the second entity and emotion type labeled data set, the third entity and emotion type labeled data set and the fourth entity and emotion type labeled data set to obtain an entity and emotion type labeled data set.
Further, the parts of speech include subjects, negatives, predicates, objects, and determinants.
Further, the emotion categories of the sentence comprise a positive category, a negative category and an uncertain category; obtaining the emotion classification of each sentence through the following strategies:
1) when the predicate is an emotional word, a subject, or an entity:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
2) when the fixed language is an emotional word, a subject or a predicate is an entity word:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
3) when the object is an emotional word, a subject or a predicate is an entity:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
4) when the emotional words are only verbs and the objects are entity words:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
5) when the emotional words and the entity words are in other conditions, the tendency of the sentence cannot be judged.
Further, obtaining the emotion type of the text to be detected through the following strategies:
1) if a sentence in the text to be detected is judged to be in a negative category, the text to be detected is in the negative category;
2) if one sentence in the text to be detected is judged to be the positive type, and other sentences are judged to be positive tendency or can not identify the entity in the sentence, the detected text is the positive type;
3) if one sentence in the text to be detected is a sentence whose tendency cannot be judged, and other sentences are judged to be positive tendency, entities in the sentence cannot be identified, or the sentence tendency cannot be judged, the tendency of the text to be detected cannot be judged.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.
An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.
Compared with the prior art, the invention has the advantages that:
(1) determining a directional entity set by using a semi-supervised learning mode, a collaborative training and active learning mode and a learning and emotion rule mode;
(2) and (3) providing an emotion rule based on principal component analysis, and performing tendency judgment on the blog through identifying an entity with a specified direction and combining emotion words. The method comprises the steps of extracting main components of sentences, eliminating noise information, normalizing spoken bobbles into a specified format and the like; based on the processed text, performing tendency analysis on the blossoms containing the entities with the specified directions, and judging the blossoms' position by utilizing the positive and negative aspects of the entities, the commendation and derviation of the emotional words, sentence components serving as the emotional words and other information;
(3) and generating an entity set of the appointed category, and judging the position of the Bo text by combining the emotion rules to realize deeper analysis of the Bo text.
Drawings
FIG. 1 is a flow diagram of a text classification method of the present invention.
FIG. 2 is a flowchart of emotion object entity set extraction.
FIG. 3 is a schematic diagram of emotion object entity set extraction.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the following describes in detail a microblog tendency analysis method and steps based on emotional object recognition and emotional rules according to the present invention with reference to the accompanying drawings.
As shown in fig. 1, in the text classification method based on entity recognition, on one hand, a semi-supervised learning method is adopted, an entity recognition model is trained through cooperative training and active learning, and a specified type entity contained in a microblog is extracted in combination with a learning and emotion rule mode to determine a directional entity set. On the other hand, an emotion rule based on principal component analysis is constructed, principal components of sentences are extracted, noise information is eliminated, and the spoken texts are normalized to be in a specified format. And judging the tendency of the blooms by utilizing the positive and negative aspects of the directional entity, the recognition and derogation of the emotional words and sentence components serving as the emotional words, so as to realize the position judgment.
According to the first aspect of the invention, a semi-supervised learning method is adopted, a small amount of labeled microblog text data sets are used as initial input, and two different entity recognition models are cooperatively trained. In order to compare the model training effect in the collaborative training, two classifiers are constructed based on the two entity recognition models, and the two classifiers respectively comprise an entity set, an emotion dictionary and an emotion rule. In the process of collaborative training, a certain amount of data is firstly extracted from a data set which is not subjected to entity labeling and position judgment, and entity recognition is carried out on the unlabeled corpus by utilizing the two trained entity recognition models. Meanwhile, the two entity recognition models extract entities in the appointed direction by judging the entity types in the microblog texts, and an entity set is formed respectively. And then constructing two classifiers on the basis of the two entity sets, judging the microblog position by utilizing the classifiers respectively in combination with the emotion dictionary and the emotion rules, and performing tendency analysis on the same blog. Comparing the tendency analysis results obtained by the two classifiers, judging the confidence degrees of the samples, selecting the samples with high confidence degrees (namely the output results of the two classifiers are completely the same), adding entity labels and tendency labels of the appointed classes to the samples, and merging the entity labels and the tendency labels into the marked blog data; meanwhile, for samples with low confidence coefficient, an active learning mode is adopted, and samples with large divergence (two classifiers have different labeling results) selected by the classifier are added into the labeled data set. And inputting the updated and expanded labeled data set into the collaborative training model again, training the two entity recognition deep learning models again, and continuously iterating until the labeled Bowen data set reaches a sufficient scale, thereby obtaining a maximum specified direction entity set.
According to the second aspect of the invention, whether the microblog is related to the entity in the designated direction or not is judged according to the microblog data needing to be judged from the standpoint based on the entity set in the designated direction obtained by learning. If the microblog text contains any entity in the entity set with the appointed direction, tendency analysis is carried out on the microblog according to the previously written emotion rules and emotion dictionaries, and the microblog text position judgment is realized. Therefore, the method provides an emotional Object recognition and emotion rule-based microblog tendency Analysis (OASOSR) algorithm, which not only can determine the emotional words in the Bowen, but also can analyze whether the entity pointed by the emotional words is a target entity.
FIG. 2 shows a flowchart of emotion object entity set extraction. According to the method, semi-supervised learning is adopted, two entity recognition learning models are trained in a cooperative training and active learning mode, designated type entities contained in microblogs are extracted in a learning and emotion rule adding mode, and a directional entity set is determined. And constructing emotion rules based on principal component analysis, extracting principal components of sentences, eliminating noise information, and normalizing spoken texts into a specified format. And judging the tendency of the blooms by utilizing the positive and negative aspects of the directional entity, the recognition and derogation of the emotional words and sentence components serving as the emotional words, so as to realize the position judgment.
FIG. 3 shows a schematic diagram of emotion object entity set extraction. As shown in fig. 3, in the entity and emotion category labeled data sets, there are a small number of labeled microblog texts, which are labeled with entities (entity 1, entity 2, etc.) and emotion categories (positive/negative) of the blog text, respectively. The annotated microblog text is subjected to word segmentation, and the annotated microblog text is input into a layered Hidden Markov Model (CHMM) < Finn R.D. et al (2011) HMMER web server: interactive sequence search.nucleic Acids Res, 39, W29-W37 and a Conditional Random Field (Conditional Random Field, CRF) < Laffy, J., McCallum, A., Pereira, F. (2001) < comparative models for segmentation and labeling sequence data "< 12 > International. 18th. interface ray text company Lef.A. Morgan. entity Model 289. pp.282, two preliminary learning models. In the process of carrying out collaborative training on the CHMM and CRF entity recognition learning models, firstly, partial data (microblog texts a, b and c) are selected from unlabeled entity and emotion category data sets, the partial data are input into the CHMM and CRF entity recognition learning models which are subjected to preliminary training, entities in specified directions in Bowen are extracted, and an entity set I and an entity set II are obtained respectively. And performing dependency syntax analysis on the same unlabeled data based on the two entity sets respectively, performing position judgment based on a given emotion dictionary and emotion rules, and judging the emotion type of the microblog text. For each microblog, two marking results based on the entity set I and the entity set II exist, and each marking result comprises an entity (entity n) contained in the microblog and the trend (positive direction/negative direction/no judgment) of the microblog. And comparing the marking results obtained based on the entity set I and the entity set II, and if the marked entities and the emotion types in the results obtained by the two classifiers are completely the same aiming at the same microblog text, judging that the results are positive and negative type data with higher confidence coefficient. And directly adding the microblog text, the corresponding entity and emotion type labels to the entity and emotion type labeled data sets. If the entity or emotion type marked in the two results is not identical, the result is judged to be positive and negative type data with lower confidence coefficient, and an active learning method is adopted for processing. For such data, if the entities in the designated directions extracted by the two classifiers are completely the same for the same microblog text but the emotion types are opposite, the sample is considered to be a sample with larger divergence, and the result needs to be submitted to a domain expert for emotion type judgment. If the entities in the specified directions extracted by the two classifiers are different, or one classifier does not extract the entity in the specified direction, the sample is considered to be a sample with high uncertainty, and manual entity word labeling is needed. Similarly, microblog texts obtained by active learning and corresponding entity and emotion type labels are added to entity and emotion type labeled data sets. Thus, the first cycle of cooperative learning is completed.
And (4) iterating the process, and continuously expanding the data sets marked by the entity and the emotion types until the data volume of the marked data reaches the artificially set number. And acquiring an entity set consisting of all the labeled entities in the data set.
The oasorsr algorithm flow is given below. The algorithm comprises five steps of document preprocessing, condition judgment, entity information judgment, syntactic analysis and rule judgment, and the output result is the blog tendency category corresponding to the input blog.
Firstly, data cleaning is carried out on the original blog text, effective data is obtained after operations such as simplified traditional Chinese characters and the like, and then the ending participle is used, and stop words in the participle result are removed. And then judging the emotional words of the processed microblog text data based on the emotional dictionary. The method adopts an emotional vocabulary ontology library DUTIR emotional dictionary of university of great-succession studios to judge whether emotional words exist in the blog, screens out microblog texts containing the emotional words, and carries out the next processing. And judging entity information on the basis of the microblog text containing the emotional words. And screening out the microblogs containing the entity with the appointed direction. The entity information judgment is completed based on the extracted entity set. And extracting entity words in the microblog text, screening the microblog and carrying out dependency syntax analysis on sentences in the microblog text if the entity exists in the entity set in the specified direction. By identifying the subject, the negotiable term, the predicate, the object, the determinand, the punctuation and other components of the sentence, the sentence is divided into different types by using the emotional rules based on the syntactic analysis: 1) when the predicate is an emotional word, a subject or a predicate is an entity word; 2) when the fixed language is an emotional word, a subject or a predicate is an entity word; 3) when the object is an emotional word, a subject or a predicate is an entity word; 4) when the emotional words are only verbs and the objects are entity words.
When emotion words serve as different components in a sentence, the specific classification rule is shown in the OASOSR algorithm. And finally outputting the tendency of the microblog text after the judgment of the corresponding emotion rule.
Figure BDA0002629396930000071
Figure BDA0002629396930000081
An application scenario of an embodiment of the present invention: "apple Mobile phones were discarded in the last year! [ snow and snow treasures ] apple really has nothing to do with [ hum ] old and dead halt, and does not always have to be started in winter [ black line ]. . . And judging the tendency of the microblog text to the electronic product, wherein the tendency is required to be based on the constructed entity set of the electronic product. Firstly, word segmentation processing is carried out on a text; then judging that emotional words 'abandon' and 'no language' exist in the blog according to the emotional dictionary; further judging that an entity word 'apple mobile phone' exists in the text, wherein the entity word 'apple mobile phone' is an entity concentrated in the electronic product entity; and performing dependency syntactic analysis, wherein in a sentence of 'last year/abandon/apple mobile phone', the emotional words are only verbs and the objects are entity words, and the judgment conditions of the rule four in the emotion judgment rule are met, the emotional words are derviative words, and the entity words are positive entity words, so that the sentence is judged to be a negative tendency, and the microblog is also judged to be a negative tendency.
The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is specific, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims (10)

1. A text classification method based on entity recognition comprises the following steps:
1) segmenting a text to be detected to obtain emotion words and entity words, and judging the emotion types of the entity words through an entity and emotion type labeled data set;
2) carrying out sentence segmentation on a text to be detected, and acquiring the emotion type of each sentence through the part of speech, negative words and punctuation content of the emotion words and entity words for marking the emotion type in each sentence;
3) and obtaining the emotion type of the text to be detected according to the emotion type of each sentence.
2. The method of claim 1, wherein before extracting emotion words and entity words in the text to be detected, preprocessing the text to be detected; the pretreatment comprises the following steps: simplified traditional Chinese characters and removed stop words.
3. The method of claim 2, wherein the method of obtaining stop words comprises a method of ending segmentation.
4. The method of claim 1, wherein the emotion words are derived from the university of college of great studios emotion vocabulary ontology library DUTIR emotion dictionary.
5. The method of claim 1, wherein the entity and emotion class labeled data sets are obtained by:
1) acquiring a plurality of emotion type sample texts which are marked with emotion, acquiring entity words of each emotion type sample text which is marked with emotion, and marking each entity word according to the emotion type of each emotion type sample text which is marked with emotion to obtain a first entity and emotion type marked data set;
2) segmenting each emotion class sample text marked with emotion, and inputting a first segmentation result into a laminated hidden Markov model and a conditional random field entity recognition learning model respectively for training to obtain a laminated hidden Markov entity classification model and a conditional random field entity classification model;
3) collecting a plurality of unlabeled emotion category sample texts, segmenting each unknown emotion category sample text, and respectively inputting a second segmentation result into a laminated hidden Markov entity classification model and a conditional random field entity classification model to respectively obtain an entity word set of a first labeled emotion category and an entity word set of a second labeled emotion category;
4) if the emotion category labeling results of the entity words in the entity word set labeled with the emotion category are the same as those of the entity words in the entity word set labeled with the emotion category, obtaining a second entity and emotion category labeled data set; if the emotion types of an entity word in the same unlabeled emotion type sample text are different, judging the emotion types by a domain expert to obtain a third entity and an emotion type labeled data set; carrying out manual entity word tagging and emotion category judgment by a domain expert on entity words different between the entity word set tagged with the first emotion category and the entity word set tagged with the second emotion category to obtain a fourth entity and emotion category tagged data set;
5) and combining the first entity and emotion type labeled data set, the second entity and emotion type labeled data set, the third entity and emotion type labeled data set and the fourth entity and emotion type labeled data set to obtain an entity and emotion type labeled data set.
6. The method of claim 1, wherein the parts of speech include subjects, negatives, predicates, objects, and determinants.
7. The method of claim 1, wherein the emotion categories of the sentence include a positive category, a negative category, and an untudable category; obtaining the emotion classification of each sentence through the following strategies:
1) when the predicate is an emotional word, a subject, or an entity:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
2) when the fixed language is an emotional word, a subject or a predicate is an entity word:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
3) when the object is an emotional word, a subject or a predicate is an entity:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
4) when the emotional words are only verbs and the objects are entity words:
a) when the emotion words are positive words and the entity words are positive entity words, the sentences are positive categories;
b) when the emotion words are positive words and the entity words are negative entity words, the sentences are negative categories;
c) when the emotion words are deprecated words and the entity words are positive entity words, the sentences are in a negative category;
d) when the emotion words are deprecated words and the entity words are negative entity words, the sentences are positive categories;
e) if negative words or question marks exist, the sentence types are reversed;
5) when the emotional words and the entity words are in other conditions, the tendency of the sentence cannot be judged.
8. The method of claim 1, wherein the emotion classification of the text to be detected is obtained by the following strategies:
1) if a sentence in the text to be detected is judged to be in a negative category, the text to be detected is in the negative category;
2) if one sentence in the text to be detected is judged to be the positive type, and other sentences are judged to be positive tendency or can not identify the entity in the sentence, the detected text is the positive type;
3) if one sentence in the text to be detected is a sentence whose tendency cannot be judged, and other sentences are judged to be positive tendency, entities in the sentence cannot be identified, or the sentence tendency cannot be judged, the tendency of the text to be detected cannot be judged.
9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.
10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.
CN202010806716.6A 2020-08-12 2020-08-12 Text classification method based on entity recognition and electronic device Active CN112069312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806716.6A CN112069312B (en) 2020-08-12 2020-08-12 Text classification method based on entity recognition and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806716.6A CN112069312B (en) 2020-08-12 2020-08-12 Text classification method based on entity recognition and electronic device

Publications (2)

Publication Number Publication Date
CN112069312A true CN112069312A (en) 2020-12-11
CN112069312B CN112069312B (en) 2023-06-20

Family

ID=73661289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806716.6A Active CN112069312B (en) 2020-08-12 2020-08-12 Text classification method based on entity recognition and electronic device

Country Status (1)

Country Link
CN (1) CN112069312B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860887A (en) * 2021-01-18 2021-05-28 北京奇艺世纪科技有限公司 Text labeling method and device
CN113010638A (en) * 2021-02-25 2021-06-22 北京金堤征信服务有限公司 Entity recognition model generation method and device and entity extraction method and device
CN113312478A (en) * 2021-04-25 2021-08-27 国家计算机网络与信息安全管理中心 Viewpoint mining method and device based on reading understanding
CN114666282A (en) * 2021-06-08 2022-06-24 中国科学院信息工程研究所 5G flow identification method and device based on machine learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN103646088A (en) * 2013-12-13 2014-03-19 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN104899231A (en) * 2014-03-07 2015-09-09 上海市玻森数据科技有限公司 Sentiment analysis engine based on fine-granularity attributive classification
US20150278195A1 (en) * 2014-03-31 2015-10-01 Abbyy Infopoisk Llc Text data sentiment analysis method
CN110175325A (en) * 2019-04-26 2019-08-27 南京邮电大学 The comment and analysis method and Visual Intelligent Interface Model of word-based vector sum syntactic feature

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN103646088A (en) * 2013-12-13 2014-03-19 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN104899231A (en) * 2014-03-07 2015-09-09 上海市玻森数据科技有限公司 Sentiment analysis engine based on fine-granularity attributive classification
US20150278195A1 (en) * 2014-03-31 2015-10-01 Abbyy Infopoisk Llc Text data sentiment analysis method
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN110175325A (en) * 2019-04-26 2019-08-27 南京邮电大学 The comment and analysis method and Visual Intelligent Interface Model of word-based vector sum syntactic feature

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860887A (en) * 2021-01-18 2021-05-28 北京奇艺世纪科技有限公司 Text labeling method and device
CN112860887B (en) * 2021-01-18 2023-09-05 北京奇艺世纪科技有限公司 Text labeling method and device
CN113010638A (en) * 2021-02-25 2021-06-22 北京金堤征信服务有限公司 Entity recognition model generation method and device and entity extraction method and device
CN113010638B (en) * 2021-02-25 2024-02-09 北京金堤征信服务有限公司 Entity recognition model generation method and device and entity extraction method and device
CN113312478A (en) * 2021-04-25 2021-08-27 国家计算机网络与信息安全管理中心 Viewpoint mining method and device based on reading understanding
CN113312478B (en) * 2021-04-25 2022-07-19 国家计算机网络与信息安全管理中心 Viewpoint mining method and device based on reading understanding
CN114666282A (en) * 2021-06-08 2022-06-24 中国科学院信息工程研究所 5G flow identification method and device based on machine learning
CN114666282B (en) * 2021-06-08 2024-01-05 中国科学院信息工程研究所 Machine learning-based 5G flow identification method and device

Also Published As

Publication number Publication date
CN112069312B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
Saeed et al. An ensemble approach for spam detection in Arabic opinion texts
CN106649603B (en) Designated information pushing method based on emotion classification of webpage text data
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN111160031A (en) Social media named entity identification method based on affix perception
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
Alkhatlan et al. Word sense disambiguation for arabic exploiting arabic wordnet and word embedding
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN116775874B (en) Information intelligent classification method and system based on multiple semantic information
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN111444704A (en) Network security keyword extraction method based on deep neural network
CN113761377B (en) False information detection method and device based on attention mechanism multi-feature fusion, electronic equipment and storage medium
CN114861082A (en) Multi-dimensional semantic representation-based aggressive comment detection method
CN114298021A (en) Rumor detection method based on sentiment value selection comments
CN111159405B (en) Irony detection method based on background knowledge
CN107291686B (en) Method and system for identifying emotion identification
CN112528653A (en) Short text entity identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant