CN108090040B - Text information classification method and system - Google Patents

Text information classification method and system Download PDF

Info

Publication number
CN108090040B
CN108090040B CN201611044117.5A CN201611044117A CN108090040B CN 108090040 B CN108090040 B CN 108090040B CN 201611044117 A CN201611044117 A CN 201611044117A CN 108090040 B CN108090040 B CN 108090040B
Authority
CN
China
Prior art keywords
score
preset
text information
participle
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611044117.5A
Other languages
Chinese (zh)
Other versions
CN108090040A (en
Inventor
郭秦龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611044117.5A priority Critical patent/CN108090040B/en
Publication of CN108090040A publication Critical patent/CN108090040A/en
Application granted granted Critical
Publication of CN108090040B publication Critical patent/CN108090040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The embodiment of the invention discloses a text information classification method and a text information classification system, which are used for improving the accuracy of text emotion classification. The method provided by the embodiment of the invention comprises the following steps: acquiring text information; obtaining a first word segmentation, wherein the first word segmentation is obtained by performing word segmentation processing on the text information according to a first preset rule; placing the first score into a preset emotion score counter to calculate to obtain a first score; acquiring a second word segmentation, wherein the second word segmentation is obtained by performing word segmentation processing on the text information according to a second preset rule; placing the second segmentation into a preset training model to calculate to obtain a second score; when the language environment of the text information is determined according to a preset text rule, performing weight distribution on the first score and the second score by using a preset comprehensive logic; and obtaining the comprehensive score of the text information according to the weight distributed by the preset comprehensive logic, and obtaining the classification result of the text information according to the comprehensive score.

Description

Text information classification method and system
Technical Field
The present invention relates to the field of text information classification, and in particular, to a text information classification method and system.
Background
Emotion classification is a typical problem in the field of Natural Language Processing (NLP), which describes a given segment of text (which may be a sentence or an article) to determine whether the emotion expressed by the article is positive, negative, or neutral.
The sentiment classification problem itself is a topic that is widely and deeply studied both in academia and industry. The use of an emotion dictionary is one approach to solving the emotion classification problem. And manually setting scores of some emotional words, such as positive emotional words and negative emotional words. For the input text, the emotion classification of the text is determined by looking at the proportion of positive and negative emotion words.
The classification effect of the prior art is very dependent on the quality of the emotion dictionary. If the quality of the emotion dictionary is not good enough, such as some wrong word classification, or some words with fuzzy emotion classification, such as 'unexpected', the method is used in the field of household appliances, and generally indicates that the household appliances have an agnostic problem, but if the method is used in the field of movies, the method generally indicates that the movie drama is attractive.
In the prior art, a single emotion classification algorithm is utilized, flexible scoring can not be performed according to a specific field, and the emotion classification accuracy is low.
Disclosure of Invention
The embodiment of the invention provides a text information classification method and a text information classification system, which are used for improving the accuracy of text emotion classification.
A first aspect of an embodiment of the present invention provides a text information classification method, which specifically includes:
acquiring text information;
acquiring a first word, and performing word segmentation processing on the text information according to a first preset rule by the first word to obtain the first word;
placing the first score into a preset emotion score counter to calculate to obtain a first score;
acquiring a second word segmentation, and performing word segmentation processing on the text information according to a second preset rule by the second word segmentation to obtain the second word segmentation;
placing the second segmentation into a preset training model to obtain a second score;
performing weight distribution on the first score and the second score by using preset comprehensive logic;
obtaining the comprehensive score of the text information according to the weight distributed by the preset comprehensive logic,
and obtaining a classification result of the text information according to the comprehensive score.
A second aspect of the embodiments of the present invention provides a text classification system, which specifically includes:
a first acquisition unit configured to acquire text information;
the second acquisition unit is used for acquiring a first word, and the first word is obtained by performing word segmentation processing on the text information acquired by the first acquisition unit according to a first preset rule;
the first embedding unit is used for embedding the first score acquired by the second acquiring unit into a preset emotion score counter to calculate to obtain a first score;
the third obtaining unit is used for obtaining a second participle, and the second participle is obtained by carrying out participle processing on the text information obtained by the first unit according to a second preset rule;
the second embedding unit is used for embedding the second segmentation into a preset training model to obtain a second score through calculation;
the first distribution unit is used for carrying out weight distribution on the first score and the second score by utilizing preset comprehensive logic;
the calculation unit is used for obtaining the comprehensive score of the text information according to the weight distributed by the comprehensive logic;
and the processing unit is used for obtaining the classification result of the text information according to the comprehensive score obtained by the calculating unit.
A third aspect of the embodiments of the present invention provides a terminal, which specifically includes:
an input device, an output device, a processor, and a memory;
the input device performs the following steps:
acquiring text information;
acquiring a first word, and performing word segmentation processing on the text information according to a first preset rule by the first word to obtain the first word;
acquiring a second word segmentation, and performing word segmentation processing on the text information according to a second preset rule by the second word segmentation to obtain the second word segmentation;
the processor is used for executing the following steps by calling the operation instruction stored in the memory:
placing the first score into a preset emotion score counter to calculate to obtain a first score;
placing the second segmentation into a preset training model to obtain a second score;
performing weight distribution on the first score and the second score by using preset comprehensive logic;
obtaining the comprehensive score of the text information according to the weight distributed by the preset comprehensive logic,
and obtaining a classification result of the text information according to the comprehensive score.
According to the technical scheme, the embodiment of the invention has the following advantages:
in the embodiment of the invention, firstly, text information is obtained; performing word segmentation processing on the text to obtain a first word segmentation; placing the first score into a preset emotion score counter to calculate to obtain a first score; performing word segmentation processing on the text to obtain a second word segmentation; placing the second segmentation into a preset training model to obtain a second score; and carrying out weight distribution on the first score and the second score by using preset comprehensive logic, obtaining a comprehensive score of the text information according to the weight distributed by the preset comprehensive logic, and obtaining a classification result of the text information according to the comprehensive score. The embodiment of the invention utilizes a serialized emotion classification method to distribute the weight of the scores obtained by different algorithms in combination with the language environment, thereby improving the accuracy of text classification.
Drawings
FIG. 1 is a schematic diagram of a network architecture according to an embodiment of the present invention;
FIG. 2 is a diagram of an embodiment of a text information classification method according to an embodiment of the present invention;
FIG. 3 is a diagram of another embodiment of a text information classification method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of a system in accordance with embodiments of the present invention;
FIG. 5 is a schematic diagram of another embodiment of the system in an embodiment of the invention;
fig. 6 is a schematic diagram of another embodiment of the system according to the embodiment of the invention.
Detailed Description
The embodiment of the invention provides a text information classification method and a text information classification system, which are used for improving the accuracy of text emotion classification.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the present invention may be applied to a network architecture as shown in fig. 1, in which a user may use a user device (e.g., a personal computer, a notebook computer, a tablet computer, a mobile phone, etc.) to obtain a text to be classified through a storage device, etc. And then analyzing the text needing emotion classification through a text classification system on the user equipment to obtain an analysis result.
In the embodiment of the invention, the text information to be classified is firstly acquired, then the first score of the text information is acquired by using the algorithm of the emotion dictionary, the second score of the text information is acquired by using the algorithm based on machine learning, when the language environment of the text information is determined according to the preset text rule, the first score and the second score are subjected to weight distribution by using the comprehensive logic, the comprehensive logic is obtained according to the language environment, and finally the classification result of the text information is obtained according to the weight distributed by the comprehensive logic. The embodiment of the invention utilizes a serialized emotion classification method to distribute the weight of the scores obtained by different algorithms in combination with the language environment, thereby improving the accuracy of text classification.
Referring to fig. 2, an embodiment of a text information classification method according to an embodiment of the present invention includes:
201. and acquiring text information.
In this embodiment, before the text information needs to be classified, the text information needs to be acquired first.
It should be noted that the system may obtain the text information through the internet, or may obtain the text information from other ways, for example, from a storage device, and the specific obtaining manner is not limited here.
202. And acquiring a first word segmentation.
In this embodiment, when the system acquires text information that needs emotion analysis, the text information is obtained by performing word segmentation processing on the text information according to a first preset rule, where the first preset rule is a rule for dividing a text module according to words and/or sentences, and the first word segmentation is a word segmentation set including all sub-words of the text information.
It should be noted that the first word includes words and sentences.
203. And placing the first score into a preset emotion score counter to calculate to obtain a first score.
In this embodiment, after the system obtains the first word, the system stores an emotion score counter, and the first word is placed in a preset emotion score counter to be calculated, so that the first score is obtained.
204. And acquiring a second word segmentation.
In this embodiment, after the system puts the first participle into the preset emotion score counter to calculate the first score, the first participle is screened according to a second preset rule, where the second preset rule is that after all the first participles in the first participle are compared with words stored in the preset emotion dictionary, the first participles stored in the preset emotion dictionary are screened and removed, and a set of the screened first participles is used as the second participle.
It should be noted that the second sub-word includes words and sentences.
205. And placing the second segmentation into a preset training model to calculate to obtain a second score.
In this embodiment, after the system obtains the second participle, the second participle is placed in the preset training model, and the second score is obtained through calculation, where the preset training model stores a corresponding relationship between a preset score vector and a score.
206. And performing weight distribution on the first score and the second score by using preset synthesis logic.
In this embodiment, when the first score and the second score are obtained, the system performs weight distribution on the first score and the second score by using a synthesis logic, where the preset synthesis logic is a rule set according to a language environment determined by a special word in a text and then according to the language environment.
207. And obtaining the comprehensive score of the text information according to the weight distributed by the preset comprehensive logic.
In this embodiment, after the system has utilized the synthesis logic to perform weight assignment on the first score and the second score, the synthesis score of the text message is obtained according to the weight assigned by the preset synthesis logic.
The total score is the first score and the first weight + the second score and the second weight, where the first weight is the weight assigned to the first score by the total logic, and the second weight is the weight assigned to the second score by the total logic, where the sum of the weights is 1. Typically, the first score is weighted higher than the second score.
208. And obtaining a classification result of the text information according to the comprehensive score.
In this embodiment, after the comprehensive score of the text information is obtained according to the weight assigned by the preset comprehensive logic, the classification result of the text information is obtained according to the comprehensive score.
In the embodiment of the invention, firstly, text information is obtained; performing word segmentation processing on the text to obtain a first word segmentation; placing the first score into a preset emotion score counter to calculate to obtain a first score; performing word segmentation processing on the text to obtain a second word segmentation; placing the second segmentation into a preset training model to obtain a second score; when the language environment of the text information is determined according to the preset text rule, the preset comprehensive logic is utilized to carry out weight distribution on the first score and the second score, the comprehensive score of the text information is obtained according to the weight distributed by the preset comprehensive logic, and the classification result of the text information is obtained according to the comprehensive score. The embodiment of the invention utilizes a serialized emotion classification method to distribute the weight of the scores obtained by different algorithms in combination with the language environment, thereby improving the accuracy of text classification.
Referring to fig. 3, another embodiment of the text information classification method according to the embodiment of the present invention includes:
301. and acquiring text information.
In this embodiment, before the text information needs to be classified, the text information needs to be acquired first.
It should be noted that the system may obtain the text information through the internet, or may obtain the text information from other ways, for example, from a storage device, and the specific obtaining manner is not limited here.
302. And acquiring a first word segmentation.
In this embodiment, when the system acquires text information that needs emotion analysis, the text information is obtained by performing word segmentation processing on the text information according to a first preset rule, where the first preset rule is a rule for dividing a text module according to words and/or sentences, and the first word segmentation is a word segmentation set including all sub-words of the text information.
It should be noted that the first word includes words and sentences.
303. And placing the first score into a preset emotion score counter to calculate to obtain a first score.
In this embodiment, the system stores an emotion score counter, the preset emotion dictionary is included in the emotion sub-counter, a score value uniquely corresponding to a large number of words and phrases is stored in the preset emotion dictionary, the system compares a first sub-word in the acquired first sub-word with words and phrases stored in the preset emotion dictionary, and when a word or a sentence identical to a word or a sentence in the preset emotion dictionary is found, the emotion score counter adds a score uniquely corresponding to the found word or sentence, and the scores corresponding to all the sub-words are added to obtain the first score.
304. And acquiring a second word segmentation.
In this embodiment, after the system puts the first participle into the preset emotion score counter to calculate the first score, the first participle is screened according to a second preset rule, where the second preset rule is that after all the first participles in the first participle are compared with words stored in the preset emotion dictionary, the first participles stored in the preset emotion dictionary are screened and removed, and a set of the screened first participles is used as the second participle.
It should be noted that the second sub-word includes words and sentences.
305. And placing the second segmentation into a preset training model to calculate to obtain a second score.
In this embodiment, after the system obtains the second participle, the second sub-participle in the second participle is converted into a numerical vector through the preset training model, then the preset score vector closest to the numerical vector corresponding to the second sub-participle is searched from the preset score vectorization database in the preset training model, the score corresponding to the preset score vector closest to the numerical vector is used as the score of the second sub-participle, and finally the scores corresponding to the second sub-participle are added to obtain the second score. The preset score vectorization database stores the corresponding relation between the preset score vector and the score.
306. And acquiring a third score of the text information by using an emotion classification method.
In the embodiment, the system supports the expansion function, if a new appropriate algorithm is found with the evolution of a service scene in the future, the new algorithm is added as a sub-module of the algorithm through the algorithm self-defining function, and then the third score of the text information is obtained by using the emotion classification method.
It should be noted that the emotion classification method added later can be various, and is not limited herein.
307. And carrying out weight distribution on the first score, the second score and the third score by utilizing preset comprehensive logic.
In this embodiment, after the first score, the second score, and the third score are obtained, the system performs weight distribution on the first score, the second score, and the third score by using a comprehensive logic, where the preset comprehensive logic is a rule set according to a language environment determined by a special word in a text and then according to the language environment.
308. And obtaining the comprehensive score of the text information according to the weight distributed by the preset comprehensive logic.
In this embodiment, after the first score, the second score, and the third score are assigned with weights by the synthetic logic, the synthetic score of the text message is obtained according to the weights assigned by the preset synthetic logic.
The total score is a first score, a first weight, a second score, a second weight, a third score, and a third weight, where it is to be noted that the first weight is a weight assigned to the first score by the total logic, the second weight is a weight assigned to the second score by the total logic, and the third weight is a weight assigned to the third score by the total logic, where the sum of the weights is 1. Typically, the first score is weighted higher than the second score.
309. And obtaining a classification result of the text information according to the comprehensive score.
In the embodiment of the invention, the preset score threshold range of the comprehensive score is judged to obtain the judgment result, and then the classification result of the text information is obtained according to the judgment result.
The preset score threshold range of the positive emotion can be adjusted according to specific conditions, and the preset score threshold range of each emotion can also take other values, and is not limited herein.
It should be noted that the system may determine a preset score threshold range where the composite score is located to obtain a determination result, then obtain a classification result of the text information according to the determination result, and also obtain a classification result of the text information according to other determination methods, for example, determine which of the composite score and the preset emotion score is closer, specifically apply which method to obtain the classification result of the text information, and the specific place is not limited herein.
The preset emotion score may be that the positive emotion is 2 points, the neutral emotion is 0 points, and the negative emotion is-2 points, and the specific emotion score value may be adjusted according to the actual application situation, and is not limited here.
It should be noted that, when the language environment of the text information cannot be determined according to the preset text rule, the user-defined logic may be used to assign the weights, where the user-defined logic is a logic input by the user through the parameter configuration port.
For example, if the emotion text in a business scenario is mostly positive, and only a small amount is negative, then the classification effect can be improved by inputting custom logic to make all classifiers negative, and considering the text result as negative.
It should be noted that, when the language environment of the text information cannot be determined according to the preset text rule, besides the user-defined logic input by the user may be installed to perform weight assignment on each text information, there are other assignment methods, for example, directly configuring the weight assignment as an average assignment, and specifically using which assignment method, which is not limited herein.
Wherein the average assignment is such that the first score is assigned a weight of 0.5 and the second score is also assigned a weight of 0.5, wherein the final score is the first score, the first weight + the second score, and the final score is the first score, the second weight, and the final score is 0.5+ the second score, 0.5. Wherein the sum of the weights is 1. If there are multiple weights, each weight is 1 ÷ the number of weights.
In the embodiment of the invention, firstly, text information is obtained; performing word segmentation processing on the text to obtain a first word segmentation; placing the first score into a preset emotion score counter to calculate to obtain a first score; performing word segmentation processing on the text to obtain a second word segmentation; placing the second segmentation into a preset training model to obtain a second score; and acquiring a third score of the text information by using an emotion classification method. When the language environment of the text information is determined according to the preset text rule, the first score, the second score and the third score are subjected to weight distribution by using preset comprehensive logic, the comprehensive score of the text information is obtained according to the weight distributed by the preset comprehensive logic, and the classification result of the text information is obtained according to the comprehensive score. The embodiment of the invention utilizes a serialized emotion classification method to distribute the weight of the scores obtained by different algorithms in combination with the language environment, thereby improving the accuracy of text classification.
For ease of understanding, the present embodiment is described below with reference to specific application scenarios:
scenario 1, the system takes a text message, "Xiaoli is a good exam, her father knows this message and is happy with it. "
The system carries out word segmentation processing on the text to obtain four words of 'test', 'good', 'happy' and 'incapable of being described', then the four words are put into an emotion score counter to be searched, two words of 'good' and 'happy' are found, and corresponding scores are respectively 3 points and 4 points. The first score is therefore 3+4 to 7. Then, the four words of 'examination', 'good', 'happy', and 'incapable of being described' are screened to obtain two words of 'examination' and 'incapable of being described', then the two words are placed into a preset training model and converted into corresponding numerical vectors, after the preset score vector with the closest distance between the two numerical vectors is obtained through calculation, the scores corresponding to the two corresponding preset score vectors are obtained, the score corresponding to the numerical vector corresponding to the 'incapable of being described' is-2 points, the score corresponding to the numerical vector corresponding to the 'examination' is 0 point, and the second score is obtained through calculation and is-2 points. And analyzing the two words of 'test' and 'good' according to the memory language environment template to obtain the language environment of the text, wherein the language environment belongs to the narrative text, and the logic corresponding to the narrative text is that the first score corresponds to the weight of 0.7, and the second score corresponds to the weight of 0.3, so that the comprehensive score is 7 x 0.7+ (-2) x 0.3-4.3. The preset score threshold range of the forward emotion is (-100, -1), the preset score threshold range of the neutral emotion is (1, 1), the preset score threshold range of the forward emotion is (1, 100), and 4.3 points are in the preset score threshold range of the forward emotion, so that the system classifies the text as the text of the forward emotion.
Scene 2, the system acquires text information, and the movie that ' the former days and friends see ' crazy stones ' is originally thought to be boring, but the movie is really unexpected for me. "
The system carries out word segmentation processing on the text to obtain four words of 'friend', 'boring', 'movie' and 'unexpected', then the four words are put into an emotion score counter to be searched, the word of 'boring' is found, and corresponding scores are-3. The first score is-2. Then, the four words of 'friend', 'boring', 'movie', 'unexpected' are screened to obtain three words of 'friend', 'movie', 'unexpected', then the three words are placed in a preset training model and are converted into corresponding numerical vectors, after the preset score vectors with the three numerical vectors closest to each other are calculated, scores corresponding to the corresponding three preset score vectors are obtained, the score corresponding to the numerical vector corresponding to the 'friend' is 1 score, the score corresponding to the numerical vector corresponding to the 'movie' is 0 score, the score corresponding to the numerical vector corresponding to the 'unexpected' is 1 score, and a second score is 2 score. The method comprises the steps of analyzing two words of 'movie' and 'unexpected' according to a memory language environment template to obtain the language environment of a text, wherein the text belongs to the field of movies, the logics corresponding to the text in the field of movies are that a first score corresponds to the weight of 0.2, a second score corresponds to the weight of 0.8, a comprehensive score of-2 x 0.2+2 x 0.8 ═ 1.2 is obtained, a preset score threshold range of forward emotion is (-100, -1), a preset score threshold range of neutral emotion is (1, 1), a preset score threshold range of forward emotion is (1, 100), and 1.2 is in the preset score threshold range of forward emotion, so that the text is classified as the text of the forward emotion by a system.
The text information classification method in the embodiment of the present invention is described above, and a system in the embodiment of the present invention is described below with reference to fig. 4, where the system in the embodiment of the present invention includes:
a first acquisition unit 401 configured to acquire text information;
a second obtaining unit 402, configured to obtain a first word, where the first word is obtained by performing word segmentation processing on the text information obtained by the first obtaining unit according to a first preset rule;
a first embedding unit 403, configured to embed the first score obtained by the second obtaining unit into a preset emotion score counter to obtain a first score through calculation;
a third obtaining unit 404, configured to obtain a second participle, where the second participle is obtained by performing participle processing on the text information obtained by the first unit according to a second preset rule;
a second embedding unit 405, configured to embed the second score into a preset training model to obtain a second score;
the first distributing unit 406 is configured to, when the language environment of the text information is determined according to the preset text rule, perform weight distribution on the first score and the second score by using a preset comprehensive logic;
the calculation unit 407 is configured to obtain a comprehensive score of the text information according to the weight assigned by the comprehensive logic;
the processing unit 408 is configured to obtain a classification result of the text information according to the comprehensive score obtained by the calculating unit.
In the embodiment of the present invention, first, a first obtaining unit 401 obtains text information; the second obtaining unit 402 obtains a first word segmentation by performing word segmentation processing on the text; the first embedding unit 403 embeds the first word into a preset emotion score counter to obtain a first score through calculation; the third obtaining unit 404 obtains a second word by performing word segmentation processing on the text; the second embedding unit 405 embeds the second segmentation into a preset training model to obtain a second score through calculation; when the language environment of the text information is determined according to the preset text rule, the first distribution unit 406 performs weight distribution on the first score and the second score by using preset comprehensive logic, and the calculation unit 407 obtains the comprehensive score of the text information according to the weight distributed by the comprehensive logic; the processing unit 408 derives a classification result of the text information based on the composite score derived by the calculation unit. The embodiment of the invention utilizes a serialized emotion classification method to distribute the weight of the scores obtained by different algorithms in combination with the language environment, thereby improving the accuracy of text classification.
Referring to fig. 5, another embodiment of the system according to the embodiment of the present invention includes:
a first obtaining unit 501, configured to obtain text information;
a second obtaining unit 502, configured to obtain a first word, where the first word is obtained by performing word segmentation processing on the text information obtained by the first obtaining unit according to a first preset rule;
a first embedding unit 503, configured to embed the first score obtained by the second obtaining unit into a preset emotion score counter to obtain a first score through calculation;
the first embedding unit 503 includes:
a searching subunit 5031, configured to search, in the preset emotion dictionary, whether a first sub-participle exists, where the first sub-participle is included in the first participle;
the extracting sub-unit 5032 is configured to, when the finding unit finds that the first sub-participle exists, extract a score corresponding to the existing first sub-participle, and preset a corresponding relationship between the first sub-participle and the score in the emotion dictionary;
the first calculating sub-unit 5033 is configured to calculate, according to the preset emotion score counter, a score corresponding to the first sub-participle to obtain a first score.
A third obtaining unit 504, configured to obtain a second participle, where the second participle is obtained by performing participle processing on the text information obtained by the first unit according to a second preset rule;
a second embedding unit 505, configured to embed the second score into a preset training model to obtain a second score;
the second insertion unit 505 includes:
a conversion subunit 5051, configured to convert the second sub-participle into a numerical vector according to a preset training model, where the second sub-participle is included in the second participle;
a second calculating subunit 5052, configured to calculate a distance between the numerical vector and the preset fraction vector;
a determination subunit 5053 configured to determine a preset score vector closest to the numeric vector as a score of the second sub-participle;
and a third computing subunit 5054, configured to add the scores corresponding to the second sub-participles to obtain a second score.
A fourth obtaining unit 506, configured to obtain the third score of the text information by using an emotion classification method, where the emotion classification method is configured according to the change of the language environment.
A second assigning unit 507 for assigning a weight to the first score, the second score and the third score using a preset integration logic.
A calculating unit 508, configured to obtain a comprehensive score of the text information according to the weight assigned by the comprehensive logic;
a processing unit 509, configured to obtain a classification result of the text information according to the comprehensive score obtained by the calculating unit.
Wherein the processing unit 509 comprises:
a second determining subunit 5091, configured to determine a preset score threshold range where the comprehensive score is located, to obtain a determination result;
and the processing subunit 5092 is configured to obtain a text classification result according to the determination result.
In the embodiment of the present invention, first, a first obtaining unit 501 obtains text information; the second obtaining unit 502 obtains a first word segmentation by performing word segmentation processing on the text; the first embedding unit 503 embeds the first score into a preset emotion score counter to obtain a first score through calculation; the third obtaining unit 504 performs word segmentation processing on the text to obtain a second word segmentation; the second embedding unit 505 embeds the second segmentation into a preset training model to obtain a second score; the fourth obtaining unit 506 obtains the third score of the text information by using the emotion classification method. When the language environment of the text information is determined according to the preset text rule, the second allocating unit 507 performs weight allocation on the first score, the second score and the third score by using preset comprehensive logic, the calculating unit 508 obtains a comprehensive score of the text information according to the weight allocated by the preset comprehensive logic, and the processing unit 509 obtains a classification result of the text information according to the comprehensive score. The embodiment of the invention utilizes a serialized emotion classification method to distribute the weight of the scores obtained by different algorithms in combination with the language environment, thereby improving the accuracy of text classification.
Referring to fig. 6, fig. 6 is a schematic diagram of a server structure according to an embodiment of the present invention, the server 600 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and a memory 632, and one or more storage media 630 (e.g., one or more mass storage devices) for storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 600.
The server 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, and/or one or more operating systems 641, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 6.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A text information classification method is characterized by comprising the following steps:
acquiring text information;
obtaining a first word segmentation, wherein the first word segmentation is obtained by performing word segmentation processing on the text information according to a first preset rule;
placing the first score into a preset emotion score counter to calculate to obtain a first score;
obtaining a second word segmentation, wherein the second word segmentation is obtained by screening the first word segmentation according to a second preset rule;
placing the second segmentation into a preset training model to calculate to obtain a second score;
performing weight distribution on the first score and the second score by utilizing preset comprehensive logic based on the first score, wherein the preset comprehensive logic is a rule set according to a language environment in a text;
obtaining a comprehensive score of the text information according to the weight distributed by the preset comprehensive logic;
and obtaining a classification result of the text information according to the comprehensive score.
2. The method of claim 1, wherein the step of placing the first score into a preset sentiment score counter to calculate a first score comprises:
searching whether a first sub-participle exists in a preset emotion dictionary arranged in the preset emotion score counter, wherein the first sub-participle is contained in the first participle, and the preset emotion dictionary stores the corresponding relation between the first sub-participle and the score;
if the first sub-participle exists, extracting a score corresponding to the first existing sub-participle;
and calculating the score corresponding to the first sub-participle according to the preset emotion score counter to obtain the first score.
3. The method for classifying text information according to claim 1, wherein the step of placing the second score into a preset training model to obtain a second score comprises:
converting a second sub-participle into a numerical vector according to a preset training model, wherein the second sub-participle is contained in the second participle;
calculating the distance between the numerical value vector and a preset fraction vector;
taking the score corresponding to the preset score vector closest to the numerical vector as the score of the second sub-participle;
and adding the scores corresponding to the second sub-participles to obtain the second score.
4. The method of claim 1, wherein the deriving the classification result of the text message according to the composite score comprises:
judging a preset score threshold range in which the comprehensive score is positioned to obtain a judgment result;
and obtaining the classification result of the text information according to the judgment result.
5. The method according to any one of claims 1 to 4, wherein after the text information is acquired, the method further comprises:
and acquiring a third score of the text information by using an emotion classification method, wherein the emotion classification method is configured according to language environment change.
6. A text classification system, comprising:
a first acquisition unit configured to acquire text information;
the second obtaining unit is used for obtaining a first word segmentation, and the first word segmentation is obtained by performing word segmentation processing on the text information according to a first preset rule;
the first embedding unit is used for embedding the first segmentation into a preset emotion score counter to obtain a first score through calculation;
the third obtaining unit is used for obtaining a second participle, and the second participle is obtained by screening the first participle according to a second preset rule;
the second embedding unit is used for embedding the second segmentation into a preset training model to obtain a second score through calculation;
the first distribution unit is used for carrying out weight distribution on the first score and the second score by utilizing preset comprehensive logic based on the first score, and the preset comprehensive logic is a rule set according to a language environment in a text;
the calculation unit is used for obtaining the comprehensive score of the text information according to the weight distributed by the comprehensive logic;
and the processing unit is used for obtaining the classification result of the text information according to the comprehensive score.
7. The system of claim 6, wherein the first placement unit comprises:
the searching subunit is used for searching whether a first sub-participle exists in a preset emotion dictionary, and the first sub-participle is contained in the first participle;
the extracting subunit is configured to, when the searching unit searches for the first sub-participle, extract a score corresponding to the first sub-participle, where a corresponding relationship between the first sub-participle and the score is stored in the preset emotion dictionary;
and the first calculating subunit is used for calculating the score corresponding to the first sub-participle according to the preset emotion score counter to obtain the first score.
8. The system of claim 6, wherein the second placement unit comprises:
the conversion subunit is used for converting a second sub-participle into a numerical vector according to a preset training model, wherein the second sub-participle is contained in the second participle;
the second calculating subunit is used for calculating the distance between the numerical value vector and a preset fraction vector;
the first determining subunit is used for taking the score corresponding to the preset score vector closest to the numerical vector as the score of the second sub-participle;
and the third calculating subunit is used for adding the scores corresponding to the second sub-participles to obtain the second score.
9. The system of claim 6, wherein the processing unit comprises:
the second determining subunit is used for judging the preset score threshold range in which the comprehensive score is positioned to obtain a judgment result;
and the processing subunit is used for obtaining a text classification result according to the judgment result.
10. The system according to any one of claims 6 to 8, further comprising:
and the fourth acquisition unit is used for acquiring a third score of the text information by utilizing an emotion classification method, wherein the emotion classification method is configured according to the language environment change.
CN201611044117.5A 2016-11-23 2016-11-23 Text information classification method and system Active CN108090040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611044117.5A CN108090040B (en) 2016-11-23 2016-11-23 Text information classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611044117.5A CN108090040B (en) 2016-11-23 2016-11-23 Text information classification method and system

Publications (2)

Publication Number Publication Date
CN108090040A CN108090040A (en) 2018-05-29
CN108090040B true CN108090040B (en) 2021-08-17

Family

ID=62170951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611044117.5A Active CN108090040B (en) 2016-11-23 2016-11-23 Text information classification method and system

Country Status (1)

Country Link
CN (1) CN108090040B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460550A (en) * 2018-10-22 2019-03-12 平安科技(深圳)有限公司 Report sentiment analysis method, apparatus and computer equipment are ground using the security of big data
CN110046342A (en) * 2019-02-19 2019-07-23 阿里巴巴集团控股有限公司 A kind of text quality's detection method
CN109829167B (en) * 2019-02-22 2023-11-21 维沃移动通信有限公司 Word segmentation processing method and mobile terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929861A (en) * 2012-10-22 2013-02-13 杭州东信北邮信息技术有限公司 Method and system for calculating text emotion index
CN104008091A (en) * 2014-05-26 2014-08-27 上海大学 Sentiment value based web text sentiment analysis method
CN104392006A (en) * 2014-12-17 2015-03-04 中国农业银行股份有限公司 Event query processing method and device
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
CN104951548A (en) * 2015-06-24 2015-09-30 烟台中科网络技术研究所 Method and system for calculating negative public opinion index

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7231399B1 (en) * 2003-11-14 2007-06-12 Google Inc. Ranking documents based on large data sets
CN102023986B (en) * 2009-09-22 2015-09-30 日电(中国)有限公司 The method and apparatus of text classifier is built with reference to external knowledge
CN103927302B (en) * 2013-01-10 2017-05-31 阿里巴巴集团控股有限公司 A kind of file classification method and system
US20150073774A1 (en) * 2013-09-11 2015-03-12 Avaya Inc. Automatic Domain Sentiment Expansion
CN103544255B (en) * 2013-10-15 2017-01-11 常州大学 Text semantic relativity based network public opinion information analysis method
CN105260356B (en) * 2015-10-10 2018-02-06 西安交通大学 Chinese interaction text emotion and topic detection method based on multi-task learning
CN105653649B (en) * 2015-12-28 2019-05-21 福建亿榕信息技术有限公司 Low accounting information identifying method and device in mass text
CN105740228B (en) * 2016-01-25 2019-06-04 云南大学 A kind of internet public feelings analysis method and system
CN106096623A (en) * 2016-05-25 2016-11-09 中山大学 A kind of crime identifies and Forecasting Methodology

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929861A (en) * 2012-10-22 2013-02-13 杭州东信北邮信息技术有限公司 Method and system for calculating text emotion index
CN104008091A (en) * 2014-05-26 2014-08-27 上海大学 Sentiment value based web text sentiment analysis method
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
CN104392006A (en) * 2014-12-17 2015-03-04 中国农业银行股份有限公司 Event query processing method and device
CN104951548A (en) * 2015-06-24 2015-09-30 烟台中科网络技术研究所 Method and system for calculating negative public opinion index

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
中文文本情感倾向性分类研究;邓时滔;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130315(第(2013)03 期);I138-1743 *

Also Published As

Publication number Publication date
CN108090040A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
US11645517B2 (en) Information processing method and terminal, and computer storage medium
CN105912716B (en) A kind of short text classification method and device
CN107436875A (en) File classification method and device
CN111898643B (en) Semantic matching method and device
CN108090040B (en) Text information classification method and system
CN106649250B (en) A kind of recognition methods of emotion neologisms and device
JP7198408B2 (en) Trademark information processing device and method, and program
CN110110049A (en) Service consultation method, apparatus, system, service robot and storage medium
CN108519998A (en) The problem of knowledge based collection of illustrative plates bootstrap technique and device
CN110969172A (en) Text classification method and related equipment
CN110032736A (en) A kind of text analyzing method, apparatus and storage medium
CN109902284A (en) A kind of unsupervised argument extracting method excavated based on debate
KR101931624B1 (en) Trend Analyzing Method for Fassion Field and Storage Medium Having the Same
CN111475731A (en) Data processing method, device, storage medium and equipment
CN105512300A (en) Information filtering method and system
JP2013131075A (en) Classification model learning method, device, program, and review document classifying method
CN110162769B (en) Text theme output method and device, storage medium and electronic device
CN104408036B (en) It is associated with recognition methods and the device of topic
CN110019556B (en) Topic news acquisition method, device and equipment thereof
US10191786B2 (en) Application program interface mashup generation
CN106653006A (en) Search method and device based on voice interaction
CN110019832B (en) Method and device for acquiring language model
CN106782516B (en) Corpus classification method and apparatus
CN110413990A (en) The configuration method of term vector, device, storage medium, electronic device
CN110069780B (en) Specific field text-based emotion word recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant