CN111159410A - Text emotion classification method, system and device and storage medium - Google Patents
Text emotion classification method, system and device and storage medium Download PDFInfo
- Publication number
- CN111159410A CN111159410A CN201911410177.8A CN201911410177A CN111159410A CN 111159410 A CN111159410 A CN 111159410A CN 201911410177 A CN201911410177 A CN 201911410177A CN 111159410 A CN111159410 A CN 111159410A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- vector
- vectors
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
The invention discloses a text emotion classification method, a system, a device and a storage medium, wherein the method comprises the following steps: preprocessing the text; carrying out statistic calculation on the preprocessed text to obtain a text vector; selecting the features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors; carrying out weight calculation on the feature vectors to obtain the weight of each feature vector; and combining the weights of the characteristic vectors, and classifying the texts based on a support vector machine. The system comprises: the device comprises a preprocessing module, a statistic module, a characteristic module, a weight module and a classification module. The device comprises a memory and a processor for executing the text emotion classification method. By using the method and the device, the accuracy of text classification can be improved. The method, the system, the device and the storage medium for text emotion classification can be widely applied to the field of text classification.
Description
Technical Field
The invention relates to the field of text classification, in particular to a text emotion classification method, a text emotion classification system, a text emotion classification device and a storage medium.
Background
Emotion classification is a task in the field of natural language processing, also known as trend analysis, which is a process of analyzing, processing, generalizing, and reasoning subjective text with emotional colors. The method can analyze emotional preference and viewpoint of an author to a specific subject in a text, is used for predicting film box houses, stock trends, public sentiment analysis, improving services and products, knowing user experience and the like, and the main research methods of text emotion classification at present are based on a dictionary and a corpus, information mining is carried out on the corpus or the dictionary, the emotional tendency of words is recognized, so that statistical data is obtained and the polarity of the words is judged, but the two methods have no part-of-speech distinguishing capability on new words, and the accuracy of the result obtained by classification is low because the judgment is not carried out from the semantic level.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method, a system, a device and a storage medium for text emotion classification, which can improve the accuracy of text classification.
The first technical scheme adopted by the invention is as follows: a text emotion classification method comprises the following steps:
preprocessing the text;
carrying out statistic calculation on the preprocessed text to obtain a text vector;
selecting the features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors;
carrying out weight calculation on the feature vectors to obtain the weight of each feature vector;
and combining the weights of the characteristic vectors, and classifying the texts based on a support vector machine.
Further, the step of preprocessing the text specifically includes:
obtaining a text, filtering illegal characters of the text and performing word segmentation processing on the text;
and removing irrelevant words and counting word frequency to obtain the preprocessed text.
Further, the feature selection of the text vector by using the chi-square statistical method specifically adopts the following formula:
said t isiIs a feature item, said CjIs a category, N is the total number of texts, A is the inclusion tiAnd belong to CjB is a number containing tiBut not belonging to CjIs the number of CjBut does not contain tiIs not CjAnd does not contain tiThe number of the cells.
Further, the weight calculation of the feature vectors to obtain the weight of each feature vector specifically adopts the following formula:
said wijRepresents a weight, said tfijRepresents tiIn the number of occurrences of the text, niIndicates that t is includediThe number of texts in (1).
Further, the performing weight calculation on the feature vectors to obtain the weight of each feature vector further includes performing normalization processing on the weight, specifically using the following formula:
the M represents a vector number.
Further, the step of selecting features of the text vector by using a chi-square statistical method to extract the feature vector specifically includes:
scoring the feature items of the text vector and sequencing the feature items according to the scoring size;
and obtaining text feature items according to a preset quantity, and extracting feature vectors of the text by adopting a chi-square statistical method.
Further, the irrelevant words include stop words, pronouns, quantifiers, auxiliary words, conjunctions, and vocabularies.
The second technical scheme adopted by the invention is as follows: a text sentiment classification system comprising:
the preprocessing module is used for preprocessing the text;
the statistical module is used for carrying out statistical calculation on the preprocessed text to obtain a text vector;
the characteristic module is used for selecting the characteristics of the text vectors by adopting a chi-square statistical method and extracting the characteristic vectors;
the weighting module is used for carrying out weighting calculation on the feature vectors to obtain the weight of each feature vector;
and the classification module is used for classifying the texts based on the support vector machine by combining the weight of each feature vector.
The third technical scheme adopted by the invention is as follows: a text emotion classification apparatus comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for emotion classification of text as described above.
The fourth technical scheme adopted by the invention is as follows: a storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a text emotion classification method as described above.
The method, the system, the device and the storage medium have the advantages that: the text is expressed in a vector form, emotion classification of the text is realized by extracting the features of the text and performing weight calculation on the extracted features, and the text is classified by inputting a vector space model of the text into a support vector machine in combination with feature weights, so that the accuracy of the emotion classification of the text is improved.
Drawings
FIG. 1 is a flowchart of the steps of a method for classifying text emotion according to the present invention;
FIG. 2 is a block diagram of a text emotion classification system according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
For example, in some comments aiming at products, the enterprise directly extracts comment texts of all users, and the method is used for carrying out sentiment classification on a large number of comment texts, so that the enterprise can quickly guide whether the users approve the products.
As shown in FIG. 1, the invention provides a text emotion classification method, which comprises the following steps:
s101, preprocessing a text;
specifically, the purpose of text preprocessing is to extract main content from a text corpus in a standard manner and remove information irrelevant to text emotion classification, the main operations include steps of filtering illegal characters, performing word segmentation processing, removing stop words and the like, and the words can be subjected to emotion identification after the word segmentation processing.
S102, carrying out statistic calculation on the preprocessed text to obtain a text vector;
specifically, the text is unstructured data and is composed of a large number of characters, and a computer cannot directly process data of character types, so that the content of a common text needs to be converted into a data form which can be read and understood by the computer, namely the text is formally represented.
S103, selecting features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors;
s104, carrying out weight calculation on the feature vectors to obtain the weight of each feature vector;
and S105, classifying the texts based on a support vector machine by combining the weight of each feature vector.
Specifically, the process of carrying out weight calculation on the feature vector, namely giving a certain weight according to the contribution degree of the feature item to classification, mainly uses a support vector machine to classify, is a binary classification model, and aims to find a hyperplane to segment samples, wherein the segmentation principle is interval maximization, and finally is converted into a convex quadratic programming problem to solve.
Further, as a preferred embodiment of the method, the step of preprocessing the text specifically includes:
obtaining a text, filtering illegal characters of the text and performing word segmentation processing on the text;
and removing irrelevant words and counting word frequency to obtain the preprocessed text.
Specifically, the text data for filtering the illegal characters is segmented, a series of long sentences are segmented into words, and the words can be subjected to emotion identification.
Further, as a preferred embodiment of the method, the following formula is specifically adopted for feature selection of the text vector by using a chi-square statistical method:
said t isiIs a feature item, said CjIs a category, N is the total number of texts, A is the inclusion tiAnd belong to CjB is a number containing tiBut not belonging to CjIs the number of CjBut does not contain tiIs not CjAnd does not contain tiThe number of the cells.
Specifically, the algorithm uses a chi-square statistical method for feature selection. Chi-square statistical method for measuring characteristics tiAnd document class CjThe higher the statistical value is, the more information content is contained, and the greater the correlation with the class is.
Further, as a preferred embodiment of the method, the weight calculation of the feature vectors is performed to obtain the weight of each feature vector by using the following formula:
said wijRepresents a weight, said tfijRepresents tiIn the number of occurrences of the text, niIndicates that t is includediThe number of texts in (1).
Specifically, in the feature selection process, feature vectors which can represent text content most are selected, but the influence of the features on text classification is different, and it is necessary to weight the selected features, to give a larger weight to features with strong feature capability and a smaller weight to features with weak category distinguishing capability, so that noise can be effectively suppressed.
Further, as a preferred embodiment of the method, the calculating the weight of the feature vector to obtain the weight of each feature vector further includes normalizing the weight, specifically using the following formula:
the M represents a vector number.
Specifically, in order to eliminate the influence of the text length on the feature weight, the weight of the feature is normalized.
Further, as a preferred embodiment of the method, the step of selecting the feature of the text vector by using a chi-square statistical method to extract the feature vector specifically includes:
scoring the feature items of the text vector and sequencing the feature items according to the scoring size;
and obtaining text feature items according to a preset quantity, and extracting feature vectors of the text by adopting a chi-square statistical method.
In particular, the number of features may reach several tens of thousands of dimensions, which not only makes the operation time long, but also greatly reduces the accuracy of classification. The feature selection is to select a small part of features from an original high-dimensional feature set as classification features of a classifier, score each feature through a constructed evaluation function in the feature selection process, sort the feature vectors in a descending order according to the score, and finally select a certain number of features as a classification feature set
Further preferred as an embodiment of the method said irrelevant words comprise stop words, pronouns, quantifiers, co-words, conjunctions and vocabularies.
Specifically, the type of the irrelevant word can be set according to needs, and options such as prepositions, pure numbers and the like can be added.
The specific embodiment of the invention is as follows:
obtaining a comment text of a user, carrying out illegal character filtering and word segmentation processing on the comment text, removing irrelevant words to obtain main text data information, counting the times of the words appearing in the text, carrying out emotion identification on the words, combining a preprocessing result, word frequency information and emotion labels, carrying out feature selection on the text by using a chi-square statistical method, grading the features, carrying out descending order sorting on feature vectors according to the grading size, selecting the features according to a preset number, carrying out weight calculation on the selected features and normalizing the weight, finally representing the text in a vector space model mode, combining the normalized feature weight vectors, and classifying a large batch of texts by using a support vector machine classifier.
As shown in fig. 2, a text emotion classification system includes:
the preprocessing module is used for preprocessing the text;
the statistical module is used for carrying out statistical calculation on the preprocessed text to obtain a text vector;
the characteristic module is used for selecting the characteristics of the text vectors by adopting a chi-square statistical method and extracting the characteristic vectors;
the weighting module is used for carrying out weighting calculation on the feature vectors to obtain the weight of each feature vector;
and the classification module is used for classifying the texts based on the support vector machine by combining the weight of each feature vector.
As a further preferred embodiment of the present system, the preprocessing module further includes:
the word segmentation submodule is used for acquiring the text, filtering illegal characters of the text and carrying out word segmentation processing on the text;
the removing submodule is used for removing irrelevant words and counting word frequency to obtain a preprocessed text;
as a further preferred embodiment of the present system, the feature module further comprises:
the sorting submodule is used for grading the feature items of the text vector and sorting the feature items according to the grading size;
and the extraction submodule is used for obtaining the text feature items according to the preset quantity and extracting the feature vector of the text by adopting a chi-square statistical method.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
An emotion classification device for authentication texts:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for emotion classification of text as described above.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a text emotion classification method as described above.
The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A text emotion classification method is characterized by comprising the following steps:
preprocessing the text;
carrying out statistic calculation on the preprocessed text to obtain a text vector;
selecting the features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors;
carrying out weight calculation on the feature vectors to obtain the weight of each feature vector;
and combining the weights of the characteristic vectors, and classifying the texts based on a support vector machine.
2. The method for classifying emotion of text according to claim 1, wherein the step of preprocessing the text specifically includes:
obtaining a text, filtering illegal characters of the text and performing word segmentation processing on the text;
and removing irrelevant words and counting word frequency to obtain the preprocessed text.
3. The method for classifying emotion of text according to claim 1, wherein the feature selection of the text vector by using the chi-square statistical method specifically uses the following formula:
said t isiIs a feature item, said CjIs a category, N is the total number of texts, A is the inclusion tiAnd belong to CjB is a number containing tiBut not belonging to CjIs the number of CjBut does not contain tiIs not CjAnd does not contain tiThe number of the cells.
4. The method of claim 3, wherein the weight calculation of the feature vectors is performed to obtain the weight of each feature vector by using the following formula:
said wijRepresents a weight, said tfijRepresents tiIn the number of occurrences of the text, niIndicates that t is includediThe number of texts in (1).
6. The method for classifying emotion of text according to claim 1, wherein said step of extracting feature vectors by selecting features of text vectors using chi-square statistical method specifically comprises:
scoring the feature items of the text vector and sequencing the feature items according to the scoring size;
and obtaining text feature items according to a preset quantity, and extracting feature vectors of the text by adopting a chi-square statistical method.
7. The method for classifying emotion of text according to claim 1, wherein: the irrelevant words comprise stop words, pronouns, quantifiers, auxiliary words, conjunctions and vocabularies.
8. A text sentiment classification system, comprising:
the preprocessing module is used for preprocessing the text;
the statistical module is used for carrying out statistical calculation on the preprocessed text to obtain a text vector;
the characteristic module is used for selecting the characteristics of the text vectors by adopting a chi-square statistical method and extracting the characteristic vectors;
the weighting module is used for carrying out weighting calculation on the feature vectors to obtain the weight of each feature vector;
and the classification module is used for classifying the texts based on the support vector machine by combining the weight of each feature vector.
9. A text emotion classification device, characterized by further comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for emotion classification of text as claimed in any of claims 1 to 7.
10. A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by a processor, are for implementing a method for textual emotion classification as claimed in any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911410177.8A CN111159410A (en) | 2019-12-31 | 2019-12-31 | Text emotion classification method, system and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911410177.8A CN111159410A (en) | 2019-12-31 | 2019-12-31 | Text emotion classification method, system and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111159410A true CN111159410A (en) | 2020-05-15 |
Family
ID=70559884
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911410177.8A Pending CN111159410A (en) | 2019-12-31 | 2019-12-31 | Text emotion classification method, system and device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111159410A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117408652A (en) * | 2023-12-15 | 2024-01-16 | 江西驱动交通科技有限公司 | File data analysis and management method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103995876A (en) * | 2014-05-26 | 2014-08-20 | 上海大学 | Text classification method based on chi square statistics and SMO algorithm |
CN107590134A (en) * | 2017-10-26 | 2018-01-16 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
-
2019
- 2019-12-31 CN CN201911410177.8A patent/CN111159410A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103995876A (en) * | 2014-05-26 | 2014-08-20 | 上海大学 | Text classification method based on chi square statistics and SMO algorithm |
CN107590134A (en) * | 2017-10-26 | 2018-01-16 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117408652A (en) * | 2023-12-15 | 2024-01-16 | 江西驱动交通科技有限公司 | File data analysis and management method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107291723B (en) | Method and device for classifying webpage texts and method and device for identifying webpage texts | |
US7689531B1 (en) | Automatic charset detection using support vector machines with charset grouping | |
US20200311113A1 (en) | Method and device for extracting core word of commodity short text | |
CN108509629B (en) | Text emotion analysis method based on emotion dictionary and support vector machine | |
CN109101478B (en) | Aspect-level emotion analysis method for E-commerce comment text | |
US7711673B1 (en) | Automatic charset detection using SIM algorithm with charset grouping | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN110705286A (en) | Comment information-based data processing method and device | |
Probierz et al. | Rapid detection of fake news based on machine learning methods | |
US8560466B2 (en) | Method and arrangement for automatic charset detection | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
Rasheed et al. | Urdu text classification: a comparative study using machine learning techniques | |
CN111144106A (en) | Two-stage text feature selection method under unbalanced data set | |
CN112527958A (en) | User behavior tendency identification method, device, equipment and storage medium | |
CN110287493B (en) | Risk phrase identification method and device, electronic equipment and storage medium | |
Farhoodi et al. | N-gram based text classification for Persian newspaper corpus | |
CN114722198A (en) | Method, system and related device for determining product classification code | |
Karo et al. | Karonese sentiment analysis: a new dataset and preliminary result | |
CN113626604A (en) | Webpage text classification system based on maximum interval criterion | |
CN111310467B (en) | Topic extraction method and system combining semantic inference in long text | |
CN110888983B (en) | Positive and negative emotion analysis method, terminal equipment and storage medium | |
CN111159410A (en) | Text emotion classification method, system and device and storage medium | |
CN115827867A (en) | Text type detection method and device | |
CN113095073B (en) | Corpus tag generation method and device, computer equipment and storage medium | |
CN114896398A (en) | Text classification system and method based on feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |