CN111159410A - Text emotion classification method, system and device and storage medium - Google Patents

Text emotion classification method, system and device and storage medium Download PDF

Info

Publication number
CN111159410A
CN111159410A CN201911410177.8A CN201911410177A CN111159410A CN 111159410 A CN111159410 A CN 111159410A CN 201911410177 A CN201911410177 A CN 201911410177A CN 111159410 A CN111159410 A CN 111159410A
Authority
CN
China
Prior art keywords
text
feature
vector
vectors
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911410177.8A
Other languages
Chinese (zh)
Inventor
寇永娴
占太雄
陈惠芳
黄娇燕
余嘉昇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GRG Banking Equipment Co Ltd
GRG Banking IT Co Ltd
Original Assignee
GRG Banking Equipment Co Ltd
GRG Banking IT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GRG Banking Equipment Co Ltd, GRG Banking IT Co Ltd filed Critical GRG Banking Equipment Co Ltd
Priority to CN201911410177.8A priority Critical patent/CN111159410A/en
Publication of CN111159410A publication Critical patent/CN111159410A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The invention discloses a text emotion classification method, a system, a device and a storage medium, wherein the method comprises the following steps: preprocessing the text; carrying out statistic calculation on the preprocessed text to obtain a text vector; selecting the features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors; carrying out weight calculation on the feature vectors to obtain the weight of each feature vector; and combining the weights of the characteristic vectors, and classifying the texts based on a support vector machine. The system comprises: the device comprises a preprocessing module, a statistic module, a characteristic module, a weight module and a classification module. The device comprises a memory and a processor for executing the text emotion classification method. By using the method and the device, the accuracy of text classification can be improved. The method, the system, the device and the storage medium for text emotion classification can be widely applied to the field of text classification.

Description

Text emotion classification method, system and device and storage medium
Technical Field
The invention relates to the field of text classification, in particular to a text emotion classification method, a text emotion classification system, a text emotion classification device and a storage medium.
Background
Emotion classification is a task in the field of natural language processing, also known as trend analysis, which is a process of analyzing, processing, generalizing, and reasoning subjective text with emotional colors. The method can analyze emotional preference and viewpoint of an author to a specific subject in a text, is used for predicting film box houses, stock trends, public sentiment analysis, improving services and products, knowing user experience and the like, and the main research methods of text emotion classification at present are based on a dictionary and a corpus, information mining is carried out on the corpus or the dictionary, the emotional tendency of words is recognized, so that statistical data is obtained and the polarity of the words is judged, but the two methods have no part-of-speech distinguishing capability on new words, and the accuracy of the result obtained by classification is low because the judgment is not carried out from the semantic level.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method, a system, a device and a storage medium for text emotion classification, which can improve the accuracy of text classification.
The first technical scheme adopted by the invention is as follows: a text emotion classification method comprises the following steps:
preprocessing the text;
carrying out statistic calculation on the preprocessed text to obtain a text vector;
selecting the features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors;
carrying out weight calculation on the feature vectors to obtain the weight of each feature vector;
and combining the weights of the characteristic vectors, and classifying the texts based on a support vector machine.
Further, the step of preprocessing the text specifically includes:
obtaining a text, filtering illegal characters of the text and performing word segmentation processing on the text;
and removing irrelevant words and counting word frequency to obtain the preprocessed text.
Further, the feature selection of the text vector by using the chi-square statistical method specifically adopts the following formula:
Figure BDA0002349760790000011
said t isiIs a feature item, said CjIs a category, N is the total number of texts, A is the inclusion tiAnd belong to CjB is a number containing tiBut not belonging to CjIs the number of CjBut does not contain tiIs not CjAnd does not contain tiThe number of the cells.
Further, the weight calculation of the feature vectors to obtain the weight of each feature vector specifically adopts the following formula:
Figure BDA0002349760790000021
said wijRepresents a weight, said tfijRepresents tiIn the number of occurrences of the text, niIndicates that t is includediThe number of texts in (1).
Further, the performing weight calculation on the feature vectors to obtain the weight of each feature vector further includes performing normalization processing on the weight, specifically using the following formula:
Figure BDA0002349760790000022
the M represents a vector number.
Further, the step of selecting features of the text vector by using a chi-square statistical method to extract the feature vector specifically includes:
scoring the feature items of the text vector and sequencing the feature items according to the scoring size;
and obtaining text feature items according to a preset quantity, and extracting feature vectors of the text by adopting a chi-square statistical method.
Further, the irrelevant words include stop words, pronouns, quantifiers, auxiliary words, conjunctions, and vocabularies.
The second technical scheme adopted by the invention is as follows: a text sentiment classification system comprising:
the preprocessing module is used for preprocessing the text;
the statistical module is used for carrying out statistical calculation on the preprocessed text to obtain a text vector;
the characteristic module is used for selecting the characteristics of the text vectors by adopting a chi-square statistical method and extracting the characteristic vectors;
the weighting module is used for carrying out weighting calculation on the feature vectors to obtain the weight of each feature vector;
and the classification module is used for classifying the texts based on the support vector machine by combining the weight of each feature vector.
The third technical scheme adopted by the invention is as follows: a text emotion classification apparatus comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for emotion classification of text as described above.
The fourth technical scheme adopted by the invention is as follows: a storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a text emotion classification method as described above.
The method, the system, the device and the storage medium have the advantages that: the text is expressed in a vector form, emotion classification of the text is realized by extracting the features of the text and performing weight calculation on the extracted features, and the text is classified by inputting a vector space model of the text into a support vector machine in combination with feature weights, so that the accuracy of the emotion classification of the text is improved.
Drawings
FIG. 1 is a flowchart of the steps of a method for classifying text emotion according to the present invention;
FIG. 2 is a block diagram of a text emotion classification system according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
For example, in some comments aiming at products, the enterprise directly extracts comment texts of all users, and the method is used for carrying out sentiment classification on a large number of comment texts, so that the enterprise can quickly guide whether the users approve the products.
As shown in FIG. 1, the invention provides a text emotion classification method, which comprises the following steps:
s101, preprocessing a text;
specifically, the purpose of text preprocessing is to extract main content from a text corpus in a standard manner and remove information irrelevant to text emotion classification, the main operations include steps of filtering illegal characters, performing word segmentation processing, removing stop words and the like, and the words can be subjected to emotion identification after the word segmentation processing.
S102, carrying out statistic calculation on the preprocessed text to obtain a text vector;
specifically, the text is unstructured data and is composed of a large number of characters, and a computer cannot directly process data of character types, so that the content of a common text needs to be converted into a data form which can be read and understood by the computer, namely the text is formally represented.
S103, selecting features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors;
s104, carrying out weight calculation on the feature vectors to obtain the weight of each feature vector;
and S105, classifying the texts based on a support vector machine by combining the weight of each feature vector.
Specifically, the process of carrying out weight calculation on the feature vector, namely giving a certain weight according to the contribution degree of the feature item to classification, mainly uses a support vector machine to classify, is a binary classification model, and aims to find a hyperplane to segment samples, wherein the segmentation principle is interval maximization, and finally is converted into a convex quadratic programming problem to solve.
Further, as a preferred embodiment of the method, the step of preprocessing the text specifically includes:
obtaining a text, filtering illegal characters of the text and performing word segmentation processing on the text;
and removing irrelevant words and counting word frequency to obtain the preprocessed text.
Specifically, the text data for filtering the illegal characters is segmented, a series of long sentences are segmented into words, and the words can be subjected to emotion identification.
Further, as a preferred embodiment of the method, the following formula is specifically adopted for feature selection of the text vector by using a chi-square statistical method:
Figure BDA0002349760790000041
said t isiIs a feature item, said CjIs a category, N is the total number of texts, A is the inclusion tiAnd belong to CjB is a number containing tiBut not belonging to CjIs the number of CjBut does not contain tiIs not CjAnd does not contain tiThe number of the cells.
Specifically, the algorithm uses a chi-square statistical method for feature selection. Chi-square statistical method for measuring characteristics tiAnd document class CjThe higher the statistical value is, the more information content is contained, and the greater the correlation with the class is.
Further, as a preferred embodiment of the method, the weight calculation of the feature vectors is performed to obtain the weight of each feature vector by using the following formula:
Figure BDA0002349760790000042
said wijRepresents a weight, said tfijRepresents tiIn the number of occurrences of the text, niIndicates that t is includediThe number of texts in (1).
Specifically, in the feature selection process, feature vectors which can represent text content most are selected, but the influence of the features on text classification is different, and it is necessary to weight the selected features, to give a larger weight to features with strong feature capability and a smaller weight to features with weak category distinguishing capability, so that noise can be effectively suppressed.
Further, as a preferred embodiment of the method, the calculating the weight of the feature vector to obtain the weight of each feature vector further includes normalizing the weight, specifically using the following formula:
Figure BDA0002349760790000043
the M represents a vector number.
Specifically, in order to eliminate the influence of the text length on the feature weight, the weight of the feature is normalized.
Further, as a preferred embodiment of the method, the step of selecting the feature of the text vector by using a chi-square statistical method to extract the feature vector specifically includes:
scoring the feature items of the text vector and sequencing the feature items according to the scoring size;
and obtaining text feature items according to a preset quantity, and extracting feature vectors of the text by adopting a chi-square statistical method.
In particular, the number of features may reach several tens of thousands of dimensions, which not only makes the operation time long, but also greatly reduces the accuracy of classification. The feature selection is to select a small part of features from an original high-dimensional feature set as classification features of a classifier, score each feature through a constructed evaluation function in the feature selection process, sort the feature vectors in a descending order according to the score, and finally select a certain number of features as a classification feature set
Further preferred as an embodiment of the method said irrelevant words comprise stop words, pronouns, quantifiers, co-words, conjunctions and vocabularies.
Specifically, the type of the irrelevant word can be set according to needs, and options such as prepositions, pure numbers and the like can be added.
The specific embodiment of the invention is as follows:
obtaining a comment text of a user, carrying out illegal character filtering and word segmentation processing on the comment text, removing irrelevant words to obtain main text data information, counting the times of the words appearing in the text, carrying out emotion identification on the words, combining a preprocessing result, word frequency information and emotion labels, carrying out feature selection on the text by using a chi-square statistical method, grading the features, carrying out descending order sorting on feature vectors according to the grading size, selecting the features according to a preset number, carrying out weight calculation on the selected features and normalizing the weight, finally representing the text in a vector space model mode, combining the normalized feature weight vectors, and classifying a large batch of texts by using a support vector machine classifier.
As shown in fig. 2, a text emotion classification system includes:
the preprocessing module is used for preprocessing the text;
the statistical module is used for carrying out statistical calculation on the preprocessed text to obtain a text vector;
the characteristic module is used for selecting the characteristics of the text vectors by adopting a chi-square statistical method and extracting the characteristic vectors;
the weighting module is used for carrying out weighting calculation on the feature vectors to obtain the weight of each feature vector;
and the classification module is used for classifying the texts based on the support vector machine by combining the weight of each feature vector.
As a further preferred embodiment of the present system, the preprocessing module further includes:
the word segmentation submodule is used for acquiring the text, filtering illegal characters of the text and carrying out word segmentation processing on the text;
the removing submodule is used for removing irrelevant words and counting word frequency to obtain a preprocessed text;
as a further preferred embodiment of the present system, the feature module further comprises:
the sorting submodule is used for grading the feature items of the text vector and sorting the feature items according to the grading size;
and the extraction submodule is used for obtaining the text feature items according to the preset quantity and extracting the feature vector of the text by adopting a chi-square statistical method.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
An emotion classification device for authentication texts:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for emotion classification of text as described above.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a text emotion classification method as described above.
The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A text emotion classification method is characterized by comprising the following steps:
preprocessing the text;
carrying out statistic calculation on the preprocessed text to obtain a text vector;
selecting the features of the text vectors by adopting a chi-square statistical method, and extracting the feature vectors;
carrying out weight calculation on the feature vectors to obtain the weight of each feature vector;
and combining the weights of the characteristic vectors, and classifying the texts based on a support vector machine.
2. The method for classifying emotion of text according to claim 1, wherein the step of preprocessing the text specifically includes:
obtaining a text, filtering illegal characters of the text and performing word segmentation processing on the text;
and removing irrelevant words and counting word frequency to obtain the preprocessed text.
3. The method for classifying emotion of text according to claim 1, wherein the feature selection of the text vector by using the chi-square statistical method specifically uses the following formula:
Figure FDA0002349760780000011
said t isiIs a feature item, said CjIs a category, N is the total number of texts, A is the inclusion tiAnd belong to CjB is a number containing tiBut not belonging to CjIs the number of CjBut does not contain tiIs not CjAnd does not contain tiThe number of the cells.
4. The method of claim 3, wherein the weight calculation of the feature vectors is performed to obtain the weight of each feature vector by using the following formula:
Figure FDA0002349760780000012
said wijRepresents a weight, said tfijRepresents tiIn the number of occurrences of the text, niIndicates that t is includediThe number of texts in (1).
5. The method of classifying text emotions according to claim 4, wherein the calculating the weights of the feature vectors to obtain the weights of the feature vectors further comprises normalizing the weights, specifically using the following formula:
Figure FDA0002349760780000013
the M represents a vector number.
6. The method for classifying emotion of text according to claim 1, wherein said step of extracting feature vectors by selecting features of text vectors using chi-square statistical method specifically comprises:
scoring the feature items of the text vector and sequencing the feature items according to the scoring size;
and obtaining text feature items according to a preset quantity, and extracting feature vectors of the text by adopting a chi-square statistical method.
7. The method for classifying emotion of text according to claim 1, wherein: the irrelevant words comprise stop words, pronouns, quantifiers, auxiliary words, conjunctions and vocabularies.
8. A text sentiment classification system, comprising:
the preprocessing module is used for preprocessing the text;
the statistical module is used for carrying out statistical calculation on the preprocessed text to obtain a text vector;
the characteristic module is used for selecting the characteristics of the text vectors by adopting a chi-square statistical method and extracting the characteristic vectors;
the weighting module is used for carrying out weighting calculation on the feature vectors to obtain the weight of each feature vector;
and the classification module is used for classifying the texts based on the support vector machine by combining the weight of each feature vector.
9. A text emotion classification device, characterized by further comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for emotion classification of text as claimed in any of claims 1 to 7.
10. A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by a processor, are for implementing a method for textual emotion classification as claimed in any of claims 1-7.
CN201911410177.8A 2019-12-31 2019-12-31 Text emotion classification method, system and device and storage medium Pending CN111159410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911410177.8A CN111159410A (en) 2019-12-31 2019-12-31 Text emotion classification method, system and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911410177.8A CN111159410A (en) 2019-12-31 2019-12-31 Text emotion classification method, system and device and storage medium

Publications (1)

Publication Number Publication Date
CN111159410A true CN111159410A (en) 2020-05-15

Family

ID=70559884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911410177.8A Pending CN111159410A (en) 2019-12-31 2019-12-31 Text emotion classification method, system and device and storage medium

Country Status (1)

Country Link
CN (1) CN111159410A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117408652A (en) * 2023-12-15 2024-01-16 江西驱动交通科技有限公司 File data analysis and management method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm
CN107590134A (en) * 2017-10-26 2018-01-16 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm
CN107590134A (en) * 2017-10-26 2018-01-16 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117408652A (en) * 2023-12-15 2024-01-16 江西驱动交通科技有限公司 File data analysis and management method and system

Similar Documents

Publication Publication Date Title
CN107291723B (en) Method and device for classifying webpage texts and method and device for identifying webpage texts
US7689531B1 (en) Automatic charset detection using support vector machines with charset grouping
US20200311113A1 (en) Method and device for extracting core word of commodity short text
CN108509629B (en) Text emotion analysis method based on emotion dictionary and support vector machine
CN109101478B (en) Aspect-level emotion analysis method for E-commerce comment text
US7711673B1 (en) Automatic charset detection using SIM algorithm with charset grouping
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN110705286A (en) Comment information-based data processing method and device
Probierz et al. Rapid detection of fake news based on machine learning methods
US8560466B2 (en) Method and arrangement for automatic charset detection
CN112069312B (en) Text classification method based on entity recognition and electronic device
Rasheed et al. Urdu text classification: a comparative study using machine learning techniques
CN111144106A (en) Two-stage text feature selection method under unbalanced data set
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
Farhoodi et al. N-gram based text classification for Persian newspaper corpus
CN114722198A (en) Method, system and related device for determining product classification code
Karo et al. Karonese sentiment analysis: a new dataset and preliminary result
CN113626604A (en) Webpage text classification system based on maximum interval criterion
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN111159410A (en) Text emotion classification method, system and device and storage medium
CN115827867A (en) Text type detection method and device
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN114896398A (en) Text classification system and method based on feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination