CN110287328B - Text classification method, device and equipment and computer readable storage medium - Google Patents

Text classification method, device and equipment and computer readable storage medium Download PDF

Info

Publication number
CN110287328B
CN110287328B CN201910594623.9A CN201910594623A CN110287328B CN 110287328 B CN110287328 B CN 110287328B CN 201910594623 A CN201910594623 A CN 201910594623A CN 110287328 B CN110287328 B CN 110287328B
Authority
CN
China
Prior art keywords
text
training
word
feature
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910594623.9A
Other languages
Chinese (zh)
Other versions
CN110287328A (en
Inventor
谢宝钢
谢胜利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201910594623.9A priority Critical patent/CN110287328B/en
Publication of CN110287328A publication Critical patent/CN110287328A/en
Application granted granted Critical
Publication of CN110287328B publication Critical patent/CN110287328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification method, which comprises the following steps: receiving a text to be classified, and mapping the text to be classified into a target feature vector according to a feature item set obtained by training; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set; selecting neighbor texts of the texts to be classified according to the Euclidean distances; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; and determining the text category of the text to be classified according to the weights. The invention greatly improves the accuracy of text classification, shortens the classification time and greatly reduces the cost. The invention also discloses a text classification device, equipment and a storage medium, and has corresponding technical effects.

Description

Text classification method, device and equipment and computer readable storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, device, and computer-readable storage medium.
Background
With the rapid development of network technologies, including social software such as microblogs, wechat and QQ, text information becomes an important existing form, and people have higher and higher requirements for finding related information rapidly, accurately and comprehensively. Text classification is one of the basic tasks in natural language processing, and generally includes processes of text expression, classifier selection and training, classification result evaluation and feedback, and the like.
The existing text classification modes mainly include a text classification method based on multi-dimensional feature selection through an integrated statistical learning method and a deep learning method, and a text classification method based on a rapid text classification model and a convolutional neural network model. Firstly, the multi-dimensional feature selection text classification method considers the selection of feature words through multiple dimensions and then classifies through a neural network classifier, so that the accuracy and the stability of text classification can be improved to a certain extent. However, the method has the defects that the method is complex in the pretreatment process and takes long time. Secondly, in the process of text classification based on the rapid text classification model and the convolutional neural network model, word segmentation needs to be carried out through an artificial method, so that much time is needed for training observation data, different people have different understandings on different feature words, the artificial word segmentation is different from person to person and is easily influenced by subjective factors, the accuracy of final classification is not high, the calculation cost is too high, and the consumed time is too long.
In summary, how to effectively solve the problems of long time consumption, high labor cost, low classification accuracy and the like of the existing text classification method is a problem which needs to be solved urgently by technical personnel in the field at present.
Disclosure of Invention
The invention aims to provide a text classification method, which greatly improves the accuracy of text classification, greatly shortens the classification duration and greatly reduces the cost; another object of the present invention is to provide a text classification apparatus, a device and a computer readable storage medium.
In order to solve the technical problems, the invention provides the following technical scheme:
a method of text classification, comprising:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set, and sorting the Euclidean distances in size;
selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
In a specific embodiment of the present invention, a training process of training the training text data set to obtain the feature item set and the text category set includes:
utilizing a jieba word segmentation algorithm to perform word segmentation on each text in the training text data set respectively to obtain a word segmentation set of each text;
calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set respectively to obtain the characteristic weight corresponding to each word in each word segmentation set respectively;
respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words as primary selection feature words of each text from one end of each text with a large weight in the weight sequencing, and combining the primary selection feature words to obtain a primary selection feature word set of the training text data set;
calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm, and sequencing the gain values of each information gain value;
selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain the text category set.
In a specific embodiment of the present invention, the segmenting each text in the training text data set by using a jieba word segmentation algorithm includes:
and segmenting words of each text in the training text data set by utilizing an accurate mode of a jieba word segmentation algorithm.
In a specific embodiment of the present invention, before performing word segmentation on each text in the training text data set by using a jieba word segmentation algorithm, the method further includes:
and removing illegal format characters of each text in the training text data set.
In an embodiment of the present invention, after obtaining the set of word segments of each text, the method further includes:
and removing stop words in the word segmentation set of each text.
A text classification apparatus comprising:
the characteristic vector mapping module is used for receiving texts to be classified and mapping the texts to be classified into target characteristic vectors of target dimensions according to a characteristic item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
the distance sorting module is used for calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set and sorting the Euclidean distances;
the neighbor text obtaining module is used for selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
the weight calculation module is used for calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
and the text category determining module is used for determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
In one embodiment of the invention, the apparatus comprises a training module comprising:
the word segmentation characteristic word obtaining submodule is used for performing word segmentation on each text in the training text data set by utilizing a jieba word segmentation algorithm to obtain a word segmentation set of each text;
the characteristic weight obtaining submodule is used for calculating the word frequency and the reverse file frequency of each word in each participle set, calculating the product of the word frequency and the reverse file frequency corresponding to each word in each participle set and obtaining the characteristic weight corresponding to each word in each participle set;
the feature word set obtaining sub-module is used for respectively carrying out weight sorting on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words from one end of each text with a large weight in the weight sorting as primary feature words of each text, and combining the primary feature words to obtain a primary feature word set of the training text data set;
the gain value sequencing submodule is used for calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm and sequencing the gain value of each information gain value;
and the feature item set and text category set submodule is used for selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set consisting of the final feature words, and classifying each text in the training text data set according to the final feature words to obtain the text category set.
In one embodiment of the present invention, the word segmentation characteristic word obtaining sub-module includes a word segmentation unit,
the word segmentation unit is specifically a unit for performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm.
A text classification apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of the text classification method as described above when executing the computer program.
A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the text classification method as described above.
The application provides a text classification method which comprises the following steps: receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances; selecting texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sequence as neighbor texts of the texts to be classified; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the training text data set in advance according to the feature item set; and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
According to the technical scheme, the training text data set is trained in advance by combining a word segmentation algorithm, a calculation algorithm for calculating the feature weight by utilizing the product of the word frequency and the reverse file frequency and an information gain algorithm to obtain the feature item set formed by feature words containing more classification information, so that the effectiveness of the feature item set for classifying the text is ensured, and the accuracy of text classification is greatly improved. The text to be classified is mapped into a target feature vector by directly utilizing the obtained feature item set, and the target feature vector is calculated by utilizing a K nearest neighbor algorithm, so that the text to be classified is classified. The time for classification is greatly shortened, and the whole process does not need personnel to participate, so that the cost is greatly reduced.
Correspondingly, the embodiment of the invention also provides a text classification device, equipment and a computer readable storage medium corresponding to the text classification method, which have the technical effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating an implementation of a text classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another embodiment of a text classification method according to the present invention;
FIG. 3 is a flowchart illustrating another implementation of a text classification method according to an embodiment of the present invention;
FIG. 4 is a block diagram of a text classification apparatus according to an embodiment of the present invention;
fig. 5 is a block diagram of a text classification device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment is as follows:
referring to fig. 1, fig. 1 is a flowchart of an implementation of a text classification method according to an embodiment of the present invention, where the method may include the following steps:
s101: receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm.
The training text data set can be preset, each text in the training text data set is segmented by using a preset word segmentation algorithm, the feature weight of each word in each text is calculated by using a calculation algorithm for calculating the feature weight by using the product of the word frequency and the reverse file frequency, and a feature item set consisting of feature words containing more classified information is obtained by combining an information gain algorithm. When a text to be classified (unstructured data such as a line saying in a bus line community) is received, the text to be classified can be mapped into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set.
The target dimensionality is obtained by calculation according to the characteristics of the text to be classified, and the text to be classified with different characteristics corresponds to the corresponding dimensionality quantity.
S102: and calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances.
After the target feature vector corresponding to the text to be classified is obtained, the Euclidean distance between the target feature vector and the feature vector of each text in the training text data set can be calculated, and the Euclidean distances are sorted, so that a sorting result is obtained. Specifically, the euclidean distances may be sorted from small to large, or the euclidean distances may be sorted from large to small, which is not limited in the embodiment of the present invention.
S103: and selecting the texts corresponding to the first preset number of Euclidean distances at the end with the small Euclidean distance in the sequence as the neighbor texts of the texts to be classified.
After the sorting result is obtained by sorting the Euclidean distances according to the sizes, texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sorting can be selected, and the first preset number of texts are used as the neighbor texts of the texts to be classified.
It should be noted that the first preset number may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention.
S104: and calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text.
The text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set.
After the feature item set is obtained by training the training text data set in advance, the training text data set can be further classified according to the feature item set to obtain a text category set. After obtaining each Neighbor text of the text to be classified, the weights of the text to be classified to each type of text in the text category set can be calculated by using a K-Nearest Neighbor algorithm (KNN) based on each Neighbor text.
Assume that the set of text classes is c ═ c1,c2,…,ckThe weight of the text d to be classified to each type of text in the text category set can be calculated through the following formula:
Figure BDA0002117231340000071
wherein, w (d, c)i) Representing a text d to be classified and a text category ciKNN (d) set of K neighbor texts with the minimum Euclidean distance, sim (d, d)j) Representing a certain neighbor text d in the texts d and KNN (d) to be classifiedjEuclidean distance of (d)j,ci) Is a Boolean type variable, when the neighbor text djBelonging to the text category ciIts value is 1 if not 0.
S105: and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
After the weights of the texts to be classified to various texts in the text category set are obtained by utilizing the calculation of all the neighbor texts, the text category corresponding to the maximum weight value can be determined as the text category of the texts to be classified, so that the text category of the texts to be classified can be quickly and accurately obtained.
According to the technical scheme, the training text data set is trained in advance by combining a word segmentation algorithm, a calculation algorithm for calculating the feature weight by utilizing the product of the word frequency and the reverse file frequency and an information gain algorithm to obtain the feature item set formed by feature words containing more classification information, so that the effectiveness of the feature item set for classifying the text is ensured, and the accuracy of text classification is greatly improved. The text to be classified is mapped into a target feature vector by directly utilizing the obtained feature item set, and the target feature vector is calculated by utilizing a K nearest neighbor algorithm, so that the text to be classified is classified. The time for classification is greatly shortened, and the whole process does not need personnel to participate, so that the cost is greatly reduced.
It should be noted that, based on the first embodiment, the embodiment of the present invention further provides a corresponding improvement scheme. In the following embodiments, steps that are the same as or correspond to those in the first embodiment may be referred to each other, and corresponding advantageous effects may also be referred to each other, which are not described in detail in the following modified embodiments.
Example two:
referring to fig. 2, fig. 2 is a flowchart of another implementation of the text classification method in the embodiment of the present invention, where the method may include the following steps:
s201: and removing illegal format characters of each text in the training text data set.
In the process of training a preset training text data set, illegal format characters of each text in the training text data set can be removed firstly, so that the training text data set is preprocessed. Since the text data acquired from the web page is basically stored in the HTML format, the HTML file usually has many tags representing format information, and these tags are collectively called "illegal format characters", so that these "illegal format characters" need to be filtered out first. However, the illegal format characters are not limited to the marks, and also comprise some emoticons, web addresses and the like, so that the more the 'illegal format characters' are removed, the more obvious the classification effect is, and the interference of the 'illegal format characters' on the text classification is avoided.
S202: and performing word segmentation on each text in the training text data set by using a jieba word segmentation algorithm to obtain a word segmentation set of each text.
After removing the illegal format characters of each text in the training text data set, the jieba word segmentation algorithm can be utilized to segment each text in the training text data set to obtain a word segmentation set of each text, namely, each text in the training text data set is segmented and finally segmented into a single word set, so that the meaning of the original text can be expressed to the maximum extent.
S203: and removing stop words in the participle set of each text.
After the words of each text in the training text data set are segmented by using a jieba word segmentation algorithm to obtain a word segmentation set of each text, each text is segmented into a single word set, but from the view of Natural Language Processing (NLP), a text main body is represented by verbs, nouns, adjectives and the like, and adverbs, punctuations and the like exist in the word segmentation set.
S204: and calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set to obtain the characteristic weight corresponding to each word in each word segmentation set.
After the stop words in the word segmentation sets of the texts are removed, the word frequency and the reverse file frequency of each word in each word segmentation set can be calculated, and the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set is calculated to obtain the characteristic weight corresponding to each word in each word segmentation set.
The Term Frequency (TF) for a given text refers to the Frequency with which a given word appears in the text. Word frequency is the normalization of the number of words to prevent it from biasing towards long text. The importance of a word in a text to the text can be expressed by a word frequency, and can be expressed by a formula:
Figure BDA0002117231340000091
wherein n isi,jRepresents the frequency, sigma, of a certain characteristic word appearing in the textknk,jRepresenting the sum of the number of occurrences of all words of the text.
Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF of a specific term can be obtained by dividing the total document number by the number of texts containing the characteristic term, and then taking the logarithm of the obtained quotient, and can be specifically represented by the following formula:
Figure BDA0002117231340000092
wherein N refers to the total number of texts in the text data set, NkThe number of texts containing the characteristic words is referred to, and 1 is usually added to the denominator in the calculation in order to prevent the reverse file frequency from being calculated because the number of texts containing the characteristic words is 0.
Thus, for each word, its corresponding feature weight can be represented by the following formula:
Figure BDA0002117231340000093
according to the method and the device, the characteristic weight corresponding to each word is calculated through the product of the word frequency corresponding to each word segmentation characteristic word and the reverse file frequency, and the accuracy of text representation is improved.
S205: and respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the characteristic weight, selecting a second preset number of words at the end with a large weight in the weight sequencing of each text as the primary selection characteristic words of each text, and combining the primary selection characteristic words to obtain a primary selection characteristic word set of the training text data set.
After the feature weights respectively corresponding to the words in the word segmentation set are obtained, the words in the word segmentation set corresponding to each text can be respectively subjected to weight sorting according to the feature weights, a second preset number of words before the first preset number of words are selected as primary selection feature words of each text from one end of each text with a large weight in the weight sorting, and the primary selection feature words are combined to obtain a primary selection feature word set of the training text data set. And after the initial selection feature words of each text are obtained, the original unstructured text data can be represented by using a space vector model, each text in the training text data set is converted into an n-dimensional vector in a vector space, and the vector usually has high dimensionality and sparsity and is formally expressed as follows:
doci=(m1,m2,m3,...,mj,...,mn);
therein, dociRepresenting the i-th text, m, in the training text datasetjAnd the weight of the jth characteristic when the ith text data is subjected to text representation is shown.
It should be noted that the second preset number may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention.
S206: and calculating the information gain value of each initially selected feature word in the initially selected feature word set by using an information gain algorithm, and sequencing the gain values of each information.
After the above steps, in order to further reduce the number of feature words and select the primary feature words containing more text classification Information as much as possible, an Information Gain algorithm (IG) may be used to calculate the Information Gain values of the primary feature words in the primary feature word set, and the Information Gain values are sorted, so as to further improve the accuracy and stability of subsequent text classification. The information gain algorithm is a feature selection algorithm based on information entropy, the information quantity of a training text data set under the two conditions of a certain initially selected feature word and the absence of the initially selected feature word is calculated according to a certain information entropy algorithm, and the difference value of the two information quantities is calculated to illustrate the importance of the feature word to text classification. The larger the difference value is, the larger the information gain value is, and the stronger the classification capability of the initially selected feature words is; otherwise, the classification capability of the initially selected feature words is weaker. The calculation formula of the information gain value of the initially selected feature word t to the training text data set is as follows:
Figure BDA0002117231340000101
wherein m represents the number of text categories after each text in the training text data set is pre-classified according to the initially selected feature word set, p (c)i) C represents the text category in the training text data setiP (t) represents the probability of the text with the primary feature word t in the training text data set, p (c)iI t) indicates that the initially selected feature word t appears in ciConditional probability in class text, p (c)iI t') indicates that the initially selected feature word t does not appear in ciConditional probabilities in class text.
The gain values of the information gain values are sorted, the gain values of the information may be sorted from small to large, or the gain values of the information may be sorted from large to small, which is not limited in the embodiment of the present invention.
S207: and selecting a third preset number of primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain a text category set.
The information gain values of all the initially selected feature words are obtained in step S206, and the larger the value is, the more classification information the initially selected feature words contain, that is, the feature words with the maximum probability of being a certain text category in the training text data set. Therefore, after a sorting result of the gain value sorting of the information gain values is obtained, a third preset number of primarily selected feature words are selected from one end with a large gain value in the gain value sorting as final-level feature words to obtain a feature item set formed by the final-level feature words, and each text in the training text data set is classified according to the final-level feature words to obtain a text category set.
It should be noted that the third preset number may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention.
S208: and receiving the texts to be classified, and mapping the texts to be classified into target feature vectors of target dimensions according to a feature item set obtained by pre-training each text in the training text data set.
S209: and calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances.
S210: and selecting the texts corresponding to the first preset number of Euclidean distances at the end with the small Euclidean distance in the sequence as the neighbor texts of the texts to be classified.
S211: and calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text.
S212: and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
Referring to fig. 3, fig. 3 is a flowchart of another implementation of the text classification method in the embodiment of the present invention, where the method may include the following steps:
s301: and removing illegal format characters of each text in the training text data set.
S302: and performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm to obtain a word segmentation set of each text.
After removing illegal format characters of each text in the training text data set, performing word segmentation on each text in the training text data set respectively by using an accurate mode of a jieba word segmentation algorithm to obtain a word segmentation set of each text. There are three modes due to the jieba word segmentation algorithm: (1) the accurate mode is used for trying to cut the sentence most accurately, and is suitable for text analysis; (2) in the full mode, all words which can be formed into words in a sentence are scanned, so that the speed is very high, but ambiguity cannot be solved; (3) and the search engine mode is used for segmenting long words again on the basis of the accurate mode, so that the recall rate is improved, and the search engine mode is suitable for word segmentation of the search engine. For example, divide "i am from the institute of automation" and the result is the following, precise mode: i/from/automation school. Full mode: i/from/automation/college of automation. Search engine mode: i/from/automation/college/automation/chemistry/college. Therefore, the accurate mode of the jieba word segmentation algorithm is selected to segment the words of the text data, and the accuracy of subsequent text classification is further improved.
S303: and removing stop words in the participle set of each text.
S304: and calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set to obtain the characteristic weight corresponding to each word in each word segmentation set.
S305: and respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the characteristic weight, selecting a second preset number of words at the end with a large weight in the weight sequencing of each text as the primary selection characteristic words of each text, and combining the primary selection characteristic words to obtain a primary selection characteristic word set of the training text data set.
S306: and calculating the information gain value of each initially selected feature word in the initially selected feature word set by using an information gain algorithm, and sequencing the gain values of each information.
S307: and selecting a third preset number of primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain a text category set.
S308: and receiving the texts to be classified, and mapping the texts to be classified into target feature vectors of target dimensions according to a feature item set obtained by pre-training each text in the training text data set.
S309: and calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances.
S310: and selecting the texts corresponding to the first preset number of Euclidean distances at the end with the small Euclidean distance in the sequence as the neighbor texts of the texts to be classified.
S311: and calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text.
S312: and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
Corresponding to the above method embodiments, the embodiments of the present invention further provide a text classification apparatus, and the text classification apparatus described below and the text classification method described above may be referred to in correspondence with each other.
Referring to fig. 4, fig. 4 is a block diagram of a text classification apparatus according to an embodiment of the present invention, where the apparatus includes:
the feature vector mapping module 41 is configured to receive a text to be classified, and map the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
the distance sorting module 42 is configured to calculate euclidean distances between the target feature vectors and feature vectors of each text in the training text data set, and sort the euclidean distances;
a neighbor text obtaining module 43, configured to select, as each neighbor text of the text to be classified, a text corresponding to a first preset number of euclidean distances at a end with a small euclidean distance in the ranking;
the weight calculation module 44 is configured to calculate, based on each neighbor text, a weight of the text to be classified to each type of text in the text category set by using a K-nearest neighbor algorithm; the text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set;
and a text category determining module 45, configured to determine the text category corresponding to the maximum weight value as the text category of the text to be classified.
According to the technical scheme, the training text data set is trained in advance by combining a word segmentation algorithm, a calculation algorithm for calculating the feature weight by utilizing the product of the word frequency and the reverse file frequency and an information gain algorithm to obtain the feature item set formed by feature words containing more classification information, so that the effectiveness of the feature item set for classifying the text is ensured, and the accuracy of text classification is greatly improved. The text to be classified is mapped into a target feature vector by directly utilizing the obtained feature item set, and the target feature vector is calculated by utilizing a K nearest neighbor algorithm, so that the text to be classified is classified. The time for classification is greatly shortened, and the whole process does not need personnel to participate, so that the cost is greatly reduced.
In one embodiment of the invention, the apparatus comprises a training module comprising:
the word segmentation characteristic word obtaining submodule is used for performing word segmentation on each text in the training text data set by utilizing a jieba word segmentation algorithm to obtain a word segmentation set of each text;
the characteristic weight obtaining submodule is used for calculating the word frequency and the reverse file frequency of each word in each participle set, calculating the product of the word frequency and the reverse file frequency corresponding to each word in each participle set and obtaining the characteristic weight corresponding to each word in each participle set;
the characteristic word set obtaining sub-module is used for respectively carrying out weight sorting on all words in the word segmentation set corresponding to each text according to the characteristic weight, selecting a second preset number of words before the end with the large weight in the weight sorting of each text from the weight sorting as the primary selection characteristic words of each text, and combining the primary selection characteristic words to obtain a primary selection characteristic word set of the training text data set;
the gain value sequencing submodule is used for calculating the information gain value of each initially selected feature word in the initially selected feature word set by using an information gain algorithm and sequencing the gain value of each information gain value;
and the feature item set and text category set submodule is used for selecting a third preset number of primarily selected feature words from one end with a large gain value in the gain value sequencing as final-level feature words to obtain a feature item set consisting of the final-level feature words, and classifying each text in the training text data set according to the final-level feature words to obtain a text category set.
In one embodiment of the present invention, the participle feature word obtaining sub-module includes a participle unit,
the word segmentation unit is specifically a unit for performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm.
In one embodiment of the invention, the training module further comprises a character removal sub-module,
and the character removal submodule is used for removing illegal format characters of each text in the training text data set after the word segmentation set of each text is obtained.
In one embodiment of the invention, the training module further comprises a stop word removal submodule,
and the stop word removing submodule is used for removing the stop words in the segmentation sets of the texts after the segmentation sets of the texts are obtained.
Corresponding to the above method embodiment, referring to fig. 5, fig. 5 is a schematic diagram of a text classification device provided by the present invention, where the device may include:
a memory 51 for storing a computer program;
the processor 52, when executing the computer program stored in the memory 51, may implement the following steps:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances; selecting texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sequence as neighbor texts of the texts to be classified; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set; and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
For the introduction of the device provided by the present invention, please refer to the above method embodiment, which is not described herein again.
Corresponding to the above method embodiment, the present invention further provides a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the steps of:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances; selecting texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sequence as neighbor texts of the texts to be classified; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set; and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
For the introduction of the computer-readable storage medium provided by the present invention, please refer to the above method embodiments, which are not described herein again.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device, the apparatus and the computer-readable storage medium disclosed in the embodiments correspond to the method disclosed in the embodiments, so that the description is simple, and the relevant points can be referred to the description of the method.
The principle and the implementation of the present invention are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (8)

1. A method of text classification, comprising:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set, and sorting the Euclidean distances in size;
selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
the training process of training the training text data set to obtain the feature item set and the text category set comprises the following steps:
utilizing a jieba word segmentation algorithm to perform word segmentation on each text in the training text data set respectively to obtain a word segmentation set of each text;
calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set respectively to obtain the characteristic weight corresponding to each word in each word segmentation set respectively;
respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words as primary selection feature words of each text from one end of each text with a large weight in the weight sequencing, and combining the primary selection feature words to obtain a primary selection feature word set of the training text data set;
calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm, and sequencing the gain values of each information gain value;
selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain the text category set;
and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
2. The text classification method according to claim 1, wherein the tokenizing of each text in the training text data set using a jieba tokenizing algorithm comprises:
and segmenting words of each text in the training text data set by utilizing an accurate mode of a jieba word segmentation algorithm.
3. The method of classifying texts according to claim 2, wherein before performing word segmentation on each text in the training text data set by using a jieba word segmentation algorithm, the method further comprises:
and removing illegal format characters of each text in the training text data set.
4. The method of claim 3, further comprising, after obtaining the set of participles for each of the texts:
and removing stop words in the word segmentation set of each text.
5. A text classification apparatus, comprising:
the characteristic vector mapping module is used for receiving texts to be classified and mapping the texts to be classified into target characteristic vectors of target dimensions according to a characteristic item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
the distance sorting module is used for calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set and sorting the Euclidean distances;
the neighbor text obtaining module is used for selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
the weight calculation module is used for calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
the text category determining module is used for determining the text category corresponding to the maximum weight value as the text category of the text to be classified;
a training module, the training module comprising:
the word segmentation characteristic word obtaining submodule is used for performing word segmentation on each text in the training text data set by utilizing a jieba word segmentation algorithm to obtain a word segmentation set of each text;
the characteristic weight obtaining submodule is used for calculating the word frequency and the reverse file frequency of each word in each participle set, calculating the product of the word frequency and the reverse file frequency corresponding to each word in each participle set and obtaining the characteristic weight corresponding to each word in each participle set;
the feature word set obtaining sub-module is used for respectively carrying out weight sorting on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words from one end of each text with a large weight in the weight sorting as primary feature words of each text, and combining the primary feature words to obtain a primary feature word set of the training text data set;
the gain value sequencing submodule is used for calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm and sequencing the gain value of each information gain value;
and the feature item set and text category set submodule is used for selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set consisting of the final feature words, and classifying each text in the training text data set according to the final feature words to obtain the text category set.
6. The text classification apparatus according to claim 5, wherein the participle feature word obtaining submodule includes a participle unit,
the word segmentation unit is specifically a unit for performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm.
7. A text classification apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the text classification method according to any one of claims 1 to 4 when executing said computer program.
8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the text classification method according to one of the claims 1 to 4.
CN201910594623.9A 2019-07-03 2019-07-03 Text classification method, device and equipment and computer readable storage medium Active CN110287328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910594623.9A CN110287328B (en) 2019-07-03 2019-07-03 Text classification method, device and equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910594623.9A CN110287328B (en) 2019-07-03 2019-07-03 Text classification method, device and equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110287328A CN110287328A (en) 2019-09-27
CN110287328B true CN110287328B (en) 2021-03-16

Family

ID=68020450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910594623.9A Active CN110287328B (en) 2019-07-03 2019-07-03 Text classification method, device and equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110287328B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177386B (en) * 2019-12-27 2021-05-14 安徽商信政通信息技术股份有限公司 Proposal classification method and system
CN111143303B (en) * 2019-12-31 2023-06-02 海南电网有限责任公司信息通信分公司 Log classification method based on information gain and improved KNN algorithm
CN111091161B (en) * 2019-12-31 2023-09-22 中国银行股份有限公司 Data classification method, device and system
CN111259148B (en) * 2020-01-19 2024-03-26 北京小米松果电子有限公司 Information processing method, device and storage medium
CN111538766B (en) * 2020-05-19 2023-06-30 支付宝(杭州)信息技术有限公司 Text classification method, device, processing equipment and bill classification system
CN111667152B (en) * 2020-05-19 2024-07-02 深圳莫比嗨客树莓派智能机器人有限公司 Automatic auditing method for text data calibration task based on crowdsourcing
CN111695353B (en) * 2020-06-12 2023-07-04 百度在线网络技术(北京)有限公司 Method, device, equipment and storage medium for identifying timeliness text
CN111708888B (en) * 2020-06-16 2023-10-24 腾讯科技(深圳)有限公司 Classification method, device, terminal and storage medium based on artificial intelligence
CN112214598B (en) * 2020-09-27 2023-01-13 吾征智能技术(北京)有限公司 Cognitive system based on hair condition
CN112861974A (en) * 2021-02-08 2021-05-28 和美(深圳)信息技术股份有限公司 Text classification method and device, electronic equipment and storage medium
CN113094494B (en) * 2021-04-19 2024-09-13 广东电网有限责任公司 Intelligent classification method, device, equipment and medium for electric power operation ticket text
CN113364751B (en) * 2021-05-26 2023-06-09 北京电子科技职业学院 Network attack prediction method, computer readable storage medium and electronic device
CN114068028A (en) * 2021-11-18 2022-02-18 泰康保险集团股份有限公司 Medical inquiry data processing method and device, readable storage medium and electronic equipment
CN114817526B (en) * 2022-02-21 2024-03-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622373B (en) * 2011-01-31 2013-12-11 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN102789473A (en) * 2011-05-18 2012-11-21 国际商业机器公司 Identifier retrieval method and equipment
CN105426426B (en) * 2015-11-04 2018-11-02 北京工业大学 A kind of KNN file classification methods based on improved K-Medoids
CN107045503B (en) * 2016-02-05 2019-03-05 华为技术有限公司 A kind of method and device that feature set determines
CN107169086B (en) * 2017-05-12 2020-10-27 北京化工大学 Text classification method
CN107481132A (en) * 2017-08-02 2017-12-15 上海前隆信息科技有限公司 A kind of credit estimation method and system, storage medium and terminal device
CN108764399B (en) * 2018-05-22 2021-03-19 东南大学 kNN-based RFID tag classification method and device
CN109299263B (en) * 2018-10-10 2021-01-05 上海观安信息技术股份有限公司 Text classification method and electronic equipment

Also Published As

Publication number Publication date
CN110287328A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN110287328B (en) Text classification method, device and equipment and computer readable storage medium
Trstenjak et al. KNN with TF-IDF based framework for text categorization
CN111767403B (en) Text classification method and device
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN108228541B (en) Method and device for generating document abstract
CN107145560B (en) Text classification method and device
JPH07114572A (en) Document classifying device
CN111159359A (en) Document retrieval method, document retrieval device and computer-readable storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN109492217B (en) Word segmentation method based on machine learning and terminal equipment
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
CN107357895B (en) Text representation processing method based on bag-of-words model
CN112579783B (en) Short text clustering method based on Laplace atlas
CN112667806B (en) Text classification screening method using LDA
CN113807073B (en) Text content anomaly detection method, device and storage medium
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
CN113934848B (en) Data classification method and device and electronic equipment
CN114896398A (en) Text classification system and method based on feature selection
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
CN117171331A (en) Professional field information interaction method, device and equipment based on large language model
CN111831819B (en) Text updating method and device
CN116881451A (en) Text classification method based on machine learning
CN116935057A (en) Target evaluation method, electronic device, and computer-readable storage medium
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant