CN110287328B - Text classification method, device and equipment and computer readable storage medium - Google Patents
Text classification method, device and equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN110287328B CN110287328B CN201910594623.9A CN201910594623A CN110287328B CN 110287328 B CN110287328 B CN 110287328B CN 201910594623 A CN201910594623 A CN 201910594623A CN 110287328 B CN110287328 B CN 110287328B
- Authority
- CN
- China
- Prior art keywords
- text
- training
- word
- feature
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text classification method, which comprises the following steps: receiving a text to be classified, and mapping the text to be classified into a target feature vector according to a feature item set obtained by training; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set; selecting neighbor texts of the texts to be classified according to the Euclidean distances; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; and determining the text category of the text to be classified according to the weights. The invention greatly improves the accuracy of text classification, shortens the classification time and greatly reduces the cost. The invention also discloses a text classification device, equipment and a storage medium, and has corresponding technical effects.
Description
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, device, and computer-readable storage medium.
Background
With the rapid development of network technologies, including social software such as microblogs, wechat and QQ, text information becomes an important existing form, and people have higher and higher requirements for finding related information rapidly, accurately and comprehensively. Text classification is one of the basic tasks in natural language processing, and generally includes processes of text expression, classifier selection and training, classification result evaluation and feedback, and the like.
The existing text classification modes mainly include a text classification method based on multi-dimensional feature selection through an integrated statistical learning method and a deep learning method, and a text classification method based on a rapid text classification model and a convolutional neural network model. Firstly, the multi-dimensional feature selection text classification method considers the selection of feature words through multiple dimensions and then classifies through a neural network classifier, so that the accuracy and the stability of text classification can be improved to a certain extent. However, the method has the defects that the method is complex in the pretreatment process and takes long time. Secondly, in the process of text classification based on the rapid text classification model and the convolutional neural network model, word segmentation needs to be carried out through an artificial method, so that much time is needed for training observation data, different people have different understandings on different feature words, the artificial word segmentation is different from person to person and is easily influenced by subjective factors, the accuracy of final classification is not high, the calculation cost is too high, and the consumed time is too long.
In summary, how to effectively solve the problems of long time consumption, high labor cost, low classification accuracy and the like of the existing text classification method is a problem which needs to be solved urgently by technical personnel in the field at present.
Disclosure of Invention
The invention aims to provide a text classification method, which greatly improves the accuracy of text classification, greatly shortens the classification duration and greatly reduces the cost; another object of the present invention is to provide a text classification apparatus, a device and a computer readable storage medium.
In order to solve the technical problems, the invention provides the following technical scheme:
a method of text classification, comprising:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set, and sorting the Euclidean distances in size;
selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
In a specific embodiment of the present invention, a training process of training the training text data set to obtain the feature item set and the text category set includes:
utilizing a jieba word segmentation algorithm to perform word segmentation on each text in the training text data set respectively to obtain a word segmentation set of each text;
calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set respectively to obtain the characteristic weight corresponding to each word in each word segmentation set respectively;
respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words as primary selection feature words of each text from one end of each text with a large weight in the weight sequencing, and combining the primary selection feature words to obtain a primary selection feature word set of the training text data set;
calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm, and sequencing the gain values of each information gain value;
selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain the text category set.
In a specific embodiment of the present invention, the segmenting each text in the training text data set by using a jieba word segmentation algorithm includes:
and segmenting words of each text in the training text data set by utilizing an accurate mode of a jieba word segmentation algorithm.
In a specific embodiment of the present invention, before performing word segmentation on each text in the training text data set by using a jieba word segmentation algorithm, the method further includes:
and removing illegal format characters of each text in the training text data set.
In an embodiment of the present invention, after obtaining the set of word segments of each text, the method further includes:
and removing stop words in the word segmentation set of each text.
A text classification apparatus comprising:
the characteristic vector mapping module is used for receiving texts to be classified and mapping the texts to be classified into target characteristic vectors of target dimensions according to a characteristic item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
the distance sorting module is used for calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set and sorting the Euclidean distances;
the neighbor text obtaining module is used for selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
the weight calculation module is used for calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
and the text category determining module is used for determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
In one embodiment of the invention, the apparatus comprises a training module comprising:
the word segmentation characteristic word obtaining submodule is used for performing word segmentation on each text in the training text data set by utilizing a jieba word segmentation algorithm to obtain a word segmentation set of each text;
the characteristic weight obtaining submodule is used for calculating the word frequency and the reverse file frequency of each word in each participle set, calculating the product of the word frequency and the reverse file frequency corresponding to each word in each participle set and obtaining the characteristic weight corresponding to each word in each participle set;
the feature word set obtaining sub-module is used for respectively carrying out weight sorting on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words from one end of each text with a large weight in the weight sorting as primary feature words of each text, and combining the primary feature words to obtain a primary feature word set of the training text data set;
the gain value sequencing submodule is used for calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm and sequencing the gain value of each information gain value;
and the feature item set and text category set submodule is used for selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set consisting of the final feature words, and classifying each text in the training text data set according to the final feature words to obtain the text category set.
In one embodiment of the present invention, the word segmentation characteristic word obtaining sub-module includes a word segmentation unit,
the word segmentation unit is specifically a unit for performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm.
A text classification apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of the text classification method as described above when executing the computer program.
A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the text classification method as described above.
The application provides a text classification method which comprises the following steps: receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances; selecting texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sequence as neighbor texts of the texts to be classified; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the training text data set in advance according to the feature item set; and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
According to the technical scheme, the training text data set is trained in advance by combining a word segmentation algorithm, a calculation algorithm for calculating the feature weight by utilizing the product of the word frequency and the reverse file frequency and an information gain algorithm to obtain the feature item set formed by feature words containing more classification information, so that the effectiveness of the feature item set for classifying the text is ensured, and the accuracy of text classification is greatly improved. The text to be classified is mapped into a target feature vector by directly utilizing the obtained feature item set, and the target feature vector is calculated by utilizing a K nearest neighbor algorithm, so that the text to be classified is classified. The time for classification is greatly shortened, and the whole process does not need personnel to participate, so that the cost is greatly reduced.
Correspondingly, the embodiment of the invention also provides a text classification device, equipment and a computer readable storage medium corresponding to the text classification method, which have the technical effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating an implementation of a text classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another embodiment of a text classification method according to the present invention;
FIG. 3 is a flowchart illustrating another implementation of a text classification method according to an embodiment of the present invention;
FIG. 4 is a block diagram of a text classification apparatus according to an embodiment of the present invention;
fig. 5 is a block diagram of a text classification device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment is as follows:
referring to fig. 1, fig. 1 is a flowchart of an implementation of a text classification method according to an embodiment of the present invention, where the method may include the following steps:
s101: receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm.
The training text data set can be preset, each text in the training text data set is segmented by using a preset word segmentation algorithm, the feature weight of each word in each text is calculated by using a calculation algorithm for calculating the feature weight by using the product of the word frequency and the reverse file frequency, and a feature item set consisting of feature words containing more classified information is obtained by combining an information gain algorithm. When a text to be classified (unstructured data such as a line saying in a bus line community) is received, the text to be classified can be mapped into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set.
The target dimensionality is obtained by calculation according to the characteristics of the text to be classified, and the text to be classified with different characteristics corresponds to the corresponding dimensionality quantity.
S102: and calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances.
After the target feature vector corresponding to the text to be classified is obtained, the Euclidean distance between the target feature vector and the feature vector of each text in the training text data set can be calculated, and the Euclidean distances are sorted, so that a sorting result is obtained. Specifically, the euclidean distances may be sorted from small to large, or the euclidean distances may be sorted from large to small, which is not limited in the embodiment of the present invention.
S103: and selecting the texts corresponding to the first preset number of Euclidean distances at the end with the small Euclidean distance in the sequence as the neighbor texts of the texts to be classified.
After the sorting result is obtained by sorting the Euclidean distances according to the sizes, texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sorting can be selected, and the first preset number of texts are used as the neighbor texts of the texts to be classified.
It should be noted that the first preset number may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention.
S104: and calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text.
The text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set.
After the feature item set is obtained by training the training text data set in advance, the training text data set can be further classified according to the feature item set to obtain a text category set. After obtaining each Neighbor text of the text to be classified, the weights of the text to be classified to each type of text in the text category set can be calculated by using a K-Nearest Neighbor algorithm (KNN) based on each Neighbor text.
Assume that the set of text classes is c ═ c1,c2,…,ckThe weight of the text d to be classified to each type of text in the text category set can be calculated through the following formula:
wherein, w (d, c)i) Representing a text d to be classified and a text category ciKNN (d) set of K neighbor texts with the minimum Euclidean distance, sim (d, d)j) Representing a certain neighbor text d in the texts d and KNN (d) to be classifiedjEuclidean distance of (d)j,ci) Is a Boolean type variable, when the neighbor text djBelonging to the text category ciIts value is 1 if not 0.
S105: and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
After the weights of the texts to be classified to various texts in the text category set are obtained by utilizing the calculation of all the neighbor texts, the text category corresponding to the maximum weight value can be determined as the text category of the texts to be classified, so that the text category of the texts to be classified can be quickly and accurately obtained.
According to the technical scheme, the training text data set is trained in advance by combining a word segmentation algorithm, a calculation algorithm for calculating the feature weight by utilizing the product of the word frequency and the reverse file frequency and an information gain algorithm to obtain the feature item set formed by feature words containing more classification information, so that the effectiveness of the feature item set for classifying the text is ensured, and the accuracy of text classification is greatly improved. The text to be classified is mapped into a target feature vector by directly utilizing the obtained feature item set, and the target feature vector is calculated by utilizing a K nearest neighbor algorithm, so that the text to be classified is classified. The time for classification is greatly shortened, and the whole process does not need personnel to participate, so that the cost is greatly reduced.
It should be noted that, based on the first embodiment, the embodiment of the present invention further provides a corresponding improvement scheme. In the following embodiments, steps that are the same as or correspond to those in the first embodiment may be referred to each other, and corresponding advantageous effects may also be referred to each other, which are not described in detail in the following modified embodiments.
Example two:
referring to fig. 2, fig. 2 is a flowchart of another implementation of the text classification method in the embodiment of the present invention, where the method may include the following steps:
s201: and removing illegal format characters of each text in the training text data set.
In the process of training a preset training text data set, illegal format characters of each text in the training text data set can be removed firstly, so that the training text data set is preprocessed. Since the text data acquired from the web page is basically stored in the HTML format, the HTML file usually has many tags representing format information, and these tags are collectively called "illegal format characters", so that these "illegal format characters" need to be filtered out first. However, the illegal format characters are not limited to the marks, and also comprise some emoticons, web addresses and the like, so that the more the 'illegal format characters' are removed, the more obvious the classification effect is, and the interference of the 'illegal format characters' on the text classification is avoided.
S202: and performing word segmentation on each text in the training text data set by using a jieba word segmentation algorithm to obtain a word segmentation set of each text.
After removing the illegal format characters of each text in the training text data set, the jieba word segmentation algorithm can be utilized to segment each text in the training text data set to obtain a word segmentation set of each text, namely, each text in the training text data set is segmented and finally segmented into a single word set, so that the meaning of the original text can be expressed to the maximum extent.
S203: and removing stop words in the participle set of each text.
After the words of each text in the training text data set are segmented by using a jieba word segmentation algorithm to obtain a word segmentation set of each text, each text is segmented into a single word set, but from the view of Natural Language Processing (NLP), a text main body is represented by verbs, nouns, adjectives and the like, and adverbs, punctuations and the like exist in the word segmentation set.
S204: and calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set to obtain the characteristic weight corresponding to each word in each word segmentation set.
After the stop words in the word segmentation sets of the texts are removed, the word frequency and the reverse file frequency of each word in each word segmentation set can be calculated, and the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set is calculated to obtain the characteristic weight corresponding to each word in each word segmentation set.
The Term Frequency (TF) for a given text refers to the Frequency with which a given word appears in the text. Word frequency is the normalization of the number of words to prevent it from biasing towards long text. The importance of a word in a text to the text can be expressed by a word frequency, and can be expressed by a formula:
wherein n isi,jRepresents the frequency, sigma, of a certain characteristic word appearing in the textknk,jRepresenting the sum of the number of occurrences of all words of the text.
Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF of a specific term can be obtained by dividing the total document number by the number of texts containing the characteristic term, and then taking the logarithm of the obtained quotient, and can be specifically represented by the following formula:
wherein N refers to the total number of texts in the text data set, NkThe number of texts containing the characteristic words is referred to, and 1 is usually added to the denominator in the calculation in order to prevent the reverse file frequency from being calculated because the number of texts containing the characteristic words is 0.
Thus, for each word, its corresponding feature weight can be represented by the following formula:
according to the method and the device, the characteristic weight corresponding to each word is calculated through the product of the word frequency corresponding to each word segmentation characteristic word and the reverse file frequency, and the accuracy of text representation is improved.
S205: and respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the characteristic weight, selecting a second preset number of words at the end with a large weight in the weight sequencing of each text as the primary selection characteristic words of each text, and combining the primary selection characteristic words to obtain a primary selection characteristic word set of the training text data set.
After the feature weights respectively corresponding to the words in the word segmentation set are obtained, the words in the word segmentation set corresponding to each text can be respectively subjected to weight sorting according to the feature weights, a second preset number of words before the first preset number of words are selected as primary selection feature words of each text from one end of each text with a large weight in the weight sorting, and the primary selection feature words are combined to obtain a primary selection feature word set of the training text data set. And after the initial selection feature words of each text are obtained, the original unstructured text data can be represented by using a space vector model, each text in the training text data set is converted into an n-dimensional vector in a vector space, and the vector usually has high dimensionality and sparsity and is formally expressed as follows:
doci=(m1,m2,m3,...,mj,...,mn);
therein, dociRepresenting the i-th text, m, in the training text datasetjAnd the weight of the jth characteristic when the ith text data is subjected to text representation is shown.
It should be noted that the second preset number may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention.
S206: and calculating the information gain value of each initially selected feature word in the initially selected feature word set by using an information gain algorithm, and sequencing the gain values of each information.
After the above steps, in order to further reduce the number of feature words and select the primary feature words containing more text classification Information as much as possible, an Information Gain algorithm (IG) may be used to calculate the Information Gain values of the primary feature words in the primary feature word set, and the Information Gain values are sorted, so as to further improve the accuracy and stability of subsequent text classification. The information gain algorithm is a feature selection algorithm based on information entropy, the information quantity of a training text data set under the two conditions of a certain initially selected feature word and the absence of the initially selected feature word is calculated according to a certain information entropy algorithm, and the difference value of the two information quantities is calculated to illustrate the importance of the feature word to text classification. The larger the difference value is, the larger the information gain value is, and the stronger the classification capability of the initially selected feature words is; otherwise, the classification capability of the initially selected feature words is weaker. The calculation formula of the information gain value of the initially selected feature word t to the training text data set is as follows:
wherein m represents the number of text categories after each text in the training text data set is pre-classified according to the initially selected feature word set, p (c)i) C represents the text category in the training text data setiP (t) represents the probability of the text with the primary feature word t in the training text data set, p (c)iI t) indicates that the initially selected feature word t appears in ciConditional probability in class text, p (c)iI t') indicates that the initially selected feature word t does not appear in ciConditional probabilities in class text.
The gain values of the information gain values are sorted, the gain values of the information may be sorted from small to large, or the gain values of the information may be sorted from large to small, which is not limited in the embodiment of the present invention.
S207: and selecting a third preset number of primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain a text category set.
The information gain values of all the initially selected feature words are obtained in step S206, and the larger the value is, the more classification information the initially selected feature words contain, that is, the feature words with the maximum probability of being a certain text category in the training text data set. Therefore, after a sorting result of the gain value sorting of the information gain values is obtained, a third preset number of primarily selected feature words are selected from one end with a large gain value in the gain value sorting as final-level feature words to obtain a feature item set formed by the final-level feature words, and each text in the training text data set is classified according to the final-level feature words to obtain a text category set.
It should be noted that the third preset number may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention.
S208: and receiving the texts to be classified, and mapping the texts to be classified into target feature vectors of target dimensions according to a feature item set obtained by pre-training each text in the training text data set.
S209: and calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances.
S210: and selecting the texts corresponding to the first preset number of Euclidean distances at the end with the small Euclidean distance in the sequence as the neighbor texts of the texts to be classified.
S211: and calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text.
S212: and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
Referring to fig. 3, fig. 3 is a flowchart of another implementation of the text classification method in the embodiment of the present invention, where the method may include the following steps:
s301: and removing illegal format characters of each text in the training text data set.
S302: and performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm to obtain a word segmentation set of each text.
After removing illegal format characters of each text in the training text data set, performing word segmentation on each text in the training text data set respectively by using an accurate mode of a jieba word segmentation algorithm to obtain a word segmentation set of each text. There are three modes due to the jieba word segmentation algorithm: (1) the accurate mode is used for trying to cut the sentence most accurately, and is suitable for text analysis; (2) in the full mode, all words which can be formed into words in a sentence are scanned, so that the speed is very high, but ambiguity cannot be solved; (3) and the search engine mode is used for segmenting long words again on the basis of the accurate mode, so that the recall rate is improved, and the search engine mode is suitable for word segmentation of the search engine. For example, divide "i am from the institute of automation" and the result is the following, precise mode: i/from/automation school. Full mode: i/from/automation/college of automation. Search engine mode: i/from/automation/college/automation/chemistry/college. Therefore, the accurate mode of the jieba word segmentation algorithm is selected to segment the words of the text data, and the accuracy of subsequent text classification is further improved.
S303: and removing stop words in the participle set of each text.
S304: and calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set to obtain the characteristic weight corresponding to each word in each word segmentation set.
S305: and respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the characteristic weight, selecting a second preset number of words at the end with a large weight in the weight sequencing of each text as the primary selection characteristic words of each text, and combining the primary selection characteristic words to obtain a primary selection characteristic word set of the training text data set.
S306: and calculating the information gain value of each initially selected feature word in the initially selected feature word set by using an information gain algorithm, and sequencing the gain values of each information.
S307: and selecting a third preset number of primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain a text category set.
S308: and receiving the texts to be classified, and mapping the texts to be classified into target feature vectors of target dimensions according to a feature item set obtained by pre-training each text in the training text data set.
S309: and calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances.
S310: and selecting the texts corresponding to the first preset number of Euclidean distances at the end with the small Euclidean distance in the sequence as the neighbor texts of the texts to be classified.
S311: and calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text.
S312: and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
Corresponding to the above method embodiments, the embodiments of the present invention further provide a text classification apparatus, and the text classification apparatus described below and the text classification method described above may be referred to in correspondence with each other.
Referring to fig. 4, fig. 4 is a block diagram of a text classification apparatus according to an embodiment of the present invention, where the apparatus includes:
the feature vector mapping module 41 is configured to receive a text to be classified, and map the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
the distance sorting module 42 is configured to calculate euclidean distances between the target feature vectors and feature vectors of each text in the training text data set, and sort the euclidean distances;
a neighbor text obtaining module 43, configured to select, as each neighbor text of the text to be classified, a text corresponding to a first preset number of euclidean distances at a end with a small euclidean distance in the ranking;
the weight calculation module 44 is configured to calculate, based on each neighbor text, a weight of the text to be classified to each type of text in the text category set by using a K-nearest neighbor algorithm; the text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set;
and a text category determining module 45, configured to determine the text category corresponding to the maximum weight value as the text category of the text to be classified.
According to the technical scheme, the training text data set is trained in advance by combining a word segmentation algorithm, a calculation algorithm for calculating the feature weight by utilizing the product of the word frequency and the reverse file frequency and an information gain algorithm to obtain the feature item set formed by feature words containing more classification information, so that the effectiveness of the feature item set for classifying the text is ensured, and the accuracy of text classification is greatly improved. The text to be classified is mapped into a target feature vector by directly utilizing the obtained feature item set, and the target feature vector is calculated by utilizing a K nearest neighbor algorithm, so that the text to be classified is classified. The time for classification is greatly shortened, and the whole process does not need personnel to participate, so that the cost is greatly reduced.
In one embodiment of the invention, the apparatus comprises a training module comprising:
the word segmentation characteristic word obtaining submodule is used for performing word segmentation on each text in the training text data set by utilizing a jieba word segmentation algorithm to obtain a word segmentation set of each text;
the characteristic weight obtaining submodule is used for calculating the word frequency and the reverse file frequency of each word in each participle set, calculating the product of the word frequency and the reverse file frequency corresponding to each word in each participle set and obtaining the characteristic weight corresponding to each word in each participle set;
the characteristic word set obtaining sub-module is used for respectively carrying out weight sorting on all words in the word segmentation set corresponding to each text according to the characteristic weight, selecting a second preset number of words before the end with the large weight in the weight sorting of each text from the weight sorting as the primary selection characteristic words of each text, and combining the primary selection characteristic words to obtain a primary selection characteristic word set of the training text data set;
the gain value sequencing submodule is used for calculating the information gain value of each initially selected feature word in the initially selected feature word set by using an information gain algorithm and sequencing the gain value of each information gain value;
and the feature item set and text category set submodule is used for selecting a third preset number of primarily selected feature words from one end with a large gain value in the gain value sequencing as final-level feature words to obtain a feature item set consisting of the final-level feature words, and classifying each text in the training text data set according to the final-level feature words to obtain a text category set.
In one embodiment of the present invention, the participle feature word obtaining sub-module includes a participle unit,
the word segmentation unit is specifically a unit for performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm.
In one embodiment of the invention, the training module further comprises a character removal sub-module,
and the character removal submodule is used for removing illegal format characters of each text in the training text data set after the word segmentation set of each text is obtained.
In one embodiment of the invention, the training module further comprises a stop word removal submodule,
and the stop word removing submodule is used for removing the stop words in the segmentation sets of the texts after the segmentation sets of the texts are obtained.
Corresponding to the above method embodiment, referring to fig. 5, fig. 5 is a schematic diagram of a text classification device provided by the present invention, where the device may include:
a memory 51 for storing a computer program;
the processor 52, when executing the computer program stored in the memory 51, may implement the following steps:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances; selecting texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sequence as neighbor texts of the texts to be classified; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set; and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
For the introduction of the device provided by the present invention, please refer to the above method embodiment, which is not described herein again.
Corresponding to the above method embodiment, the present invention further provides a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the steps of:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training a training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm; calculating Euclidean distances between the target characteristic vector and the characteristic vector of each text in the training text data set, and sequencing the Euclidean distances; selecting texts corresponding to the first preset number of Euclidean distances at the end with the smaller Euclidean distance in the sequence as neighbor texts of the texts to be classified; calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying each text in the training text data set in advance according to the feature item set; and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
For the introduction of the computer-readable storage medium provided by the present invention, please refer to the above method embodiments, which are not described herein again.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device, the apparatus and the computer-readable storage medium disclosed in the embodiments correspond to the method disclosed in the embodiments, so that the description is simple, and the relevant points can be referred to the description of the method.
The principle and the implementation of the present invention are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
Claims (8)
1. A method of text classification, comprising:
receiving a text to be classified, and mapping the text to be classified into a target feature vector of a target dimension according to a feature item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set, and sorting the Euclidean distances in size;
selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
the training process of training the training text data set to obtain the feature item set and the text category set comprises the following steps:
utilizing a jieba word segmentation algorithm to perform word segmentation on each text in the training text data set respectively to obtain a word segmentation set of each text;
calculating the word frequency and the reverse file frequency of each word in each word segmentation set, and calculating the product of the word frequency and the reverse file frequency corresponding to each word in each word segmentation set respectively to obtain the characteristic weight corresponding to each word in each word segmentation set respectively;
respectively carrying out weight sequencing on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words as primary selection feature words of each text from one end of each text with a large weight in the weight sequencing, and combining the primary selection feature words to obtain a primary selection feature word set of the training text data set;
calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm, and sequencing the gain values of each information gain value;
selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set formed by all the final feature words, and classifying all texts in the training text data set according to all the final feature words to obtain the text category set;
and determining the text category corresponding to the maximum weight value as the text category of the text to be classified.
2. The text classification method according to claim 1, wherein the tokenizing of each text in the training text data set using a jieba tokenizing algorithm comprises:
and segmenting words of each text in the training text data set by utilizing an accurate mode of a jieba word segmentation algorithm.
3. The method of classifying texts according to claim 2, wherein before performing word segmentation on each text in the training text data set by using a jieba word segmentation algorithm, the method further comprises:
and removing illegal format characters of each text in the training text data set.
4. The method of claim 3, further comprising, after obtaining the set of participles for each of the texts:
and removing stop words in the word segmentation set of each text.
5. A text classification apparatus, comprising:
the characteristic vector mapping module is used for receiving texts to be classified and mapping the texts to be classified into target characteristic vectors of target dimensions according to a characteristic item set obtained by pre-training each text in a training text data set; the feature item set is obtained by training the training text data set by combining a word segmentation algorithm, a calculation algorithm for calculating feature weight by using the product of word frequency and reverse file frequency and an information gain algorithm;
the distance sorting module is used for calculating Euclidean distances between the target characteristic vector and the characteristic vectors of all texts in the training text data set and sorting the Euclidean distances;
the neighbor text obtaining module is used for selecting the texts corresponding to the first preset number of European-style distances at the end with the small European-style distance in the sequence as the neighbor texts of the texts to be classified;
the weight calculation module is used for calculating the weight of the text to be classified to each type of text in the text category set by using a K nearest neighbor algorithm based on each neighbor text; the text classification set is obtained by classifying the texts in the training text data set in advance according to the feature item set;
the text category determining module is used for determining the text category corresponding to the maximum weight value as the text category of the text to be classified;
a training module, the training module comprising:
the word segmentation characteristic word obtaining submodule is used for performing word segmentation on each text in the training text data set by utilizing a jieba word segmentation algorithm to obtain a word segmentation set of each text;
the characteristic weight obtaining submodule is used for calculating the word frequency and the reverse file frequency of each word in each participle set, calculating the product of the word frequency and the reverse file frequency corresponding to each word in each participle set and obtaining the characteristic weight corresponding to each word in each participle set;
the feature word set obtaining sub-module is used for respectively carrying out weight sorting on all words in the word segmentation set corresponding to each text according to the feature weight, selecting a first preset number of words from one end of each text with a large weight in the weight sorting as primary feature words of each text, and combining the primary feature words to obtain a primary feature word set of the training text data set;
the gain value sequencing submodule is used for calculating the information gain value of each initially selected feature word in the initially selected feature word set by using the information gain algorithm and sequencing the gain value of each information gain value;
and the feature item set and text category set submodule is used for selecting a third preset number of the primarily selected feature words from one end with a large gain value in the gain value sequence as final feature words to obtain a feature item set consisting of the final feature words, and classifying each text in the training text data set according to the final feature words to obtain the text category set.
6. The text classification apparatus according to claim 5, wherein the participle feature word obtaining submodule includes a participle unit,
the word segmentation unit is specifically a unit for performing word segmentation on each text in the training text data set by using an accurate mode of a jieba word segmentation algorithm.
7. A text classification apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the text classification method according to any one of claims 1 to 4 when executing said computer program.
8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the text classification method according to one of the claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910594623.9A CN110287328B (en) | 2019-07-03 | 2019-07-03 | Text classification method, device and equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910594623.9A CN110287328B (en) | 2019-07-03 | 2019-07-03 | Text classification method, device and equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110287328A CN110287328A (en) | 2019-09-27 |
CN110287328B true CN110287328B (en) | 2021-03-16 |
Family
ID=68020450
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910594623.9A Active CN110287328B (en) | 2019-07-03 | 2019-07-03 | Text classification method, device and equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110287328B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111177386B (en) * | 2019-12-27 | 2021-05-14 | 安徽商信政通信息技术股份有限公司 | Proposal classification method and system |
CN111143303B (en) * | 2019-12-31 | 2023-06-02 | 海南电网有限责任公司信息通信分公司 | Log classification method based on information gain and improved KNN algorithm |
CN111091161B (en) * | 2019-12-31 | 2023-09-22 | 中国银行股份有限公司 | Data classification method, device and system |
CN111259148B (en) * | 2020-01-19 | 2024-03-26 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN111538766B (en) * | 2020-05-19 | 2023-06-30 | 支付宝(杭州)信息技术有限公司 | Text classification method, device, processing equipment and bill classification system |
CN111667152B (en) * | 2020-05-19 | 2024-07-02 | 深圳莫比嗨客树莓派智能机器人有限公司 | Automatic auditing method for text data calibration task based on crowdsourcing |
CN111695353B (en) * | 2020-06-12 | 2023-07-04 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and storage medium for identifying timeliness text |
CN111708888B (en) * | 2020-06-16 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Classification method, device, terminal and storage medium based on artificial intelligence |
CN112214598B (en) * | 2020-09-27 | 2023-01-13 | 吾征智能技术(北京)有限公司 | Cognitive system based on hair condition |
CN112861974A (en) * | 2021-02-08 | 2021-05-28 | 和美(深圳)信息技术股份有限公司 | Text classification method and device, electronic equipment and storage medium |
CN113094494B (en) * | 2021-04-19 | 2024-09-13 | 广东电网有限责任公司 | Intelligent classification method, device, equipment and medium for electric power operation ticket text |
CN113364751B (en) * | 2021-05-26 | 2023-06-09 | 北京电子科技职业学院 | Network attack prediction method, computer readable storage medium and electronic device |
CN114068028A (en) * | 2021-11-18 | 2022-02-18 | 泰康保险集团股份有限公司 | Medical inquiry data processing method and device, readable storage medium and electronic equipment |
CN114817526B (en) * | 2022-02-21 | 2024-03-29 | 华院计算技术(上海)股份有限公司 | Text classification method and device, storage medium and terminal |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622373B (en) * | 2011-01-31 | 2013-12-11 | 中国科学院声学研究所 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
CN102789473A (en) * | 2011-05-18 | 2012-11-21 | 国际商业机器公司 | Identifier retrieval method and equipment |
CN105426426B (en) * | 2015-11-04 | 2018-11-02 | 北京工业大学 | A kind of KNN file classification methods based on improved K-Medoids |
CN107045503B (en) * | 2016-02-05 | 2019-03-05 | 华为技术有限公司 | A kind of method and device that feature set determines |
CN107169086B (en) * | 2017-05-12 | 2020-10-27 | 北京化工大学 | Text classification method |
CN107481132A (en) * | 2017-08-02 | 2017-12-15 | 上海前隆信息科技有限公司 | A kind of credit estimation method and system, storage medium and terminal device |
CN108764399B (en) * | 2018-05-22 | 2021-03-19 | 东南大学 | kNN-based RFID tag classification method and device |
CN109299263B (en) * | 2018-10-10 | 2021-01-05 | 上海观安信息技术股份有限公司 | Text classification method and electronic equipment |
-
2019
- 2019-07-03 CN CN201910594623.9A patent/CN110287328B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110287328A (en) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110287328B (en) | Text classification method, device and equipment and computer readable storage medium | |
Trstenjak et al. | KNN with TF-IDF based framework for text categorization | |
CN111767403B (en) | Text classification method and device | |
CN100583101C (en) | Text categorization feature selection and weight computation method based on field knowledge | |
Wilkinson et al. | Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections | |
CN108228541B (en) | Method and device for generating document abstract | |
CN107145560B (en) | Text classification method and device | |
JPH07114572A (en) | Document classifying device | |
CN111159359A (en) | Document retrieval method, document retrieval device and computer-readable storage medium | |
CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
CN109492217B (en) | Word segmentation method based on machine learning and terminal equipment | |
CN110858217A (en) | Method and device for detecting microblog sensitive topics and readable storage medium | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
CN112667806B (en) | Text classification screening method using LDA | |
CN113807073B (en) | Text content anomaly detection method, device and storage medium | |
CN116910599A (en) | Data clustering method, system, electronic equipment and storage medium | |
CN113934848B (en) | Data classification method and device and electronic equipment | |
CN114896398A (en) | Text classification system and method based on feature selection | |
CN112711944B (en) | Word segmentation method and system, and word segmentation device generation method and system | |
CN117171331A (en) | Professional field information interaction method, device and equipment based on large language model | |
CN111831819B (en) | Text updating method and device | |
CN116881451A (en) | Text classification method based on machine learning | |
CN116935057A (en) | Target evaluation method, electronic device, and computer-readable storage medium | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |