CN113641824A

CN113641824A - Text classification system and method based on deep learning

Info

Publication number: CN113641824A
Application number: CN202110971103.2A
Authority: CN
Inventors: 梅亮
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-11-12

Abstract

The invention discloses a text classification system and method based on deep learning, which comprises a text source acquisition module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector and a classification module, and is characterized in that: the system and the method realize the dimensionality reduction effect on high-dimensional text data through a text classification system and method based on deep learning, can extract more accurate characteristics through segmenting the text and clearing the useless words, and improve the classification accuracy.

Description

Text classification system and method based on deep learning

Technical Field

The invention relates to the technical field of text classification, in particular to a text classification system and method based on deep learning.

Background

With the rapid development of network technology, massive information resources exist in the form of texts. People hope to quickly and effectively find the interesting content from the explosive information wave. The text classification is used as an important research direction of information processing, is a common method for solving text information discovery, is deep learning, is an unsupervised feature learning and feature hierarchical structure learning method, is a feature learning method for realizing feature extraction by reconstructing original input data generally, is popular in the field of machine learning in recent years, and is essentially used for learning more abstract high-level features by using a large amount of training data and constructing a network structure model containing a plurality of hidden layers.

In order to apply a neural network algorithm in deep learning to text classification, firstly, a text is expressed in a form which is easy to process by a computer, however, aiming at massive data and many problems brought by high feature dimension to the text classification, the requirement of people for obtaining useful knowledge cannot be met, and the existing processing mode has certain defects.

Disclosure of Invention

The present invention is directed to a system and method for text classification based on deep learning to solve the problems set forth in the background art.

In order to achieve the purpose, the invention provides the following technical scheme:

the text classification system based on deep learning comprises a text source obtaining module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature vector, a test text feature weight calculation unit, a test text feature vector and a classification module, wherein the text preprocessing module comprises a new word adding unit, a word segmentation unit, a useless word removing unit and a full text index establishing unit, and the classification module comprises a classification judgment unit and a classifier.

As a further scheme of the invention: the text source acquisition module is used for acquiring a text initial source part, the text source input module is used for inputting an initial source text, the normalization processing module is used for integrating a source file into a text meeting the specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the feature of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit establishes full-text indexes for the training text set and the test text set, the training text module is used for obtaining a training text set, the test text module is used for obtaining a test text set, and the feature extraction dimension reduction unit carries out feature dimension reduction by using a feature extraction method, the feature weight calculation unit is used for performing feature weighting on texts in a training set and a test set, the classification module is used for classifying the texts in the test set, the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, and the classification judgment unit is used for judging a classification result and classification performance.

The text classification method based on deep learning comprises the following steps:

s1: acquiring, inputting and standardizing a text source;

s2: preprocessing a text;

s3: reducing the dimension of the feature and collecting feature items in the text feature vector;

s4: carrying out characteristic weighting on texts in the training set and the test set;

s5: and classifying the texts in the training set and the test set.

As a still further scheme of the invention: in S1, the source text is obtained by the text source obtaining module, the initial source text is transmitted to the normalization processing module by the text source input module, and the received initial source text is processed by the normalization processing module to conform to the text processing form of the model.

As a still further scheme of the invention: in the S2, the text is preprocessed, full-text indexes are respectively established for the training text set and the test text set, and then a training set index and a test set index are respectively obtained, in the process of establishing indexes, word segmentation is performed on the training text set and the test text set in the text set by using word segmentation units, and an original feature word set of the training text set and an original feature word set of the test text set are respectively obtained and correspondingly transmitted to the training text module and the test text module.

As a still further scheme of the invention: in S3, the feature extraction and dimension reduction unit performs feature dimension reduction by using a feature extraction method, performs statistics on data in the feature extraction method by using an index query function, performs descending order arrangement according to the feature evaluation value after substitution calculation, and selects features with the best category effect to form a feature item set in the text feature vector.

As a still further scheme of the invention: in the S4, feature weighting is performed on the texts in the training set and the test set in the training text module and the test text module, and the feature items in the training text feature vector and the test text feature vector are represented by the spatial feature vector, and feature item sets extracted and processed for the text features are obtained by using a user-defined index query function for each text, so as to obtain the distribution statistics of the feature items corresponding to the text and the corresponding categories.

As a still further scheme of the invention: in S5, the classifier operates a corresponding classification algorithm to perform classification, and spatial feature vectors representing all texts in the training set and the test set are used as input data to classify the texts in the training set and the test set, and after classification is completed, the classification performance is evaluated by the classification judgment unit.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the text classification system and method based on deep learning, the dimensionality reduction effect on high-dimensional text data is achieved, the text is segmented, the useless words are eliminated, more accurate features can be extracted, the classification accuracy is improved, meanwhile, the new word adding unit is arranged for adding new words, the iterative replacement of the new words is met, the model training effect and efficiency are improved, meanwhile, the distribution condition statistics of feature items corresponding to the text and corresponding categories are obtained by performing feature weighting on the text in the training set and the text in the testing set, the text classification and recognition accuracy is further improved, and the text classification system and method have good application prospects.

Drawings

Fig. 1 is a block diagram of a text classification system and method based on deep learning.

Fig. 2 is a block diagram of a method flow of a text classification method based on deep learning.

FIG. 3 is a block diagram of a text pre-processing module in a deep learning based text classification system.

FIG. 4 is a block diagram of a classification module in the deep learning based text classification system and method.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 4, in an embodiment of the present invention, a text classification system based on deep learning includes a text source obtaining module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction and dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector, and a classification module, where the text preprocessing module includes a new word adding unit, a word segmentation unit, a useless word removal unit, and a full-text index establishing unit, the classification module includes a classification determination unit and a classifier, the text source obtaining module is configured to obtain an initial text source, the text source input module is configured to input an initial source text, the normalization processing module is configured to integrate source files into texts meeting specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the characteristic of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit is used for establishing full-text indexes for the training text set and the test text set, the training text module is used for obtaining the training text set, the test text module is used for obtaining the test text set, the characteristic extraction dimension reduction unit is used for carrying out characteristic dimension reduction by using a characteristic extraction method, the characteristic weight calculating unit is used for carrying out characteristic weighting on the texts in the training set and the test set, the classification module is used for classifying the texts in the test set, and the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, the classification judgment unit is used for judging the classification result and the classification performance.

s1: acquiring, inputting and standardizing a text source;

s2: preprocessing a text;

s5: and classifying the texts in the training set and the test set.

In the step S1, a source text is obtained through a text source obtaining module, an initial source text is transmitted to a standardized processing module through a text source input module, the received initial source text is processed through the standardized processing module to be in accordance with a text processing form of a model, in the step S2, the text is preprocessed, full-text indexes are respectively established for a training text set and a testing text set, then a training set index and a testing set index are respectively obtained, in the process of establishing the indexes, a word segmentation unit is used for carrying out word segmentation on the training text set and the testing text set in the text set to respectively obtain an original feature word set of the training text set and an original feature word set of the testing text set, the original feature word sets are respectively and correspondingly transmitted to the training text module and the testing text module, in the step S3, feature dimension reduction is carried out through a feature extraction method through a feature extraction dimension reduction unit, counting data in the feature extraction method by using an index query function, performing descending order arrangement according to the height of a feature evaluation value after substituting calculation, selecting features with the best class effect, forming a feature item set in a text feature vector, in S4, performing feature weighting on texts in a training set and a test set in a training text module and a test text module, representing the features of the texts by using a spatial feature vector, using the feature items in the training text feature vector and the test text feature vector as the feature item set for text feature extraction processing, using a self-defined index query function for each text to obtain the distribution condition statistics of the feature items corresponding to the text and corresponding classes, in S5, classifying the texts by using a classifier to run a corresponding classification algorithm, and using the spatial feature vectors representing all the texts in the training set and the test set as input data, and classifying the texts in the training set and the test set, and judging the classification performance through a classification judgment unit after classification is finished.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. The text classification system based on deep learning comprises a text source acquisition module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector and a classification module, and is characterized in that: the text preprocessing module comprises a new word adding unit, a word segmentation unit, a useless word clearing unit and a full-text index establishing unit, and the classification module comprises a classification judgment unit and a classifier.

2. The deep learning based text classification system according to claim 1, characterized in that: the text source acquisition module is used for acquiring a text initial source part, the text source input module is used for inputting an initial source text, the normalization processing module is used for integrating a source file into a text meeting the specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the feature of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit establishes full-text indexes for the training text set and the test text set, the training text module is used for obtaining a training text set, the test text module is used for obtaining a test text set, and the feature extraction dimension reduction unit carries out feature dimension reduction by using a feature extraction method, the feature weight calculation unit is used for performing feature weighting on texts in a training set and a test set, the classification module is used for classifying the texts in the test set, the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, and the classification judgment unit is used for judging a classification result and classification performance.

3. The text classification method based on deep learning is characterized in that: the method comprises the following steps:

s1: acquiring, inputting and standardizing a text source;

s2: preprocessing a text;

s5: and classifying the texts in the training set and the test set.

4. The deep learning based text classification method according to claim 3, characterized in that: in S1, the source text is obtained by the text source obtaining module, the initial source text is transmitted to the normalization processing module by the text source input module, and the received initial source text is processed by the normalization processing module to conform to the text processing form of the model.

5. The deep learning based text classification system of claim 3, wherein: in the S2, the text is preprocessed, full-text indexes are respectively established for the training text set and the test text set, and then a training set index and a test set index are respectively obtained, in the process of establishing indexes, word segmentation is performed on the training text set and the test text set in the text set by using word segmentation units, and an original feature word set of the training text set and an original feature word set of the test text set are respectively obtained and correspondingly transmitted to the training text module and the test text module.

6. The deep learning based text classification method according to claim 3, characterized in that: in S3, the feature extraction and dimension reduction unit performs feature dimension reduction by using a feature extraction method, performs statistics on data in the feature extraction method by using an index query function, performs descending order arrangement according to the feature evaluation value after substitution calculation, and selects features with the best category effect to form a feature item set in the text feature vector.

7. The deep learning based text classification method according to claim 3, characterized in that: in the S4, feature weighting is performed on the texts in the training set and the test set in the training text module and the test text module, and the feature items in the training text feature vector and the test text feature vector are represented by the spatial feature vector, and feature item sets extracted and processed for the text features are obtained by using a user-defined index query function for each text, so as to obtain the distribution statistics of the feature items corresponding to the text and the corresponding categories.

8. The deep learning based text classification method according to claim 3, characterized in that: in S5, the classifier operates a corresponding classification algorithm to perform classification, and spatial feature vectors representing all texts in the training set and the test set are used as input data to classify the texts in the training set and the test set, and after classification is completed, the classification performance is evaluated by the classification judgment unit.