CN113641824A - Text classification system and method based on deep learning - Google Patents
Text classification system and method based on deep learning Download PDFInfo
- Publication number
- CN113641824A CN113641824A CN202110971103.2A CN202110971103A CN113641824A CN 113641824 A CN113641824 A CN 113641824A CN 202110971103 A CN202110971103 A CN 202110971103A CN 113641824 A CN113641824 A CN 113641824A
- Authority
- CN
- China
- Prior art keywords
- text
- module
- feature
- test
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000013135 deep learning Methods 0.000 title claims abstract description 25
- 238000012360 testing method Methods 0.000 claims abstract description 70
- 238000012549 training Methods 0.000 claims abstract description 68
- 239000013598 vector Substances 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 21
- 238000012545 processing Methods 0.000 claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 238000010606 normalization Methods 0.000 claims abstract description 11
- 230000000694 effects Effects 0.000 claims abstract description 6
- 230000011218 segmentation Effects 0.000 claims description 18
- 238000007635 classification algorithm Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000006467 substitution reaction Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text classification system and method based on deep learning, which comprises a text source acquisition module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector and a classification module, and is characterized in that: the system and the method realize the dimensionality reduction effect on high-dimensional text data through a text classification system and method based on deep learning, can extract more accurate characteristics through segmenting the text and clearing the useless words, and improve the classification accuracy.
Description
Technical Field
The invention relates to the technical field of text classification, in particular to a text classification system and method based on deep learning.
Background
With the rapid development of network technology, massive information resources exist in the form of texts. People hope to quickly and effectively find the interesting content from the explosive information wave. The text classification is used as an important research direction of information processing, is a common method for solving text information discovery, is deep learning, is an unsupervised feature learning and feature hierarchical structure learning method, is a feature learning method for realizing feature extraction by reconstructing original input data generally, is popular in the field of machine learning in recent years, and is essentially used for learning more abstract high-level features by using a large amount of training data and constructing a network structure model containing a plurality of hidden layers.
In order to apply a neural network algorithm in deep learning to text classification, firstly, a text is expressed in a form which is easy to process by a computer, however, aiming at massive data and many problems brought by high feature dimension to the text classification, the requirement of people for obtaining useful knowledge cannot be met, and the existing processing mode has certain defects.
Disclosure of Invention
The present invention is directed to a system and method for text classification based on deep learning to solve the problems set forth in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
the text classification system based on deep learning comprises a text source obtaining module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature vector, a test text feature weight calculation unit, a test text feature vector and a classification module, wherein the text preprocessing module comprises a new word adding unit, a word segmentation unit, a useless word removing unit and a full text index establishing unit, and the classification module comprises a classification judgment unit and a classifier.
As a further scheme of the invention: the text source acquisition module is used for acquiring a text initial source part, the text source input module is used for inputting an initial source text, the normalization processing module is used for integrating a source file into a text meeting the specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the feature of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit establishes full-text indexes for the training text set and the test text set, the training text module is used for obtaining a training text set, the test text module is used for obtaining a test text set, and the feature extraction dimension reduction unit carries out feature dimension reduction by using a feature extraction method, the feature weight calculation unit is used for performing feature weighting on texts in a training set and a test set, the classification module is used for classifying the texts in the test set, the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, and the classification judgment unit is used for judging a classification result and classification performance.
The text classification method based on deep learning comprises the following steps:
s1: acquiring, inputting and standardizing a text source;
s2: preprocessing a text;
s3: reducing the dimension of the feature and collecting feature items in the text feature vector;
s4: carrying out characteristic weighting on texts in the training set and the test set;
s5: and classifying the texts in the training set and the test set.
As a still further scheme of the invention: in S1, the source text is obtained by the text source obtaining module, the initial source text is transmitted to the normalization processing module by the text source input module, and the received initial source text is processed by the normalization processing module to conform to the text processing form of the model.
As a still further scheme of the invention: in the S2, the text is preprocessed, full-text indexes are respectively established for the training text set and the test text set, and then a training set index and a test set index are respectively obtained, in the process of establishing indexes, word segmentation is performed on the training text set and the test text set in the text set by using word segmentation units, and an original feature word set of the training text set and an original feature word set of the test text set are respectively obtained and correspondingly transmitted to the training text module and the test text module.
As a still further scheme of the invention: in S3, the feature extraction and dimension reduction unit performs feature dimension reduction by using a feature extraction method, performs statistics on data in the feature extraction method by using an index query function, performs descending order arrangement according to the feature evaluation value after substitution calculation, and selects features with the best category effect to form a feature item set in the text feature vector.
As a still further scheme of the invention: in the S4, feature weighting is performed on the texts in the training set and the test set in the training text module and the test text module, and the feature items in the training text feature vector and the test text feature vector are represented by the spatial feature vector, and feature item sets extracted and processed for the text features are obtained by using a user-defined index query function for each text, so as to obtain the distribution statistics of the feature items corresponding to the text and the corresponding categories.
As a still further scheme of the invention: in S5, the classifier operates a corresponding classification algorithm to perform classification, and spatial feature vectors representing all texts in the training set and the test set are used as input data to classify the texts in the training set and the test set, and after classification is completed, the classification performance is evaluated by the classification judgment unit.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the text classification system and method based on deep learning, the dimensionality reduction effect on high-dimensional text data is achieved, the text is segmented, the useless words are eliminated, more accurate features can be extracted, the classification accuracy is improved, meanwhile, the new word adding unit is arranged for adding new words, the iterative replacement of the new words is met, the model training effect and efficiency are improved, meanwhile, the distribution condition statistics of feature items corresponding to the text and corresponding categories are obtained by performing feature weighting on the text in the training set and the text in the testing set, the text classification and recognition accuracy is further improved, and the text classification system and method have good application prospects.
Drawings
Fig. 1 is a block diagram of a text classification system and method based on deep learning.
Fig. 2 is a block diagram of a method flow of a text classification method based on deep learning.
FIG. 3 is a block diagram of a text pre-processing module in a deep learning based text classification system.
FIG. 4 is a block diagram of a classification module in the deep learning based text classification system and method.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 to 4, in an embodiment of the present invention, a text classification system based on deep learning includes a text source obtaining module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction and dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector, and a classification module, where the text preprocessing module includes a new word adding unit, a word segmentation unit, a useless word removal unit, and a full-text index establishing unit, the classification module includes a classification determination unit and a classifier, the text source obtaining module is configured to obtain an initial text source, the text source input module is configured to input an initial source text, the normalization processing module is configured to integrate source files into texts meeting specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the characteristic of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit is used for establishing full-text indexes for the training text set and the test text set, the training text module is used for obtaining the training text set, the test text module is used for obtaining the test text set, the characteristic extraction dimension reduction unit is used for carrying out characteristic dimension reduction by using a characteristic extraction method, the characteristic weight calculating unit is used for carrying out characteristic weighting on the texts in the training set and the test set, the classification module is used for classifying the texts in the test set, and the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, the classification judgment unit is used for judging the classification result and the classification performance.
The text classification method based on deep learning comprises the following steps:
s1: acquiring, inputting and standardizing a text source;
s2: preprocessing a text;
s3: reducing the dimension of the feature and collecting feature items in the text feature vector;
s4: carrying out characteristic weighting on texts in the training set and the test set;
s5: and classifying the texts in the training set and the test set.
In the step S1, a source text is obtained through a text source obtaining module, an initial source text is transmitted to a standardized processing module through a text source input module, the received initial source text is processed through the standardized processing module to be in accordance with a text processing form of a model, in the step S2, the text is preprocessed, full-text indexes are respectively established for a training text set and a testing text set, then a training set index and a testing set index are respectively obtained, in the process of establishing the indexes, a word segmentation unit is used for carrying out word segmentation on the training text set and the testing text set in the text set to respectively obtain an original feature word set of the training text set and an original feature word set of the testing text set, the original feature word sets are respectively and correspondingly transmitted to the training text module and the testing text module, in the step S3, feature dimension reduction is carried out through a feature extraction method through a feature extraction dimension reduction unit, counting data in the feature extraction method by using an index query function, performing descending order arrangement according to the height of a feature evaluation value after substituting calculation, selecting features with the best class effect, forming a feature item set in a text feature vector, in S4, performing feature weighting on texts in a training set and a test set in a training text module and a test text module, representing the features of the texts by using a spatial feature vector, using the feature items in the training text feature vector and the test text feature vector as the feature item set for text feature extraction processing, using a self-defined index query function for each text to obtain the distribution condition statistics of the feature items corresponding to the text and corresponding classes, in S5, classifying the texts by using a classifier to run a corresponding classification algorithm, and using the spatial feature vectors representing all the texts in the training set and the test set as input data, and classifying the texts in the training set and the test set, and judging the classification performance through a classification judgment unit after classification is finished.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.
Claims (8)
1. The text classification system based on deep learning comprises a text source acquisition module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector and a classification module, and is characterized in that: the text preprocessing module comprises a new word adding unit, a word segmentation unit, a useless word clearing unit and a full-text index establishing unit, and the classification module comprises a classification judgment unit and a classifier.
2. The deep learning based text classification system according to claim 1, characterized in that: the text source acquisition module is used for acquiring a text initial source part, the text source input module is used for inputting an initial source text, the normalization processing module is used for integrating a source file into a text meeting the specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the feature of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit establishes full-text indexes for the training text set and the test text set, the training text module is used for obtaining a training text set, the test text module is used for obtaining a test text set, and the feature extraction dimension reduction unit carries out feature dimension reduction by using a feature extraction method, the feature weight calculation unit is used for performing feature weighting on texts in a training set and a test set, the classification module is used for classifying the texts in the test set, the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, and the classification judgment unit is used for judging a classification result and classification performance.
3. The text classification method based on deep learning is characterized in that: the method comprises the following steps:
s1: acquiring, inputting and standardizing a text source;
s2: preprocessing a text;
s3: reducing the dimension of the feature and collecting feature items in the text feature vector;
s4: carrying out characteristic weighting on texts in the training set and the test set;
s5: and classifying the texts in the training set and the test set.
4. The deep learning based text classification method according to claim 3, characterized in that: in S1, the source text is obtained by the text source obtaining module, the initial source text is transmitted to the normalization processing module by the text source input module, and the received initial source text is processed by the normalization processing module to conform to the text processing form of the model.
5. The deep learning based text classification system of claim 3, wherein: in the S2, the text is preprocessed, full-text indexes are respectively established for the training text set and the test text set, and then a training set index and a test set index are respectively obtained, in the process of establishing indexes, word segmentation is performed on the training text set and the test text set in the text set by using word segmentation units, and an original feature word set of the training text set and an original feature word set of the test text set are respectively obtained and correspondingly transmitted to the training text module and the test text module.
6. The deep learning based text classification method according to claim 3, characterized in that: in S3, the feature extraction and dimension reduction unit performs feature dimension reduction by using a feature extraction method, performs statistics on data in the feature extraction method by using an index query function, performs descending order arrangement according to the feature evaluation value after substitution calculation, and selects features with the best category effect to form a feature item set in the text feature vector.
7. The deep learning based text classification method according to claim 3, characterized in that: in the S4, feature weighting is performed on the texts in the training set and the test set in the training text module and the test text module, and the feature items in the training text feature vector and the test text feature vector are represented by the spatial feature vector, and feature item sets extracted and processed for the text features are obtained by using a user-defined index query function for each text, so as to obtain the distribution statistics of the feature items corresponding to the text and the corresponding categories.
8. The deep learning based text classification method according to claim 3, characterized in that: in S5, the classifier operates a corresponding classification algorithm to perform classification, and spatial feature vectors representing all texts in the training set and the test set are used as input data to classify the texts in the training set and the test set, and after classification is completed, the classification performance is evaluated by the classification judgment unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971103.2A CN113641824A (en) | 2021-08-23 | 2021-08-23 | Text classification system and method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971103.2A CN113641824A (en) | 2021-08-23 | 2021-08-23 | Text classification system and method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113641824A true CN113641824A (en) | 2021-11-12 |
Family
ID=78423466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110971103.2A Pending CN113641824A (en) | 2021-08-23 | 2021-08-23 | Text classification system and method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113641824A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117851602A (en) * | 2024-03-07 | 2024-04-09 | 武汉百智诚远科技有限公司 | Automatic legal document classification method and system based on deep learning |
-
2021
- 2021-08-23 CN CN202110971103.2A patent/CN113641824A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117851602A (en) * | 2024-03-07 | 2024-04-09 | 武汉百智诚远科技有限公司 | Automatic legal document classification method and system based on deep learning |
CN117851602B (en) * | 2024-03-07 | 2024-05-14 | 武汉百智诚远科技有限公司 | Automatic legal document classification method and system based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363810B (en) | Text classification method and device | |
CN105469096B (en) | A kind of characteristic bag image search method based on Hash binary-coding | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN108509629B (en) | Text emotion analysis method based on emotion dictionary and support vector machine | |
CN108763213A (en) | Theme feature text key word extracting method | |
CN109886020A (en) | Software vulnerability automatic classification method based on deep neural network | |
CN110826618A (en) | Personal credit risk assessment method based on random forest | |
CN110046250A (en) | Three embedded convolutional neural networks model and its more classification methods of text | |
CN103699523A (en) | Product classification method and device | |
CN111556016B (en) | Network flow abnormal behavior identification method based on automatic encoder | |
CN108804595B (en) | Short text representation method based on word2vec | |
CN108846047A (en) | A kind of picture retrieval method and system based on convolution feature | |
CN113420294A (en) | Malicious code detection method based on multi-scale convolutional neural network | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
CN112036511B (en) | Image retrieval method based on attention mechanism graph convolution neural network | |
CN112417893A (en) | Software function demand classification method and system based on semantic hierarchical clustering | |
CN113641824A (en) | Text classification system and method based on deep learning | |
CN113626604A (en) | Webpage text classification system based on maximum interval criterion | |
CN112200260B (en) | Figure attribute identification method based on discarding loss function | |
CN116935138A (en) | Picture subject content diversity calculation and automatic selection method and system | |
CN116881451A (en) | Text classification method based on machine learning | |
CN114298020B (en) | Keyword vectorization method based on topic semantic information and application thereof | |
Ojo et al. | Improved model for facial expression classification for fear and sadness using local binary pattern histogram | |
CN114925198A (en) | Knowledge-driven text classification method fusing character information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20211112 |
|
WD01 | Invention patent application deemed withdrawn after publication |