CN113641824A - Text classification system and method based on deep learning - Google Patents

Text classification system and method based on deep learning Download PDF

Info

Publication number
CN113641824A
CN113641824A CN202110971103.2A CN202110971103A CN113641824A CN 113641824 A CN113641824 A CN 113641824A CN 202110971103 A CN202110971103 A CN 202110971103A CN 113641824 A CN113641824 A CN 113641824A
Authority
CN
China
Prior art keywords
text
module
feature
test
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110971103.2A
Other languages
Chinese (zh)
Inventor
梅亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110971103.2A priority Critical patent/CN113641824A/en
Publication of CN113641824A publication Critical patent/CN113641824A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification system and method based on deep learning, which comprises a text source acquisition module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector and a classification module, and is characterized in that: the system and the method realize the dimensionality reduction effect on high-dimensional text data through a text classification system and method based on deep learning, can extract more accurate characteristics through segmenting the text and clearing the useless words, and improve the classification accuracy.

Description

Text classification system and method based on deep learning
Technical Field
The invention relates to the technical field of text classification, in particular to a text classification system and method based on deep learning.
Background
With the rapid development of network technology, massive information resources exist in the form of texts. People hope to quickly and effectively find the interesting content from the explosive information wave. The text classification is used as an important research direction of information processing, is a common method for solving text information discovery, is deep learning, is an unsupervised feature learning and feature hierarchical structure learning method, is a feature learning method for realizing feature extraction by reconstructing original input data generally, is popular in the field of machine learning in recent years, and is essentially used for learning more abstract high-level features by using a large amount of training data and constructing a network structure model containing a plurality of hidden layers.
In order to apply a neural network algorithm in deep learning to text classification, firstly, a text is expressed in a form which is easy to process by a computer, however, aiming at massive data and many problems brought by high feature dimension to the text classification, the requirement of people for obtaining useful knowledge cannot be met, and the existing processing mode has certain defects.
Disclosure of Invention
The present invention is directed to a system and method for text classification based on deep learning to solve the problems set forth in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
the text classification system based on deep learning comprises a text source obtaining module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature vector, a test text feature weight calculation unit, a test text feature vector and a classification module, wherein the text preprocessing module comprises a new word adding unit, a word segmentation unit, a useless word removing unit and a full text index establishing unit, and the classification module comprises a classification judgment unit and a classifier.
As a further scheme of the invention: the text source acquisition module is used for acquiring a text initial source part, the text source input module is used for inputting an initial source text, the normalization processing module is used for integrating a source file into a text meeting the specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the feature of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit establishes full-text indexes for the training text set and the test text set, the training text module is used for obtaining a training text set, the test text module is used for obtaining a test text set, and the feature extraction dimension reduction unit carries out feature dimension reduction by using a feature extraction method, the feature weight calculation unit is used for performing feature weighting on texts in a training set and a test set, the classification module is used for classifying the texts in the test set, the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, and the classification judgment unit is used for judging a classification result and classification performance.
The text classification method based on deep learning comprises the following steps:
s1: acquiring, inputting and standardizing a text source;
s2: preprocessing a text;
s3: reducing the dimension of the feature and collecting feature items in the text feature vector;
s4: carrying out characteristic weighting on texts in the training set and the test set;
s5: and classifying the texts in the training set and the test set.
As a still further scheme of the invention: in S1, the source text is obtained by the text source obtaining module, the initial source text is transmitted to the normalization processing module by the text source input module, and the received initial source text is processed by the normalization processing module to conform to the text processing form of the model.
As a still further scheme of the invention: in the S2, the text is preprocessed, full-text indexes are respectively established for the training text set and the test text set, and then a training set index and a test set index are respectively obtained, in the process of establishing indexes, word segmentation is performed on the training text set and the test text set in the text set by using word segmentation units, and an original feature word set of the training text set and an original feature word set of the test text set are respectively obtained and correspondingly transmitted to the training text module and the test text module.
As a still further scheme of the invention: in S3, the feature extraction and dimension reduction unit performs feature dimension reduction by using a feature extraction method, performs statistics on data in the feature extraction method by using an index query function, performs descending order arrangement according to the feature evaluation value after substitution calculation, and selects features with the best category effect to form a feature item set in the text feature vector.
As a still further scheme of the invention: in the S4, feature weighting is performed on the texts in the training set and the test set in the training text module and the test text module, and the feature items in the training text feature vector and the test text feature vector are represented by the spatial feature vector, and feature item sets extracted and processed for the text features are obtained by using a user-defined index query function for each text, so as to obtain the distribution statistics of the feature items corresponding to the text and the corresponding categories.
As a still further scheme of the invention: in S5, the classifier operates a corresponding classification algorithm to perform classification, and spatial feature vectors representing all texts in the training set and the test set are used as input data to classify the texts in the training set and the test set, and after classification is completed, the classification performance is evaluated by the classification judgment unit.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the text classification system and method based on deep learning, the dimensionality reduction effect on high-dimensional text data is achieved, the text is segmented, the useless words are eliminated, more accurate features can be extracted, the classification accuracy is improved, meanwhile, the new word adding unit is arranged for adding new words, the iterative replacement of the new words is met, the model training effect and efficiency are improved, meanwhile, the distribution condition statistics of feature items corresponding to the text and corresponding categories are obtained by performing feature weighting on the text in the training set and the text in the testing set, the text classification and recognition accuracy is further improved, and the text classification system and method have good application prospects.
Drawings
Fig. 1 is a block diagram of a text classification system and method based on deep learning.
Fig. 2 is a block diagram of a method flow of a text classification method based on deep learning.
FIG. 3 is a block diagram of a text pre-processing module in a deep learning based text classification system.
FIG. 4 is a block diagram of a classification module in the deep learning based text classification system and method.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 to 4, in an embodiment of the present invention, a text classification system based on deep learning includes a text source obtaining module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction and dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector, and a classification module, where the text preprocessing module includes a new word adding unit, a word segmentation unit, a useless word removal unit, and a full-text index establishing unit, the classification module includes a classification determination unit and a classifier, the text source obtaining module is configured to obtain an initial text source, the text source input module is configured to input an initial source text, the normalization processing module is configured to integrate source files into texts meeting specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the characteristic of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit is used for establishing full-text indexes for the training text set and the test text set, the training text module is used for obtaining the training text set, the test text module is used for obtaining the test text set, the characteristic extraction dimension reduction unit is used for carrying out characteristic dimension reduction by using a characteristic extraction method, the characteristic weight calculating unit is used for carrying out characteristic weighting on the texts in the training set and the test set, the classification module is used for classifying the texts in the test set, and the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, the classification judgment unit is used for judging the classification result and the classification performance.
The text classification method based on deep learning comprises the following steps:
s1: acquiring, inputting and standardizing a text source;
s2: preprocessing a text;
s3: reducing the dimension of the feature and collecting feature items in the text feature vector;
s4: carrying out characteristic weighting on texts in the training set and the test set;
s5: and classifying the texts in the training set and the test set.
In the step S1, a source text is obtained through a text source obtaining module, an initial source text is transmitted to a standardized processing module through a text source input module, the received initial source text is processed through the standardized processing module to be in accordance with a text processing form of a model, in the step S2, the text is preprocessed, full-text indexes are respectively established for a training text set and a testing text set, then a training set index and a testing set index are respectively obtained, in the process of establishing the indexes, a word segmentation unit is used for carrying out word segmentation on the training text set and the testing text set in the text set to respectively obtain an original feature word set of the training text set and an original feature word set of the testing text set, the original feature word sets are respectively and correspondingly transmitted to the training text module and the testing text module, in the step S3, feature dimension reduction is carried out through a feature extraction method through a feature extraction dimension reduction unit, counting data in the feature extraction method by using an index query function, performing descending order arrangement according to the height of a feature evaluation value after substituting calculation, selecting features with the best class effect, forming a feature item set in a text feature vector, in S4, performing feature weighting on texts in a training set and a test set in a training text module and a test text module, representing the features of the texts by using a spatial feature vector, using the feature items in the training text feature vector and the test text feature vector as the feature item set for text feature extraction processing, using a self-defined index query function for each text to obtain the distribution condition statistics of the feature items corresponding to the text and corresponding classes, in S5, classifying the texts by using a classifier to run a corresponding classification algorithm, and using the spatial feature vectors representing all the texts in the training set and the test set as input data, and classifying the texts in the training set and the test set, and judging the classification performance through a classification judgment unit after classification is finished.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims (8)

1. The text classification system based on deep learning comprises a text source acquisition module, a text source input module, a normalization processing module, a text preprocessing module, a training text module, a test text module, a feature extraction dimension reduction unit, a feature weight calculation unit, a training text feature weight calculation, a training text feature vector, a test text feature weight calculation, a test text feature vector and a classification module, and is characterized in that: the text preprocessing module comprises a new word adding unit, a word segmentation unit, a useless word clearing unit and a full-text index establishing unit, and the classification module comprises a classification judgment unit and a classifier.
2. The deep learning based text classification system according to claim 1, characterized in that: the text source acquisition module is used for acquiring a text initial source part, the text source input module is used for inputting an initial source text, the normalization processing module is used for integrating a source file into a text meeting the specifications, the text preprocessing module is used for adding new characteristic words through a new word adding unit, the word segmentation unit is used for carrying out word segmentation and classifying the feature of a text set after word segmentation into a training text module and a test text module, the useless word removing unit is used for removing useless and non-used words, the full-text index establishing unit establishes full-text indexes for the training text set and the test text set, the training text module is used for obtaining a training text set, the test text module is used for obtaining a test text set, and the feature extraction dimension reduction unit carries out feature dimension reduction by using a feature extraction method, the feature weight calculation unit is used for performing feature weighting on texts in a training set and a test set, the classification module is used for classifying the texts in the test set, the classifier is used for classifying the texts in the training set and the test set by operating a corresponding classification algorithm, and the classification judgment unit is used for judging a classification result and classification performance.
3. The text classification method based on deep learning is characterized in that: the method comprises the following steps:
s1: acquiring, inputting and standardizing a text source;
s2: preprocessing a text;
s3: reducing the dimension of the feature and collecting feature items in the text feature vector;
s4: carrying out characteristic weighting on texts in the training set and the test set;
s5: and classifying the texts in the training set and the test set.
4. The deep learning based text classification method according to claim 3, characterized in that: in S1, the source text is obtained by the text source obtaining module, the initial source text is transmitted to the normalization processing module by the text source input module, and the received initial source text is processed by the normalization processing module to conform to the text processing form of the model.
5. The deep learning based text classification system of claim 3, wherein: in the S2, the text is preprocessed, full-text indexes are respectively established for the training text set and the test text set, and then a training set index and a test set index are respectively obtained, in the process of establishing indexes, word segmentation is performed on the training text set and the test text set in the text set by using word segmentation units, and an original feature word set of the training text set and an original feature word set of the test text set are respectively obtained and correspondingly transmitted to the training text module and the test text module.
6. The deep learning based text classification method according to claim 3, characterized in that: in S3, the feature extraction and dimension reduction unit performs feature dimension reduction by using a feature extraction method, performs statistics on data in the feature extraction method by using an index query function, performs descending order arrangement according to the feature evaluation value after substitution calculation, and selects features with the best category effect to form a feature item set in the text feature vector.
7. The deep learning based text classification method according to claim 3, characterized in that: in the S4, feature weighting is performed on the texts in the training set and the test set in the training text module and the test text module, and the feature items in the training text feature vector and the test text feature vector are represented by the spatial feature vector, and feature item sets extracted and processed for the text features are obtained by using a user-defined index query function for each text, so as to obtain the distribution statistics of the feature items corresponding to the text and the corresponding categories.
8. The deep learning based text classification method according to claim 3, characterized in that: in S5, the classifier operates a corresponding classification algorithm to perform classification, and spatial feature vectors representing all texts in the training set and the test set are used as input data to classify the texts in the training set and the test set, and after classification is completed, the classification performance is evaluated by the classification judgment unit.
CN202110971103.2A 2021-08-23 2021-08-23 Text classification system and method based on deep learning Pending CN113641824A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110971103.2A CN113641824A (en) 2021-08-23 2021-08-23 Text classification system and method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110971103.2A CN113641824A (en) 2021-08-23 2021-08-23 Text classification system and method based on deep learning

Publications (1)

Publication Number Publication Date
CN113641824A true CN113641824A (en) 2021-11-12

Family

ID=78423466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110971103.2A Pending CN113641824A (en) 2021-08-23 2021-08-23 Text classification system and method based on deep learning

Country Status (1)

Country Link
CN (1) CN113641824A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117851602A (en) * 2024-03-07 2024-04-09 武汉百智诚远科技有限公司 Automatic legal document classification method and system based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117851602A (en) * 2024-03-07 2024-04-09 武汉百智诚远科技有限公司 Automatic legal document classification method and system based on deep learning
CN117851602B (en) * 2024-03-07 2024-05-14 武汉百智诚远科技有限公司 Automatic legal document classification method and system based on deep learning

Similar Documents

Publication Publication Date Title
CN108363810B (en) Text classification method and device
CN105469096B (en) A kind of characteristic bag image search method based on Hash binary-coding
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN108509629B (en) Text emotion analysis method based on emotion dictionary and support vector machine
CN108763213A (en) Theme feature text key word extracting method
CN109886020A (en) Software vulnerability automatic classification method based on deep neural network
CN110826618A (en) Personal credit risk assessment method based on random forest
CN110046250A (en) Three embedded convolutional neural networks model and its more classification methods of text
CN103699523A (en) Product classification method and device
CN111556016B (en) Network flow abnormal behavior identification method based on automatic encoder
CN108804595B (en) Short text representation method based on word2vec
CN108846047A (en) A kind of picture retrieval method and system based on convolution feature
CN113420294A (en) Malicious code detection method based on multi-scale convolutional neural network
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN112579783B (en) Short text clustering method based on Laplace atlas
CN112036511B (en) Image retrieval method based on attention mechanism graph convolution neural network
CN112417893A (en) Software function demand classification method and system based on semantic hierarchical clustering
CN113641824A (en) Text classification system and method based on deep learning
CN113626604A (en) Webpage text classification system based on maximum interval criterion
CN112200260B (en) Figure attribute identification method based on discarding loss function
CN116935138A (en) Picture subject content diversity calculation and automatic selection method and system
CN116881451A (en) Text classification method based on machine learning
CN114298020B (en) Keyword vectorization method based on topic semantic information and application thereof
Ojo et al. Improved model for facial expression classification for fear and sadness using local binary pattern histogram
CN114925198A (en) Knowledge-driven text classification method fusing character information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20211112

WD01 Invention patent application deemed withdrawn after publication