CN112328790A - Fast text classification method of corpus - Google Patents
Fast text classification method of corpus Download PDFInfo
- Publication number
- CN112328790A CN112328790A CN202011235587.6A CN202011235587A CN112328790A CN 112328790 A CN112328790 A CN 112328790A CN 202011235587 A CN202011235587 A CN 202011235587A CN 112328790 A CN112328790 A CN 112328790A
- Authority
- CN
- China
- Prior art keywords
- corpus
- words
- text
- feature
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 18
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 230000000694 effects Effects 0.000 claims abstract description 11
- 239000000463 material Substances 0.000 claims abstract description 5
- 230000011218 segmentation Effects 0.000 claims description 24
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000011156 evaluation Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000013145 classification model Methods 0.000 claims description 3
- 230000036651 mood Effects 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 9
- 238000004458 analytical method Methods 0.000 abstract description 4
- 230000009286 beneficial effect Effects 0.000 description 8
- 238000013398 bayesian method Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a fast text classification method of a corpus, which comprises the following steps: selecting an existing corpus to be used; extracting information data in the corpus and preprocessing the information data; inputting the preprocessing result into a vector space model; processing the characteristic words; selecting a classifier for the feature words; evaluating the effect of the classifier; and classifying the material library by using a classifier. The method for rapidly classifying the texts of the corpus can rapidly and accurately classify the corpora in the corpus, so that the efficiency and the accuracy of the corpus classification can be improved, and researchers and scholars can perform deep analysis and research on the corpora conveniently.
Description
Technical Field
The invention relates to the technical field of corpus text classification, in particular to a method for quickly classifying texts in a corpus.
Background
A language database is a large-scale electronic text library which is scientifically sampled and processed, has the function that a researcher can develop relevant linguistic theory and application research by means of a computer analysis tool, and is one of the main data supports for the researcher and the scholars to develop linguistic research because the language database is a basic resource bearing linguistic knowledge, and is stored with linguistic materials which really appear in actual use, so the language database is also one of important theoretical sources of a linguistic research method and is mainly applied to the aspects of dictionary compilation, language teaching, traditional language research, research based on statistics or examples in natural language processing and the like, and with the continuous development of times and the continuous improvement of computer technology, text classification refers to automatic classification based on a classification system, and the classification basis is one or more text characteristics, because the texts have similarity, the text classification cannot achieve the perfect result, and only the optimal classification result is selected according to the classification characteristics and the perfection of the evaluation standard, the patent number CN103823824A discloses a method and a system for automatically constructing a text classification corpus by means of the Internet, which classify according to the parts of speech, have too simple and single classification basis, cannot accurately and effectively classify the linguistic data of the similar meaning words, and are inconvenient for researchers and schools with definite purposes, so that the method for quickly classifying the texts of the corpus is invented to be particularly important;
the existing fast text classification method of the corpus is classified according to parts of speech, and the corpora in the corpus cannot be classified fast and accurately, so that the efficiency and the accuracy of the corpus classification cannot be improved, and researchers and scholars cannot conveniently conduct deep analysis and research on the corpora.
Disclosure of Invention
The invention aims to provide a method for quickly classifying texts in a corpus, which aims to solve the problems that the existing method for quickly classifying texts in the corpus, which is provided in the background art, is often classified according to parts of speech, and cannot quickly and accurately classify the linguistic data in the corpus, so that the efficiency and the accuracy of linguistic data classification cannot be improved, and researchers and scholars cannot conveniently and deeply analyze and research the linguistic data.
In order to achieve the purpose, the invention provides the following technical scheme: a method for fast text classification of a corpus, said classification method comprising the steps of:
(1) selecting an existing corpus to be used;
(2) extracting information data in the corpus and preprocessing the information data;
(3) inputting the preprocessing result into a vector space model;
(4) processing the characteristic words;
(5) selecting a classifier for the feature words;
(6) evaluating the effect of the classifier;
(7) and classifying the material library by using a classifier.
Preferably, the existing corpus in step (1) refers to a chinese corpus, and the corpus type is a monolingual type.
Preferably, the information data in step (2) refers to a chinese text corpus with a close similarity, and the preprocessing specifically includes: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;
the training sample set is an initial feature item set, short a feature set, formed by a set of obtained keywords, and the word segmentation process is specifically represented as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, and the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and the stop words comprise two types: one group refers to words that are widely used and frequently appearing in all corpora, and the other group refers to certain fictional words including mood assist words, adverbs, prepositions, conjunctions, and exclamation words, the stop words being replaced with symbols and removed from the word segmentation result to obtain an effective word combination, the symbols including "()", "" - ","/", and" & ".
Preferably, the vector space model in step (3) means that the corpus text and the query both contain independent attributes expressed by feature items and revealing their contents, and each attribute can be regarded as a dimension of the vector space, so that the corpus text and the query can be represented as a set of some attributes, complex relationships among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarities measured by distances among vectors,
the similarity calculation method adopts a cosine coefficient method, wherein the cosine coefficient method is that the similarity between the corpus text and the query is expressed by the cosine of an included angle between vectors, and the smaller the included angle is, the greater the similarity between the corpus text and the query is.
Preferably, the feature processing in step (4) means that tens of thousands of feature words are obtained after preprocessing, wherein the feature words are called weak frequency related words with a small occurrence frequency in the corpus, and strong frequency related words with a high occurrence frequency, and the feature words are formed into a feature set by removing the weak frequency related words and extracting the strong frequency related words;
the feature processing comprises a feature extraction method and feature word weight determination, wherein the feature extraction method adopts frequency statistics, the frequency statistics comprise word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from extracted information data to form feature items and endowing the feature items with corresponding weights, and the feature word weight algorithm is a Boolean weight method.
Preferably, the classifier in step (5), also called a classification model, maps the corpus text of unknown classes to a specified class space, and adopts a bayesian algorithm based on bayesian theorem.
Preferably, the effect evaluation in step (6) includes three aspects of effectiveness, computational complexity and descriptive simplicity, and the effectiveness includes three indexes of recall, precision and F-measure.
The technical scheme of the invention has the following beneficial technical effects: by extracting the information data in the corpus, preprocessing the information data is beneficial to formatting the text corpus in the existing corpus into a unified format, subsequent unified processing is convenient, the preprocessing result is input into a vector space model to be beneficial to decomposing the text into basic processing units and further obtaining feature words, finally, feature terms are generated through the feature terms, and by processing the feature terms, the feature of the information data is beneficial to reflecting, so that the weight of the feature terms is convenient to determine, through selecting a classifier for the feature terms, selecting a proper classification algorithm is beneficial to improving the classification speed and the classification accuracy of the corpus, and through evaluating the effect of the classifier, the classification capability of the classifier is beneficial to knowing and judging.
Drawings
Fig. 1 is a schematic structural diagram of a fast text classification method for a corpus according to the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
As shown in fig. 1, a method for fast text classification of a corpus includes the following steps:
(1) selecting an existing corpus to be used;
(2) extracting information data in the corpus and preprocessing the information data;
(3) inputting the preprocessing result into a vector space model;
(4) processing the characteristic words;
(5) selecting a classifier for the feature words;
(6) evaluating the effect of the classifier;
(7) and classifying the material library by using a classifier.
The existing corpus in the step (1) refers to a Chinese corpus in particular, and the corpus type is a monolingual type.
The information data in the step (2) refers to a Chinese text corpus with close similarity, and the preprocessing specifically shows that: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;
the training sample set is an initial feature item set, which is a feature set for short and is formed by acquiring a set of keywords, and the word segmentation processing is specifically expressed as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and stop words comprise two types: one is a word that is widely used and frequently appears in all corpora, and the other is a certain imaginary word that includes inflexion, adverb, preposition, conjunctions and exclamation, the stop word is replaced by a symbol and removed from the word segmentation result to obtain an effective word combination, the symbol includes "()", "", ""/", and" & ".
The vector space model in the step (3) means that the corpus text and the query both contain independent attributes which are expressed by characteristic items and reveal the contents of the corpus text and the attributes can be regarded as one dimension of the vector space, so the corpus text and the query can be expressed as a set of certain attributes, complex relations among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarity which is measured by the distance among vectors and is beneficial to decomposing the text into basic processing units to further obtain the characteristic words,
the similarity calculation method adopts a cosine coefficient method, wherein the cosine coefficient method is that the similarity between the corpus text and the query is expressed by the cosine of an included angle between vectors, and the smaller the included angle is, the greater the similarity between the corpus text and the query is.
The characteristic processing in the step (4) means that tens of thousands of characteristic words can be obtained after preprocessing, wherein the characteristic words are small in occurrence frequency in the corpus and are called weak frequency related words, the characteristic words are high in occurrence frequency and are called strong frequency related words, and the weak frequency related words are removed and the strong frequency related words are extracted to form a characteristic set;
the feature processing comprises a feature extraction method and feature word weight determination, the feature extraction method adopts frequency statistics, the frequency statistics comprise word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from the extracted information data to form feature items and endowing the feature items with corresponding weights, and the feature word weight algorithm is a Boolean weight method and is beneficial to reflecting the features of the information data.
The classifier in the step (5) is also called a classification model, which means that the corpus text of unknown classes is mapped to a specified class space, and the classifier adopts a Bayes algorithm based on Bayes' theorem.
The effect evaluation in the step (6) comprises three aspects of effectiveness, calculation complexity and description simplicity, wherein the effectiveness comprises three indexes of recall ratio, precision ratio and F-measurement.
The invention is a method for identifying similar words based on a corpus, which determines the type of the corpus by selecting the existing corpus to be used, further determines the selection of the classification method and the prediction of the classification cost, then extracts part of the text corpus in the existing corpus, the extracted text corpus is called information data, then preprocesses the information data, the preprocessing comprises word segmentation and stop words, each word of the English text is distinguished by a space, the word segmentation is very simple, and the Chinese text is distinguished by a symbol and a paragraph, so the method is very troublesome and fuzzy, therefore, the word segmentation is needed when the characteristic word processing is carried out on the corpus, and the Chinese contains many dummy words, including a mood assistant word, an adverb, a preposition word, a connecting word and an exclamation word, the stop word processing mode is mainly used for "()", and "()", the method for processing the stop word processing is mainly carried out on the words, The words are replaced by the words, the extraction of characteristic words is used, the preprocessed result is input into a vector space model, the generation of a characteristic set is facilitated, the characteristic word processing is carried out to obtain weak frequency related words and strong frequency related words, the weak frequency related words are removed, the strong frequency related words are extracted to form the characteristic set, a classifier is selected for the characteristic words through the characteristic set, the assignment of the corpus text is completed according to the algorithm of the classifier, the classifier adopts a Bayesian method, the corpus text is favorably and accurately and rapidly classified, the efficiency and the accuracy of corpus classification are improved, researchers and scholars can conveniently carry out deep analysis and research on the corpus, the classifier is subjected to effect evaluation according to three aspects of effectiveness, calculation complexity and simplicity of description, and the processing speed and the accuracy of the classifier are favorably tested and known, and finally, performing corpus classification by using the classifier passing the test.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (7)
1. A method for fast text classification of a corpus, the method comprising the steps of:
(1) selecting an existing corpus to be used;
(2) extracting information data in the corpus and preprocessing the information data;
(3) inputting the preprocessing result into a vector space model;
(4) processing the characteristic words;
(5) selecting a classifier for the feature words;
(6) evaluating the effect of the classifier;
(7) and classifying the material library by using a classifier.
2. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the existing corpus in the step (1) refers to a Chinese corpus in particular, and the corpus type is a monolingual type.
3. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the information data in the step (2) refers to a Chinese text corpus with close similarity, and the preprocessing specifically includes: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;
the training sample set is an initial feature item set, short a feature set, formed by a set of obtained keywords, and the word segmentation process is specifically represented as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, and the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and the stop words comprise two types: one group refers to words that are widely used and frequently appearing in all corpora, and the other group refers to certain fictional words including mood assist words, adverbs, prepositions, conjunctions, and exclamation words, the stop words being replaced with symbols and removed from the word segmentation result to obtain an effective word combination, the symbols including "()", "" - ","/", and" & ".
4. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the vector space model in the step (3) means that the corpus text and the query both contain independent attributes which are expressed by characteristic items and reveal the contents of the corpus text and the attributes can be regarded as one dimension of the vector space, so the corpus text and the query can be expressed as a set of certain attributes, complex relations among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarity, and the similarity is measured by the distance among vectors;
the similarity calculation method adopts a cosine coefficient method, wherein the cosine coefficient method is that the similarity between the corpus text and the query is expressed by the cosine of an included angle between vectors, and the smaller the included angle is, the greater the similarity between the corpus text and the query is.
5. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the characteristic processing in the step (4) means that tens of thousands of characteristic words are obtained after preprocessing, wherein the characteristic words are small in occurrence frequency in the corpus and are called weak frequency related words, the characteristic words are high in occurrence frequency and are called strong frequency related words, and the weak frequency related words are removed and the strong frequency related words are extracted to form a characteristic set;
the feature processing comprises a feature extraction method and feature word weight determination, the feature extraction method adopts frequency statistics, the frequency statistics comprises word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from extracted information data to form feature items and endowing the feature items with corresponding weights, and the algorithm of the feature word weights is a Boolean weight method.
6. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the classifier in the step (5) is also called a classification model, which means that the corpus text of unknown classes is mapped to a specified class space, and the classifier adopts a Bayes algorithm based on Bayes' theorem.
7. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the effect evaluation in the step (6) comprises three aspects of effectiveness, calculation complexity and description simplicity, wherein the effectiveness comprises three indexes of recall ratio, precision ratio and F-measurement.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011235587.6A CN112328790A (en) | 2020-11-06 | 2020-11-06 | Fast text classification method of corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011235587.6A CN112328790A (en) | 2020-11-06 | 2020-11-06 | Fast text classification method of corpus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112328790A true CN112328790A (en) | 2021-02-05 |
Family
ID=74316428
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011235587.6A Pending CN112328790A (en) | 2020-11-06 | 2020-11-06 | Fast text classification method of corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112328790A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN104503960A (en) * | 2015-01-07 | 2015-04-08 | 渤海大学 | Text data processing method for English translation |
CN109002473A (en) * | 2018-06-13 | 2018-12-14 | 天津大学 | A kind of sentiment analysis method based on term vector and part of speech |
CN109960799A (en) * | 2019-03-12 | 2019-07-02 | 中南大学 | A kind of Optimum Classification method towards short text |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
-
2020
- 2020-11-06 CN CN202011235587.6A patent/CN112328790A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN104503960A (en) * | 2015-01-07 | 2015-04-08 | 渤海大学 | Text data processing method for English translation |
CN109002473A (en) * | 2018-06-13 | 2018-12-14 | 天津大学 | A kind of sentiment analysis method based on term vector and part of speech |
CN109960799A (en) * | 2019-03-12 | 2019-07-02 | 中南大学 | A kind of Optimum Classification method towards short text |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
CN105701084A (en) | Characteristic extraction method of text classification on the basis of mutual information | |
CN102214189B (en) | Data mining-based word usage knowledge acquisition system and method | |
Lydia et al. | Correlative study and analysis for hidden patterns in text analytics unstructured data using supervised and unsupervised learning techniques | |
El-Shishtawy et al. | An accurate arabic root-based lemmatizer for information retrieval purposes | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
Mustofa et al. | Sentiment analysis using lexicon-based method with naive bayes classifier algorithm on# newnormal hashtag in twitter | |
AL-Jibory | Hybrid system for plagiarism detection on a scientific paper | |
Puri et al. | An efficient hindi text classification model using svm | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
AL-Jumaili | A hybrid method of linguistic and statistical features for Arabic sentiment analysis | |
KR20110017129A (en) | Apparatus and method for words sense disambiguation using korean wordnet and its program stored recording medium | |
Dhar et al. | Bengali news headline categorization using optimized machine learning pipeline | |
Ma | Research on keyword extraction algorithm in english text based on cluster analysis | |
Mutuvi et al. | A dataset for multi-lingual epidemiological event extraction | |
Thielmann et al. | Coherence based document clustering | |
Mekala et al. | A survey on authorship attribution approaches | |
Saeed et al. | An abstractive summarization technique with variable length keywords as per document diversity | |
CN112328790A (en) | Fast text classification method of corpus | |
Tao et al. | The Text modeling method of Tibetan text combining Word2vec and improved TF-IDF | |
CN113963748A (en) | Protein knowledge map vectorization method | |
CN112800243A (en) | Project budget analysis method and system based on knowledge graph | |
CN113516202A (en) | Webpage accurate classification method for CBL feature extraction and denoising | |
BAZRFKAN et al. | Using machine learning methods to summarize persian texts | |
Osochkin et al. | Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210205 |
|
RJ01 | Rejection of invention patent application after publication |