CN112328790A - Fast text classification method of corpus - Google Patents

Fast text classification method of corpus Download PDF

Info

Publication number
CN112328790A
CN112328790A CN202011235587.6A CN202011235587A CN112328790A CN 112328790 A CN112328790 A CN 112328790A CN 202011235587 A CN202011235587 A CN 202011235587A CN 112328790 A CN112328790 A CN 112328790A
Authority
CN
China
Prior art keywords
corpus
words
text
feature
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011235587.6A
Other languages
Chinese (zh)
Inventor
王大鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bohai University
Original Assignee
Bohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bohai University filed Critical Bohai University
Priority to CN202011235587.6A priority Critical patent/CN112328790A/en
Publication of CN112328790A publication Critical patent/CN112328790A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a fast text classification method of a corpus, which comprises the following steps: selecting an existing corpus to be used; extracting information data in the corpus and preprocessing the information data; inputting the preprocessing result into a vector space model; processing the characteristic words; selecting a classifier for the feature words; evaluating the effect of the classifier; and classifying the material library by using a classifier. The method for rapidly classifying the texts of the corpus can rapidly and accurately classify the corpora in the corpus, so that the efficiency and the accuracy of the corpus classification can be improved, and researchers and scholars can perform deep analysis and research on the corpora conveniently.

Description

Fast text classification method of corpus
Technical Field
The invention relates to the technical field of corpus text classification, in particular to a method for quickly classifying texts in a corpus.
Background
A language database is a large-scale electronic text library which is scientifically sampled and processed, has the function that a researcher can develop relevant linguistic theory and application research by means of a computer analysis tool, and is one of the main data supports for the researcher and the scholars to develop linguistic research because the language database is a basic resource bearing linguistic knowledge, and is stored with linguistic materials which really appear in actual use, so the language database is also one of important theoretical sources of a linguistic research method and is mainly applied to the aspects of dictionary compilation, language teaching, traditional language research, research based on statistics or examples in natural language processing and the like, and with the continuous development of times and the continuous improvement of computer technology, text classification refers to automatic classification based on a classification system, and the classification basis is one or more text characteristics, because the texts have similarity, the text classification cannot achieve the perfect result, and only the optimal classification result is selected according to the classification characteristics and the perfection of the evaluation standard, the patent number CN103823824A discloses a method and a system for automatically constructing a text classification corpus by means of the Internet, which classify according to the parts of speech, have too simple and single classification basis, cannot accurately and effectively classify the linguistic data of the similar meaning words, and are inconvenient for researchers and schools with definite purposes, so that the method for quickly classifying the texts of the corpus is invented to be particularly important;
the existing fast text classification method of the corpus is classified according to parts of speech, and the corpora in the corpus cannot be classified fast and accurately, so that the efficiency and the accuracy of the corpus classification cannot be improved, and researchers and scholars cannot conveniently conduct deep analysis and research on the corpora.
Disclosure of Invention
The invention aims to provide a method for quickly classifying texts in a corpus, which aims to solve the problems that the existing method for quickly classifying texts in the corpus, which is provided in the background art, is often classified according to parts of speech, and cannot quickly and accurately classify the linguistic data in the corpus, so that the efficiency and the accuracy of linguistic data classification cannot be improved, and researchers and scholars cannot conveniently and deeply analyze and research the linguistic data.
In order to achieve the purpose, the invention provides the following technical scheme: a method for fast text classification of a corpus, said classification method comprising the steps of:
(1) selecting an existing corpus to be used;
(2) extracting information data in the corpus and preprocessing the information data;
(3) inputting the preprocessing result into a vector space model;
(4) processing the characteristic words;
(5) selecting a classifier for the feature words;
(6) evaluating the effect of the classifier;
(7) and classifying the material library by using a classifier.
Preferably, the existing corpus in step (1) refers to a chinese corpus, and the corpus type is a monolingual type.
Preferably, the information data in step (2) refers to a chinese text corpus with a close similarity, and the preprocessing specifically includes: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;
the training sample set is an initial feature item set, short a feature set, formed by a set of obtained keywords, and the word segmentation process is specifically represented as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, and the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and the stop words comprise two types: one group refers to words that are widely used and frequently appearing in all corpora, and the other group refers to certain fictional words including mood assist words, adverbs, prepositions, conjunctions, and exclamation words, the stop words being replaced with symbols and removed from the word segmentation result to obtain an effective word combination, the symbols including "()", "" - ","/", and" & ".
Preferably, the vector space model in step (3) means that the corpus text and the query both contain independent attributes expressed by feature items and revealing their contents, and each attribute can be regarded as a dimension of the vector space, so that the corpus text and the query can be represented as a set of some attributes, complex relationships among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarities measured by distances among vectors,
the similarity calculation method adopts a cosine coefficient method, wherein the cosine coefficient method is that the similarity between the corpus text and the query is expressed by the cosine of an included angle between vectors, and the smaller the included angle is, the greater the similarity between the corpus text and the query is.
Preferably, the feature processing in step (4) means that tens of thousands of feature words are obtained after preprocessing, wherein the feature words are called weak frequency related words with a small occurrence frequency in the corpus, and strong frequency related words with a high occurrence frequency, and the feature words are formed into a feature set by removing the weak frequency related words and extracting the strong frequency related words;
the feature processing comprises a feature extraction method and feature word weight determination, wherein the feature extraction method adopts frequency statistics, the frequency statistics comprise word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from extracted information data to form feature items and endowing the feature items with corresponding weights, and the feature word weight algorithm is a Boolean weight method.
Preferably, the classifier in step (5), also called a classification model, maps the corpus text of unknown classes to a specified class space, and adopts a bayesian algorithm based on bayesian theorem.
Preferably, the effect evaluation in step (6) includes three aspects of effectiveness, computational complexity and descriptive simplicity, and the effectiveness includes three indexes of recall, precision and F-measure.
The technical scheme of the invention has the following beneficial technical effects: by extracting the information data in the corpus, preprocessing the information data is beneficial to formatting the text corpus in the existing corpus into a unified format, subsequent unified processing is convenient, the preprocessing result is input into a vector space model to be beneficial to decomposing the text into basic processing units and further obtaining feature words, finally, feature terms are generated through the feature terms, and by processing the feature terms, the feature of the information data is beneficial to reflecting, so that the weight of the feature terms is convenient to determine, through selecting a classifier for the feature terms, selecting a proper classification algorithm is beneficial to improving the classification speed and the classification accuracy of the corpus, and through evaluating the effect of the classifier, the classification capability of the classifier is beneficial to knowing and judging.
Drawings
Fig. 1 is a schematic structural diagram of a fast text classification method for a corpus according to the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
As shown in fig. 1, a method for fast text classification of a corpus includes the following steps:
(1) selecting an existing corpus to be used;
(2) extracting information data in the corpus and preprocessing the information data;
(3) inputting the preprocessing result into a vector space model;
(4) processing the characteristic words;
(5) selecting a classifier for the feature words;
(6) evaluating the effect of the classifier;
(7) and classifying the material library by using a classifier.
The existing corpus in the step (1) refers to a Chinese corpus in particular, and the corpus type is a monolingual type.
The information data in the step (2) refers to a Chinese text corpus with close similarity, and the preprocessing specifically shows that: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;
the training sample set is an initial feature item set, which is a feature set for short and is formed by acquiring a set of keywords, and the word segmentation processing is specifically expressed as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and stop words comprise two types: one is a word that is widely used and frequently appears in all corpora, and the other is a certain imaginary word that includes inflexion, adverb, preposition, conjunctions and exclamation, the stop word is replaced by a symbol and removed from the word segmentation result to obtain an effective word combination, the symbol includes "()", "", ""/", and" & ".
The vector space model in the step (3) means that the corpus text and the query both contain independent attributes which are expressed by characteristic items and reveal the contents of the corpus text and the attributes can be regarded as one dimension of the vector space, so the corpus text and the query can be expressed as a set of certain attributes, complex relations among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarity which is measured by the distance among vectors and is beneficial to decomposing the text into basic processing units to further obtain the characteristic words,
the similarity calculation method adopts a cosine coefficient method, wherein the cosine coefficient method is that the similarity between the corpus text and the query is expressed by the cosine of an included angle between vectors, and the smaller the included angle is, the greater the similarity between the corpus text and the query is.
The characteristic processing in the step (4) means that tens of thousands of characteristic words can be obtained after preprocessing, wherein the characteristic words are small in occurrence frequency in the corpus and are called weak frequency related words, the characteristic words are high in occurrence frequency and are called strong frequency related words, and the weak frequency related words are removed and the strong frequency related words are extracted to form a characteristic set;
the feature processing comprises a feature extraction method and feature word weight determination, the feature extraction method adopts frequency statistics, the frequency statistics comprise word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from the extracted information data to form feature items and endowing the feature items with corresponding weights, and the feature word weight algorithm is a Boolean weight method and is beneficial to reflecting the features of the information data.
The classifier in the step (5) is also called a classification model, which means that the corpus text of unknown classes is mapped to a specified class space, and the classifier adopts a Bayes algorithm based on Bayes' theorem.
The effect evaluation in the step (6) comprises three aspects of effectiveness, calculation complexity and description simplicity, wherein the effectiveness comprises three indexes of recall ratio, precision ratio and F-measurement.
The invention is a method for identifying similar words based on a corpus, which determines the type of the corpus by selecting the existing corpus to be used, further determines the selection of the classification method and the prediction of the classification cost, then extracts part of the text corpus in the existing corpus, the extracted text corpus is called information data, then preprocesses the information data, the preprocessing comprises word segmentation and stop words, each word of the English text is distinguished by a space, the word segmentation is very simple, and the Chinese text is distinguished by a symbol and a paragraph, so the method is very troublesome and fuzzy, therefore, the word segmentation is needed when the characteristic word processing is carried out on the corpus, and the Chinese contains many dummy words, including a mood assistant word, an adverb, a preposition word, a connecting word and an exclamation word, the stop word processing mode is mainly used for "()", and "()", the method for processing the stop word processing is mainly carried out on the words, The words are replaced by the words, the extraction of characteristic words is used, the preprocessed result is input into a vector space model, the generation of a characteristic set is facilitated, the characteristic word processing is carried out to obtain weak frequency related words and strong frequency related words, the weak frequency related words are removed, the strong frequency related words are extracted to form the characteristic set, a classifier is selected for the characteristic words through the characteristic set, the assignment of the corpus text is completed according to the algorithm of the classifier, the classifier adopts a Bayesian method, the corpus text is favorably and accurately and rapidly classified, the efficiency and the accuracy of corpus classification are improved, researchers and scholars can conveniently carry out deep analysis and research on the corpus, the classifier is subjected to effect evaluation according to three aspects of effectiveness, calculation complexity and simplicity of description, and the processing speed and the accuracy of the classifier are favorably tested and known, and finally, performing corpus classification by using the classifier passing the test.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. A method for fast text classification of a corpus, the method comprising the steps of:
(1) selecting an existing corpus to be used;
(2) extracting information data in the corpus and preprocessing the information data;
(3) inputting the preprocessing result into a vector space model;
(4) processing the characteristic words;
(5) selecting a classifier for the feature words;
(6) evaluating the effect of the classifier;
(7) and classifying the material library by using a classifier.
2. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the existing corpus in the step (1) refers to a Chinese corpus in particular, and the corpus type is a monolingual type.
3. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the information data in the step (2) refers to a Chinese text corpus with close similarity, and the preprocessing specifically includes: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;
the training sample set is an initial feature item set, short a feature set, formed by a set of obtained keywords, and the word segmentation process is specifically represented as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, and the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and the stop words comprise two types: one group refers to words that are widely used and frequently appearing in all corpora, and the other group refers to certain fictional words including mood assist words, adverbs, prepositions, conjunctions, and exclamation words, the stop words being replaced with symbols and removed from the word segmentation result to obtain an effective word combination, the symbols including "()", "" - ","/", and" & ".
4. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the vector space model in the step (3) means that the corpus text and the query both contain independent attributes which are expressed by characteristic items and reveal the contents of the corpus text and the attributes can be regarded as one dimension of the vector space, so the corpus text and the query can be expressed as a set of certain attributes, complex relations among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarity, and the similarity is measured by the distance among vectors;
the similarity calculation method adopts a cosine coefficient method, wherein the cosine coefficient method is that the similarity between the corpus text and the query is expressed by the cosine of an included angle between vectors, and the smaller the included angle is, the greater the similarity between the corpus text and the query is.
5. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the characteristic processing in the step (4) means that tens of thousands of characteristic words are obtained after preprocessing, wherein the characteristic words are small in occurrence frequency in the corpus and are called weak frequency related words, the characteristic words are high in occurrence frequency and are called strong frequency related words, and the weak frequency related words are removed and the strong frequency related words are extracted to form a characteristic set;
the feature processing comprises a feature extraction method and feature word weight determination, the feature extraction method adopts frequency statistics, the frequency statistics comprises word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from extracted information data to form feature items and endowing the feature items with corresponding weights, and the algorithm of the feature word weights is a Boolean weight method.
6. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the classifier in the step (5) is also called a classification model, which means that the corpus text of unknown classes is mapped to a specified class space, and the classifier adopts a Bayes algorithm based on Bayes' theorem.
7. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the effect evaluation in the step (6) comprises three aspects of effectiveness, calculation complexity and description simplicity, wherein the effectiveness comprises three indexes of recall ratio, precision ratio and F-measurement.
CN202011235587.6A 2020-11-06 2020-11-06 Fast text classification method of corpus Pending CN112328790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011235587.6A CN112328790A (en) 2020-11-06 2020-11-06 Fast text classification method of corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011235587.6A CN112328790A (en) 2020-11-06 2020-11-06 Fast text classification method of corpus

Publications (1)

Publication Number Publication Date
CN112328790A true CN112328790A (en) 2021-02-05

Family

ID=74316428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011235587.6A Pending CN112328790A (en) 2020-11-06 2020-11-06 Fast text classification method of corpus

Country Status (1)

Country Link
CN (1) CN112328790A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN104503960A (en) * 2015-01-07 2015-04-08 渤海大学 Text data processing method for English translation
CN109002473A (en) * 2018-06-13 2018-12-14 天津大学 A kind of sentiment analysis method based on term vector and part of speech
CN109960799A (en) * 2019-03-12 2019-07-02 中南大学 A kind of Optimum Classification method towards short text
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN104503960A (en) * 2015-01-07 2015-04-08 渤海大学 Text data processing method for English translation
CN109002473A (en) * 2018-06-13 2018-12-14 天津大学 A kind of sentiment analysis method based on term vector and part of speech
CN109960799A (en) * 2019-03-12 2019-07-02 中南大学 A kind of Optimum Classification method towards short text
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach

Similar Documents

Publication Publication Date Title
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN105701084A (en) Characteristic extraction method of text classification on the basis of mutual information
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
Lydia et al. Correlative study and analysis for hidden patterns in text analytics unstructured data using supervised and unsupervised learning techniques
El-Shishtawy et al. An accurate arabic root-based lemmatizer for information retrieval purposes
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
Mustofa et al. Sentiment analysis using lexicon-based method with naive bayes classifier algorithm on# newnormal hashtag in twitter
AL-Jibory Hybrid system for plagiarism detection on a scientific paper
Puri et al. An efficient hindi text classification model using svm
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
AL-Jumaili A hybrid method of linguistic and statistical features for Arabic sentiment analysis
KR20110017129A (en) Apparatus and method for words sense disambiguation using korean wordnet and its program stored recording medium
Dhar et al. Bengali news headline categorization using optimized machine learning pipeline
Ma Research on keyword extraction algorithm in english text based on cluster analysis
Mutuvi et al. A dataset for multi-lingual epidemiological event extraction
Thielmann et al. Coherence based document clustering
Mekala et al. A survey on authorship attribution approaches
Saeed et al. An abstractive summarization technique with variable length keywords as per document diversity
CN112328790A (en) Fast text classification method of corpus
Tao et al. The Text modeling method of Tibetan text combining Word2vec and improved TF-IDF
CN113963748A (en) Protein knowledge map vectorization method
CN112800243A (en) Project budget analysis method and system based on knowledge graph
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
BAZRFKAN et al. Using machine learning methods to summarize persian texts
Osochkin et al. Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210205

RJ01 Rejection of invention patent application after publication