CN112328790A

CN112328790A - Fast text classification method of corpus

Info

Publication number: CN112328790A
Application number: CN202011235587.6A
Authority: CN
Inventors: 王大鹏
Original assignee: Bohai University
Current assignee: Bohai University
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-02-05

Abstract

The invention discloses a fast text classification method of a corpus, which comprises the following steps: selecting an existing corpus to be used; extracting information data in the corpus and preprocessing the information data; inputting the preprocessing result into a vector space model; processing the characteristic words; selecting a classifier for the feature words; evaluating the effect of the classifier; and classifying the material library by using a classifier. The method for rapidly classifying the texts of the corpus can rapidly and accurately classify the corpora in the corpus, so that the efficiency and the accuracy of the corpus classification can be improved, and researchers and scholars can perform deep analysis and research on the corpora conveniently.

Description

Fast text classification method of corpus

Technical Field

The invention relates to the technical field of corpus text classification, in particular to a method for quickly classifying texts in a corpus.

Background

A language database is a large-scale electronic text library which is scientifically sampled and processed, has the function that a researcher can develop relevant linguistic theory and application research by means of a computer analysis tool, and is one of the main data supports for the researcher and the scholars to develop linguistic research because the language database is a basic resource bearing linguistic knowledge, and is stored with linguistic materials which really appear in actual use, so the language database is also one of important theoretical sources of a linguistic research method and is mainly applied to the aspects of dictionary compilation, language teaching, traditional language research, research based on statistics or examples in natural language processing and the like, and with the continuous development of times and the continuous improvement of computer technology, text classification refers to automatic classification based on a classification system, and the classification basis is one or more text characteristics, because the texts have similarity, the text classification cannot achieve the perfect result, and only the optimal classification result is selected according to the classification characteristics and the perfection of the evaluation standard, the patent number CN103823824A discloses a method and a system for automatically constructing a text classification corpus by means of the Internet, which classify according to the parts of speech, have too simple and single classification basis, cannot accurately and effectively classify the linguistic data of the similar meaning words, and are inconvenient for researchers and schools with definite purposes, so that the method for quickly classifying the texts of the corpus is invented to be particularly important;

the existing fast text classification method of the corpus is classified according to parts of speech, and the corpora in the corpus cannot be classified fast and accurately, so that the efficiency and the accuracy of the corpus classification cannot be improved, and researchers and scholars cannot conveniently conduct deep analysis and research on the corpora.

Disclosure of Invention

The invention aims to provide a method for quickly classifying texts in a corpus, which aims to solve the problems that the existing method for quickly classifying texts in the corpus, which is provided in the background art, is often classified according to parts of speech, and cannot quickly and accurately classify the linguistic data in the corpus, so that the efficiency and the accuracy of linguistic data classification cannot be improved, and researchers and scholars cannot conveniently and deeply analyze and research the linguistic data.

In order to achieve the purpose, the invention provides the following technical scheme: a method for fast text classification of a corpus, said classification method comprising the steps of:

(1) selecting an existing corpus to be used;

(2) extracting information data in the corpus and preprocessing the information data;

(3) inputting the preprocessing result into a vector space model;

(4) processing the characteristic words;

(5) selecting a classifier for the feature words;

(6) evaluating the effect of the classifier;

(7) and classifying the material library by using a classifier.

Preferably, the existing corpus in step (1) refers to a chinese corpus, and the corpus type is a monolingual type.

Preferably, the information data in step (2) refers to a chinese text corpus with a close similarity, and the preprocessing specifically includes: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;

the training sample set is an initial feature item set, short a feature set, formed by a set of obtained keywords, and the word segmentation process is specifically represented as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, and the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and the stop words comprise two types: one group refers to words that are widely used and frequently appearing in all corpora, and the other group refers to certain fictional words including mood assist words, adverbs, prepositions, conjunctions, and exclamation words, the stop words being replaced with symbols and removed from the word segmentation result to obtain an effective word combination, the symbols including "()", "" - ","/", and" & ".

Preferably, the vector space model in step (3) means that the corpus text and the query both contain independent attributes expressed by feature items and revealing their contents, and each attribute can be regarded as a dimension of the vector space, so that the corpus text and the query can be represented as a set of some attributes, complex relationships among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarities measured by distances among vectors,

the similarity calculation method adopts a cosine coefficient method, wherein the cosine coefficient method is that the similarity between the corpus text and the query is expressed by the cosine of an included angle between vectors, and the smaller the included angle is, the greater the similarity between the corpus text and the query is.

Preferably, the feature processing in step (4) means that tens of thousands of feature words are obtained after preprocessing, wherein the feature words are called weak frequency related words with a small occurrence frequency in the corpus, and strong frequency related words with a high occurrence frequency, and the feature words are formed into a feature set by removing the weak frequency related words and extracting the strong frequency related words;

the feature processing comprises a feature extraction method and feature word weight determination, wherein the feature extraction method adopts frequency statistics, the frequency statistics comprise word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from extracted information data to form feature items and endowing the feature items with corresponding weights, and the feature word weight algorithm is a Boolean weight method.

Preferably, the classifier in step (5), also called a classification model, maps the corpus text of unknown classes to a specified class space, and adopts a bayesian algorithm based on bayesian theorem.

Preferably, the effect evaluation in step (6) includes three aspects of effectiveness, computational complexity and descriptive simplicity, and the effectiveness includes three indexes of recall, precision and F-measure.

The technical scheme of the invention has the following beneficial technical effects: by extracting the information data in the corpus, preprocessing the information data is beneficial to formatting the text corpus in the existing corpus into a unified format, subsequent unified processing is convenient, the preprocessing result is input into a vector space model to be beneficial to decomposing the text into basic processing units and further obtaining feature words, finally, feature terms are generated through the feature terms, and by processing the feature terms, the feature of the information data is beneficial to reflecting, so that the weight of the feature terms is convenient to determine, through selecting a classifier for the feature terms, selecting a proper classification algorithm is beneficial to improving the classification speed and the classification accuracy of the corpus, and through evaluating the effect of the classifier, the classification capability of the classifier is beneficial to knowing and judging.

Drawings

Fig. 1 is a schematic structural diagram of a fast text classification method for a corpus according to the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

As shown in fig. 1, a method for fast text classification of a corpus includes the following steps:

(1) selecting an existing corpus to be used;

(3) inputting the preprocessing result into a vector space model;

(4) processing the characteristic words;

(5) selecting a classifier for the feature words;

(6) evaluating the effect of the classifier;

(7) and classifying the material library by using a classifier.

The existing corpus in the step (1) refers to a Chinese corpus in particular, and the corpus type is a monolingual type.

The information data in the step (2) refers to a Chinese text corpus with close similarity, and the preprocessing specifically shows that: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;

the training sample set is an initial feature item set, which is a feature set for short and is formed by acquiring a set of keywords, and the word segmentation processing is specifically expressed as follows: dividing a corpus text in a Chinese text corpus set into a plurality of words, wherein the adopted word segmentation technology is a word segmentation algorithm based on statistics, the word segmentation algorithm based on statistics refers to the frequency of adjacent co-occurrence of characters as a credibility evaluation standard of word formation, the word segmentation algorithm is used for counting the combination frequency of each adjacent co-occurrence character in the corpus, and stop words comprise two types: one is a word that is widely used and frequently appears in all corpora, and the other is a certain imaginary word that includes inflexion, adverb, preposition, conjunctions and exclamation, the stop word is replaced by a symbol and removed from the word segmentation result to obtain an effective word combination, the symbol includes "()", "", ""/", and" & ".

The vector space model in the step (3) means that the corpus text and the query both contain independent attributes which are expressed by characteristic items and reveal the contents of the corpus text and the attributes can be regarded as one dimension of the vector space, so the corpus text and the query can be expressed as a set of certain attributes, complex relations among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarity which is measured by the distance among vectors and is beneficial to decomposing the text into basic processing units to further obtain the characteristic words,

The characteristic processing in the step (4) means that tens of thousands of characteristic words can be obtained after preprocessing, wherein the characteristic words are small in occurrence frequency in the corpus and are called weak frequency related words, the characteristic words are high in occurrence frequency and are called strong frequency related words, and the weak frequency related words are removed and the strong frequency related words are extracted to form a characteristic set;

the feature processing comprises a feature extraction method and feature word weight determination, the feature extraction method adopts frequency statistics, the frequency statistics comprise word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from the extracted information data to form feature items and endowing the feature items with corresponding weights, and the feature word weight algorithm is a Boolean weight method and is beneficial to reflecting the features of the information data.

The classifier in the step (5) is also called a classification model, which means that the corpus text of unknown classes is mapped to a specified class space, and the classifier adopts a Bayes algorithm based on Bayes' theorem.

The effect evaluation in the step (6) comprises three aspects of effectiveness, calculation complexity and description simplicity, wherein the effectiveness comprises three indexes of recall ratio, precision ratio and F-measurement.

The invention is a method for identifying similar words based on a corpus, which determines the type of the corpus by selecting the existing corpus to be used, further determines the selection of the classification method and the prediction of the classification cost, then extracts part of the text corpus in the existing corpus, the extracted text corpus is called information data, then preprocesses the information data, the preprocessing comprises word segmentation and stop words, each word of the English text is distinguished by a space, the word segmentation is very simple, and the Chinese text is distinguished by a symbol and a paragraph, so the method is very troublesome and fuzzy, therefore, the word segmentation is needed when the characteristic word processing is carried out on the corpus, and the Chinese contains many dummy words, including a mood assistant word, an adverb, a preposition word, a connecting word and an exclamation word, the stop word processing mode is mainly used for "()", and "()", the method for processing the stop word processing is mainly carried out on the words, The words are replaced by the words, the extraction of characteristic words is used, the preprocessed result is input into a vector space model, the generation of a characteristic set is facilitated, the characteristic word processing is carried out to obtain weak frequency related words and strong frequency related words, the weak frequency related words are removed, the strong frequency related words are extracted to form the characteristic set, a classifier is selected for the characteristic words through the characteristic set, the assignment of the corpus text is completed according to the algorithm of the classifier, the classifier adopts a Bayesian method, the corpus text is favorably and accurately and rapidly classified, the efficiency and the accuracy of corpus classification are improved, researchers and scholars can conveniently carry out deep analysis and research on the corpus, the classifier is subjected to effect evaluation according to three aspects of effectiveness, calculation complexity and simplicity of description, and the processing speed and the accuracy of the classifier are favorably tested and known, and finally, performing corpus classification by using the classifier passing the test.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for fast text classification of a corpus, the method comprising the steps of:

(1) selecting an existing corpus to be used;

(3) inputting the preprocessing result into a vector space model;

(4) processing the characteristic words;

(5) selecting a classifier for the feature words;

(6) evaluating the effect of the classifier;

(7) and classifying the material library by using a classifier.

2. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the existing corpus in the step (1) refers to a Chinese corpus in particular, and the corpus type is a monolingual type.

3. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the information data in the step (2) refers to a Chinese text corpus with close similarity, and the preprocessing specifically includes: performing word segmentation processing and stop word removal processing on information data extracted from a corpus so as to obtain a training sample set;

4. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the vector space model in the step (3) means that the corpus text and the query both contain independent attributes which are expressed by characteristic items and reveal the contents of the corpus text and the attributes can be regarded as one dimension of the vector space, so the corpus text and the query can be expressed as a set of certain attributes, complex relations among paragraphs, sentences and words in the corpus text are ignored, the text and the query have similarity, and the similarity is measured by the distance among vectors;

5. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the characteristic processing in the step (4) means that tens of thousands of characteristic words are obtained after preprocessing, wherein the characteristic words are small in occurrence frequency in the corpus and are called weak frequency related words, the characteristic words are high in occurrence frequency and are called strong frequency related words, and the weak frequency related words are removed and the strong frequency related words are extracted to form a characteristic set;

the feature processing comprises a feature extraction method and feature word weight determination, the feature extraction method adopts frequency statistics, the frequency statistics comprises word frequency and document frequency, the feature word weight determination refers to the steps of extracting words capable of representing text features from extracted information data to form feature items and endowing the feature items with corresponding weights, and the algorithm of the feature word weights is a Boolean weight method.

6. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the classifier in the step (5) is also called a classification model, which means that the corpus text of unknown classes is mapped to a specified class space, and the classifier adopts a Bayes algorithm based on Bayes' theorem.

7. A method for fast text classification of a corpus as claimed in claim 1, characterized in that: the effect evaluation in the step (6) comprises three aspects of effectiveness, calculation complexity and description simplicity, wherein the effectiveness comprises three indexes of recall ratio, precision ratio and F-measurement.