CN116860963A

CN116860963A - Text classification method, equipment and storage medium

Info

Publication number: CN116860963A
Application number: CN202310550978.4A
Authority: CN
Inventors: 王吉煜; 周树亮; 冯偲; 高延龙
Original assignee: Tibet Ningsuan Technology Group Co ltd
Current assignee: Tibet Ningsuan Technology Group Co ltd
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-10-10

Abstract

The invention discloses a text classification method, equipment and a storage medium, which comprise the following steps: determining keywords in the document to be classified according to a keyword set of at least one category, wherein the keyword set comprises at least one keyword; respectively splicing each word in the document to be classified with each keyword in the document to be classified to obtain at least one spliced word of the document to be classified; vectorizing and encoding all the segmented words in each document to be classified and the corresponding spliced words to obtain at least one text feature vector; inputting the text feature vector into the trained neural network model to obtain the semantics corresponding to the document to be classified. Which can improve the classification accuracy of text.

Description

Text classification method, equipment and storage medium

Technical Field

The invention belongs to the technical field of text classification, and particularly relates to a text classification method, text classification equipment and a storage medium.

Background

In the present era, text data on the internet is growing day by day, and it is very important to scientifically organize and manage mass data by adopting a text classification technology. The wide distribution of text data and large data volume, and how to effectively manage the data is always a difficult problem.

Text classification is a basic work in the NLP field, aims to sort and classify texts, and can effectively solve the problem of information overload.

Text classification business floor application scenarios are numerous, such as sensitive information classification, public opinion classification, topic classification, etc.

Sensitive information classification, securities, banks, insurance and other industries have high requirements on information asset security, and in the face of huge information asset data inside, the sensitive data are effectively managed, the resource classification is marked essentially, the data security level is identified according to the mark, and the mark record is reserved as an audit basis.

The public opinion classification, the network public opinion is used as a channel for public expression opinion, and reflects certain social conditions and public opinion. Therefore, the first step in the public opinion processing procedure should be to establish an efficient network public opinion information collection mechanism, use public opinion classification, monitor public opinion in the whole network in a multi-level and omnibearing manner, discover public opinion information in time and monitor the development trend of public opinion, and prevent the occurrence of public opinion crisis.

The topic division can be divided into different classifications by words in the articles by utilizing the topic division, and when the user is recommended, the articles with similar classifications can be recommended according to the record of browsing the articles by the user.

However, the data is classified by the existing method, and the classification accuracy is required to be improved.

Disclosure of Invention

In order to solve the problem of low classification precision in the existing method, the invention provides a text classification method, equipment and a storage medium, which can improve the classification precision of texts.

The aim of the invention is achieved by the following technical scheme:

the first aspect of the present invention provides a text classification method, comprising the steps of:

acquiring a document to be classified, wherein the document comprises at least one word;

determining keywords in the document to be classified according to at least one category of keyword sets, wherein the keyword sets comprise at least one keyword;

respectively splicing each word in the document to be classified with each keyword in the document to be classified to obtain at least one spliced word of the document to be classified;

vectorizing and encoding all the segmented words in each document to be classified and the corresponding spliced words to obtain at least one text feature vector;

inputting the text feature vector into the trained neural network model to obtain the semantics corresponding to the document to be classified.

In one possible design, the determining of the keyword set of the category is further included before the obtaining of the at least one document.

In one possible design, the determining of the keyword set for the category includes:

acquiring at least one document;

calculating a TF-IDF value of each word in each document in the at least one document, wherein the TF-IDF value is TF-IDF, TF refers to word frequency of the word in the document to which the word belongs, and IDF refers to reverse document frequency;

and determining and obtaining a keyword set of at least one category according to the TF-IDF value.

In one possible design, the idf=log (N1/(n2+1)), tf=n3/N4,

where N1 is the total number of documents, N2 is the number of documents in which the word appears, N3 is the number of times the word appears in a document, and N4 is the number of vocabularies in the document.

In one possible design, the determining the keyword set of at least one category according to the TF-IDF value includes:

filtering out words with TF-IDF values lower than a first threshold value;

words having a TF-IDF value with a word frequency greater than a second threshold in each document are filtered out.

In one possible design, the neural network model is RNN, LSTM, transformer or BERT.

The second aspect of the invention provides a text classification device, which comprises a document acquisition unit to be classified, a keyword determination unit, a splicing unit, a vectorization unit and an identification unit which are connected in sequence in an information manner,

the document to be classified acquisition unit is used for acquiring a document to be classified, wherein the document comprises at least one word;

the keyword determining unit is used for determining keywords in the documents to be classified according to at least one category of keyword set, wherein the keyword set comprises at least one keyword;

the splicing unit is used for respectively splicing each word in the document to be classified with each keyword in the document to be classified to obtain at least one spliced word of the document to be classified;

the vectorization unit is used for vectorizing and encoding all the segmentation words in each document to be classified and the corresponding splicing words to obtain at least one text feature vector;

the recognition unit is used for inputting the text feature vector into the trained neural network model to obtain the semantics corresponding to the document to be classified.

A third aspect of the invention provides a computer readable storage medium having instructions stored thereon which, when executed on a computer, perform a method of text classification as described in the first aspect and any of its possible designs.

Compared with the prior art, the invention has at least the following advantages and beneficial effects:

the invention splices the word segmentation of the document to be classified and the key words appearing in the document, then carries out vectorization coding on the spliced word and the word segmentation, and the neural network obtains the semantics of the document to be classified according to the text feature vector obtained after vectorization coding.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the classification method of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without collision.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be noted that, directions or positional relationships indicated by terms such as "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., are directions or positional relationships based on those shown in the drawings, or are directions or positional relationships conventionally put in use of the inventive product, or are directions or positional relationships conventionally understood by those skilled in the art, are merely for convenience of describing the present invention and for simplifying the description, and are not to indicate or imply that the apparatus or element to be referred to must have a specific direction, be constructed and operated in a specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

As shown in fig. 1, the present invention discloses a text classification method, which may be, but not limited to, executed by a text classification device, and the text classification device may be software, or a combination of software and hardware, and the text classification device may be integrated in an intelligent mobile terminal, a tablet, a computer, or other intelligent devices. Specifically, the text classification method includes the following steps S01 to S05.

Step S01, at least one document is acquired, wherein the document comprises at least one word. The document may be a cloud document, a document stored in a local server, or a document uploaded by a user.

Step S02, determining keywords in each document in at least one document according to at least one category of keyword sets, wherein the keyword sets comprise at least one keyword.

Determining a category keyword set, namely acquiring at least one document, and then calculating the TF-IDF value of each word in each document in the at least one document; and finally, determining and obtaining a keyword set of at least one category according to the TF-IDF value.

The TF-IDF value is TF-IDF, wherein TF refers to word frequency of the word in the document to which the word belongs, and IDF refers to reverse document frequency. The meaning of the TF-IDF value characterization can be understood as: a trade-off between the frequency of occurrence of a term in a document and the breadth of the number of documents that occur, if a term frequently occurs in a document (TF high), but does not occur substantially in other documents (IDF high), such a term or feature can better distinguish between different types of documents.

IDF= log(N1/(N2+1))

TF=N3/N4

Where N1 is the total number of documents, N2 is the number of documents in which the word appears, N3 is the number of times the word appears in a document, and N4 is the number of vocabularies in the document. The IDF is subjected to Laplace smoothing, so that 0 errors of denominator are avoided.

And filtering out words with TF-IDF values lower than a first threshold value and words with TF-IDF values with word frequencies higher than a second threshold value in each document when determining to obtain a keyword set of at least one category according to the TF-IDF values.

Words with TF-IDF values below the first threshold, so-called stop words, which are very low values of TF-IDF, are filtered out in advance, which reduces the amount of computation without great help for classification.

Similar to "we", "the frequency of occurrence of such words as" in a document is high, i.e., TF is high, while the frequency of occurrence in different documents is also high, i.e., IDF is low, the calculated total TF-IDF is relatively low, as compared to the word that does not distinguish very well from the document.

The category is assumed to be sports, music, society.

TFIDF values are relatively high in sports categories: sports, basketball, table tennis, soccer;

the TFIDF values in the music class are relatively high: music, rap, classical, popular;

TFIDF values are relatively high in the social class: doctor, net friend, police.

The keywords in the keyword set corresponding to the sports category comprise sports, basketball, table tennis and football;

keywords in the keyword set corresponding to the music category comprise music, rap, classical and popular;

keywords in the keyword set corresponding to the social category comprise doctors, net friends and police.

After determining the keyword set of each category, keywords appearing in the document can be determined according to the keyword set. Taking the text of classical music with a long-standing sports event as an example, keywords of the text including sports, classical and music are determined according to the keyword set.

And S03, respectively splicing each word in the document to be classified with each keyword in the document to be classified to obtain at least one spliced word of the document to be classified.

Illustratively, the word in the text "the home sports event is brought into play with the ascending classical music" includes the home, sports, event, accompaniment, accompanying, ascending, classical, music, and coming into play.

The keyword splicing result corresponding to the segmentation word and the text is as follows:

step S04, vectorizing and encoding all the segmentation words in each document to be classified and the corresponding splicing words to obtain at least one text feature vector;

in this step, word2vec may be used to vectorize the word and the corresponding concatenated word, and the word vectors of all the words may be used as text feature vectors.

And step S05, inputting the text feature vector into the trained neural network model to obtain the semantics corresponding to the document to be classified.

The neural network model is a recurrent neural network (Recurrent Neural Network, RNN), long Short-Term Memory artificial neural network (Long Short-Term Memory LSTM), vision Transformer or BERT. BERT is a generic semantic representation model. The model has the advantage that potential modes in the text can be automatically found without manually extracting the characteristics of the data. The model is complex and unexplainable, the text data amount and the training iteration number can influence the text classification precision, and by adopting the method, only coarse-grained classification, such as halving or 5-degree classification, can be performed; while it is not possible to cover the different sides of the body.

The text feature vector is input into the trained neural network model to obtain the corresponding semantics of the document to be classified, and the semantics are interpreted as generalization of document expression events, namely, potential hierarchical relations between the text and different categories can be found through the neural network model. The document classification can be further realized through the semantics. By means of the method, semi-automatic classification is achieved, the text classification accuracy can be greatly improved by means of construction of priori knowledge and a self-learning neural network model, potential new classification can be found, and the method has high application value for scenes such as public opinion classification, theme classification and sensitive information classification.

A third aspect of the present invention provides a point cloud thermodynamic diagram drawing apparatus, including a memory and a controller which are sequentially communicatively connected, where the memory stores a computer program, and the controller is configured to read the computer program and execute a text classification method as described in the first aspect and any one of the foregoing possible aspects. By way of specific example, the Memory may include, but is not limited to, random-Access Memory (RAM), read-only Memory (ROM), flash Memory (Flash Memory), first-in First-out Memory (First Input First Output, FIFO), first-in First-out Memory (First Input LastOutput, FILO), and the like; the controller may not be limited to use with a microcontroller model STM32F105 series. In addition, the computer device may include, but is not limited to, a power supply unit, a display screen, and other necessary components.

A fourth aspect of the invention provides a computer readable storage medium having instructions stored thereon which, when executed on a computer, perform a method of text classification as described in the first aspect and any of its possibilities.

It should be noted that, the working principles of the apparatus and medium disclosed in the second to fourth aspects of the present invention are described in the method of the first aspect, and are not described herein.

Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. A method of classifying text, comprising the steps of:

2. A method of text classification as claimed in claim 1, wherein: the method further comprises the step of determining a keyword set of the category before the at least one document is acquired.

3. A method of text classification as claimed in claim 2, wherein: the determining of the keyword set of the category includes:

acquiring at least one document;

4. A method of text classification as claimed in claim 3 wherein: the idf=log (N1/(n2+1)), tf=n3/N4,

5. A method of text classification as claimed in claim 3 wherein: the determining the keyword set of at least one category according to the TF-IDF value comprises the following steps:

filtering out words with TF-IDF values lower than a first threshold value;

6. A method of text classification as claimed in claim 1, wherein: the neural network model is RNN, LSTM, transformer or BERT.

7. A text classification device is characterized by comprising a document acquisition unit to be classified, a keyword determination unit, a splicing unit, a vectorization unit and an identification unit which are connected in sequence in an information manner,

8. A text classification apparatus comprising a memory and a controller in communication with each other in sequence, the memory having a computer program stored thereon, the controller being adapted to read the computer program and to perform a text classification method according to any of claims 1-6.

9. A computer-readable storage medium having instructions stored thereon, characterized in that: a method of text classification as claimed in any of claims 1 to 6 when said instructions are run on a computer.