CN111723199A

CN111723199A - Text classification method and device and computer readable storage medium

Info

Publication number: CN111723199A
Application number: CN201910206324.3A
Authority: CN
Inventors: 王三鹏
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2020-09-29

Abstract

The disclosure relates to a text classification method, a text classification device and a computer readable storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: calculating a word vector of each word in the target text; calculating the similarity between the word vector of each word and the word vector of each classification label; according to the similarity, determining the attention probability of each word by using an attention model; and determining the classification of the target text by using a classifier model according to the similarity. The technical scheme of the text classification method and the device can improve the accuracy of the computer in text classification.

Description

Text classification method and device and computer readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a text classification method, a text classification device, and a computer-readable storage medium.

Background

With the development of artificial intelligence technology, computers can perform semantic understanding on natural language texts. On the basis, the classification of the article can be determined by semantically understanding the description text of the article.

For example, classification tags for describing attributes of articles are arranged on the e-commerce platform, but a plurality of articles with incomplete tags often exist in the e-commerce platform. The titles of the articles are processed to determine the classification of the articles, so that the articles are supplemented with labels.

In the related art, the classification of a title is determined using an LSTM (long short-Term Memory) model according to a title text of an article and a hot unique code of a classification tag, thereby determining the classification of the article.

Disclosure of Invention

The inventors of the present disclosure found that the following problems exist in the above-described related art: semantic association between each word and each label in the text cannot be deeply mined, and each word in the text is processed indiscriminately, so that the accuracy of a computer for text classification is low.

In view of this, the present disclosure provides a text classification technical solution, which can improve the accuracy of text classification by a computer.

According to some embodiments of the present disclosure, there is provided a method of classifying text, including: calculating a word vector of each word in the target text; calculating the similarity between the word vector of each word and the word vector of each classification label; and determining the classification of the target text by utilizing a classifier model according to the similarity.

In some embodiments, determining an attention probability for the words using an attention model based on the similarity; and determining the classification of the target text by utilizing the classifier model according to the attention probability of each word.

In some embodiments, the word vectors of the words are ordered according to the similarity to form a word vector sequence; and inputting the word vector sequence into the attention model, and determining the attention probability of each word.

In some embodiments, calculating a mean of the word vectors for the classification tags; and calculating the similarity between the word vector of each word and the mean value to serve as the similarity between the word vector of each word and the word vector of each classification label.

In some embodiments, the sequence of word vectors is input into the attention model, and weights of the word vectors of the words are determined; and determining the attention probability of the corresponding word according to the weight and the word vector of the corresponding word.

In some embodiments, the target text is a description text of the item, the classification tag is used for identifying the category of the item, and the classification to which the target text belongs is the category to which the item belongs.

In some embodiments, a cosine similarity of the word vector of the words to the mean is calculated.

According to other embodiments of the present disclosure, there is provided a text classification apparatus including: the calculation unit is used for calculating word vectors of all words in the target text and calculating the similarity between the word vectors of all the words and the word vectors of all the classification labels; and the determining unit is used for determining the classification of the target text by utilizing a classifier model according to the similarity.

According to still other embodiments of the present disclosure, there is provided a text classification apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the method of classifying text in any of the above embodiments based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of classifying text in any of the above embodiments.

In the above embodiment, the classification of the text is determined according to the semantic similarity between each word in the text and the tag. Therefore, semantic association between the text and the label can be mined, and the importance of each word in the target text to classification can be evaluated, so that the accuracy of the computer to text classification is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of some embodiments of a method of classification of text of the present disclosure;

FIG. 2 illustrates a flow diagram of some embodiments of step 120 in FIG. 1;

FIG. 3 illustrates a flow diagram of some embodiments of an attention probability determination method of the present disclosure;

FIG. 4 illustrates a block diagram of some embodiments of a classification apparatus of the text of the present disclosure;

FIG. 5 shows a block diagram of further embodiments of a classification apparatus of the text of the present disclosure;

fig. 6 shows a block diagram of further embodiments of a classification apparatus of the text of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Fig. 1 illustrates a flow diagram of some embodiments of a method of classifying text of the present disclosure.

As shown in fig. 1, the method includes: step 110, calculating a word vector; step 120, calculating similarity; and step 130, determining the classification of the target text.

In step 110, a word vector for each word in the target text is calculated. For example, the target text may be a description text of the item (e.g., a title of the item in the e-commerce platform, etc.).

In some embodiments, word segmentation techniques may be employed to segment the title of an item. For example, the title of the article is "lady autumn clothing T-shirt 2018 new autumn korean version fresco", and the result after the word segmentation is "lady", "autumn clothing", "T-shirt", "2018", "autumn", "new version", "korean version", "fresco".

In some embodiments, word vectors of words in the target text may be obtained by training using a word2vec method. The dimensions of the word vectors can be customized. For example, according to the arrangement order of each word in the target text, a word vector set X of each word in the target text can be obtained as { X }₁、x₂……x_n……x_NN is the number of words in the target text, and N is more than or equal to 1 and less than or equal to N.

In step 120, the similarity between the word vector of each word and the word vector of each category label is calculated. In some embodiments, a database of the e-commerce platform may store classification tags corresponding to each of the categories. For example, the category labels may be "light girls", "maiden", "mature girls", "middle aged and old", etc., words for identifying suitable groups of articles.

In some embodiments, word vectors of the class labels may be trained using the word2vec method. And calculating the similarity degree of the word vector of each word and the word vector of each classification label, and evaluating the correlation degree of each word and each classification label in the target text. The words which are important for classification in the target text can be selected based on the method, so that the accuracy of the target text classification is improved.

In some embodiments, the similarity may be calculated by the steps in fig. 2.

Fig. 2 illustrates a flow diagram of some embodiments of step 120 in fig. 1.

As shown in fig. 2, step 120 includes: step 1210, calculating a word vector mean value; and step 1220, calculating the similarity.

In the step ofAt 1210, the mean of the word vectors for each class label is calculated. For example, there are M classification tags y in the e-commerce platform₁、y₂……y_MMean value of

In step 1220, the similarity between the word vector of each word and the mean is calculated as the similarity between the word vector of each word and the word vector of each classification label. For example, x can be computed separately₁、x₂……x_n……x_NAnd

cosine similarity of

As an evaluation of the importance of each word vector (corresponding to each word) in the word vector set X to the classification. Therefore, the relevance degree of each word and all classification labels can be evaluated as the basis of text classification, and the efficiency and the accuracy of text classification are improved.

In some embodiments, x may also be calculated separately₁、x₂……x_n……x_NEach word vector and y₁、y₂……y_MAnd then taking the mean value of the cosine similarity as the similarity of the selected word vector and the word vector of each classification label.

In some embodiments, the similarity between the word vector of each word and the word vector of each classification label is obtained, and the classification of the target text can be determined through step 130 in fig. 1.

In step 130, the classification to which the target text belongs is determined by using a classifier model according to the similarity. For example, the classification to which the target text belongs is the category to which the item belongs.

In some embodiments, the title text of the article mostly belongs to discrete text, i.e., the association and sequence between words in the text is not strong. For example, the words "lady", "autumn clothing" and "T shirt" in the title text do not have a certain sequence. Such text properties may affect the processing effectiveness of the LSTM model. Therefore, the text can be classified according to the importance degree of the words by using an Attention Model (Attention Model), so that the classification accuracy of the text is improved.

For example, the attention probability of each word can be determined by using an attention model according to the similarity; and determining the classification of the target text by utilizing a classifier model according to the attention probability of each word.

In some embodiments, the output of the attention model may be used as an input of an MLP (Multi-Layer Perceptron) to determine a classification label corresponding to the target text, and thus determine the category of the item described by the target text.

In some embodiments, the attention probability may be determined by the steps in fig. 3.

Fig. 3 illustrates a flow diagram of some embodiments of a method of determining attention probability of the present disclosure.

As shown in fig. 3, the method includes: step 310, ordering the word vectors; and step 320, determining the attention probability.

In step 310, the word vectors of the words are sorted according to the similarity to form a word vector sequence. For example, in accordance with

Reordering the word vectors in the word vector set X from large to small to obtain a word vector sequence X': { x'₁、x′₂……x′_n……x′_N}。

In step 320, the sequence of word vectors is input into an attention model to determine an attention probability for each word.

In some embodiments, the sequence of word vectors is input to an attention model, and the weights of the word vectors for each word are determined. For example, the word vector sequence X 'is input into a trained attention model, and each word vector { X' in X 'is determined'₁、x′₂……x′_n……x′_NWeight of { W } W₁、W₂……W_n……W_N}。

Then, according to the weights and the word vectors of the corresponding words, attention probabilities of the corresponding words are determined, and the attention probabilities can describe the importance degree of the words to the target text. E.g., x'_nThe corresponding attention probability may be P_n＝Softmax(x′_n·W_n) Softmax () is a normalized exponential function.

With the above embodiment, the word vector set of the target text can be readjusted to a word vector sequence in order of the degree of importance (similarity) of classification. Therefore, on one hand, in the text classification stage, the processing efficiency and accuracy of text classification can be improved; on the other hand, the training efficiency of the weights in the attention model can be improved in the model training stage.

In some embodiments, the attention probability P may be_nAnd x'_nThe multiplication is taken as the output of the attention model. Having determined the attention probabilities for the words, the classification of the target text may be determined by step 130 of FIG. 1.

Fig. 4 illustrates a block diagram of some embodiments of a classification apparatus of the text of the present disclosure.

As shown in fig. 4, the text classification device 4 includes a calculation unit 41 and a determination unit 42.

The calculation unit 41 calculates a word vector of each word in the target text, and calculates the similarity of the word vector of each word and the word vector of each classification label.

In some embodiments, the calculation unit 41 calculates a mean value of the word vectors of the respective classification labels. The calculation unit 41 calculates the similarity between the word vector of each word and the mean as the similarity between the word vector of each word and the word vector of each classification label.

The determining unit 42 determines the classification to which the target text belongs using the classifier model according to the similarity. For example, the determination unit 42 determines the attention probability of each word using the attention model according to the similarity. The determination unit 42 also determines a classification to which the target text belongs using a classifier model based on the attention probability of each word. For example, the target text is a description text of the article, the classification tag is used for identifying the category of the article, and the category to which the target text belongs is the category to which the article belongs.

In some embodiments, the determining unit 42 orders the word vectors of the words according to the similarity to form a word vector sequence. The determining unit 42 inputs the word vector sequence into the attention model, and determines the attention probability of each word.

In some embodiments, the determining unit 42 inputs the sequence of word vectors into the attention model, determining the weights of the word vectors for the words. The determining unit 42 determines the attention probability of the corresponding word based on the weight and the word vector of the corresponding word.

Fig. 5 shows a block diagram of further embodiments of a classification apparatus of the text of the present disclosure.

As shown in fig. 5, the text classification device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to execute a method of classifying text in any one of the embodiments of the present disclosure based on instructions stored in the memory 51.

The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.

As shown in fig. 6, the text classification device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to perform the method of classifying text in any of the embodiments described above based on instructions stored in the memory 610.

The memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.

The text classification means 6 may further comprise an input output interface 630, a network interface 640, a storage interface 650, etc. These

interfaces

630, 640, 650 and the connections between the memory 610 and the processor 620 may be through a bus 660, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

So far, a classification method of a text, a classification apparatus of a text, and a computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method of classifying text, comprising:

calculating a word vector of each word in the target text;

calculating the similarity between the word vector of each word and the word vector of each classification label;

and determining the classification of the target text by utilizing a classifier model according to the similarity.

2. The classification method of claim 1, wherein determining the classification to which the target text belongs comprises:

according to the similarity, determining the attention probability of each word by using an attention model;

and determining the classification of the target text by utilizing the classifier model according to the attention probability of each word.

3. The classification method of claim 2, wherein determining the attention probability of the words comprises:

sequencing the word vectors of the words according to the similarity to form a word vector sequence;

and inputting the word vector sequence into the attention model, and determining the attention probability of each word.

4. The classification method according to claim 1, wherein calculating the similarity of the word vector of each word and the word vector of each classification label comprises:

calculating the mean value of the word vectors of all the classification labels;

and calculating the similarity between the word vector of each word and the mean value to serve as the similarity between the word vector of each word and the word vector of each classification label.

5. The classification method of claim 3, wherein determining the attention probability of the words comprises:

inputting the word vector sequence into the attention model, and determining the weight of the word vector of each word;

and determining the attention probability of the corresponding word according to the weight and the word vector of the corresponding word.

6. The classification method according to claim 4, wherein calculating the similarity of the word vector of each word to the mean comprises:

and calculating the cosine similarity between the word vector of each word and the mean value.

7. The classification method according to any one of claims 1 to 6,

the target text is a description text of the article, the classification label is used for identifying the category of the article, and the category to which the target text belongs is the category to which the article belongs.

8. A device for classifying text, comprising:

the calculation unit is used for calculating word vectors of all words in the target text and calculating the similarity between the word vectors of all the words and the word vectors of all the classification labels;

and the determining unit is used for determining the classification of the target text by utilizing a classifier model according to the similarity.

9. A device for classifying text, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of classifying text of any of claims 1-7 based on instructions stored in the memory device.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of classifying text according to any one of claims 1 to 7.