CN111723199A - Text classification method and device and computer readable storage medium - Google Patents

Text classification method and device and computer readable storage medium Download PDF

Info

Publication number
CN111723199A
CN111723199A CN201910206324.3A CN201910206324A CN111723199A CN 111723199 A CN111723199 A CN 111723199A CN 201910206324 A CN201910206324 A CN 201910206324A CN 111723199 A CN111723199 A CN 111723199A
Authority
CN
China
Prior art keywords
word
classification
text
similarity
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910206324.3A
Other languages
Chinese (zh)
Inventor
王三鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201910206324.3A priority Critical patent/CN111723199A/en
Publication of CN111723199A publication Critical patent/CN111723199A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a text classification method, a text classification device and a computer readable storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: calculating a word vector of each word in the target text; calculating the similarity between the word vector of each word and the word vector of each classification label; according to the similarity, determining the attention probability of each word by using an attention model; and determining the classification of the target text by using a classifier model according to the similarity. The technical scheme of the text classification method and the device can improve the accuracy of the computer in text classification.

Description

Text classification method and device and computer readable storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a text classification method, a text classification device, and a computer-readable storage medium.
Background
With the development of artificial intelligence technology, computers can perform semantic understanding on natural language texts. On the basis, the classification of the article can be determined by semantically understanding the description text of the article.
For example, classification tags for describing attributes of articles are arranged on the e-commerce platform, but a plurality of articles with incomplete tags often exist in the e-commerce platform. The titles of the articles are processed to determine the classification of the articles, so that the articles are supplemented with labels.
In the related art, the classification of a title is determined using an LSTM (long short-Term Memory) model according to a title text of an article and a hot unique code of a classification tag, thereby determining the classification of the article.
Disclosure of Invention
The inventors of the present disclosure found that the following problems exist in the above-described related art: semantic association between each word and each label in the text cannot be deeply mined, and each word in the text is processed indiscriminately, so that the accuracy of a computer for text classification is low.
In view of this, the present disclosure provides a text classification technical solution, which can improve the accuracy of text classification by a computer.
According to some embodiments of the present disclosure, there is provided a method of classifying text, including: calculating a word vector of each word in the target text; calculating the similarity between the word vector of each word and the word vector of each classification label; and determining the classification of the target text by utilizing a classifier model according to the similarity.
In some embodiments, determining an attention probability for the words using an attention model based on the similarity; and determining the classification of the target text by utilizing the classifier model according to the attention probability of each word.
In some embodiments, the word vectors of the words are ordered according to the similarity to form a word vector sequence; and inputting the word vector sequence into the attention model, and determining the attention probability of each word.
In some embodiments, calculating a mean of the word vectors for the classification tags; and calculating the similarity between the word vector of each word and the mean value to serve as the similarity between the word vector of each word and the word vector of each classification label.
In some embodiments, the sequence of word vectors is input into the attention model, and weights of the word vectors of the words are determined; and determining the attention probability of the corresponding word according to the weight and the word vector of the corresponding word.
In some embodiments, the target text is a description text of the item, the classification tag is used for identifying the category of the item, and the classification to which the target text belongs is the category to which the item belongs.
In some embodiments, a cosine similarity of the word vector of the words to the mean is calculated.
According to other embodiments of the present disclosure, there is provided a text classification apparatus including: the calculation unit is used for calculating word vectors of all words in the target text and calculating the similarity between the word vectors of all the words and the word vectors of all the classification labels; and the determining unit is used for determining the classification of the target text by utilizing a classifier model according to the similarity.
According to still other embodiments of the present disclosure, there is provided a text classification apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the method of classifying text in any of the above embodiments based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of classifying text in any of the above embodiments.
In the above embodiment, the classification of the text is determined according to the semantic similarity between each word in the text and the tag. Therefore, semantic association between the text and the label can be mined, and the importance of each word in the target text to classification can be evaluated, so that the accuracy of the computer to text classification is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 illustrates a flow diagram of some embodiments of a method of classification of text of the present disclosure;
FIG. 2 illustrates a flow diagram of some embodiments of step 120 in FIG. 1;
FIG. 3 illustrates a flow diagram of some embodiments of an attention probability determination method of the present disclosure;
FIG. 4 illustrates a block diagram of some embodiments of a classification apparatus of the text of the present disclosure;
FIG. 5 shows a block diagram of further embodiments of a classification apparatus of the text of the present disclosure;
fig. 6 shows a block diagram of further embodiments of a classification apparatus of the text of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Fig. 1 illustrates a flow diagram of some embodiments of a method of classifying text of the present disclosure.
As shown in fig. 1, the method includes: step 110, calculating a word vector; step 120, calculating similarity; and step 130, determining the classification of the target text.
In step 110, a word vector for each word in the target text is calculated. For example, the target text may be a description text of the item (e.g., a title of the item in the e-commerce platform, etc.).
In some embodiments, word segmentation techniques may be employed to segment the title of an item. For example, the title of the article is "lady autumn clothing T-shirt 2018 new autumn korean version fresco", and the result after the word segmentation is "lady", "autumn clothing", "T-shirt", "2018", "autumn", "new version", "korean version", "fresco".
In some embodiments, word vectors of words in the target text may be obtained by training using a word2vec method. The dimensions of the word vectors can be customized. For example, according to the arrangement order of each word in the target text, a word vector set X of each word in the target text can be obtained as { X }1、x2……xn……xNN is the number of words in the target text, and N is more than or equal to 1 and less than or equal to N.
In step 120, the similarity between the word vector of each word and the word vector of each category label is calculated. In some embodiments, a database of the e-commerce platform may store classification tags corresponding to each of the categories. For example, the category labels may be "light girls", "maiden", "mature girls", "middle aged and old", etc., words for identifying suitable groups of articles.
In some embodiments, word vectors of the class labels may be trained using the word2vec method. And calculating the similarity degree of the word vector of each word and the word vector of each classification label, and evaluating the correlation degree of each word and each classification label in the target text. The words which are important for classification in the target text can be selected based on the method, so that the accuracy of the target text classification is improved.
In some embodiments, the similarity may be calculated by the steps in fig. 2.
Fig. 2 illustrates a flow diagram of some embodiments of step 120 in fig. 1.
As shown in fig. 2, step 120 includes: step 1210, calculating a word vector mean value; and step 1220, calculating the similarity.
In the step ofAt 1210, the mean of the word vectors for each class label is calculated. For example, there are M classification tags y in the e-commerce platform1、y2……yMMean value of
Figure BDA0001999048060000051
In step 1220, the similarity between the word vector of each word and the mean is calculated as the similarity between the word vector of each word and the word vector of each classification label. For example, x can be computed separately1、x2……xn……xNAnd
Figure BDA0001999048060000052
cosine similarity of
Figure BDA0001999048060000053
As an evaluation of the importance of each word vector (corresponding to each word) in the word vector set X to the classification. Therefore, the relevance degree of each word and all classification labels can be evaluated as the basis of text classification, and the efficiency and the accuracy of text classification are improved.
In some embodiments, x may also be calculated separately1、x2……xn……xNEach word vector and y1、y2……yMAnd then taking the mean value of the cosine similarity as the similarity of the selected word vector and the word vector of each classification label.
In some embodiments, the similarity between the word vector of each word and the word vector of each classification label is obtained, and the classification of the target text can be determined through step 130 in fig. 1.
In step 130, the classification to which the target text belongs is determined by using a classifier model according to the similarity. For example, the classification to which the target text belongs is the category to which the item belongs.
In some embodiments, the title text of the article mostly belongs to discrete text, i.e., the association and sequence between words in the text is not strong. For example, the words "lady", "autumn clothing" and "T shirt" in the title text do not have a certain sequence. Such text properties may affect the processing effectiveness of the LSTM model. Therefore, the text can be classified according to the importance degree of the words by using an Attention Model (Attention Model), so that the classification accuracy of the text is improved.
For example, the attention probability of each word can be determined by using an attention model according to the similarity; and determining the classification of the target text by utilizing a classifier model according to the attention probability of each word.
In some embodiments, the output of the attention model may be used as an input of an MLP (Multi-Layer Perceptron) to determine a classification label corresponding to the target text, and thus determine the category of the item described by the target text.
In some embodiments, the attention probability may be determined by the steps in fig. 3.
Fig. 3 illustrates a flow diagram of some embodiments of a method of determining attention probability of the present disclosure.
As shown in fig. 3, the method includes: step 310, ordering the word vectors; and step 320, determining the attention probability.
In step 310, the word vectors of the words are sorted according to the similarity to form a word vector sequence. For example, in accordance with
Figure BDA0001999048060000061
Reordering the word vectors in the word vector set X from large to small to obtain a word vector sequence X': { x'1、x′2……x′n……x′N}。
In step 320, the sequence of word vectors is input into an attention model to determine an attention probability for each word.
In some embodiments, the sequence of word vectors is input to an attention model, and the weights of the word vectors for each word are determined. For example, the word vector sequence X 'is input into a trained attention model, and each word vector { X' in X 'is determined'1、x′2……x′n……x′NWeight of { W } W1、W2……Wn……WN}。
Then, according to the weights and the word vectors of the corresponding words, attention probabilities of the corresponding words are determined, and the attention probabilities can describe the importance degree of the words to the target text. E.g., x'nThe corresponding attention probability may be Pn=Softmax(x′n·Wn) Softmax () is a normalized exponential function.
With the above embodiment, the word vector set of the target text can be readjusted to a word vector sequence in order of the degree of importance (similarity) of classification. Therefore, on one hand, in the text classification stage, the processing efficiency and accuracy of text classification can be improved; on the other hand, the training efficiency of the weights in the attention model can be improved in the model training stage.
In some embodiments, the attention probability P may benAnd x'nThe multiplication is taken as the output of the attention model. Having determined the attention probabilities for the words, the classification of the target text may be determined by step 130 of FIG. 1.
In the above embodiment, the classification of the text is determined according to the semantic similarity between each word in the text and the tag. Therefore, semantic association between the text and the label can be mined, and the importance of each word in the target text to classification can be evaluated, so that the accuracy of the computer to text classification is improved.
Fig. 4 illustrates a block diagram of some embodiments of a classification apparatus of the text of the present disclosure.
As shown in fig. 4, the text classification device 4 includes a calculation unit 41 and a determination unit 42.
The calculation unit 41 calculates a word vector of each word in the target text, and calculates the similarity of the word vector of each word and the word vector of each classification label.
In some embodiments, the calculation unit 41 calculates a mean value of the word vectors of the respective classification labels. The calculation unit 41 calculates the similarity between the word vector of each word and the mean as the similarity between the word vector of each word and the word vector of each classification label.
The determining unit 42 determines the classification to which the target text belongs using the classifier model according to the similarity. For example, the determination unit 42 determines the attention probability of each word using the attention model according to the similarity. The determination unit 42 also determines a classification to which the target text belongs using a classifier model based on the attention probability of each word. For example, the target text is a description text of the article, the classification tag is used for identifying the category of the article, and the category to which the target text belongs is the category to which the article belongs.
In some embodiments, the determining unit 42 orders the word vectors of the words according to the similarity to form a word vector sequence. The determining unit 42 inputs the word vector sequence into the attention model, and determines the attention probability of each word.
In some embodiments, the determining unit 42 inputs the sequence of word vectors into the attention model, determining the weights of the word vectors for the words. The determining unit 42 determines the attention probability of the corresponding word based on the weight and the word vector of the corresponding word.
In the above embodiment, the classification of the text is determined according to the semantic similarity between each word in the text and the tag. Therefore, semantic association between the text and the label can be mined, and the importance of each word in the target text to classification can be evaluated, so that the accuracy of the computer to text classification is improved.
Fig. 5 shows a block diagram of further embodiments of a classification apparatus of the text of the present disclosure.
As shown in fig. 5, the text classification device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to execute a method of classifying text in any one of the embodiments of the present disclosure based on instructions stored in the memory 51.
The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.
Fig. 6 shows a block diagram of further embodiments of a classification apparatus of the text of the present disclosure.
As shown in fig. 6, the text classification device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to perform the method of classifying text in any of the embodiments described above based on instructions stored in the memory 610.
The memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.
The text classification means 6 may further comprise an input output interface 630, a network interface 640, a storage interface 650, etc. These interfaces 630, 640, 650 and the connections between the memory 610 and the processor 620 may be through a bus 660, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
So far, a classification method of a text, a classification apparatus of a text, and a computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (10)

1. A method of classifying text, comprising:
calculating a word vector of each word in the target text;
calculating the similarity between the word vector of each word and the word vector of each classification label;
and determining the classification of the target text by utilizing a classifier model according to the similarity.
2. The classification method of claim 1, wherein determining the classification to which the target text belongs comprises:
according to the similarity, determining the attention probability of each word by using an attention model;
and determining the classification of the target text by utilizing the classifier model according to the attention probability of each word.
3. The classification method of claim 2, wherein determining the attention probability of the words comprises:
sequencing the word vectors of the words according to the similarity to form a word vector sequence;
and inputting the word vector sequence into the attention model, and determining the attention probability of each word.
4. The classification method according to claim 1, wherein calculating the similarity of the word vector of each word and the word vector of each classification label comprises:
calculating the mean value of the word vectors of all the classification labels;
and calculating the similarity between the word vector of each word and the mean value to serve as the similarity between the word vector of each word and the word vector of each classification label.
5. The classification method of claim 3, wherein determining the attention probability of the words comprises:
inputting the word vector sequence into the attention model, and determining the weight of the word vector of each word;
and determining the attention probability of the corresponding word according to the weight and the word vector of the corresponding word.
6. The classification method according to claim 4, wherein calculating the similarity of the word vector of each word to the mean comprises:
and calculating the cosine similarity between the word vector of each word and the mean value.
7. The classification method according to any one of claims 1 to 6,
the target text is a description text of the article, the classification label is used for identifying the category of the article, and the category to which the target text belongs is the category to which the article belongs.
8. A device for classifying text, comprising:
the calculation unit is used for calculating word vectors of all words in the target text and calculating the similarity between the word vectors of all the words and the word vectors of all the classification labels;
and the determining unit is used for determining the classification of the target text by utilizing a classifier model according to the similarity.
9. A device for classifying text, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of classifying text of any of claims 1-7 based on instructions stored in the memory device.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of classifying text according to any one of claims 1 to 7.
CN201910206324.3A 2019-03-19 2019-03-19 Text classification method and device and computer readable storage medium Pending CN111723199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910206324.3A CN111723199A (en) 2019-03-19 2019-03-19 Text classification method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910206324.3A CN111723199A (en) 2019-03-19 2019-03-19 Text classification method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111723199A true CN111723199A (en) 2020-09-29

Family

ID=72563030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910206324.3A Pending CN111723199A (en) 2019-03-19 2019-03-19 Text classification method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111723199A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818117A (en) * 2021-01-19 2021-05-18 新华智云科技有限公司 Label mapping method, system and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491541A (en) * 2017-08-24 2017-12-19 北京丁牛科技有限公司 File classification method and device
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN108563722A (en) * 2018-04-03 2018-09-21 有米科技股份有限公司 Trade classification method, system, computer equipment and the storage medium of text message
CN108647205A (en) * 2018-05-02 2018-10-12 深圳前海微众银行股份有限公司 Fine granularity sentiment analysis model building method, equipment and readable storage medium storing program for executing
CN109189933A (en) * 2018-09-14 2019-01-11 腾讯科技(深圳)有限公司 A kind of method and server of text information classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491541A (en) * 2017-08-24 2017-12-19 北京丁牛科技有限公司 File classification method and device
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN108563722A (en) * 2018-04-03 2018-09-21 有米科技股份有限公司 Trade classification method, system, computer equipment and the storage medium of text message
CN108647205A (en) * 2018-05-02 2018-10-12 深圳前海微众银行股份有限公司 Fine granularity sentiment analysis model building method, equipment and readable storage medium storing program for executing
CN109189933A (en) * 2018-09-14 2019-01-11 腾讯科技(深圳)有限公司 A kind of method and server of text information classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
洪文学等: "《基于多元统计图表示原理的信息融合和模式识别技术》", 国防工业出版社, pages: 153 - 156 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818117A (en) * 2021-01-19 2021-05-18 新华智云科技有限公司 Label mapping method, system and computer readable storage medium

Similar Documents

Publication Publication Date Title
US10643109B2 (en) Method and system for automatically classifying data expressed by a plurality of factors with values of text word and symbol sequence by using deep learning
JP6526329B2 (en) Web page training method and apparatus, search intention identification method and apparatus
JP5424001B2 (en) LEARNING DATA GENERATION DEVICE, REQUESTED EXTRACTION EXTRACTION SYSTEM, LEARNING DATA GENERATION METHOD, AND PROGRAM
US11521372B2 (en) Utilizing machine learning models, position based extraction, and automated data labeling to process image-based documents
CN109933686B (en) Song label prediction method, device, server and storage medium
CN112069321B (en) Method, electronic device and storage medium for text hierarchical classification
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN104834651B (en) Method and device for providing high-frequency question answers
CA3059929C (en) Text searching method, apparatus, and non-transitory computer-readable storage medium
CN110019653B (en) Social content representation method and system fusing text and tag network
CN111666766A (en) Data processing method, device and equipment
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN113221918B (en) Target detection method, training method and device of target detection model
CN110827112A (en) Deep learning commodity recommendation method and device, computer equipment and storage medium
CN111881671A (en) Attribute word extraction method
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
Jayady et al. Theme Identification using Machine Learning Techniques
CN111611395A (en) Entity relationship identification method and device
CN111723199A (en) Text classification method and device and computer readable storage medium
JP2014021757A (en) Content evaluation value prediction device, method and program
CN111488400B (en) Data classification method, device and computer readable storage medium
CN112328655A (en) Text label mining method, device, equipment and storage medium
Paik et al. Malware family prediction with an awareness of label uncertainty
CN111488452A (en) Webpage tampering detection method, detection system and related equipment
CN113705692B (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination