CN113297379A

CN113297379A - Text data multi-label classification method and device

Info

Publication number: CN113297379A
Application number: CN202110569710.6A
Authority: CN
Inventors: 胡任之; 陈培华
Original assignee: Good Diagnosis Shanghai Information Technology Co ltd
Current assignee: Good Diagnosis Shanghai Information Technology Co ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-24

Abstract

The text data multi-label classification method and device comprises the following steps: preprocessing text data to be analyzed to obtain a word/word sequence; inputting the word/word sequence into a multi-label classification model to obtain a label probability vector of the text data to be analyzed; calculating to obtain a label list of the text data to be analyzed according to the label probability vector of the text data to be analyzed and a preset label classification threshold; the multi-label classification model comprises an embedding layer and a coding classification layer; the embedding layer is used for obtaining an output vector of the embedding layer according to the word/word sequence of the text data, the predetermined word/dictionary and the label category dictionary; the encoding classification layer is used for outputting a label probability vector of the text data according to the output vector of the embedding layer; the output vector of the embedding layer contains word/word embedding, the relevance of the tags in the tag class dictionary to the text data, and the location of the tags in the text data. The method enriches semantic information input by the model, and can improve the accuracy of the model.

Description

Text data multi-label classification method and device

Technical Field

The present disclosure relates to the field of natural language text classification, and in particular, to a text data multi-label classification method and apparatus.

Background

With the rapid development of internet technology and the continuous promotion of various intelligent applications, how to analyze mass data and obtain valuable information from the mass data has become one of the main focused hot spots in the academic world and the industrial world, and technologies related to text processing have received great attention, wherein the text classification technology has been developed and advanced.

The traditional text classification technology mainly focuses on single label classification, namely, a piece of text generally has only one category label. In real life, however, a text segment is usually more than one type of label, especially in the classification task of medical texts (such as physical examination texts), and usually a text segment may correspond to one or more label types, such as: the "hepatic cyst and double kidney stone" can correspond to two labels of the "hepatic cyst" and the "kidney stone". Therefore, multi-label classification of physical examination data is of great significance to the processing of the physical examination data.

At present, people do a lot of research and practice on the research and application of text multi-label classification. The current text multi-label classification method mainly has two types: the method is a text multi-label classification method based on traditional machine learning, a classifier is trained by a manually designed feature extraction method, the quality of a model depends on the quality of feature design, text context semantic information is not considered, data dimension disasters are easily caused, and the classification accuracy is low; the other type is a text multi-label classification method based on a deep neural network, the method usually uses a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a long-term memory network (LSTM), an Attention mechanism (Attention), a pre-training model and the like to train a multi-label classification model, compared with the traditional machine learning method, the method has the advantages of automatic learning feature representation and high classification accuracy, but the classification method has poor adaptability to different use scenes (such as different data characteristics), and meanwhile, due to the fact that correlation between texts and labels is not considered, repeated research and development are often caused, and the classification accuracy cannot be further improved.

Disclosure of Invention

In the prior art, the relevance between texts and labels is not considered in text classification, and the problems of low classification accuracy and limited applicable scenes are solved.

In order to solve the above technical problem, a first aspect herein provides a text data multi-label classification method, including:

preprocessing text data to be analyzed to obtain a word/word sequence;

inputting the word/word sequence into a multi-label classification model to obtain a label probability vector of the text data to be analyzed;

calculating to obtain a label list of the text data to be analyzed according to the label probability vector of the text data to be analyzed and a preset label classification threshold;

the multi-label classification model comprises an embedding layer and a coding classification layer; the embedding layer is used for obtaining an output vector of the embedding layer according to a word/word sequence of the text data, a predetermined word/dictionary and a label category dictionary; the encoding classification layer is used for outputting a label probability vector of the text data according to the output vector of the embedding layer;

wherein the output vector of the embedding layer comprises word/word embedding, the relevance of the tag in the tag category dictionary to the text data, and the position of the tag in the text data.

As a further embodiment herein, the process of building the multi-label classification model comprises:

acquiring a label list of the text samples, and preprocessing the text samples to obtain a word/word sequence and a label list of each text sample;

respectively obtaining a word/dictionary and a label category dictionary according to the word/word list and the label list of all the text samples;

generating a label vector of each sample according to the label list in each text sample and the label category dictionary;

inputting a word/word sequence, a word/dictionary and a label category dictionary of a text sample into the embedding layer to obtain an output vector of the sample embedding layer;

inputting the output vector of the sample embedding layer into the coding classification layer, and predicting to obtain a sample classification probability vector;

and updating parameters in the multi-label classification model according to the predicted sample classification probability vector and the sample label vector.

As a further embodiment herein, the text data multi-label classification method further comprises:

inputting a word/word sequence, a word/dictionary, a label category dictionary and a pre-trained word/word vector model of a text sample into the embedding layer to obtain an output vector of the sample embedding layer;

when the text sample data is characterized by few labels and multiple samples, inputting a word/word sequence, a word/dictionary and a label category dictionary of the text sample into the embedding layer to obtain an output vector of the sample embedding layer; in specific implementation, when the text sample data is characterized by few tags and multiple samples, the word/word sequence, the word/dictionary, the tag category dictionary and the pre-trained word/word vector model of the text sample can be input into the embedding layer to obtain the output vector of the sample embedding layer.

And when the text sample data is characterized by multiple labels and few samples, inputting the word/word sequence, the word/dictionary, the label category dictionary and the pre-trained word/word vector model of the text sample into the embedding layer to obtain an output vector of the sample embedding layer. In specific implementation, when the text sample data is characterized by multiple labels and few samples, the word/word sequence, the word/dictionary and the label category dictionary of the text sample can be input into the embedding layer to obtain the output vector of the sample embedding layer.

As a further embodiment herein, the process of establishing the multi-label classification model further comprises:

processing the word/word sequence according to the n-gram model to obtain an n-gram word/word sequence;

inputting the word/word sequence, word/dictionary, and label category dictionary of the text sample into the embedding layer to obtain an output vector of the sample embedding layer further comprises: and inputting the word/word sequence, the n-element word/word sequence, the word/dictionary input and the label category dictionary of the text sample into the embedding layer to obtain an output vector of the sample embedding layer.

As a further embodiment herein, the encoding classification layer comprises: a coding layer and a classification layer;

the coding layer is used for obtaining a text vector according to the output vector of the sample embedding layer;

the classification layer is used for determining a sample classification probability vector according to the text vector;

and selecting a coding layer network according to the application scene of the multi-label classification model.

identifying the text to be analyzed according to a preset label identification rule to obtain a supplementary label category of the text to be analyzed;

and retrieving the supplementary tag category from a tag list of the text data to be analyzed, and if the supplementary tag category is not retrieved, adding the supplementary tag category to the tag list.

identifying the text to be analyzed according to a preset label removing rule and/or a constraint condition to obtain the wrong label category of the text to be analyzed;

and deleting the error label category in the label list of the text data to be analyzed.

A second aspect herein provides a text data multi-label classification apparatus comprising:

the preprocessing module is used for preprocessing the text data to be analyzed to obtain a word/character sequence;

the classification module is used for inputting the word/word sequence into a multi-label classification model to obtain a label probability vector of the text data to be analyzed;

the first output module is used for calculating to obtain a label list of the text data to be analyzed according to the label probability vector of the text data to be analyzed and a preset label classification threshold;

wherein the output vector of the embedding layer comprises the relevance of the label of the word/word embedded in the label category dictionary and the text data and the position of the label in the text data.

A third aspect provides a computer device comprising a memory, a processor, and a computer program stored on the memory, the computer program, when executed by the processor, executing the instructions of the text data multi-label classification method according to any of the preceding embodiments.

A fourth aspect herein provides a computer storage medium having stored thereon a computer program which, when executed by a processor of a computer device, executes instructions of a text data multi-label classification method according to any of the preceding embodiments.

According to the text data multi-label classification method and device, the relevance of the labels in the label category dictionary and the text data and the positions of the labels in the text data are considered in the embedding layer of the multi-label classification model, so that the multi-label classification model enriches semantic information input by the model, and the accuracy of the multi-label classification model can be improved. The text data to be analyzed is identified through the multi-label classification model, so that the classification accuracy can be improved, and the obtained classification result is more in line with the actual situation.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 illustrates a first flowchart of a method for multi-label classification of textual data according to embodiments herein;

FIG. 2 illustrates a second flowchart of a method of multi-label classification of textual data of embodiments herein;

FIG. 3 illustrates a third flowchart of a multi-label classification method of text data in an embodiment herein;

FIG. 4 is a block diagram illustrating a multi-label classification model according to an embodiment herein;

FIG. 5 shows a first flowchart of a multi-label classification model building process of embodiments herein;

FIG. 6 shows a second flowchart of a multi-label classification model building process of embodiments herein;

FIG. 7 shows a third flowchart of a multi-label classification model building process of embodiments herein;

FIG. 8 shows a first block diagram of a multi-label sorting apparatus according to embodiments herein;

FIG. 9 is a second block diagram of the multi-label sorting apparatus according to an embodiment of the present disclosure;

FIG. 10 is a block diagram illustrating a computer device according to an embodiment of the present disclosure.

Description of the symbols of the drawings:

410. an embedding layer;

420. a code classification layer;

421. a coding layer;

422. a classification layer;

801. an acquisition module;

802. a preprocessing module;

803. a classification module;

804. a first output module;

805. a post-processing module;

806. a second output module;

1002. a computer device;

1004. a processor;

1006. a memory;

1008. a drive mechanism;

1010. an input/output module;

1012. an input device;

1014. an output device;

1016. a presentation device;

1018. a graphical user interface;

1020. a network interface;

1022. a communication link;

1024. a communication bus.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments herein without making any creative effort, shall fall within the scope of protection.

The present specification provides method steps as described in the examples or flowcharts, but may include more or fewer steps based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual system or apparatus product executes, it can execute sequentially or in parallel according to the method shown in the embodiment or the figures.

The text data multi-label classification method provided by the invention is applicable to scenes in which multi-label classification exists and the labels are related to texts, such as physical examination text classification, news label classification, fine-grained classification of user evaluation and the like, and the specific application field is not limited in the text.

In an embodiment of the present disclosure, a text data multi-label classification method is provided, which is used to solve the problems in the prior art that relevance between a text and a label is not considered for classification of multi-label data, classification accuracy is low, and an applicable scenario is limited. The method can be operated in a third-party system and an intelligent terminal independent of a text data source to be analyzed, including a smart phone, a tablet computer, a desktop computer and the like, and can also be an independent application program, a small program embedded in other programs and the like, or can also be in a webpage form and the like, and the specific implementation mode is not limited herein.

Specifically, as shown in fig. 1, the text data multi-label classification method includes:

step 101, receiving text data to be analyzed;

step 102, preprocessing text data to be analyzed to obtain a word/word sequence;

103, inputting the word/word sequence into a multi-label classification model to obtain a label probability vector of the text data to be analyzed;

and 104, calculating to obtain a label list of the text data to be analyzed according to the label probability vector of the text data to be analyzed and a preset label classification threshold.

As shown in fig. 4, the multi-label classification model includes an embedding layer 410 and a coding classification layer 420; the embedding layer 410 is configured to obtain an output vector of the embedding layer according to the word/word sequence of the text data, the predetermined word/dictionary, and the tag category dictionary. The encoding classification layer 420 is configured to output a tag probability vector of the text data according to the output vector of the embedding layer. Specifically, the encoding classification layer 420 includes: an encoding layer 421 and a classification layer 422. The encoding layer 421 is used to obtain a text vector according to the output vector of the embedding layer. The classification layer 422 is configured to determine a sample classification probability vector according to the encoding layer 421 text vector.

Wherein the output vector of the embedding layer comprises word/word embedding, the relevance of the tags in the tag class dictionary to the text data, and the location of the tags in the text data (i.e. tag location vector).

The word/dictionary and the label category dictionary can be determined when the multi-label classification model is established, and the word/dictionary and the label category dictionary in different application fields contain different contents. The word/dictionary is all or filtered keywords/words (for example, words/words with an occurrence probability exceeding a certain value) included in the field to which the text to be recognized belongs. The label class dictionary is a combination of all class labels in the field of the text to be recognized. The word/dictionary is used for determining a word/word ID sequence corresponding to a word/word sequence corresponding to the text to be analyzed, and the word/word ID respectively corresponds to the index position of the corresponding word/word in the word/dictionary.

The embedding layer is used for determining word/word embedding with fixed length according to a word/word sequence of the text data and a predetermined word/dictionary (the dimension of the word/word embedding can be determined empirically, for example, a random initialization vector of 64, 100, 128, 256 and the like), determining the relevance of the tags in the tag class dictionary and the text data and the positions of the tags in the text data according to the tag class dictionary, namely obtaining a tag position vector with fixed length (for example, 16-dimensional, the first 15-dimensional is random initialization data for reflecting the relevance of the tags and the text data, the 16 th dimension represents whether the tags appear in the text data, if the tags appear in the text, the specification is tag text, the 16 th dimension is set to be 1, if no tags appear in the text, the specification is non-tag text, the 16 th dimension is set to be 0, in specific implementation, the tag location vector may also be in other dimensions, typically setting the last dimension to indicate whether a tag appears in the text data). And the relevance of the label in the label category dictionary and the text data is represented by whether the label in the label category dictionary appears in the text data or not, and the position of the label in the text data can be automatically judged by a program. And the output vector of the embedding layer obtained by the embedding layer, namely the vector output by the embedding layer is a vector obtained by word/word embedding and label position vector splicing. Text data can be converted into a computable embedding vector by an embedding layer.

The coding layer may be a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or an Attention-based (Attention) Transformer network. The CNN and the Transformer network are suitable for scenes with high requirements on computing efficiency, the RNN network is suitable for scenes with less training corpora, and the Transformer network is also suitable for application scenes under large-scale corpora.

The classification layer contains a classification function, usually a sigmoid function.

In the embodiment, the relevance of the labels in the label category dictionary in the field to which the text belongs and the text data and the positions of the labels in the text data are considered in the embedding layer of the multi-label classification model, so that the multi-label classification model enriches semantic information input by the model, and the accuracy of the multi-label classification model can be improved. The text data to be analyzed is identified through the multi-label classification model, so that the classification accuracy can be improved, and the obtained classification result is more in line with the actual situation.

In a further embodiment of the present disclosure, after obtaining the tag list, the tag list is further output in an order from a high probability value to a low probability value, and the output form is, for example, displaying the tag list in a display screen, broadcasting the tag list by voice, and the like.

The term/word is used to refer to one of the words or words, and the implementation can determine whether to use the word for analysis or the word for analysis according to the application scenario. Specifically, when there is no pre-trained language model, a word sequence is usually selected, and when there is a pre-trained language model, a word sequence or a word sequence may be selected, which sequence is specifically selected, depending on whether the input of the pre-trained language model is a word sequence or a word sequence.

In step 101, the text data to be analyzed may be uploaded by a user, which may be a patient or a doctor, or automatically uploaded by a medical examination facility, taking medical examination text classification as an example. The text data to be analyzed may include one or more classification tags, for example, the text to be analyzed is "empty gallbladder size and morphology normal, gallbladder wall rough and thickened, thickness about 4.3mm, poor internal sound transmission, and common bile duct not expanded. Repeatedly exploring the images without cyst, wherein the corresponding labels are 'thickening of the gallbladder wall and rough of the gallbladder wall', and for example, the text to be analyzed is 'widening of the aortic node and calcified plaque', the corresponding labels are 'widening of the aorta and calcification of the aorta', for example, the text to be analyzed is 'widening of the aortic node and calcified plaque', and the corresponding label is 'widening of the aorta'.

In step 102, a word/word sequence refers to a word sequence or word sequence, e.g. text data "liver size is normal. ", the corresponding word sequence is" liver "," zang "," big "," small "," positive "," normal ". ", the corresponding word sequence is" liver "," size "," normal "," etc. ". Whether a word sequence or a word sequence is specifically derived is determined by the sequence used in the multi-label classification model. When step 102 is implemented, the existing word segmentation/word segmentation method can be used for preprocessing, which is not limited herein.

In step 103, the label probability vector of the text data is fixed in length, each probability corresponds to one kind of label, for example, the label probability vector is { probability 1, probability 2, …, probability n } corresponds to { label 1, label 2, …, label n }, and the probabilities of the labels do not interfere with each other, i.e., each probability is relatively independent, and the sum is not 1.

In step 104, the label classification threshold is, for example, 0.5, and the specific value is not limited herein. In specific implementation, each probability value in the label probability vector is compared with a label classification threshold, if the probability value is greater than 0.5, the element at the corresponding position in the label list is set to be 1, which indicates that the label at the position exists, and if the probability value is less than 0.5, the element at the corresponding position in the label list is set to be 0, which indicates that the label at the position does not exist.

In one embodiment herein, as shown in fig. 2, the text data multi-label classification method further includes, in addition to the above steps 101 to 104:

step 105, identifying the text to be analyzed according to a preset label identification rule to obtain a supplementary label category of the text to be analyzed;

and 106, retrieving the supplementary tag category from a tag list of the text data to be analyzed, and if the supplementary tag category is not retrieved, adding the supplementary tag category to the tag list.

In detail, the preset tag identification rule in step 105 may be determined according to an actual application scenario, and the specific content is not limited herein. In some embodiments, the preset tag recognition rule is, for example, a text content satisfying "a part + a non-else word + a tag partial content of a removed part" (where the tag is taken from a tag category dictionary), and the recognized text content is a supplementary tag category, for example, the text "aortic node is slightly widened and has calcified plaque" is recognized by using the preset tag recognition rule, and the obtained tag aortic calcification is the supplementary tag category.

The embodiment can improve the recall rate of label identification and reduce label missing identification, for example, fatty liver (mild) does not identify mild fatty liver.

In one embodiment herein, as shown in fig. 3, the text data multi-label classification method further includes, in addition to steps 101 to 104:

step 107, identifying the text to be analyzed according to a preset label removing rule and/or a constraint rule to obtain an error label category;

and step 108, deleting the error label category in the label list of the text data to be analyzed.

In detail, the preset label removal rule in step 107 may be determined according to an actual application scenario, and the specific content is not limited herein. The false label category can be identified in a fuzzy manner according to the preset label removing rule.

In some embodiments, the preset rule for removing the tag is, for example, to match the text content to satisfy "location + else word + tag part content of removed location (where the tag is taken from the predicted tag list)", and the text content is taken as an error tag category, specifically, for example, the tag aortic calcification in the text "aortic node is slightly widened and has not seen calcification spots" is taken as the error tag category.

In other embodiments, the preset label removing rule may further be that each label X output by the multi-label classification model is split and then matched with the text data to be analyzed, so as to determine whether the label X output by the multi-label classification model is an error label category, specifically, if the matching is successful, the label X is not an error label category, and if the matching is failed, the label X is an error label category.

In detail, the constraint condition may be determined according to an actual application scenario, and the content included therein is not specifically limited. The constraint condition is used for accurately determining the category of the error label, for example, if the label corresponding to the liver ultrasound (such as an intrahepatic calcification focus) is contained in the predicted label list predicted by the examination text (such as the text "the aortic nodule is slightly widened and the calcified plaque is found) of the chest film, the label corresponding to the liver ultrasound can be determined to be the category of the error label.

The condition that this embodiment can avoid label misidentification takes place, improves label identification's rate of accuracy.

When the method is implemented by a person skilled in the art, the method may further implement steps 105 to 106 and 107 to 108 at the same time, which is not limited herein.

In one embodiment of the present disclosure, in order to adapt to the multi-tag classification requirements of different usage scenarios, different embedding generation methods and coding layers may be selected according to different application scenarios and data characteristics to establish a multi-tag classification model of each usage scenario.

For a usage scenario with few labels and many samples, for example, the number of labels is less than 10, and the number of samples is more than 1 ten thousand, a multi-label classification model can be established by using the following process shown in fig. 5.

For a usage scenario with multiple labels and few samples, for example, the number of labels is greater than 100, and the number of samples corresponding to some labels is less than 100, a multi-label classification model may be established by using the following process shown in fig. 6.

Of course, in specific implementation, if the establishing effect of the multi-label classification model is not considered, for the usage scenario with few labels and multiple samples, the following process shown in fig. 6 may be used to establish the multi-label classification model. For a multi-label and few-sample usage scenario, a multi-label classification model may also be established using the process illustrated in fig. 5, described below.

For a usage scenario with a higher computational efficiency requirement, the coding layer may use a CNN network and a transform network, whereas for a usage scenario with a lower computational efficiency requirement, an RNN network encoder may be used.

In one embodiment herein, as shown in fig. 5, the process of establishing the multi-label classification model includes:

step 501, acquiring a text sample and a label list thereof, and preprocessing the text sample to obtain a word/word sequence of each text sample;

step 502, respectively obtaining a word/dictionary and a label category dictionary according to the word/word lists and the label lists of all the text samples;

step 503, generating label vectors of each sample according to the label list in each text sample and the label category dictionary;

step 504, inputting the word/word sequence, the word/dictionary and the label category dictionary of the text sample into an embedding layer of the multi-label classification model to obtain an output vector of the sample embedding layer;

step 505, inputting the output vector of the sample embedding layer into a coding layer of a multi-label classification model to obtain a text vector;

step 506, inputting the text vector obtained in the step 505 into a classification layer of a multi-label classification model, and predicting to obtain a sample classification probability vector;

and 507, updating parameters in the multi-label classification model according to the predicted sample classification probability vector and the sample label vector.

In detail, the multi-label classification model comprises the input items of the word/dictionary and the label category dictionary, and the values of the word/dictionary and the label category dictionary can be input in an assignment mode during training without repeatedly inputting every time parameters are updated.

In step 501, the text samples may be labeled manually in advance to determine a tag list, where the tag list of each text sample includes at least one tag category. The text sample contains text content. The pre-processing content comprises word segmentation and word segmentation.

In step 502, the labels included in the label list of all text samples may be grouped into a label category dictionary. And forming a word/dictionary by the words/characters contained in all the word/character lists.

In step 503, the resulting sample label vector may be based on the vector representation of One-hot. All sample label vectors are the same in length, and in implementation, the sample label vectors can be initialized to be vectors of all 0 s, then the sample label list and the label category dictionary are compared, the positions of the matched positions in the label category dictionary are recorded, and the value of the corresponding position in the sample label vectors is set to be 1, so that the sample label vectors are obtained.

In step 504, two embodiments can be included as follows:

(1) the embedding layer of the multi-label classification model is input according to the word sequence and the dictionary of each text sample, randomly generated words with fixed length can be obtained to be embedded, for example, a randomly initialized vector with dimension 64, a label position vector with fixed length can be defined according to whether a label in the label class dictionary appears in the text sample and the position of the label, if the position vector is 16-dimensional, the former 15-dimensional data represents the position of the label appearing in the text, the random initialization is carried out in the training stage of the multi-label classification model, the learning of the multi-model classification model is used to obtain, the 16-dimensional data represents whether a label class appears in the text, the appearance is 1, and otherwise, the 16-dimensional data is 0. The vector (vector with dimension of 80) after splicing the word embedding and the label position vector can be used as an output vector of an embedding layer of the multi-label classification model.

(2) According to the word sequence of each text sample and the embedding layer of the dictionary input multi-label classification model, randomly generated word embedding with fixed length can be obtained, and a label position vector with fixed length can be defined according to whether the label in the label class dictionary appears in the text and the position of the label in the label class dictionary. And the vector after splicing the word embedding and the label position vector can be used as an output vector of a sample embedding layer output by the embedding layer of the multi-label classification model.

In step 505, the coding layer may be selected from a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or a Attention-based Transformer network.

In step 506, the classification layer may choose a sigmoid function.

In step 507, parameters in the multi-label classification model are updated including respective neural network parameters in an embedding layer, an encoding layer, and a classification layer. The specific updating process comprises the following steps: the loss function loss, such as a cross entropy loss function, can be calculated according to the sample classification probability vector and the sample label vector predicted by the model, and the updating is completed by judging whether the loss function meets the condition or not, if not, updating the training step length, wherein the step length can be set according to the actual situation, which is not limited herein. Generally, a smaller loss value is better.

The multi-classification label model obtained by training is suitable for the scene with few labels and multiple samples, and the problem of low multi-label classification accuracy in the scene can be solved.

In order to obtain the multi-label classification model more accurately, when the multi-label classification model inspection process is implemented, after step 503 is performed, the processed text sample may be further divided into a training set, a verification set and a test set (usually 8:1:1) according to a certain ratio. Wherein the training set, the validation set and the test set respectively contain a plurality of samples.

In one embodiment herein, as shown in fig. 6, the process of establishing the multi-label classification model includes:

step 601, preprocessing the obtained text samples and the label lists thereof to obtain word/word sequences and label lists of each text sample;

step 602, respectively obtaining a word/dictionary and a label category dictionary according to the word/word lists and the label lists of all the text samples;

step 603, generating label vectors of each sample according to the label list and the label category dictionary in each text sample;

step 604, inputting a word/word sequence, a word/dictionary, a label category dictionary and a pre-trained word/word vector model of a text sample into the embedding layer to obtain an output vector of the sample embedding layer;

step 605, inputting the output vector of the sample embedding layer into a coding layer of a multi-label classification model to obtain a text vector;

step 606, inputting the text vector obtained in the step 605 into a classification layer of a multi-label classification model, and predicting to obtain a sample classification probability vector;

step 607, updating the parameters in the multi-label classification model according to the predicted sample classification probability vector and the sample label vector.

Specifically, the steps 601 to 603 and the steps 605 to 607 can be performed by referring to the steps 501 to 503 and the steps 505 to 507, which are not described in detail herein.

In step 604, the following two embodiments may be included:

(1) according to the Word sequence of each text sample, a dictionary and a pre-trained Word vector model (which can be obtained by training labeled data by using a tool such as Word2vec and the like and is used for obtaining a Word vector corresponding to each Word in the Word sequence), an embedding layer of the multi-label classification model is input, Word embedding with a specific length (which is consistent with the vector length in the Word vector model) can be obtained, and a label position vector with a fixed length can be defined according to whether a label in the label class dictionary appears in the text and the position of the label. And the vector obtained by splicing the word embedding and the label position vector can be used as an output vector of an embedding layer of the multi-label classification model. In specific implementation, since the word vector model trained in advance has a dictionary, the embedding layer does not need to input the dictionary.

(2) Word embedding containing context semantic information with a specific length (consistent with the vector length in a pre-training model) can be obtained according to the word sequence of each text sample and the pre-training word vector model, for example, a pre-training model such as bert (bidirectional Encoder retrieval from transformations), and a fixed-length tag position vector can be defined according to whether the tag in the tag class dictionary appears in the text and the position of the tag. And a vector formed by splicing the word embedding and the label position vector can be used as an output vector of an embedding layer of the multi-label classification model.

The multi-classification label model obtained by training is suitable for a scene with multiple labels and few samples in part of labels, and the problem of low multi-label classification accuracy in the scene can be solved.

In one embodiment herein, in order to enable the multi-label classification model to enhance the representation of the contextual semantic information in the text, as shown in fig. 7, the multi-label classification model is established by:

step 701, preprocessing the obtained text samples and the label lists thereof to obtain word/word sequences and label lists of each text sample;

step 702, respectively obtaining a word/dictionary and a label category dictionary according to the word/word sequences and the label lists of all the text samples;

703, generating label vectors of each sample according to the label list in each text sample and the label category dictionary;

step 704, processing the word/word sequence according to the n-gram model to obtain an n-gram word/word sequence;

step 705, inputting a word/word sequence, an n-gram word/word sequence, a word/dictionary input and a label category dictionary of a text sample into the embedding layer to obtain an output vector of the sample embedding layer;

step 706, inputting the output vector of the sample embedding layer to a coding layer of a multi-label classification model to obtain a text vector;

step 707, inputting the text vector obtained in step 706 into a classification layer of the multi-label classification model, and predicting to obtain a sample classification probability vector;

and step 708, updating parameters in the multi-label classification model according to the predicted sample classification probability vector and the sample label vector.

In step 704, the n-gram (n ≧ 2) is implemented by referring to the prior art, which is not limited herein.

When step 705 is performed, the pre-trained word/word vector model may also be input to the embedding layer to obtain the output vector of the sample embedding layer. And the vector dimension of embedding the words/characters in the output vector is consistent with the vector dimension of the pre-training model.

The embodiment can enhance the representation of the context semantic information in the text through step 704, and improve the accuracy of the multi-label classification model.

Based on the same inventive concept, a text data multi-label classification device is also provided herein, as described in the following embodiments. Because the principle of the text data multi-label classification device for solving the problems is similar to the text data multi-label classification method, the implementation of the text data multi-label classification device can refer to the text data multi-label classification method, and repeated parts are not repeated.

The text data multi-label classification device provided by this embodiment includes a plurality of functional modules, which may be implemented by dedicated or general chips, and may also be implemented by software programs, which are not limited herein. The text data multi-tag classification device can be installed in a client, and the client can be a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device and the like. Of course, the client is not limited to the electronic device with a certain entity, and may also be software running in the electronic device.

Specifically, as shown in fig. 8, the text data multi-label classification apparatus includes:

an obtaining module 801, configured to obtain text data to be analyzed;

the preprocessing module 802 is configured to preprocess text data to be analyzed to obtain a word/word sequence;

a classification module 803, configured to input the word/word sequence into a multi-label classification model, to obtain a label probability vector of the text data to be analyzed;

a first output module 804, configured to calculate a tag list of the text data to be analyzed according to the tag probability vector of the text data to be analyzed and a preset tag classification threshold;

wherein the output vector of the embedding layer comprises word/word embedding, the relevance of the label in the label category dictionary of the field to which the text belongs and the text data, and the position of the label in the text data.

The training process of the multi-label classification model refers to the foregoing embodiments, and is not described in detail here.

The text data multi-label classification device provided by the embodiment considers the relevance of the labels in the label category dictionary and the text data and the positions of the labels in the text data in the embedding layer of the multi-label classification model, so that the multi-label classification model enriches the semantic information input by the model, and the accuracy of the multi-label classification model can be improved. The text data to be analyzed is identified through the multi-label classification model, so that the classification accuracy can be improved, and the obtained classification result is more in line with the actual situation.

In one embodiment of this document, as shown in fig. 9, the text data multi-label classification apparatus further includes:

a post-processing module 805 configured to perform the following processes:

(1) identifying the text to be analyzed according to a preset label identification rule to obtain a supplementary label category of the text to be analyzed;

(2) Identifying a text to be analyzed according to a preset label removing rule and a constraint condition to obtain a wrong label category of the text to be analyzed; and deleting the error label category in the label list of the text data to be analyzed.

And a second output module 806, configured to output the tag lists filtered by the post-processing module 805 in an order from a large output probability to a small output probability.

According to the above implementation procedure, a specific embodiment is provided, where this embodiment contains physical examination text data of 108 tags, and the correlation between the text and the tags (prior art) is not considered, and the experimental result pair such as table one which considers the correlation between the text and the tags (herein) is provided.

Watch 1

Wherein, F1 is 2P R/(P + R).

As can be seen from Table I, the accuracy, recall and F1 values of the multi-label classification model can be improved compared with the prior art.

In addition, when the method is implemented specifically, the multi-label classification model can select different input sequences (word sequences or word sequences) and different coding layers, so that the method can be applied to various scenes with multiple labels, few samples, multiple labels, multiple samples, high computational efficiency requirements and the like.

In one embodiment herein, a computer device is also provided, as shown in fig. 10, the computer device 1002 may include one or more processors 1004, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 1002 may also include any memory 1006 for storing any kind of information, such as code, settings, data, etc. For example, and without limitation, the memory 1006 may include any one or more of the following in combination: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 1002. In one case, when the processor 1004 executes the associated instructions stored in any memory or combination of memories, the computer device 1002 can perform any of the operations of the associated instructions, in particular, for example, the text data multi-label classification method or the multi-label classification model building method described in any of the previous embodiments. The computer device 1002 also includes one or more drive mechanisms 1008, such as a hard disk drive mechanism, an optical disk drive mechanism, or the like, for interacting with any memory.

Computer device 1002 may also include an input/output module 1010(I/O) for receiving various inputs (via input device 1012) and for providing various outputs (via output device 1014)). One particular output mechanism may include a presentation device 1016 and an associated graphical user interface 1018 (GUI). In other embodiments, input/output module 1010(I/O), input device 1012, and output device 1014 may also be excluded, as only one computer device in a network. Computer device 1002 can also include one or more network interfaces 1020 for exchanging data with other devices via one or more communication links 1022. One or more communication buses 1024 couple the above-described components together.

Communication link 1022 may be implemented in any manner, such as over a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. Communications link 1022 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.

Corresponding to the methods in fig. 1-3, 5-7, embodiments herein also provide a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, performs the steps of the above-described method.

Embodiments herein also provide computer readable instructions, wherein when executed by a processor, a program thereof causes the processor to perform the methods as shown in fig. 1-3, 5-7.

It should be understood that, in various embodiments herein, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments herein.

It should also be understood that, in the embodiments herein, the term "and/or" is only one kind of association relation describing an associated object, meaning that three kinds of relations may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided herein, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purposes of the embodiments herein.

In addition, functional units in the embodiments herein may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present invention may be implemented in a form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The principles and embodiments of this document are explained herein using specific examples, which are presented only to aid in understanding the methods and their core concepts; meanwhile, for the general technical personnel in the field, according to the idea of this document, there may be changes in the concrete implementation and the application scope, in summary, this description should not be understood as the limitation of this document.

Claims

1. A text data multi-label classification method, characterized in that the method comprises:

preprocessing text data to be analyzed to obtain a word/word sequence;

2. The method of multi-label classification of text data according to claim 1, wherein the process of building the multi-label classification model comprises:

3. The text data multi-label classification method of claim 2, further comprising:

and inputting the word/word sequence, the word/dictionary, the label category dictionary and the pre-trained word/word vector model of the text sample into the embedding layer to obtain an output vector of the sample embedding layer.

4. The method for multi-label classification of text data according to claim 2, wherein the process of establishing the multi-label classification model further comprises:

5. The text data multi-label classification method of claim 2, wherein the coding classification layer comprises: a coding layer and a classification layer;

6. The text data multi-label classification method of claim 1, further comprising:

7. The text data multi-label classification method of claim 1, further comprising:

8. A text data multi-label classification apparatus, comprising:

9. A computer device comprising a memory, a processor, and a computer program stored on the memory, wherein the computer program, when executed by the processor, performs the instructions of the method of any one of claims 1-7.

10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor of a computer device, executes instructions of a method according to any one of claims 1-7.