CN113806507A

CN113806507A - Multi-label classification method and device and readable medium

Info

Publication number: CN113806507A
Application number: CN202111087866.7A
Authority: CN
Inventors: 蒋佳佳; 肖龙源; 李稀敏; 邹辉
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-17
Anticipated expiration: 2041-09-16
Also published as: CN113806507B

Abstract

The invention discloses a multi-label classification method, a multi-label classification device and a readable medium. And then, a multi-label classification model is constructed, the specific meanings of some special characters in the text can be captured by introducing an Attention mechanism, the classification effect is improved, the mutual relation among the multi-labels can be obtained by introducing a multi-label reasoning mechanism, the influence of the sequence among the multi-labels is avoided, the precision of the model can be further improved, and an excellent classification result is achieved.

Description

Multi-label classification method and device and readable medium

Technical Field

The invention relates to the field of natural language processing, in particular to a multi-label classification method, a multi-label classification device and a readable medium.

Background

The multi-label classification from the unstructured text becomes a very important branch in the conversation field, and can be applied to the fields of intention recognition, text emotion classification and the like. A multi-labeled category refers to a category of text that results in zero or more mutually exclusive results. And with the rapid development of the large-scale pre-training model, the large-scale pre-training model on the universal data set or the vertical field data set can provide strong assistance for the training of the downstream model, and can achieve a very good effect. The combination of the pre-trained model with the traditional deep network model has become a standard for fast problem solving.

However, the existing multi-label classification method has certain limitations, which are mainly reflected in the following aspects:

(1) in the process of applying the pre-training model, the meaning of each character cannot be captured, so that the classification result is more biased to the whole meaning of the sentence, and the meaning expressed by some special characters is ignored.

(2) When a multi-label classification result is obtained, there is usually a relationship of mutual influence between labels, for example, some labels have a high possibility to appear simultaneously, or the sequence of appearance of the labels also affects the classification result.

(3) The general multi-label classification method has complex model and higher complexity, and has certain difficulty and limitation in the practical application process.

Disclosure of Invention

The technical problem mentioned above is addressed. An embodiment of the present application is directed to a multi-label classification method, apparatus and readable medium, so as to solve the technical problems mentioned in the background section above.

In a first aspect, an embodiment of the present application provides a multi-label classification method, including the following steps:

s1, acquiring a preprocessed dialogue data set, dividing the preprocessed dialogue data set into a plurality of batch data, and obtaining a feature vector of the t-th character in the i-th sentence through a trained pre-training model based on the batch data;

s2, inputting the characteristic vector of the t character in the ith sentence into the first Attention layer to obtain the first characteristic vector of the ith sentence;

s3, establishing a label information matrix, and obtaining a label characteristic vector obtained by last prediction through a label prediction result obtained last time and the label information matrix;

s4, splicing the first feature vector of the ith sentence with the label feature vector obtained by last prediction, and inputting the spliced first feature vector into a second Attention layer to obtain a second feature vector of the ith sentence;

s5, splicing the first characteristic vector of the ith sentence and the second characteristic vector of the ith sentence, and inputting the spliced first characteristic vector and the second characteristic vector into a residual error network structure for characteristic extraction to obtain a third characteristic vector of the ith sentence;

s6, inputting the third feature vector of the ith sentence into a classifier to obtain a label prediction result;

s7, repeating the steps S1-S6 for multiple times to obtain the final label prediction result.

In some embodiments, the preprocessed conversational data set is obtained by: useless characters in the dialogue data set are removed, and the useless characters comprise special characters and emoticons.

In some embodiments, the pre-trained model trained on the generic dataset is pre-trained with the pre-processed conversational dataset, resulting in a trained pre-trained model in the vertical domain.

In some embodiments, the pre-trained model comprises a Bert model comprising 12 layers of encoders.

In some embodiments, the pre-training model trained in step S1 is pre-trained again using Roberta of the haar-scale open source as an initial weight, and the input is the corresponding id of the tth character in the ith sentence of the batch data in the dictionary.

In some embodiments, the step S2 of obtaining the importance information of each character to the sentence features through the first Attention layer specifically includes:

u_it＝tanh(W_wh_it+b_w)；

s_i＝∑_tα_ith_it；

wherein h is_itIs the feature vector of the t-th character in the i-th sentence, W_w、b_wRespectively, their corresponding weights and offsets, u_wFor each character weight, α_itFor the degree of importance of each character in the sentence, s_iIs the first feature vector of the ith sentence.

In some embodiments, the tag information matrix in step S3 is an initialization matrix weight W^n×hWherein n is the number of all tags, h is the feature dimension, and the last predicted tag feature vector corresponding to each batch data is o_t-1*W^n×hWherein o is_t-1And for the last label prediction result, the last label prediction result is the prediction score of each label of the n labels, and the first label prediction result is zero.

In some embodiments, the importance information of the different tags is obtained through the second Attention layer in step S4.

In some embodiments, the classifier in step S6 includes a linear layer that employs a sigmoid activation function.

In a second aspect, an embodiment of the present application provides a multi-label sorting apparatus, including:

the pre-training module is configured to acquire a preprocessed dialogue data set, divide the preprocessed dialogue data set into a plurality of batch data, and obtain a feature vector of the t-th character in the i-th sentence through a trained pre-training model based on the batch data;

the first Attention module is configured to input a feature vector of a t-th character in an ith sentence into a first Attention layer to obtain a first feature vector of the ith sentence;

the label characteristic vector acquisition module is configured to establish a label information matrix and obtain a label characteristic vector obtained by last prediction through a label prediction result obtained last time and the label information matrix;

the second Attention module is configured to splice the first feature vector of the ith sentence with the tag feature vector obtained by last prediction, and input the second Attention layer to obtain a second feature vector of the ith sentence;

the characteristic extraction module is configured to splice a first characteristic vector of an ith sentence and a second characteristic vector of the ith sentence, and input a residual error network structure for characteristic extraction to obtain a third characteristic vector of the ith sentence;

the classification module is configured to input the third feature vector of the ith sentence into the classifier to obtain a label prediction result;

and the repeating module is configured to repeatedly execute the pre-training module to the classifying module for multiple times to obtain a final label prediction result.

In a third aspect, embodiments of the present application provide an electronic device comprising one or more processors; storage means for storing one or more programs which, when executed by one or more processors, cause the one or more processors to carry out a method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention applies the pre-training model and performs the pre-training again on the data set in the vertical field, and can obviously improve the performance of the downstream task.

2. The invention applies the Attention mechanism, can well capture the specific meaning of the special character when the meaning characteristic of the sentence is expressed, so that the classification result has smaller fine granularity.

3. The invention applies a multi-label reasoning mechanism, can discover the mutual relation among the labels, is not influenced by the sequence among the labels, and has very good model effect.

4. The model of the invention is simple, the complexity is low, and the invention can be developed and deployed rapidly.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is an exemplary device architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow chart of a multi-label classification method according to an embodiment of the invention;

FIG. 3 is a flowchart illustrating a multi-label classification method according to an embodiment of the invention before step S1;

FIG. 4 is a flow chart of a multi-label classification model of a multi-label classification method according to an embodiment of the invention;

FIG. 5 is a schematic view of a multi-label sorting apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device suitable for implementing an electronic apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 illustrates an exemplary device architecture 100 to which a multi-label classification method or a multi-label classification device of an embodiment of the present application may be applied.

As shown in fig. 1, the apparatus architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various applications, such as data processing type applications, file processing type applications, etc., may be installed on the

terminal apparatuses

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background data processing server that processes files or data uploaded by the

terminal devices

101, 102, 103. The background data processing server can process the acquired file or data to generate a processing result.

It should be noted that the multi-tag classification method provided in the embodiment of the present application may be executed by the server 105, or may be executed by the

terminal devices

101, 102, and 103, and accordingly, the multi-tag classification apparatus may be disposed in the server 105, or may be disposed in the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above device architecture may not include a network, but only a server or a terminal device.

Fig. 2 illustrates a multi-label classification method provided by an embodiment of the present application, including the following steps:

and S1, acquiring the preprocessed dialogue data set, dividing the preprocessed dialogue data set into a plurality of batch data, and obtaining the feature vector of the t-th character in the i-th sentence through a trained pre-training model based on the batch data.

In a specific embodiment, as shown in fig. 3, before step S1, the method further includes:

s11, preprocessing the dialogue data set;

and S12, pre-training the pre-training model trained on the general data set by adopting the dialogue data set preprocessed in the step S11.

Specifically, the preprocessing process of step S11 includes removing useless characters such as special characters and emoticons, which can improve the normalization of the text.

In a specific embodiment, the pre-training Model selects a Bert Model, and continues pre-training by adopting a Masked Language Model task to obtain the pre-training Model in the vertical field, so that the pre-training Model is more consistent with the distribution of a dialogue data set, and the performance of the pre-training Model on a downstream task is improved. Of course, other pre-training models may be selected according to a specific dialog scenario, and in this embodiment, a Bert model is taken as an example. BERT is a language representation model (language representation model) trained from very large data, large models, and very large computational overhead. BERT denotes that left and right contexts are commonly relied upon in all layers.

On the basis, a multi-label classification model is constructed, and as shown in fig. 4, the multi-label classification model comprises a trained Bert model, two extension layers, a label information embedding matrix and a classifier. And the preprocessed dialog data set is divided into a plurality of batch data. During model training, each batch data is repeatedly input into the multi-label classification model for training for multiple times, and is also repeated for multiple times in the test set and the verification set.

In a specific embodiment, the Bert model in step S1 is pre-trained again by using Roberta of the haar-major open source as an initial weight, specifically, the 12-layer encoder layer, inputs the id corresponding to each character in the ith sentence in the dictionary, and outputs the id as the feature vector of each character in the ith sentence. Because the Bert model adopting Roberta of the Hadamard open source as the initial weight already comprises the dictionary, the initial weight can be directly loaded for use without dictionary reconstruction.

S2, inputting the feature vector of the t-th character in the ith sentence into the first Attention layer to obtain the first feature vector of the ith sentence.

In a specific embodiment, the first Attention layer in step S2 is mainly used to measure the importance of each character to the characteristics of the whole sentence, so as to capture the specific meaning of a special character. The specific calculation formula is as follows:

u_it＝tanh(W_wh_it+b_w)；

s_i＝∑_tα_ith_it；

And S3, establishing a label information matrix, and obtaining the label characteristic vector obtained by the last prediction through the label prediction result obtained by the last time and the label information matrix.

In the specific embodiment, the tag information matrix in step S3 is an initialization matrix weight W^n×hWherein n is the number of all labels, h is the characteristic dimension, each batch data is trained for multiple times, and each time, the matrix weight and the last label prediction result o are used_t-1And multiplying to obtain the label prediction result of the last prediction. That is, the last predicted tag feature vector corresponding to each batch data is o_t-1*W^n×hWherein is a multiplication o_t-1And (4) the prediction score of each label of the n labels is zero, and the first obtained label prediction result is zero. The structure similar to a door structure is achieved, useless label information can be filtered, relevant label information is enhanced, and therefore the mutual relation among labels is introduced.

S4, splicing the first feature vector of the ith sentence with the label feature vector obtained by last prediction, and inputting the first feature vector into the second Attention layer to obtain the second feature vector of the ith sentence.

In a specific embodiment, the importance information of different tags is obtained through the second Attention layer in step S4. The calculation mode of the second Attention layer is consistent with that of the first Attention layer, and by analogy, the second characteristic vector of the ith sentence can be obtained. The label information between labels is obtained by splicing the label feature vectors obtained by the last prediction, for example, two labels always appear at the same time, and the relation can be strengthened by introducing the label feature vectors obtained by the last prediction.

And S5, splicing the first characteristic vector of the ith sentence and the second characteristic vector of the ith sentence, and inputting the spliced first characteristic vector and the second characteristic vector into a residual error network structure to perform characteristic extraction to obtain a third characteristic vector of the ith sentence. The first feature vector and the second feature vector of the ith sentence are spliced together to increase the richness of features, so that the information among the labels can be acquired in addition to the classification information of the sentence.

Specifically, the residual network structure is that y is f (x, w) + x, where x is input, y is the third feature vector of the ith sentence, and f (x, w) is the output of the network layer with weight w, and the obtained output of the network layer plus the input x is the residual network. The residual network can improve the generalization of the model. S6: and inputting the third feature vector of the ith sentence into the classifier to obtain a label prediction result.

In a specific embodiment, the classifier in step S6 includesXXX, the output of the classifier is o_tWherein o is_tThe predicted score for each label is n labels. Classifiers include, but are not limited to, logistic regression (logistic), SVM, softmax.

Specifically, in the training process, each batch data is trained for multiple times and a label prediction result of the previous round of prediction is needed, 2 rounds are adopted by default, that is, each batch data is trained twice, the label prediction result input in the first round is empty, and the label result input in the second round is the label prediction result of the first round, so that the mutual relationship among the labels is obtained. In the test set verification process, the steps S1-S6 are also repeated to obtain the final prediction result.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of a multi-tag classification apparatus, which corresponds to the method embodiment shown in fig. 2, and which can be applied in various electronic devices.

The embodiment of the application provides a many labels sorter, includes:

the pre-training module 1 is configured to obtain a preprocessed dialogue data set, divide the dialogue data set into a plurality of batch data, and obtain a feature vector of a t-th character in an ith sentence through a trained pre-training model based on the batch data;

a first Attention module 2, configured to input a feature vector of a t-th character in an ith sentence into a first Attention layer, so as to obtain a first feature vector of the ith sentence;

the label characteristic vector acquisition module 3 is configured to establish a label information matrix and obtain a label characteristic vector obtained by last prediction through a label prediction result obtained last time and the label information matrix;

the second Attention module 4 is configured to splice the first feature vector of the ith sentence with the tag feature vector obtained by the last prediction, and input the second Attention layer to obtain a second feature vector of the ith sentence;

the feature extraction module 5 is configured to splice a first feature vector of an ith sentence and a second feature vector of the ith sentence, and input a residual error network structure to perform feature extraction to obtain a third feature vector of the ith sentence;

the classification module 6 is configured to input the third feature vector of the ith sentence into the classifier to obtain a label prediction result;

and the repeating module 7 is configured to repeatedly execute the pre-training module 1 to the classifying module 6 for multiple times to obtain a final label prediction result.

The invention is mainly applied to the information analysis of the question in the dialogue system, and multi-purpose recognition is carried out so as to facilitate the subsequent targeted answer. And may be adapted to different vertical domains. In the medical treatment dialogue, the disease information and the treatment method are often inquired at the same time, the corresponding scene is multi-label classification at the moment, the method can be used for well performing targeted answer, and the fluency and the accuracy of the dialogue system are improved. For example, the user asks: what causes diarrhea, how to treat? This question asks both the cause of the disease and the treatment. By the embodiment of the application, the prediction result of the sentence is as follows: two signatures of etiology and treatment modality.

The multi-label classification method provided by the invention can capture the specific meanings of some special characters in the text by introducing the Attention mechanism, improves the classification effect, can acquire the interrelation among the multi-labels by introducing the multi-label reasoning mechanism, and cannot be influenced by the sequence among the multi-labels. Finally, the pre-training model continuously pre-trained in the vertical dialogue field is used, so that the precision of the model is further improved, and an excellent classification result is achieved.

Referring now to fig. 6, a schematic diagram of a computer device 600 suitable for use in implementing an electronic device (e.g., the server or terminal device shown in fig. 1) according to an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer apparatus 600 includes a Central Processing Unit (CPU)601 and a Graphics Processing Unit (GPU)602, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)603 or a program loaded from a storage section 609 into a Random Access Memory (RAM) 604. In the RAM604, various programs and data necessary for the operation of the apparatus 600 are also stored. The CPU 601, GPU602, ROM 603, and RAM604 are connected to each other via a bus 605. An input/output (I/O) interface 606 is also connected to bus 605.

The following components are connected to the I/O interface 606: an input portion 607 including a keyboard, a mouse, and the like; an output section 608 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 609 including a hard disk and the like; and a communication section 610 including a network interface card such as a LAN card, a modem, or the like. The communication section 610 performs communication processing via a network such as the internet. The driver 611 may also be connected to the I/O interface 606 as needed. A removable medium 612 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 611 as necessary, so that a computer program read out therefrom is mounted into the storage section 609 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication section 610, and/or installed from the removable media 612. The computer programs, when executed by a Central Processing Unit (CPU)601 and a Graphics Processor (GPU)602, perform the above-described functions defined in the methods of the present application.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable medium or any combination of the two. The computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The modules described may also be provided in a processor.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a preprocessed dialogue data set, dividing the preprocessed dialogue data set into a plurality of batch data, and obtaining a feature vector of the t-th character in the i-th sentence through a trained pre-training model based on the batch data; inputting the characteristic vector of the t character in the ith sentence into a first Attention layer to obtain a first characteristic vector of the ith sentence; establishing a label information matrix, and obtaining a label characteristic vector obtained by last prediction through a label prediction result obtained last time and the label information matrix; splicing the first characteristic vector of the ith sentence with the label characteristic vector obtained by last prediction, and inputting the spliced first characteristic vector into a second extension layer to obtain a second characteristic vector of the ith sentence; splicing the first characteristic vector of the ith sentence with the second characteristic vector of the ith sentence, and inputting the spliced first characteristic vector into a residual error network structure for characteristic extraction to obtain a third characteristic vector of the ith sentence; inputting the third feature vector of the ith sentence into a classifier to obtain a label prediction result; and repeating the steps for multiple times to obtain a final label prediction result.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A multi-label classification method is characterized by comprising the following steps:

s1, acquiring a preprocessed dialogue data set, dividing the preprocessed dialogue data set into a plurality of batch data, and obtaining a feature vector of the t-th character in the i sentence through a trained pre-training model based on the batch data;

s2, inputting the feature vector of the t character in the ith sentence into a first Attention layer to obtain a first feature vector of the ith sentence;

s5, splicing the first feature vector of the ith sentence and the second feature vector of the ith sentence, and inputting a residual error network structure for feature extraction to obtain a third feature vector of the ith sentence;

2. The multi-label classification method according to claim 1, characterized in that the preprocessed dialogue data set is obtained by: and removing useless characters in the dialogue data set, wherein the useless characters comprise special characters and emoticons.

3. The multi-label classification method according to claim 1, characterized in that the trained pre-trained model is obtained by the following steps: and pre-training the pre-training model trained on the general data set by adopting the preprocessed dialogue data set to obtain the trained pre-training model in the vertical field.

4. The multi-label classification method according to any one of claims 1-3, characterized in that the pre-trained model comprises a Bert model comprising 12-layer encoders.

5. The multi-label classification method according to claim 4, wherein the trained pre-training model in step S1 is pre-trained again by using Roberta of Haugh-Large open source as an initial weight, and the input is id corresponding to the t-th character in the i-th sentence of the batch data in a dictionary.

6. The multi-label classification method according to claim 1, wherein the step S2 of obtaining the importance information of each character to the sentence characteristics through the first Attention layer specifically includes:

u_it＝tanh(W_wh_it+b_w)；

s_i＝∑_tα_ith_it；

7. The multi-label classification method according to claim 1, wherein the label information matrix in step S3 is an initialization matrix weight W^n×hWherein n is the number of all tags, h is the feature dimension, and the last predicted tag feature vector corresponding to each batch data is o_t-1*W^n×hWherein o is_t-1And for the last obtained label prediction result, the last obtained label prediction result is the prediction score of each label of the n labels, the first obtained label prediction result is zero, and the multiplication is performed element by element.

8. The multi-label classification method according to claim 1, wherein the importance information of different labels is obtained through the second Attention layer in step S4.

9. The multi-label classification method according to claim 1, wherein the classifier in step S6 includes a linear layer, and the linear layer employs a sigmoid activation function.

10. A multi-label sorting apparatus, comprising:

the pre-training module is configured to acquire a preprocessed dialogue data set, divide the preprocessed dialogue data set into a plurality of batch data, and obtain a feature vector of a t-th character in an ith sentence through a trained pre-training model based on the batch data;

the first Attention module is configured to input a feature vector of a t-th character in the ith sentence into a first Attention layer to obtain a first feature vector of the ith sentence;

the second Attention module is configured to splice the first feature vector of the ith sentence with the tag feature vector obtained by the last prediction, and input the spliced first feature vector into a second Attention layer to obtain a second feature vector of the ith sentence;

the feature extraction module is configured to splice the first feature vector of the ith sentence and the second feature vector of the ith sentence, and input a residual error network structure for feature extraction to obtain a third feature vector of the ith sentence;

the classification module is configured to input the third feature vector of the ith sentence into a classifier to obtain a label prediction result;

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.