CN114691866A - Multilevel label-oriented text classification method, device, equipment and storage medium - Google Patents

Multilevel label-oriented text classification method, device, equipment and storage medium Download PDF

Info

Publication number
CN114691866A
CN114691866A CN202210225366.3A CN202210225366A CN114691866A CN 114691866 A CN114691866 A CN 114691866A CN 202210225366 A CN202210225366 A CN 202210225366A CN 114691866 A CN114691866 A CN 114691866A
Authority
CN
China
Prior art keywords
text
label
keyword
model
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210225366.3A
Other languages
Chinese (zh)
Inventor
王婧宜
禹宁
冯昊
孔庆超
王宇琪
许刚刚
曹家
罗引
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Wenge Technology Co ltd
AVIATION INDUSTRY INFORMATION CENTER
Institute of Automation of Chinese Academy of Science
Original Assignee
Beijing Zhongke Wenge Technology Co ltd
AVIATION INDUSTRY INFORMATION CENTER
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Wenge Technology Co ltd, AVIATION INDUSTRY INFORMATION CENTER, Institute of Automation of Chinese Academy of Science filed Critical Beijing Zhongke Wenge Technology Co ltd
Priority to CN202210225366.3A priority Critical patent/CN114691866A/en
Publication of CN114691866A publication Critical patent/CN114691866A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure relates to a text classification method, a text classification device, text classification equipment and a storage medium for multilevel labels. The method comprises the steps of obtaining a text and a label corresponding to a keyword in the text; the method comprises the steps of coding a text based on a text coding model in a preset multistage label-oriented text classification model to obtain a feature vector of the text, performing sensitive representation on a keyword of the text by the feature vector of the text, and coding a label based on a label coding model in the preset multistage label-oriented text classification model to obtain a vector of the label; respectively calculating cosine similarity between the feature vector of the text and the vector of each label; and determining the label with the cosine similarity larger than a preset threshold as the label of the text. By encoding the text and the conventional category labels and calculating the cosine similarity, the labels matched with the text content are selected, so that the dependence on manual labeling can be reduced, the maintenance cost of manual labeling and a label system is reduced, the label labeling accuracy is improved, and the text classification result is more accurate.

Description

Multilevel label-oriented text classification method, device, equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of natural language processing, in particular to a text classification method, a text classification device, text classification equipment and a storage medium for multilevel labels.
Background
In recent years, many industries have begun to develop data platforms for efficient storage, management, and exchange of information resources. With the collection of information resources, scientific management of the resources is more important, wherein information label marking is an important technology for efficient retrieval and management of information. The information label marking refers to a task of performing deep analysis on titles and contents of articles and finding one or more labels reflecting text topics and topics in a defined label system, and is also called a multi-label classification task.
At present, the multi-label classification task is mainly based on an industry label system, each label summarizes sub-labels or keywords, and a label corresponding to the keyword is returned by adopting a keyword matching method. In recent years, with the rise of deep learning algorithms, many deep learning methods are also applied to the multi-label classification task.
However, in the related art, a large amount of labeled label data is needed when multi-label classification is performed on a text, and most label systems are constructed and maintained manually, so that the problems of high subjectivity, high cost and low updating speed exist; labels matched through keywords usually have a lot of noises and misjudgments, so that the situation that texts and the labels are weakly related is caused, and the accuracy is low. On the other hand, the multi-label classification method based on deep learning needs a large number of manually labeled labels, and the labeling of multi-label data is more difficult due to the increase of label space, and particularly under the condition of long text space, a marker is likely not to traverse all the labels, but only gives a subset of real labels; meanwhile, the related technology of multi-label classification usually only considers labels and ignores keyword information, thereby easily causing the problem of low label recall rate. Therefore, a simple text classification method for multi-level labels is needed to improve the label labeling accuracy.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a method, an apparatus, a device, and a storage medium for text classification for multilevel tags.
A first aspect of an embodiment of the present disclosure provides a text classification method for a multilevel tag, where the method includes:
acquiring a text and a label corresponding to a keyword in the text; the method comprises the steps of coding a text based on a text coding model in a preset multistage label-oriented text classification model to obtain a feature vector of the text, carrying out sensitive representation on a keyword of the text by the feature vector of the text, and coding a label based on a label coding model in the preset multistage label-oriented text classification model to obtain a vector of the label; respectively calculating cosine similarity between the feature vector of the text and the vector of each label; and determining the label with the cosine similarity larger than a preset threshold as the label of the text.
A second aspect of the embodiments of the present disclosure provides a text classification apparatus for multilevel labels, including:
the acquisition module is used for acquiring the text and the label corresponding to the keyword in the text;
the encoding module is used for encoding the text based on a text encoding model in a preset multistage-label-oriented text classification model to obtain a feature vector of the text, performing sensitive representation on a keyword of the text by the feature vector of the text, and encoding the label based on a label encoding model in the preset multistage-label-oriented text classification model to obtain a vector of the label;
the calculation module is used for calculating cosine similarity between the feature vector of the text and the vector of each label;
and the determining module is used for determining the label with the cosine similarity larger than the preset threshold as the label of the text.
A third aspect of embodiments of the present disclosure provides a computing device comprising a memory and a processor, wherein the memory has stored therein a computer program, which when executed by the processor, may implement the method of the first aspect described above.
A fourth aspect of embodiments of the present disclosure provides a computer-readable storage medium having a computer program stored therein, which, when executed by a processor, may implement the method of the first aspect described above.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the embodiment of the disclosure, the text and the label corresponding to the keyword in the text are obtained; the method comprises the steps of coding a text based on a text coding model in a preset multistage label-oriented text classification model to obtain a feature vector of the text, carrying out sensitive representation on a keyword of the text by the feature vector of the text, and coding a label based on a label coding model in the preset multistage label-oriented text classification model to obtain a vector of the label; respectively calculating cosine similarity between the feature vector of the text and the vector of each label; and determining the label with the cosine similarity larger than a preset threshold as the label of the text. According to the method and the device for text classification, the text and the existing class labels are subjected to coding processing and cosine similarity calculation processing, and the labels matched with the text content are selected, so that the dependence on a large number of manual labeling labels is reduced, the maintenance cost of manual labeling and a label system is reduced, the label labeling accuracy is improved, and the text classification result is more accurate.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the embodiments or technical solutions in the prior art description will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
Fig. 1 is a flowchart of a training method of a text classification model for multi-level labels according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of an embedded layer structure of a BERT model of a text classification model for multi-level tags according to an embodiment of the present disclosure;
fig. 3 is a flowchart of a text classification method for a multi-level label according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a text classification apparatus for multilevel labels according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Fig. 1 is a flowchart of a method for training a multi-level label-oriented text classification model according to an embodiment of the present disclosure, which may be executed by a computing device, which may be understood as any device having computing functions and processing capabilities. As shown in fig. 1, the training method for a text classification model facing a multi-level label provided in this embodiment includes the following steps:
step 101, obtaining a text and a label corresponding to a keyword in the text to form an original data set.
In the embodiment of the present disclosure, the obtained text and the tags corresponding to the keywords in the text may be obtained through steps S11-S12:
and S11, matching the text to obtain the keywords based on the existing keyword list.
The keyword table referred to in the embodiments of the present disclosure may be understood as a set of predefined keywords in a certain field, and one or more keywords may be obtained from one text by matching.
And S12, determining and obtaining the label corresponding to the keyword in the text based on the mapping relation between the keyword and the label.
The mapping relationship between the keywords and the tags can be obtained through an existing tag-keyword system, one tag can correspond to a plurality of keywords, each text can correspond to one or a plurality of tags, the tag-keyword system can be obtained through manual construction, the related construction technology is the prior art, and details are not repeated here.
For example, the text "XXX company manufactures the first 3D printed drone, and successfully completes the test flight," the keywords "3D printing" and "test flight" appear, "and therefore, in the existing mapping relationship between the keywords and the labels, the labels of the text are the" structural strength technique "corresponding to the keyword" 3D printing "and the" flight test technique "corresponding to the keyword" test flight ". Here, the labels corresponding to the keywords in the acquired text are only exemplary descriptions, and are not the only descriptions.
The method obtains a plurality of texts and a plurality of labels corresponding to each text to form an original data set, one text and a plurality of labels corresponding to each text form a sample, and the original data set comprises a plurality of samples. It should be noted that, in the above manner, the text does not need to be labeled manually, but an approximate label is obtained through the text data itself, and noise, that is, a label that does not match the text semantics, may exist in the obtained label.
And 102, dividing an original data set into a training set and a verification set, inputting the training set into a text classification model facing to the multi-level labels for training, and constructing a positive sample and a negative sample in each Batch of Batch.
The text classification model for the multilevel labels in the embodiment of the disclosure is a model based on comparative learning training, an original data set needs to be divided into a training set and a verification set, the training set is input into the text classification model for the multilevel labels for training, and a positive sample and a negative sample are constructed in each Batch of Batch.
In the embodiment of the present disclosure, a positive sample and a negative sample are constructed in each Batch, specifically, N samples may be randomly extracted from a data set to form one Batch, any sample in the Batch includes a text and K tags corresponding to the text, the text and the K tags corresponding to the text form K positive samples, the text and other K-K tags in a full-volume tag set form K-K negative samples, a full-volume tag set here may be understood as all tags in a constructed tag system, and K is a total number of tags of the full-volume tags. Finally, each Batch obtains N.K samples, N, K and K are positive integers, and K is less than or equal to K.
And 103, coding the text in the sample to obtain a feature vector of the text, performing sensitive representation on the keywords of the text by the feature vector of the text, and coding the label to obtain a vector of the label.
The multistage label-oriented text classification model in the embodiment of the disclosure requires that positive samples and negative samples are learned in a feature space, and therefore, texts and labels in the samples need to be converted into vector representations.
The feature vector of the text is referred to in the embodiment of the disclosure to sensitively represent the keywords of the text, and can be understood as emphasizing keyword information in the feature vector of the text, which keywords are matched with the text. In one embodiment, the non-keyword content and the keyword content may be represented in different manners, and the keywords may be sensitively characterized.
In the embodiment of the disclosure, a text is encoded by using an existing encoding model, in order to emphasize keyword information matched with the text, a keyword embedding layer may be established in the encoding model, the matched keyword information is injected into the keyword embedding layer, and then the text information is converted into a feature vector of the text, where the feature vector of the text includes the keyword information of the text and performs sensitive representation on the keyword of the text.
In an implementation manner of the embodiment of the present disclosure, the coding model for coding the text may use a bert (bidirectional Encoder retrieval from transforms) model to perform feature vector representation on the text. The BERT is a deep bidirectional pre-training language model, can learn a large amount of language, syntax and semantic information by a method of self-supervision learning on large-scale linguistic data, and can output a feature vector of a text with context semantics fused through bidirectional representation.
On the basis of the original three embedding characteristics of a BERT model embedding layer, namely on the basis of a label embedding layer (token embedding), a segment embedding layer (segment embedding) and a position embedding layer (position embedding), a keyword embedding layer (keyword embedding) is added, and keyword information extracted from a text is injected into the keyword embedding layer to form a sensitive representation of a keyword.
Illustratively, as shown in fig. 2, the embedded layer structure of the BERT model is schematically illustrated, in which the text "XXXX completes the new type of radar flight test", CLS (classification) represents classification, [ CLS [, []At the beginning of the input text sentence, indicating that the subsequent classification task can be performed, SEP (separator) indicating separation, [ SEP]Is located at the inputThe middle or end of the text, which separates the two input sentences, hits the keyword "flight test", so at the keyword embedding layer, the position corresponding to the keyword "flight test" can be represented as EKAnd the other positions are represented as ENSensitive characterization of the keyword "flight test" is formed. In the subsequent model training process, E can be initialized randomlyKAnd ENThe model parameters are updated by minimizing the loss function.
After the embedded layer of the BERT model finishes processing text information, the processed data is input into a Transformer network architecture in the BERT model, the Transformer network architecture can enable the output text feature vectors to be expressed and contain text semantic information, after the processing of the Transformer network architecture, the BERT model finally outputs the feature vectors of the text, and the feature vectors of the text contain the keyword information of the text.
In some embodiments of the present disclosure, the feature vector of the text may be a vector corresponding to a placeholder [ CLS ] in the last layer of the Transformer, or may be a vector obtained by accumulating outputs of the first layer and the last layer of the Transformer and then performing average pooling, which is not limited herein.
In some embodiments of the present disclosure, the number of words that can be received by the coding model when encoding text may be limited, such as a BERT model that can accept up to 512 words of input. When the text space is long, the feature vector of the text can be obtained by extracting the title and the contents of the first segment and the last segment from the text, splicing the title and the contents of the first segment and the last segment, and encoding the text contents obtained based on splicing. In other embodiments, the title and the abstract may be extracted from the text, and the title and the abstract may be spliced, and the feature vector of the text may be obtained by performing encoding processing based on the text content obtained by splicing.
In the embodiment of the disclosure, before the Text is encoded, in order to reduce the change of words and reduce the Text noise, the Text may be preprocessed in a manner of at least deleting one of Hyper Text Markup Language (HTML), converting traditional characters into simplified characters, unifying upper and lower cases of english, and deleting contents conforming to a preset regular expression. Information such as author, source, release time and the like appearing in the text can be removed through the regular expression. The regular expression is the existing mature technology, and is not described herein again.
In the embodiment of the disclosure, when the tag is encoded, in many scenes, the tag is usually a general word, which includes a wider semantic range, expressed semantic information is not specific enough, and specific corresponding content needs to be understood by combining domain knowledge and keywords, so that it is difficult for the computing device to understand the content through the literal meaning of the tag. For example, if a tag is "aircraft overall integrated design technology", and corresponding keywords include "aircraft exterior design technology", "aerodynamic layout technology", "stealth technology", etc., if the tag of "aircraft overall integrated design technology" is input into a computing device, it is difficult for the computing device to determine which semantic information the tag specifically corresponds to. Therefore, in the embodiment of the present disclosure, in order to avoid the problem of semantic complexity increase caused by performing semantic representation on the tags, the existing coding model may be used to code the tags, and convert the tag information into the vector of the tags, where the vector of the tags does not include the semantic information of the tags, and the model only models the mapping relationship between the feature vector of the text and the vector of the tags.
In an implementation manner of the embodiment of the present disclosure, discrete tags may be mapped into vector representations through one-hot encoding, then the vector dimensions are changed through matrix transformation, so that the dimensions of the obtained vectors of the tags are the same as those of feature vectors of a text, the vectors of the tags are output, a matrix is constructed and randomly initialized in a model training process, parameters in the matrix are determined through back propagation and a minimum loss function, and the purpose of modeling a mapping relationship between the feature vectors of the text and the vectors of the tags is achieved. The one-hot coding is the prior mature technology and is not described in detail here.
Illustratively, assume a one-hot vector of
Figure BDA0003539006800000081
Where R represents a real number set, i.e., a set containing all rational and irrational numbers, and K is the total number of labels. In the model training process, a matrix A is constructed and randomly initialized, and A belongs to RM*KWherein M is the characteristic dimension of the BERT model, the calculation formula of the vector of the label is
Figure BDA0003539006800000082
The parameters in matrix a are determined during model training by back-propagation and minimizing the loss function.
And 104, inputting the feature vector of the text and the vector of each label into a loss function, determining a loss value, and iteratively updating the model parameters by a method of minimizing the loss function.
The loss function referred to in the embodiments of the present disclosure may be understood as a function that maps a value of a random event or its related random variable to a non-negative real number to represent a "risk" or a "loss" of the random event. In application, the loss function is usually associated with the optimization problem as a learning criterion, i.e. the accuracy of the model is solved and evaluated by minimizing the loss function.
The loss function in the embodiment of the present disclosure may adopt an infonce (information Noise contrast estimation) loss function, which is a contrast loss function for self-supervised learning, where nce (Noise contrast estimation) represents Noise contrast estimation. Specifically, the feature vector and each label vector of the text of the positive sample obtained in the above step, and the feature vector and each label vector of the text of the negative sample corresponding to the positive sample are input into the loss function, and the loss function may be represented as follows:
Figure BDA0003539006800000091
wherein (q)i,pi +) Represents the ith sample and belongs to a positive sample set, where qiA feature vector representing an article is generated by a feature vector,pi +a vector representing a label;
(qj,pj +) Represents the jth sample and belongs to a positive sample set, where qjFeature vector of the represented article, pj +A vector representing a label;
(qj,pj -) Represents the jth sample and belongs to a negative sample set, where qjFeature vector of the represented article, pj -A vector representing a label;
sim represents cosine similarity;
tau represents a temperature parameter and is a model hyper-parameter;
i represents the ith sample input;
j represents the jth sample input;
K+represents a set of positive samples, K-Representing a set of negative examples.
The cosine similarity here can be understood as a cosine value of an included angle between two vectors, which is based on angle comparison of the two vectors in space, the closer the angles of the two vectors in space are, the more similar the two vectors are, the cosine similarity can be used as a measurement standard for comparing similarity between a feature vector of a text and a vector of a label, the cosine similarity here ranges from 0 to 1, and the larger the cosine similarity is, the more similar the two vectors are.
The value of the loss function in the embodiment of the present disclosure may be defined as a loss value, which may be understood as a non-negative real number, and represents a loss or an error of the text classification model oriented to the multilevel labels, where the smaller the loss value is, the greater the cosine similarity between the feature vector of the text and the vector of the label is, and the higher the accuracy of the model is.
In an implementation manner of the embodiment of the present disclosure, in a back propagation process, an Adam optimizer is used to iteratively update model parameters by a method of minimizing a loss function until a stop condition is reached, where the Adam optimizer is an existing mature technology and is not described here again. Other related techniques may also be used to minimize the loss function, and are not limited herein.
And 105, training the text classification model facing the multilevel labels based on the training set, verifying the text classification model facing the multilevel labels based on the verification set, calculating the loss value of the model on the verification set until the loss value of the model on the verification set is less than or equal to a first preset threshold, stopping training, and determining the final parameters of the text classification model facing the multilevel labels.
In this embodiment of the present disclosure, the first preset threshold of the loss value may be set as a minimum value of the loss function, that is, the value of the loss function reaches a minimum value, it may be understood that the convergence of the loss value remains unchanged or does not decrease any more, that is, the accuracy of the model reaches a maximum value, and the probability that the feature vector of the text is similar to the vector of the tag reaches a maximum value, that is, the cosine similarity between the feature vector of the text and the vector of the tag reaches a maximum value, and at this time, the tag corresponding to the vector of the tag is the tag closest to the text semantics.
In other embodiments of the present disclosure, the first preset threshold of the loss value may be set by a user according to actual needs, or may be set by a computing device as a default, which is not limited to this.
Training a text classification model facing the multilevel labels based on a training set, verifying the text classification model facing the multilevel labels based on a verification set, calculating the loss value of the model on the verification set, and continuing to train the model if the loss value is greater than a first preset threshold value; and if the loss value is less than or equal to the first preset threshold, stopping training, and determining the parameters of the text classification model facing the multilevel labels, of which the loss value is less than or equal to the first preset threshold, as the final parameters of the model.
In other embodiments of the present disclosure, the multi-level label-oriented text classification model is verified based on the verification set, and the verification may also be performed within a preset number of cycles, a loss value of the model on the verification set is calculated in each preset cycle, when the loss value of the model in the verification set does not further decrease within the preset number of cycles, the training is stopped, and the parameters in the iteration result of the previous cycle are determined as the final parameters of the model.
The text classification model for the multilevel labels in the embodiment of the disclosure draws the distance of a positive sample (similar sample) in a feature space and draws the distance of a negative sample in the feature space through multiple training of a large amount of data to depict the feature representation of the sample, and learns a text coding model for the text and a label coding model for the labels. Under the condition that the label contains noise data, as long as the data volume is large enough, the model can still learn the correct corresponding relation between the label and the text.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the method and the device, an original data set is formed by obtaining the text and the labels corresponding to the keywords in the text; dividing an original data set into a training set and a verification set, inputting the training set into a text classification model facing to a multi-level label for training, and constructing a positive sample and a negative sample in each Batch processing Batch; coding the text in the sample to obtain a feature vector of the text, performing sensitive representation on the key words of the text by the feature vector of the text, and coding the label to obtain a vector of the label; inputting the feature vector of the text and the vector of each label into a loss function, and determining a loss value; training a text classification model facing the multilevel labels based on a training set, verifying the text classification model facing the multilevel labels based on a verification set, calculating loss values of the model on the verification set until the loss values of the model on the verification set are less than or equal to a first preset threshold, stopping training, determining final parameters of the text classification model facing the multilevel labels, obtaining the text classification model facing the multilevel labels, and being applicable to text classification facing the multilevel labels. The embodiment of the disclosure obtains the text classification model facing the multilevel labels by training the text classification model facing the multilevel labels for the texts and the existing category labels, adds the keyword embedding layer in the input of the text classification model facing the multilevel labels, effectively injects the keywords in the texts into the model to form the sensitive representation for the keywords, uses the massive noise data as the training signal, learns the text coding model for the texts and the label coding model for the labels by a contrast learning method, achieves the effects of short distance between the texts and the related labels and long distance between the unrelated labels, and the model is applied to the text classification facing the multilevel labels, can greatly reduce the dependence on a large number of manually labeled labels, improves the label labeling accuracy, makes the text classification result more accurate, and the model learns the mapping relation between the existing keywords and the labels, the method can also learn the mapping relation between new keywords and the labels, has good generalization capability, and reduces the maintenance cost of a manual labeling and label system.
Fig. 3 is a flowchart of a method for multi-level label-oriented text classification, which may be performed by a computing device, according to an embodiment of the present disclosure. The computing device may be understood as any device having computing functionality and processing capabilities. As shown in fig. 3, the text classification method for multi-level labels provided in this embodiment includes the following steps:
step 301, obtaining a text and a label corresponding to a keyword in the text.
The manner of obtaining the text and the tags corresponding to the keywords in the text in the embodiment of the present disclosure is the same as that in the steps S11-S12 in the step 101, and is not repeated here.
Step 302, a text is encoded based on a text encoding model in a preset multistage label-oriented text classification model to obtain a feature vector of the text, the feature vector of the text sensitively represents keywords of the text, and the label is encoded based on a label encoding model in the preset multistage label-oriented text classification model to obtain a vector of the label.
The feature vector of the text referred to in the embodiment of the present disclosure sensitively represents the keywords of the text, and can be understood as emphasizing keyword information in the feature vector of the text, which keywords are matched with the text. In one embodiment, the non-keyword content and the keyword content may be represented in different manners, and the keywords may be sensitively characterized.
The preset multistage-label-oriented text classification model referred to in the embodiment of the present disclosure is a multistage-label-oriented text classification model obtained by training in fig. 1, the obtained text is input into a text coding model in the multistage-label-oriented text classification model obtained by training in fig. 1, the obtained label is input into a label coding model in the multistage-label-oriented text classification model obtained by training in fig. 1, a text is coded based on the text coding model to obtain a feature vector of the text, the feature vector of the text sensitively characterizes keywords of the text, and the label is coded based on the label coding model to obtain a vector of the label.
In the embodiment of the disclosure, before the Text is encoded, in order to reduce the change of words and reduce the Text noise, the Text may be preprocessed in a manner of at least deleting one of Hyper Text Markup Language (HTML), converting traditional characters into simplified characters, unifying upper and lower cases of english, and deleting contents conforming to a preset regular expression. Information such as author, source, release time and the like appearing in the text can be removed through the regular expression. The regular expression is the existing mature technology, and is not described herein again.
In the embodiment of the present disclosure, when a text is encoded, the number of characters that can be received by the text classification model for the multilevel tag may be limited. When the text space is long, the feature vector of the text can be obtained by extracting the title and the contents of the first segment and the last segment from the text, splicing the title and the contents of the first segment and the last segment, and encoding the text contents obtained based on splicing. In other embodiments, the title and the abstract may be extracted from the text, and the title and the abstract may be spliced, and the feature vector of the text may be obtained by performing encoding processing based on the text content obtained by splicing.
Step 303, respectively calculating cosine similarity between the feature vector of the text and the vector of each label.
The cosine similarity in the embodiments of the present disclosure may be understood as evaluating the similarity between two vectors by calculating a cosine value of an included angle between them. The range of cosine similarity in the embodiments of the present disclosure is 0 to 1, and the larger the cosine similarity is, the more similar the feature vector of the text and the vector of the label is. The cosine similarity calculation method is disclosed in the related art, and the calculation may be performed by referring to the related art, which is not described herein again.
And step 304, determining the label with the cosine similarity larger than a preset threshold as a label of the text.
In the embodiment of the present disclosure, the preset threshold of the cosine similarity may be set by a user according to actual requirements, or may be set by default by a computing device, which is not limited to this.
In the embodiment of the present disclosure, if the cosine similarity is less than or equal to the preset threshold, discarding the label corresponding to the cosine similarity which is less than or equal to the preset threshold; if the cosine similarity is greater than the preset threshold, determining the label corresponding to the cosine similarity greater than the preset threshold as the label of the text, wherein one or more labels may be considered as the label closest to the text semantics.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
the method comprises the steps of obtaining a text and a label corresponding to a keyword in the text; the method comprises the steps of coding a text based on a text coding model in a preset multistage label-oriented text classification model to obtain a feature vector of the text, carrying out sensitive representation on a keyword of the text by the feature vector of the text, and coding a label based on a label coding model in the preset multistage label-oriented text classification model to obtain a vector of the label; respectively calculating cosine similarity between the feature vector of the text and the vector of each label; and determining the label with the cosine similarity larger than a preset threshold as the label of the text. According to the method and the device for text classification, the text and the existing class labels are subjected to coding processing and cosine similarity calculation processing, and the labels matched with the text content are selected, so that the dependence on a large number of manual labeling labels is greatly reduced, the maintenance cost of manual labeling and a label system is reduced, the label labeling accuracy is improved, and the text classification result is more accurate.
Fig. 4 is a schematic structural diagram of a text classification apparatus for a multilevel tag according to an embodiment of the present disclosure, where the apparatus may be understood as the above-mentioned computing device or a part of functional modules in the above-mentioned computing device. As shown in fig. 4, the apparatus 400 for classifying a text based on a multi-level label includes:
an obtaining module 410, configured to obtain a text and a tag corresponding to a keyword in the text;
the encoding module 420 is configured to encode a text based on a text encoding model in a preset multistage-tag-oriented text classification model to obtain a feature vector of the text, perform sensitive representation on a keyword of the text by the feature vector of the text, and encode a tag based on a tag encoding model in the preset multistage-tag-oriented text classification model to obtain a vector of the tag;
a calculating module 430, configured to calculate cosine similarity between the feature vector of the text and the vector of each label;
the determining module 440 is configured to determine, as a label of the text, a label whose cosine similarity is greater than a preset threshold.
Optionally, the obtaining module 410 includes:
the first matching submodule is used for matching keywords from the text based on the existing keyword table;
and the first determining sub-module is used for determining and obtaining the label corresponding to the keyword in the text based on the mapping relation between the keyword and the label.
Optionally, the apparatus 400 for classifying texts facing multi-level labels further includes:
the preprocessing module is used for preprocessing the text, and the preprocessing mode at least comprises one of deleting hypertext markup language, converting traditional characters into simplified characters, unifying capital and small English cases and deleting contents which accord with a preset regular expression.
Optionally, the encoding module 420 includes:
the first splicing submodule is used for extracting the title and the contents of the first section and the last section from the text and splicing the title and the contents of the first section and the last section, or extracting the title and the abstract from the text and splicing the title and the abstract;
and the first coding submodule is used for coding the text content obtained by splicing to obtain the characteristic vector.
The multi-level tag-oriented text classification apparatus provided in this embodiment can execute the method in any embodiment in fig. 3, and the execution manner and the beneficial effects are similar, and are not described herein again.
The embodiment of the present disclosure further provides a computing device, where the computing device includes a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the method in any embodiment in fig. 3 may be implemented, and an execution manner and beneficial effects of the method are similar, and are not described herein again.
The embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, the method of any embodiment in fig. 3 may be implemented, and the execution manner and the beneficial effect are similar, and are not described herein again.
The computer-readable storage medium described above may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer programs described above may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages, for performing the operations of embodiments of the present disclosure. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for classifying texts facing to multilevel labels, which is characterized in that the method comprises the following steps:
acquiring a text and a label corresponding to a keyword in the text;
the method comprises the steps of coding a text based on a text coding model in a preset multistage label-oriented text classification model to obtain a feature vector of the text, carrying out sensitive representation on a keyword of the text by the feature vector of the text, and coding the label based on a label coding model in the preset multistage label-oriented text classification model to obtain a label vector;
respectively calculating cosine similarity between the feature vector of the text and the vector of each label;
and determining the label with the cosine similarity larger than a preset threshold value as the label of the text.
2. The method of claim 1, wherein the obtaining the text and the tags corresponding to the keywords in the text comprises:
matching the text to obtain keywords based on the existing keyword list;
and determining to obtain a label corresponding to the keyword in the text based on the mapping relation between the keyword and the label.
3. The method according to claim 1, wherein before the text is encoded based on a text encoding model in a preset multistage tag-oriented text classification model to obtain a feature vector of the text, the feature vector of the text sensitively characterizes keywords of the text, and the tag is encoded based on a tag encoding model in a preset multistage tag-oriented text classification model to obtain a vector of the tag, the method further comprises:
and preprocessing the text, wherein the preprocessing mode at least comprises one of deleting hypertext markup language, converting traditional characters into simplified characters, unifying capital and lowercase forms of English and deleting contents conforming to a preset regular expression.
4. The method of claim 1, wherein encoding the text comprises:
extracting the title and the contents of the first section and the last section from the text, and splicing the title and the contents of the first section and the last section, or extracting the title and the abstract from the text, and splicing the title and the abstract;
and carrying out coding processing based on the text content obtained by splicing to obtain the feature vector of the text.
5. An apparatus for classifying text based on multilevel labels, the apparatus comprising:
the acquisition module is used for acquiring the text and the label corresponding to the keyword in the text;
the encoding module is used for encoding the text based on a text encoding model in a preset multistage label-oriented text classification model to obtain a feature vector of the text, performing sensitive representation on a keyword of the text by the feature vector of the text, and encoding the label based on a label encoding model in the preset multistage label-oriented text classification model to obtain a vector of the label;
the calculation module is used for respectively calculating cosine similarity between the feature vector of the text and the vector of each label;
and the determining module is used for determining the label of which the cosine similarity is greater than a preset threshold as the label of the text.
6. The apparatus of claim 5, wherein the obtaining module comprises:
the first matching submodule is used for matching and obtaining keywords from the text based on the existing keyword table;
and the first determining submodule is used for determining and obtaining the label corresponding to the keyword in the text based on the mapping relation between the keyword and the label.
7. The apparatus of claim 5, further comprising:
and the preprocessing module is used for preprocessing the text, and the preprocessing mode at least comprises one of deleting hypertext markup language, converting traditional characters into simplified characters, unifying capital and lower cases of English and deleting contents conforming to a preset regular expression.
8. The apparatus of claim 5, wherein the encoding module comprises:
the first splicing submodule is used for extracting the title and the contents of the first segment and the last segment from the text and splicing the title and the contents of the first segment and the last segment, or extracting the title and the abstract from the text and splicing the title and the abstract;
and the first coding submodule is used for coding the text content obtained by splicing to obtain the feature vector of the text.
9. A computing device, comprising:
memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, implements the multilevel tag oriented text classification method of any of claims 1-4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method for multi-level tag oriented text classification according to any one of claims 1-4.
CN202210225366.3A 2022-03-09 2022-03-09 Multilevel label-oriented text classification method, device, equipment and storage medium Pending CN114691866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210225366.3A CN114691866A (en) 2022-03-09 2022-03-09 Multilevel label-oriented text classification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210225366.3A CN114691866A (en) 2022-03-09 2022-03-09 Multilevel label-oriented text classification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114691866A true CN114691866A (en) 2022-07-01

Family

ID=82137160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210225366.3A Pending CN114691866A (en) 2022-03-09 2022-03-09 Multilevel label-oriented text classification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114691866A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115310564A (en) * 2022-10-11 2022-11-08 北京睿企信息科技有限公司 Classification label updating method and system
CN115409130A (en) * 2022-10-11 2022-11-29 北京睿企信息科技有限公司 Optimization method and system for updating classification label
CN115964658A (en) * 2022-10-11 2023-04-14 北京睿企信息科技有限公司 Classification label updating method and system based on clustering

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115310564A (en) * 2022-10-11 2022-11-08 北京睿企信息科技有限公司 Classification label updating method and system
CN115409130A (en) * 2022-10-11 2022-11-29 北京睿企信息科技有限公司 Optimization method and system for updating classification label
CN115964658A (en) * 2022-10-11 2023-04-14 北京睿企信息科技有限公司 Classification label updating method and system based on clustering
CN115409130B (en) * 2022-10-11 2023-08-15 北京睿企信息科技有限公司 Optimization method and system for updating classification labels
CN115964658B (en) * 2022-10-11 2023-10-20 北京睿企信息科技有限公司 Classification label updating method and system based on clustering

Similar Documents

Publication Publication Date Title
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN114691866A (en) Multilevel label-oriented text classification method, device, equipment and storage medium
CN111639171A (en) Knowledge graph question-answering method and device
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN108664512B (en) Text object classification method and device
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN111930939A (en) Text detection method and device
CN112182167B (en) Text matching method and device, terminal equipment and storage medium
CN114416979A (en) Text query method, text query equipment and storage medium
CN111241410A (en) Industry news recommendation method and terminal
CN117217277A (en) Pre-training method, device, equipment, storage medium and product of language model
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN113312498B (en) Text information extraction method for embedding knowledge graph by undirected graph
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN117828024A (en) Plug-in retrieval method, device, storage medium and equipment
CN118037261A (en) Knowledge graph-based power transmission and transformation equipment operation and maintenance method, device, equipment and medium
CN113988085B (en) Text semantic similarity matching method and device, electronic equipment and storage medium
CN114330350B (en) Named entity recognition method and device, electronic equipment and storage medium
CN112765940B (en) Webpage deduplication method based on theme features and content semantics
CN113886520A (en) Code retrieval method and system based on graph neural network and computer readable storage medium
CN113255319A (en) Model training method, text segmentation method, abstract extraction method and device
CN113850336B (en) Evaluation method, device, equipment and storage medium of semantic similarity model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination