CN117763152A - Text classification method, device and computer equipment in vertical field - Google Patents

Text classification method, device and computer equipment in vertical field Download PDF

Info

Publication number
CN117763152A
CN117763152A CN202410002843.9A CN202410002843A CN117763152A CN 117763152 A CN117763152 A CN 117763152A CN 202410002843 A CN202410002843 A CN 202410002843A CN 117763152 A CN117763152 A CN 117763152A
Authority
CN
China
Prior art keywords
text
classified
level
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410002843.9A
Other languages
Chinese (zh)
Inventor
李斌
谢鸣晓
张海霞
张圳
李昱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202410002843.9A priority Critical patent/CN117763152A/en
Publication of CN117763152A publication Critical patent/CN117763152A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a text classification method, a device, computer equipment and a storage medium in the vertical field, and relates to the technical field of big data processing. The method comprises the following steps: acquiring a text to be classified, and extracting text characteristics in the text to be classified; determining prior probabilities of labels corresponding to the texts to be classified in each level in a classification label system based on the texts to be classified and a predetermined classification label system corresponding to the vertical field of the target industry, and determining a plurality of labels corresponding to the texts to be classified; according to the text characteristics and the number of labels corresponding to the text to be classified in each level, calculating to obtain the conditional probability in each level; based on the prior probability of the label corresponding to the text to be classified and the conditional probability in each level, calculating to obtain the content classification label corresponding to the text to be classified by using a naive Bayes formula. By adopting the method, the text in the subdivided vertical field can be classified.

Description

Text classification method, device and computer equipment in vertical field
Technical Field
The disclosure relates to the technical field of big data processing, in particular to a text classification method, a text classification device and computer equipment in the vertical field.
Background
With the development of artificial intelligence, natural language processing technology has evolved at a high rate over the last 10 years, 20 years, from Word2Vec to deep-learned sequence models, to Transformer, bert, and GPT introduced in recent years. Along with the development of financial technology in banking industry, NLP technology is also widely used. Content intelligence is also increasingly emphasized in the field of financial management, and the application is also increasingly abundant.
The intelligent application of the content plays an important role in the client relationship maintenance of front throwing teaching, middle throwing marketing, rear throwing accompany and other scenes in the financial management field. One of the basic capabilities of content-aware applications is the understanding of content. The most important way of content understanding is to classify and label the content, and establish a classification system and a label system of the content. Content classification and automatic labeling of content or multi-label classification of content are an important task in natural language processing. The method has very deep and wide application in various large content platforms of the Internet.
However, the content classification and the automatic content generation label in the conventional technology mainly adopt some deep learning algorithms, for example, some sequence models, convolutional neural networks, bert and other methods are used for directly performing classification or multi-label classification. The methods in the conventional technology cannot deal with classification of the subdivided vertical fields.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, and a computer device for classifying texts in a vertical domain, which are capable of classifying texts in a subdivided vertical domain.
In a first aspect, the present disclosure provides a method of text classification in the vertical domain. The method comprises the following steps:
acquiring a text to be classified, and extracting text characteristics in the text to be classified, wherein the text to be classified is a text in the vertical field of the target industry;
determining prior probabilities of labels corresponding to the texts to be classified in each level in a classification label system based on the texts to be classified and a predetermined classification label system corresponding to the vertical field of the target industry, and determining a plurality of labels corresponding to the texts to be classified;
according to the text characteristics and the number of labels corresponding to the text to be classified in each level, calculating to obtain the conditional probability in each level;
based on the prior probability of the label corresponding to the text to be classified in each level and the conditional probability in each level, calculating to obtain the content classification label corresponding to the text to be classified by using a naive Bayes formula.
In one embodiment, the calculating, according to the text feature and the number of labels corresponding to the text to be classified in each hierarchy, the conditional probability in each hierarchy includes:
expanding the text features based on a predetermined expansion library to obtain an expansion feature group, wherein the expansion feature group comprises a plurality of expansion features associated with the text features, and weight relations exist between the expansion features and the text features; the expansion library comprises: the domain dictionary and the knowledge graph of the target industry;
and determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level.
In one embodiment, the extended feature set is stored in a hash table manner, and/or an inverted index is constructed for the extended feature set.
In one embodiment, the weight relationship is determined based on a degree of correlation between the extended features and the text features; the method for determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level comprises the following steps:
Acquiring associated expansion features directly or indirectly associated with each text feature, and determining a weight relation corresponding to the associated expansion features;
determining the comprehensive frequency of each text feature according to the sum of weight relations corresponding to the associated expansion features;
and calculating the conditional probability in each level according to the target text characteristics and the number of labels corresponding to the texts to be classified in each level, wherein the target text characteristics are text characteristics with the largest comprehensive frequency.
In one embodiment, the taxonomy tag hierarchy includes a plurality of levels, each level having a plurality of tags; wherein the hierarchy and the labels are determined based on a vertical domain of the target industry; the determining the prior probability of the label corresponding to the text to be classified in each level in the classification label system based on the text to be classified and a predetermined classification label system corresponding to the vertical field of the target industry comprises the following steps:
determining the prior probability corresponding to the labels of the text to be classified in the first hierarchy according to the number of the labels of the text to be classified in the first hierarchy and the total number of the labels;
And determining the prior probability corresponding to the tags of the text to be classified in each level according to the prior probability corresponding to the tags of the text to be classified in the first level, the number of the tags corresponding to the text to be classified in each level and the total number of the tags in each level.
In one embodiment, the determining the prior probability corresponding to the label of the text to be classified in each level according to the prior probability corresponding to the label of the text to be classified in the first level, the number of labels corresponding to the text to be classified in each level, and the total number of labels in each level includes:
calculating the ratio of the number of the labels corresponding to the text to be classified in each level to the total number of the labels in each level;
normalizing the quantity ratio, and determining the prior probability corresponding to each label in each level according to the normalized quantity ratio and the prior probability corresponding to each label in the first level.
In one embodiment, the extracting text features in the text to be classified includes:
extracting features of the text to be classified by using a keyword extraction technology to obtain first feature information; the weight value of the first characteristic information is a preset first value;
Extracting target sentences in the text to be classified by utilizing semantic analysis rules, and carrying out dependency analysis and semantic role analysis based on the target sentences;
obtaining second characteristic information based on the dependency analysis and semantic role analysis results; the weight value of the second characteristic information is a preset second value;
performing entity matching extraction on the text to be classified by using a predetermined expansion library to obtain third characteristic information, wherein the weight value of the third characteristic information is a preset third numerical value; the sizes of the first numerical value, the second numerical value and the third numerical value are sequentially increased;
and extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information and the weight value of the third feature information.
In one embodiment, the extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information, and the weight value of the third feature information includes:
determining fourth characteristic information and a weight value of the fourth characteristic information according to the same first characteristic information, second characteristic information and third characteristic information in response to the existence of the same characteristic information in the first characteristic information, the second characteristic information and the third characteristic information; the weight value of the fourth characteristic information is obtained by adding the weight values corresponding to the same first characteristic information, second characteristic information and third characteristic information;
And extracting text features in the text to be classified according to the weight values corresponding to the first feature information, the second feature information, the third feature information and the fourth feature information.
In one embodiment, before the extracting the text features in the text to be classified, the method further includes:
preprocessing the text to be classified, wherein the preprocessing comprises the following steps: and utilizing the domain dictionary of the target industry to segment the text to be classified and deactivating the word.
In a second aspect, the present disclosure also provides a text classification method in the vertical domain, the method comprising:
acquiring a text to be classified;
inputting the text to be classified into a classification model obtained by training in advance, and outputting a content classification label corresponding to the text to be classified through the classification model;
the classification model is obtained by training in the following mode: acquiring a training classified text, processing the training classified text by adopting the method in any embodiment, and determining a content classification label corresponding to the training classified text; and training a language processing model based on the training classification text and the content classification label corresponding to the training classification text to obtain a classification model.
In a third aspect, the present disclosure also provides a text classification apparatus in the vertical domain. The device comprises:
the feature extraction module is used for obtaining a text to be classified and extracting text features in the text to be classified, wherein the text to be classified is a text in the vertical field of the target industry;
the prior probability determining module is used for determining the prior probability of a label corresponding to the text to be classified in each level in the classification label system based on the text to be classified and a predetermined classification label system corresponding to the vertical field of the target industry, and determining a plurality of labels corresponding to the text to be classified;
the conditional probability determining module is used for calculating the conditional probability in each level according to the text characteristics and the number of the labels corresponding to the text to be classified in each level;
the classification module is used for calculating and obtaining content classification labels corresponding to the texts to be classified by using a naive Bayesian formula based on the prior probabilities of the labels corresponding to the texts to be classified in each level and the conditional probabilities in each level.
In one embodiment, the conditional probability determination module includes:
The feature expansion module is used for expanding the text features based on a predetermined expansion library to obtain an expansion feature group, wherein the expansion feature group comprises a plurality of expansion features associated with the text features, and weight relations exist between the expansion features and the text features; the expansion library comprises: the domain dictionary and the knowledge graph of the target industry;
and the conditional probability calculation module is used for determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level.
In one embodiment, the extended feature set is stored in a hash table manner, and/or a corresponding constructed inverted index exists in the extended feature set.
In one embodiment, the weight relationship is determined based on a degree of correlation between the extended features and the text features; the conditional probability calculation module comprises:
the weight relation determining module is used for acquiring the association expansion feature directly or indirectly associated with each text feature and determining the weight relation corresponding to the association expansion feature;
The comprehensive frequency determining module is used for determining the comprehensive frequency of each text feature according to the sum of the weight relationships corresponding to the associated expansion features;
and the calculating sub-module is used for calculating the conditional probability in each level according to the target text characteristics and the number of labels corresponding to the texts to be classified in each level, wherein the target text characteristics are text characteristics with the largest comprehensive frequency.
In one embodiment, the taxonomy tag hierarchy includes a plurality of levels, each level having a plurality of tags; wherein the hierarchy and the labels are determined based on a vertical domain of the target industry; the prior probability determining module includes:
the first determining module is used for determining the prior probability corresponding to the labels of the text to be classified in the first hierarchy according to the number of the labels of the text to be classified in the first hierarchy and the total number of the labels;
and the second determining module is used for determining the prior probability corresponding to the labels of the texts to be classified in each level according to the prior probability corresponding to the labels of the texts to be classified in the first level, the number of the labels corresponding to the texts to be classified in each level and the total number of the labels in each level.
In one embodiment, the second determining module is further configured to calculate a number ratio of the number of labels corresponding to the text to be classified in each level to the total number of labels in each level; normalizing the quantity ratio, and determining the prior probability corresponding to each label in each level according to the normalized quantity ratio and the prior probability corresponding to each label in the first level.
In one embodiment, the feature extraction module includes:
the first extraction module is used for extracting the characteristics of the text to be classified by utilizing a keyword extraction technology to obtain first characteristic information; the weight value of the first characteristic information is a preset first value;
the second extraction module is used for extracting target sentences in the text to be classified by utilizing semantic analysis rules, and performing dependency analysis and semantic role analysis based on the target sentences; obtaining second characteristic information based on the dependency analysis and semantic role analysis results; the weight value of the second characteristic information is a preset second value;
the third extraction module is used for carrying out entity matching extraction on the text to be classified by utilizing a predetermined expansion library to obtain third characteristic information, wherein the weight value of the third characteristic information is a preset third value; the sizes of the first numerical value, the second numerical value and the third numerical value are sequentially increased;
And the feature determining module is used for extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information and the weight value of the third feature information.
In one embodiment, the feature determining module is further configured to determine, in response to the presence of the same feature information in the first feature information, the second feature information, and the third feature information, fourth feature information, and a weight value of the fourth feature information according to the same first feature information, the second feature information, and the third feature information; the weight value of the fourth characteristic information is obtained by adding the weight values corresponding to the same first characteristic information, second characteristic information and third characteristic information; and extracting text features in the text to be classified according to the weight values corresponding to the first feature information, the second feature information, the third feature information and the fourth feature information.
In one embodiment, the apparatus further comprises: the text processing module is used for preprocessing the text to be classified, and the preprocessing comprises the following steps: and utilizing the domain dictionary of the target industry to segment the text to be classified and deactivating the word.
In a fourth aspect, the present disclosure also provides a text classification apparatus in the vertical field, the apparatus comprising: the data acquisition module is used for acquiring texts to be classified;
the model processing module is used for inputting the text to be classified into a classification model obtained by training in advance, and outputting a content classification label corresponding to the text to be classified through the classification model; the classification model is obtained by training in the following mode: acquiring a training classified text, processing the training classified text by adopting the method in any one of the embodiments, and determining a content classification label corresponding to the training classified text; and training a language processing model based on the training classification text and the content classification label corresponding to the training classification text to obtain a classification model.
In a third aspect, the present disclosure also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of any of the method embodiments described above when the processor executes the computer program.
In a fourth aspect, the present disclosure also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
In a fifth aspect, the present disclosure also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
In the above embodiments, when the text in the vertical domain of the target industry needs to be classified, since there is some specialized knowledge in the vertical domain of the target industry, the processing manner in the conventional technology cannot be covered comprehensively, so that the prior probability of the label corresponding to the text to be classified in each level and the multiple labels corresponding to the text to be classified can be determined according to the text to be classified, the predetermined classification label system corresponding to the vertical domain of the target industry. The text to be classified can be divided according to the classification label system corresponding to the vertical field of the target industry, so that a plurality of labels corresponding to the text to be classified are all in the vertical field of the target industry, the content classification professional is ensured, and the vertical field of the target industry can be more adapted. And the content classification problem of the text in the vertical field of the target industry can be converted into the hierarchical multi-label classification problem. After the prior probability is calculated, a conditional probability is generally required to be used to calculate the posterior probability. Therefore, text features of the text to be classified can be extracted, and conditional probabilities in each hierarchy are calculated according to the text features and the number of labels corresponding to the text to be classified in each hierarchy. And further, calculating the probability of the label in each level according to the prior probability of the label corresponding to the text to be classified in each level and the conditional probability in each level, and further calculating the content classification label corresponding to the text to be classified by using a naive Bayes formula. In addition, by using different types of classification label systems, all common text classification problems such as two classification, multi-classification, hierarchical multi-label classification and the like can be compatible and processed simultaneously.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a diagram of an application environment for a text classification method in the vertical field in one embodiment;
FIG. 2 is a flow diagram of a text classification method in the vertical field in one embodiment;
FIG. 3 is a schematic diagram of a classification system in one embodiment;
FIG. 4 is a flow chart of step S206 in one embodiment;
FIG. 5 is a schematic diagram of expanding text features in one embodiment;
FIG. 6 is another schematic diagram of expanding text features in one embodiment;
FIG. 7 is a flow chart of step S304 in one embodiment;
FIG. 8 is a flow chart of step S204 in one embodiment;
FIG. 9 is a flow chart of step S504 in one embodiment;
FIG. 10 is a flowchart illustrating the step S204 in one embodiment;
FIG. 11 is a flow chart of step S710 in one embodiment;
FIG. 12 is a flow chart of a text classification method in the vertical field in another embodiment;
FIG. 13 is a block diagram schematically illustrating the structure of a text classification apparatus in the vertical field in one embodiment;
FIG. 14 is a block diagram schematically illustrating the structure of a text classification apparatus in the vertical field in one embodiment;
FIG. 15 is a schematic diagram of the internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.
It should be noted that the terms "first," "second," and the like in the description and claims herein and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.
In this document, the term "and/or" is merely one association relationship describing the associated object, meaning that three relationships may exist. For example, a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
As described in the background, content classification and automatic labeling of content are an important task in natural language processing. The method has very deep and wide application in various large content platforms of the Internet. The Internet scene has a large-scale corpus and a pre-trained model, and the content range of the Internet scene is also very open, so that the Internet scene covers the fields with different aspects. However, in banking industry, there is a very strict management flow for the production, management and distribution of content, so that there are some classifications and labels for special service purposes in both the content classification system from the management perspective and the content label system from the application perspective, and the frequency of adjustment changes is relatively frequent.
In the financial management field of banks, the sources for acquiring the content mainly comprise writing and creation of content service experts and purchasing from external three-party suppliers, and the scale and the quantity of the content are relatively small. In terms of content classification system and labels, suppliers only provide a small amount of rough classification, and cannot meet the requirement of deep understanding of the content by business scenes. The classification and labeling of content in the vertical domain of banking is mainly faced with several problems and challenges:
1. The classification and label system established based on the purposes of service management and application lacks enough training samples to construct a classification model, and has serious cold start problem at the initial stage of application, and the supervised method is not easy to implement at the beginning stage.
2. The problem of frequent adjustment of the content classification and the label system in the early stage can lead to continuous optimization iteration of the classification model and continuous adjustment and investment of the labeling work of the corpus.
3. The field contains many existing knowledge, including dictionary, business terms and business knowledge in the field, and it is difficult to achieve comprehensive coverage and effective utilization by an end-to-end (end-to-end) method or model, which leads to inaccurate final classification results.
Accordingly, to solve the above-mentioned problems, the embodiments of the present disclosure provide a text classification method in the vertical field, which can be applied to the application environment as shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. Various classification tag systems and various texts to be classified which need to be classified can be stored in the data storage system. The server 104 may send the text to be classified, which needs to be classified, to the terminal 102. The terminal 102 may obtain the text to be classified, and the terminal 102 may also extract text features in the text to be classified. The text to be classified is typically text within the vertical domain within the target industry. The terminal 102 may determine a priori probabilities of the respective tags in each level in the classification tag hierarchy based on the text to be classified, a predetermined classification tag hierarchy corresponding to the vertical domain of the target industry, and a plurality of tags corresponding to the text to be classified. The terminal 102 may calculate the conditional probability according to the text feature and the number of tags corresponding to the text to be classified. The terminal 102 may calculate, based on the prior probability of the label corresponding to the text to be classified in each hierarchy and the conditional probability in each hierarchy, a content classification label corresponding to the text to be classified by using a naive bayes formula. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, etc. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a text classification method in the vertical domain is provided, and the method is applied to the terminal 102 in fig. 1 for illustration, and includes the following steps:
s202, obtaining a text to be classified, and extracting text characteristics in the text to be classified, wherein the text to be classified is a text in the vertical field of the target industry.
Wherein the text to be classified may in some embodiments of the present disclosure generally be text that needs to be classified into which categories it belongs. The target industry may be determined based on actual needs, and may be, for example, healthcare, financial services, information technology, law, education, real estate, etc. In these fields, professionals often need to have a deep understanding of relevant regulations, industry standards, latest technology, and market trends in order to provide professional services and advice to customers or organizations. Thus, a vertical domain may generally refer to a domain with specific expertise in a particular industry in some embodiments of the disclosure. For example, the vertical domain of the financial industry may be a domain with expertise and expertise in the financial industry.
In particular, since various information is required to be known for the vertical field, text for the vertical field of the target industry is generally less data, and thus it is difficult to classify using an existing model. Therefore, when the text to be classified is a text in the vertical field of the target industry, the text to be classified can be obtained from the server, and text characteristics in the text to be classified can be extracted.
In some exemplary embodiments, natural Language Processing (NLP) techniques may be used to extract features in the text to be classified. For example, the text to be classified is split into words, and the frequency of occurrence of each word is counted. Text features are extracted by calculating the number of occurrences of words in the text to be classified or using TF-IDF (word frequency-inverse document frequency) method. The text to be classified can be further divided into a sequence of N continuous words or characters, and the relevance among the words or sequences is captured by utilizing an N-gram model so as to extract the characteristics. Named Entity Recognition (NER) techniques may also be used to identify entities in the text to be classified that may be used as text features of the text to be classified.
S204, determining prior probabilities of labels corresponding to the texts to be classified in each level in a classification label system based on the texts to be classified and a predetermined classification label system corresponding to the vertical field of the target industry, and determining a plurality of labels corresponding to the texts to be classified.
Wherein a taxonomy label hierarchy may generally be a hierarchy corresponding to a vertical domain of the target industry, wherein multiple tiers may be included, and multiple labels may be present in each tier. The hierarchy and the labels are determined based on a vertical domain of the target industry. Taking the banking industry as an example, the classification tag system corresponding to the vertical domain of the banking industry may be shown in the classification tag system table in table 1.
Table 1 table of class tag systems
It should be noted that the above examples only schematically illustrate the class labels in the class label hierarchy, where a first class of class labels may be determined as one hierarchy and a second class of class labels may be determined as another hierarchy. Typically, in practical application, the classification level is generally 3 to 6 layers, so that the text to be classified can be classified better. As shown in fig. 3, moreover, the taxonomy-tagging system is presented as a tree or forest as a whole,or may be viewed as a directed acyclic graph. Each class may be represented as a path in a class tree, such as: c (C) 0 C 1 C 12 、C 0 C n C n1 C n12, Wherein C is 0 (not shown in the figure) is a virtual root class. The analysis of the classification of the content can be seen as finding one or more suitable paths, and all paths can be obtained by DFS traversal of the entire tree. Therefore, the content classification problem of the text in the vertical field of the target industry is converted into the hierarchical multi-label classification problem. It should be noted that each content may belong to a plurality of classifications. In bayesian statistics, a priori probabilities refer to initial estimates of probability distributions for events before new data is considered. In bayesian inference, a priori probabilities are combined with posterior probabilities to update estimates of probability distributions for events. In some embodiments of the present disclosure, prior probabilities are used to classify the content of text to be classified.
Specifically, the text to be classified can be classified according to a classification label system, a plurality of labels corresponding to the text to be classified are determined, and the prior probability of each label belonging to the text to be classified in each level is calculated.
In some exemplary embodiments, illustrated by way of example in FIG. 3, for example, text to be classified belongs to C in a first hierarchy 1 C in the second hierarchy 11 And C 12 The label of the text to be classified may include: c (C) 1 、C 11 And C 12 . The prior probabilities may be obtained by dividing the number of tags belonging to the text to be classified in each hierarchy by the total number of tags. C in the first hierarchy 1 The a priori probability of (C) may be C 1 /C 1 +C 2 +...+C n
S206, calculating the conditional probability in each level according to the text characteristics and the number of labels corresponding to the text to be classified in each level.
Where a conditional probability refers to the probability of an event occurring given the occurrence of another event. The conditional probability is generally denoted as P (a|b), which is read as "probability of a occurring under the condition that B occurs".
Specifically, the conditional probability in each hierarchy may be calculated according to the number of occurrences of the text feature in the label corresponding to the text to be classified in each hierarchy divided by the number of labels corresponding to the text to be classified in each hierarchy.
In addition, because Of the sparsity Of the Vocabulary in the vertical field Of the target industry to a great extent, that is, the Vocabulary Of many OOVs (Out-Of-vocaliars), a great amount Of conditional probabilities are 0 when corpus samples are absent, so that in order to avoid the situation, text features can be expanded, and the conditional probabilities in each level can be calculated by using the expanded text features and the number Of labels corresponding to the text to be classified in each level.
S208, calculating to obtain content classification labels corresponding to the texts to be classified by using a naive Bayesian formula based on prior probabilities of labels corresponding to the texts to be classified in each level and conditional probabilities in each level.
The naive Bayes formula is a Bayes theorem-based classification algorithm, and is used for calculating the probability of a certain category under the condition of a given feature.
Specifically, after the prior probability and the conditional probability of the label corresponding to the text to be classified in each level are calculated, a probability result corresponding to the label corresponding to the text to be classified in each level can be calculated by using a naive bayes formula, and the content classification label of the text to be classified is determined based on the probability result.
In some embodiments, the probability result corresponding to the text label to be classified in each hierarchy may be calculated by using the following formula:
wherein y is the probability result corresponding to the text label to be classified in each hierarchy,for the prior probability of the label corresponding to the text to be classified in each hierarchy, ++>A conditional probability in each hierarchy. n is the number of levels. />Is a hierarchy.
In the above text classification method in the vertical domain, when the text in the vertical domain of the target industry needs to be classified, since there is some specialized knowledge in the vertical domain of the target industry, the processing manner in the conventional technology cannot be covered on the whole, so that the prior probability of the label corresponding to the text to be classified in each level and the multiple labels corresponding to the text to be classified can be determined according to the text to be classified and the predetermined classification label system corresponding to the vertical domain of the target industry. The method and the device can convert the content classification problem of the text in the vertical field of the target industry into the hierarchical multi-label classification problem, and can divide the text to be classified according to the classification label system corresponding to the vertical field of the target industry, so that a plurality of labels corresponding to the text to be classified are all in the vertical field of the target industry, and finally, after classification, the corresponding labels can be ensured to be in the vertical field of the target industry, the content classification professional can be ensured, and the method and the device are more suitable for the vertical field of the target industry. After the prior probability is calculated, a conditional probability is generally required to be used to calculate the posterior probability. Therefore, text features of the text to be classified can be extracted, and conditional probabilities in each hierarchy are calculated according to the text features and the number of labels corresponding to the text to be classified in each hierarchy. And further, calculating the probability of the label in each level according to the prior probability of the label corresponding to the text to be classified in each level and the conditional probability in each level, and further calculating the content classification label corresponding to the text to be classified by using a naive Bayes formula. In addition, by using different types of classification label systems, all common text classification problems such as two classification, multi-classification, hierarchical multi-label classification and the like can be compatible and processed simultaneously.
In one embodiment, as shown in fig. 4, the calculating, according to the text feature and the number of labels corresponding to the text to be classified in each hierarchy, the conditional probability in each hierarchy includes:
s302, expanding the text features based on a predetermined expansion library to obtain an expansion feature group,
the text feature group comprises a plurality of text features, wherein the text features are associated with the text features, and weight relations exist between the text features and the text features; the expansion library comprises: domain dictionary and knowledge graph of the target industry. The weighting relationships may generally be data that characterizes the degree of importance between extended features and text features. The weight relationship may be determined based on a degree of correlation between the extended features and the text features. A domain dictionary is a dictionary that contains terms and definitions commonly used in a particular industry or domain. It provides an explanation and illustration of various terms and related concepts within the art. In the financial industry, domain dictionaries may contain definitions and interpretations of terms such as stocks, bonds, futures, derivatives, and the like. In the scientific industry, domain dictionaries may contain definitions and interpretations of terms such as artificial intelligence, big data, cloud computing, and the like. Domain lexicons in the banking industry are mainly the carding of business terms, business functions. Such as: the investment financial management comprises the following steps: funding, financial, precious metals, etc.; the foundation further comprises: the different business operations of fund account opening, fund equity, etc. are dictionaries that describe business terms and business operations more fully. The knowledge graph is a structured, expressible, graphical model for representing and storing knowledge. It represents various concepts in the real world and relationships between them by organizing entities, attributes and relationships into the form of nodes and edges. Knowledge maps can be used to store and query large-scale knowledge data. In the banking industry, a great amount of knowledge-graph data is also constructed and deposited with the popularization of artificial intelligence technology. Such as the relationship of a certain fund product to its fund company, the relationship of the fund manager, the type of fund, stocks, industries, etc. corresponding to the underlying investment asset.
Specifically, when the amount of data in the text to be classified is small or the extracted text features are small, in general, when the conditional probability is calculated, the conditional probability calculation is inaccurate, and problems such as cold start, OOV and the like of the data are caused to influence the final classification result. Therefore, in order to accurately calculate the conditional probability, the text feature may be expanded by using a predetermined expansion library to obtain an expansion feature set.
In some exemplary embodiments, text features may be semantically expanded through an expansion library. Taking the banking industry as an example, if a certain foundation product entity appears in text features, an entity with an N degree relation can be selected as an expansion phrase through a knowledge graph, and different weight relations are given according to the strength of the relation, for example, the weight relation of the first degree relation is 0.8, and the weight relation of the second degree relation is 0.5. As shown in fig. 5, for example, if the text feature is w, after semantic expansion, a first-degree relationship may be obtained: w11, w12,..w 1n, second degree relationship: w21, w22. In general, the relationship of three degrees, four degrees, and the like may be continuously determined, and the relationship of three degrees, four degrees, and the like may be directly determined since the relationship of three degrees, four degrees, and the like is weak with the text feature. In addition, it should be noted that, the weight relationship corresponding to each degree relationship may be set according to actual requirements, and specific numerical values of the weight relationship are not limited in some embodiments of the present disclosure. In addition, the text features can be expanded by using a domain dictionary. The domain dictionary can be regarded as a special relationship between entities, and its expansion weight can be set to 1, because the contained and contained relationships under different business knowledge systems are a strong relationship. When expanding, some features may be mutually formed into expanded words, and for such cases, the expanded weights may be subjected to superposition processing. As shown in fig. 6, there is an extended semantic relationship between the features w 'and w, then in counting the number of occurrences, the calculation may be performed according to the superimposed result, for example, the weight relationship between w' and w is calculated to be 1.8, which may result in the final calculated conditional probability calculation being greater than 1, and typically the probability being 1 at the maximum, so that when there is a weight relationship greater than 1, it may be adjusted to be 1. When OOV appears in the text to be classified (does not appear in the sample corpus), the search can be performed through the extended feature set, w22 in the above diagram does not appear in the sample corpus, but is in the extended feature phrase of the word w, and when w22 appears in the text to be classified, the corresponding comprehensive frequency of the text to be classified can be set to 0.5.
S304, determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level.
Specifically, after the extended feature set is obtained, since the text feature has been extended. Thus, the composite frequency of the text feature may be derived using the weight relationships corresponding to the extended features associated with the text feature. And then calculating the conditional probability in each hierarchy according to the number of labels corresponding to the text to be classified in each hierarchy.
In the embodiment, the text features are expanded through the expansion library, so that the problems of sample data missing and OOV can be effectively solved, the comprehensive frequency of the text features can be generalized, and the accuracy of text classification is further ensured.
In one embodiment, to increase the speed of data retrieval, the extended feature set may be stored in a hash table, and/or an inverted index may be constructed for the extended feature set. Among them, an Inverted Index (Inverted Index) is a data structure for quickly searching a document. It is commonly used in information retrieval systems, such as search engines. The core idea of inverted indexing is to map each feature onto a list of documents that contain the feature.
In one embodiment, as shown in fig. 7, the determining the comprehensive frequency of the text feature according to the weight relation corresponding to the extended feature associated with the text feature in the extended feature group, and calculating to obtain the conditional probability in each level according to the number of tags corresponding to the text to be classified in each level includes:
s402, acquiring associated expansion features directly or indirectly associated with each text feature, and determining a weight relation corresponding to the associated expansion features.
S404, determining the comprehensive frequency of each text feature according to the sum of the weight relations corresponding to the associated expansion features.
S406, calculating to obtain the conditional probability in each level according to the target text characteristics and the number of labels corresponding to the texts to be classified in each level, wherein the target text characteristics are text characteristics with the largest comprehensive frequency. The composite frequency may be data characterizing the importance of the text feature.
In particular, an extension may be found that is directly or indirectly related to the text feature, which may be an associated extension feature. And then determining the weight relation corresponding to the association expansion feature. And adding the weight relation of the associated expansion feature corresponding to each text feature to obtain the comprehensive frequency corresponding to each text feature. And then finding the text feature with the highest comprehensive frequency as the target text feature. And further, according to the number of times of occurrence of the labels corresponding to the texts to be classified in each level of the target text features and the number of the labels corresponding to the texts to be classified in each level, calculating to obtain the conditional probability in each level.
In some exemplary embodiments, for example, the text feature includes A, B, C, and the associated extension feature corresponding to text feature a includes: a1, A2 and A3 are respectively weighted with the weight relations of 0.8, 0.8 and 0.5. The associated extension features corresponding to the text feature B comprise: b1 and B2, the corresponding weight relations are 0.8 and 0.5. The associated extension features corresponding to the text feature C comprise: c1 and C2, the corresponding weight relations are 0.8 and 0.8, and the target text feature can be the text feature A. And then obtaining the conditional probability in each level according to the occurrence times of the labels corresponding to the texts to be classified in each level and the number of the labels corresponding to the texts to be classified in each level.
The conditional probability in each hierarchy can be calculated using the following formula:
wherein,for conditional probability in each hierarchy, +.>For the number of occurrences of the tag corresponding to the text to be classified in each hierarchical level of the target text feature, +.>And the number of labels corresponding to the text to be classified in each hierarchy is determined.
In this embodiment, the comprehensive frequency of each text feature is determined by using the weight relationship, and then the text feature with the largest comprehensive frequency is used as the target text feature, so that the feature used for calculating the conditional probability is more adapted to the text to be classified, and the accuracy of the calculated conditional probability is ensured.
In one embodiment, as shown in fig. 8, the determining, based on the text to be classified and a predetermined classification label system corresponding to the vertical field of the target industry, the prior probability of the label corresponding to the text to be classified in each level in the classification label system includes:
s502, determining the prior probability corresponding to the labels of the text to be classified in the first hierarchy according to the number of the labels of the text to be classified in the first hierarchy and the total number of the labels.
S504, determining the prior probability corresponding to the labels of the texts to be classified in each level according to the prior probability corresponding to the labels of the texts to be classified in the first level, the number of the labels corresponding to the texts to be classified in each level and the total number of the labels in each level.
Specifically, in this case, the prior probability may be calculated in batches according to the classification tag system, and the prior probability corresponding to the tags of the text to be classified in each hierarchy may be obtained by dividing the number of tags belonging to each hierarchy by the total number of the tags belonging to the hierarchy classification and then normalizing. The normalization method can be calculated by Softmax. In addition, the prior probability corresponding to the labels of the text to be classified in the first hierarchy may be calculated by dividing the number of labels of the text to be classified in the first hierarchy by the total number of labels in the first hierarchy.
In this embodiment, since the classification tag system is used to convert the content text classification into the hierarchical multi-tag classification problem, since the text to be classified can belong to different tags in multiple hierarchies at the same time, and in addition, the tags are independent from each other before each other and will not affect each other. And the problem of data sparseness is not amplified along with the increase of classification levels, namely the increase of the levels has no influence on the prior probability calculated in the scheme. Therefore, when more layers and more labels exist, the accuracy of calculating the prior probability can be ensured.
In one embodiment, as shown in fig. 9, the determining the prior probability corresponding to the labels of the text to be classified in each level according to the prior probability corresponding to the labels of the text to be classified in the first level, the number of the labels corresponding to the text to be classified in each level, and the total number of the labels in each level includes:
s602, calculating the ratio of the number of the labels corresponding to the text to be classified in each level to the number of the labels in each level of the total number of the labels in each level.
S604, normalizing the quantity ratio, and determining the prior probability corresponding to each label in each level according to the normalized quantity ratio and the prior probability corresponding to each label in the first level.
Specifically, the prior probability corresponding to the label of the text to be classified of the first level is calculated by the following formula:
wherein,for the prior probability of label correspondence of text to be classified in the first hierarchy, ++>Is the number of labels of the text to be classified in the first hierarchy. D is the total number of tags in the first hierarchy.
The prior probability corresponding to the label of the text to be classified in each hierarchy is calculated by the following formula except the first hierarchy:
wherein,for the prior probability of label correspondence of text to be classified in the first hierarchy, ++>Is the total number of tags in the i-th hierarchy. />And the number of the corresponding labels in the text to be classified in the ith level. In addition, it should be noted that normalization may also be processed using other ways than softmax, and is not limited in some embodiments of the present disclosure.
In this embodiment, since the text to be classified may belong to multiple labels at the same time, normalization may be performed, so as to ensure accuracy of the final result. In addition, the method can perform independent calculation for different levels, and the classification effect among different levels is not greatly influenced by the classification levels, so that a plurality of different classifiers are not required to be trained for different classification systems.
In one embodiment, as shown in fig. 10, the extracting text features in the text to be classified includes:
s702, extracting features of the text to be classified by using a keyword extraction technology to obtain first feature information; the weight value of the first characteristic information is a preset first value. The keyword extraction technique may be TextRank. TextRank is a method similar to Pagerank, and propagation iteration is carried out by taking word co-occurrence relations as link relations, so that a core vocabulary is finally obtained.
Specifically, the text to be classified can be subjected to feature extraction by using TextRank to obtain first feature information. The first characteristic information may be a word or a vocabulary.
S704, extracting target sentences in the text to be classified by utilizing semantic analysis rules, and carrying out dependency analysis and semantic role analysis based on the target sentences.
S706, obtaining second characteristic information based on the dependency analysis and semantic role analysis results; the weight value of the second characteristic information is a preset second value.
The semantic analysis rule extraction is mainly to perform dependency analysis and semantic role analysis on core sentences (the core sentences comprise titles, core sentences are calculated through TextRank, or front N sentences and rear N sentences of the content are directly used as core sentences or all sentences are used for analysis). Dependency analysis is an analytical method in linguistics and computational linguistics for studying the dependency relationship between words in sentences. Dependency analysis mainly focuses on the syntactic and semantic relationships between words, and by analyzing the dependency relationships between words in a sentence, the structure and meaning of the sentence can be helped to be understood. Semantic role analysis is a technique in natural language processing for identifying semantic roles played in sentences by individual components (subject, object, predicate, etc.) in sentences. Semantic role analysis aims at understanding semantic information in sentences and helps to calculate meaning and structure of mechanically solved sentences. Through semantic role analysis, the computer can recognize the relation among all components in the sentence, so that the meaning of the sentence can be more accurately understood, and the natural language processing task is facilitated.
Specifically, the semantic analysis rule may be used to extract a core sentence in the text to be classified, and then the dependency analysis and the semantic role analysis are performed on the core sentence, so that the second feature information is obtained according to the results of the dependency analysis and the semantic role analysis. The second characteristic information may also be words or vocabularies in general.
S708, performing entity matching extraction on the text to be classified by using a predetermined expansion library to obtain third characteristic information, wherein the weight value of the third characteristic information is a preset third numerical value; the magnitudes of the first numerical value, the second numerical value and the third numerical value are sequentially increased. The weight value of the first characteristic information may be set to 0.2. The weight value of the second characteristic information may be set to 0.5. The weight value of the third characteristic information may be set to 0.8. The predetermined extended library may be referred to in the above embodiments, and will not be repeated here. The first, second, and third values may be set according to actual needs, and are not limited in some embodiments of the present disclosure.
Specifically, the words in the text to be classified and the entities in the expansion library can be compared, the words matched with the entities in the expansion library are reserved, and then the third characteristic information is extracted.
S710, extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information and the weight value of the third feature information.
Specifically, since the feature information extracted by different modes is different, the weight value of the feature information is also different, so that in order to accurately find the most accurate text feature, the feature information with the highest weight value is found and then used as the text feature in the text to be classified.
In this embodiment, feature extraction is performed in three ways, so that the most accurate text feature can be found, and further, the accuracy of content classification is improved.
In one embodiment, as shown in fig. 11, the extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information, and the weight value of the third feature information includes:
s802, determining fourth characteristic information and a weight value of the fourth characteristic information according to the same first characteristic information, second characteristic information and third characteristic information in response to the existence of the same characteristic information in the first characteristic information, the second characteristic information and the third characteristic information; the weight value of the fourth characteristic information is obtained by adding the weight values corresponding to the same first characteristic information, second characteristic information and third characteristic information.
S804, extracting text features in the text to be classified according to the weight values corresponding to the first feature information, the second feature information, the third feature information and the fourth feature information which are different.
Specifically, when the same characteristic information exists in the first characteristic information, the second characteristic information and the third characteristic information, it can be determined that the characteristic information is reserved after the text to be classified is extracted in different modes, and it can be determined that the same characteristic information is accurate. For example, if a term is processed through keyword extraction techniques and expansion libraries, it is still preserved. I.e. the same characteristic information exists in the first characteristic information and in the third characteristic information. At this time, any at least two kinds of identical characteristic information of the first characteristic information, the second characteristic information and the third characteristic information can be found, so as to determine fourth characteristic information. And then the weight value of the fourth characteristic information can be obtained by adding the weight values corresponding to the same characteristic information. And then according to the weight values corresponding to the first feature information, the second feature information, the third feature information and the fourth feature information, one or more feature information corresponding to the maximum weight value is found, and further text features in the text to be classified are extracted. One or more feature information for which the determined weight value is greater than the preset weight threshold may also be text features.
In some exemplary embodiments, for example, a1, a2 are included in the first feature information, a1, a3, a4 are included in the second feature information, a2, a3, a5 are included in the third feature information, and then the fourth feature information may be a1, a2, and a3. In the fourth feature information, the weight value of a1 may be 0.2+0.5=0.7, the weight value of a2 may be 0.2+0.8=1, and the weight value of a3 may be 0.5+0.8=1.3, and in the case that three final text features are selected, a2, a3, and a5 may be final text features, it will be understood that the above is only for illustration.
In this embodiment, when the same feature information is extracted by different feature extraction modes, it may be determined that the feature information is important, so that the weight values of the feature information may be added by the weight values corresponding to the different feature extraction modes, and further, more accurate text features may be obtained by using the weight values.
In some embodiments, before the extracting the text feature in the text to be classified, the method further includes:
preprocessing the text to be classified, wherein the preprocessing comprises the following steps: and utilizing the domain dictionary of the target industry to segment the text to be classified and deactivating the word.
Specifically, the segmentation is performed through the domain dictionary of the target industry, and the words are deactivated, so that the segmentation accuracy and the deactivation accuracy of the words in the vertical domain of the target industry can be improved, and the accuracy of a final classification result is further ensured. Additionally, pretreatment may include other means such as reference resolution.
In one embodiment, the present disclosure also provides another method of text classification in the vertical domain, as shown in fig. 12, the method comprising:
s902, acquiring a text to be classified.
S904, inputting the text to be classified into a pre-trained classification model, and outputting a content classification label corresponding to the text to be classified through the classification model. The classification model is obtained by training in the following mode: acquiring a training classified text, processing the training classified text by adopting the method in the embodiment, and determining a content classification label corresponding to the training classified text; and training a language processing model based on the training classification text and the content classification label corresponding to the training classification text to obtain a classification model. The language processing model may include: some sequence models, convolutional neural networks, bert (Bidirectional Encoder Representations from Transformers), etc. For how to process the training classified text and determine the content classification label corresponding to the training classified text, refer to the above embodiment, and the repeated description is omitted here.
In this embodiment, the classification speed can be improved by processing with a pre-trained classification model, and in addition, since the data of the training classification model is processed by the method in the above embodiment, the training data is relatively accurate, and can adapt to the vertical field, and the accuracy of the final classification result is ensured.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiments of the present disclosure also provide a text classification apparatus in the vertical domain for implementing the above-mentioned text classification method in the vertical domain. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the text classification device in one or more vertical fields provided below may refer to the limitation of the text classification method in the vertical field hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 13, there is provided a text classification apparatus 900 in the vertical domain, comprising: a feature extraction module 902, a priori probability determination module 904, a conditional probability determination module 906, and a classification module 908, wherein:
the feature extraction module 902 is configured to obtain a text to be classified, and extract text features in the text to be classified, where the text to be classified is a text in a vertical field of a target industry;
the prior probability determining module 904 is configured to determine, based on the text to be classified, a predetermined classification tag system corresponding to a vertical field of the target industry, a prior probability of a tag corresponding to the text to be classified in each level in the classification tag system, and determine a plurality of tags corresponding to the text to be classified;
The conditional probability determining module 906 is configured to calculate, according to the text feature and the number of labels corresponding to the text to be classified in each level, a conditional probability in each level;
the classification module 908 is configured to calculate, based on the prior probability of the label corresponding to the text to be classified in each level and the conditional probability in each level, a content classification label corresponding to the text to be classified by using a naive bayes formula.
In one embodiment of the apparatus, the conditional probability determination module 906 includes:
the feature expansion module is used for expanding the text features based on a predetermined expansion library to obtain an expansion feature group, wherein the expansion feature group comprises a plurality of expansion features associated with the text features, and weight relations exist between the expansion features and the text features; the expansion library comprises: the domain dictionary and the knowledge graph of the target industry;
and the conditional probability calculation module is used for determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level.
In an embodiment of the device, the extended feature set is stored in a hash table manner, and/or a corresponding constructed inverted index exists in the extended feature set.
In an embodiment of the apparatus, the weight relation is determined based on a degree of correlation between the extended features and the text features; the conditional probability calculation module comprises:
the weight relation determining module is used for acquiring the association expansion feature directly or indirectly associated with each text feature and determining the weight relation corresponding to the association expansion feature;
the comprehensive frequency determining module is used for determining the comprehensive frequency of each text feature according to the sum of the weight relationships corresponding to the associated expansion features;
and the calculating sub-module is used for calculating the conditional probability in each level according to the target text characteristics and the number of labels corresponding to the texts to be classified in each level, wherein the target text characteristics are text characteristics with the largest comprehensive frequency.
In one embodiment of the apparatus, the taxonomy includes a plurality of levels, each level having a plurality of tags therein; wherein the hierarchy and the labels are determined based on a vertical domain of the target industry; the prior probability determination module 904 includes:
The first determining module is used for determining the prior probability corresponding to the labels of the text to be classified in the first hierarchy according to the number of the labels of the text to be classified in the first hierarchy and the total number of the labels;
and the second determining module is used for determining the prior probability corresponding to the labels of the texts to be classified in each level according to the prior probability corresponding to the labels of the texts to be classified in the first level, the number of the labels corresponding to the texts to be classified in each level and the total number of the labels in each level.
In an embodiment of the apparatus, the second determining module is further configured to calculate a number ratio of the number of labels corresponding to the text to be classified in each level to the total number of labels in each level; normalizing the quantity ratio, and determining the prior probability corresponding to each label in each level according to the normalized quantity ratio and the prior probability corresponding to each label in the first level.
In one embodiment of the apparatus, the feature extraction module 902 includes:
the first extraction module is used for extracting the characteristics of the text to be classified by utilizing a keyword extraction technology to obtain first characteristic information; the weight value of the first characteristic information is a preset first value;
The second extraction module is used for extracting target sentences in the text to be classified by utilizing semantic analysis rules, and performing dependency analysis and semantic role analysis based on the target sentences; obtaining second characteristic information based on the dependency analysis and semantic role analysis results; the weight value of the second characteristic information is a preset second value;
the third extraction module is used for carrying out entity matching extraction on the text to be classified by utilizing a predetermined expansion library to obtain third characteristic information, wherein the weight value of the third characteristic information is a preset third value; the sizes of the first numerical value, the second numerical value and the third numerical value are sequentially increased;
and the feature determining module is used for extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information and the weight value of the third feature information.
In one embodiment of the apparatus, the feature determining module is further configured to determine, in response to the presence of the same feature information in the first feature information, the second feature information, and the third feature information, fourth feature information, and a weight value of the fourth feature information according to the same first feature information, the second feature information, and the third feature information; the weight value of the fourth characteristic information is obtained by adding the weight values corresponding to the same first characteristic information, second characteristic information and third characteristic information; and extracting text features in the text to be classified according to the weight values corresponding to the first feature information, the second feature information, the third feature information and the fourth feature information.
In one embodiment of the apparatus, the apparatus further comprises: the text processing module is used for preprocessing the text to be classified, and the preprocessing comprises the following steps: and utilizing the domain dictionary of the target industry to segment the text to be classified and deactivating the word.
The respective modules in the text classification apparatus in the vertical field described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, as shown in fig. 14, the disclosed embodiments also provide another text classification apparatus 1000 in the vertical domain, the apparatus comprising:
a data acquisition module 1002, configured to acquire a text to be classified;
the model processing module 1004 is configured to input the text to be classified into a classification model obtained by training in advance, and output a content classification label corresponding to the text to be classified through the classification model; the classification model is obtained by training in the following mode: acquiring a training classified text, processing the training classified text by adopting the method in any embodiment, and determining a content classification label corresponding to the training classified text; and training a language processing model based on the training classification text and the content classification label corresponding to the training classification text to obtain a classification model.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 15. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing the text to be classified, the classified label system and other data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text classification method in the vertical domain.
It will be appreciated by those skilled in the art that the structure shown in fig. 15 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not limiting of the computer device to which the disclosed aspects apply, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of any of the method embodiments described above when the computer program is executed.
In one embodiment, a computer readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
In an embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
It should be noted that, the text to be classified, the training text, the extended library, and the like in the application are all information and data authorized by the user or fully authorized by each party, and the collection, the use, and the processing of the related data need to comply with related laws and regulations and standards.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided by the present disclosure may include at least one of non-volatile and volatile memory, among others. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided by the present disclosure may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors involved in the embodiments provided by the present disclosure may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic, quantum computing-based data processing logic, etc., without limitation thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples have expressed only a few embodiments of the present disclosure, which are described in more detail and detail, but are not to be construed as limiting the scope of the present disclosure. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the disclosure, which are within the scope of the disclosure. Accordingly, the scope of the present disclosure should be determined from the following claims.

Claims (23)

1. A method of text classification in the vertical field, the method comprising:
acquiring a text to be classified, and extracting text characteristics in the text to be classified, wherein the text to be classified is a text in the vertical field of the target industry;
determining prior probabilities of labels corresponding to the texts to be classified in each level in a classification label system based on the texts to be classified and a predetermined classification label system corresponding to the vertical field of the target industry, and determining a plurality of labels corresponding to the texts to be classified;
According to the text characteristics and the number of labels corresponding to the text to be classified in each level, calculating to obtain the conditional probability in each level;
based on the prior probability of the label corresponding to the text to be classified in each level and the conditional probability in each level, calculating to obtain the content classification label corresponding to the text to be classified by using a naive Bayes formula.
2. The method according to claim 1, wherein the calculating the conditional probability in each hierarchy according to the text feature and the number of tags corresponding to the text to be classified in each hierarchy includes:
expanding the text features based on a predetermined expansion library to obtain an expansion feature group, wherein the expansion feature group comprises a plurality of expansion features associated with the text features, and weight relations exist between the expansion features and the text features; the expansion library comprises: the domain dictionary and the knowledge graph of the target industry;
and determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level.
3. The method according to claim 2, wherein the extended feature set is stored in a hash table manner and/or an inverted index is constructed for the extended feature set.
4. The method of claim 2, wherein the weight relationship is determined based on a degree of correlation between the extended features and the text features; the method for determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level comprises the following steps:
acquiring associated expansion features directly or indirectly associated with each text feature, and determining a weight relation corresponding to the associated expansion features;
determining the comprehensive frequency of each text feature according to the sum of weight relations corresponding to the associated expansion features;
and calculating the conditional probability in each level according to the target text characteristics and the number of labels corresponding to the texts to be classified in each level, wherein the target text characteristics are text characteristics with the largest comprehensive frequency.
5. The method of claim 1, wherein the taxonomy comprises a plurality of levels, each level having a plurality of tags therein; wherein the hierarchy and the labels are determined based on a vertical domain of the target industry; the determining the prior probability of the label corresponding to the text to be classified in each level in the classification label system based on the text to be classified and a predetermined classification label system corresponding to the vertical field of the target industry comprises the following steps:
determining the prior probability corresponding to the labels of the text to be classified in the first hierarchy according to the number of the labels of the text to be classified in the first hierarchy and the total number of the labels;
and determining the prior probability corresponding to the tags of the text to be classified in each level according to the prior probability corresponding to the tags of the text to be classified in the first level, the number of the tags corresponding to the text to be classified in each level and the total number of the tags in each level.
6. The method of claim 5, wherein determining the prior probability corresponding to the tags of the text to be classified in each hierarchy based on the prior probability corresponding to the tags of the text to be classified in the first hierarchy, the number of tags corresponding to the text to be classified in each hierarchy, and the total number of tags in each hierarchy, comprises:
Calculating the ratio of the number of the labels corresponding to the text to be classified in each level to the total number of the labels in each level;
normalizing the quantity ratio, and determining the prior probability corresponding to each label in each level according to the normalized quantity ratio and the prior probability corresponding to each label in the first level.
7. The method of claim 1, wherein the extracting text features in the text to be classified comprises:
extracting features of the text to be classified by using a keyword extraction technology to obtain first feature information; the weight value of the first characteristic information is a preset first value;
extracting target sentences in the text to be classified by utilizing semantic analysis rules, and carrying out dependency analysis and semantic role analysis based on the target sentences;
obtaining second characteristic information based on the dependency analysis and semantic role analysis results; the weight value of the second characteristic information is a preset second value;
performing entity matching extraction on the text to be classified by using a predetermined expansion library to obtain third characteristic information, wherein the weight value of the third characteristic information is a preset third numerical value; the sizes of the first numerical value, the second numerical value and the third numerical value are sequentially increased;
And extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information and the weight value of the third feature information.
8. The method according to claim 6, wherein the extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information, and the weight value of the third feature information includes:
determining fourth characteristic information and a weight value of the fourth characteristic information according to the same first characteristic information, second characteristic information and third characteristic information in response to the existence of the same characteristic information in the first characteristic information, the second characteristic information and the third characteristic information; the weight value of the fourth characteristic information is obtained by adding the weight values corresponding to the same first characteristic information, second characteristic information and third characteristic information;
and extracting text features in the text to be classified according to the weight values corresponding to the first feature information, the second feature information, the third feature information and the fourth feature information.
9. The method of claim 1, wherein prior to extracting text features in the text to be classified, the method further comprises:
Preprocessing the text to be classified, wherein the preprocessing comprises the following steps: and utilizing the domain dictionary of the target industry to segment the text to be classified and deactivating the word.
10. A method of text classification in the vertical field, the method comprising:
acquiring a text to be classified;
inputting the text to be classified into a classification model obtained by training in advance, and outputting a content classification label corresponding to the text to be classified through the classification model;
the classification model is obtained by training in the following mode: acquiring a training classified text, processing the training classified text by adopting the method according to any one of claims 1 to 9, and determining a content classification label corresponding to the training classified text; and training a language processing model based on the training classification text and the content classification label corresponding to the training classification text to obtain a classification model.
11. A text classification apparatus in the vertical field, the apparatus comprising:
the feature extraction module is used for obtaining a text to be classified and extracting text features in the text to be classified, wherein the text to be classified is a text in the vertical field of the target industry;
The prior probability determining module is used for determining the prior probability of a label corresponding to the text to be classified in each level in the classification label system based on the text to be classified and a predetermined classification label system corresponding to the vertical field of the target industry, and determining a plurality of labels corresponding to the text to be classified;
the conditional probability determining module is used for calculating the conditional probability in each level according to the text characteristics and the number of the labels corresponding to the text to be classified in each level;
the classification module is used for calculating and obtaining content classification labels corresponding to the texts to be classified by using a naive Bayesian formula based on the prior probability of the labels corresponding to the texts to be classified and the conditional probability in each level.
12. The apparatus of claim 11, wherein the conditional probability determination module comprises:
the feature expansion module is used for expanding the text features based on a predetermined expansion library to obtain an expansion feature group, wherein the expansion feature group comprises a plurality of expansion features associated with the text features, and weight relations exist between the expansion features and the text features; the expansion library comprises: the domain dictionary and the knowledge graph of the target industry;
And the conditional probability calculation module is used for determining the comprehensive frequency of the text features according to the weight relation corresponding to the expansion features associated with the text features in the expansion feature group, and calculating the conditional probability in each level according to the number of labels corresponding to the text to be classified in each level.
13. The apparatus of claim 12, wherein the set of extended features is stored in a hash table and/or wherein the set of extended features has a corresponding build inverted index.
14. The apparatus of claim 12, wherein the weight relationship is determined based on a degree of correlation between the extended features and the text features; the conditional probability calculation module comprises:
the weight relation determining module is used for acquiring the association expansion feature directly or indirectly associated with each text feature and determining the weight relation corresponding to the association expansion feature;
the comprehensive frequency determining module is used for determining the comprehensive frequency of each text feature according to the sum of the weight relationships corresponding to the associated expansion features;
and the calculating sub-module is used for calculating the conditional probability in each level according to the target text characteristics and the number of labels corresponding to the texts to be classified in each level, wherein the target text characteristics are text characteristics with the largest comprehensive frequency.
15. The apparatus of claim 11, wherein the taxonomy comprises a plurality of levels, each level having a plurality of tags therein; wherein the hierarchy and the labels are determined based on a vertical domain of the target industry; the prior probability determining module includes:
the first determining module is used for determining the prior probability corresponding to the labels of the text to be classified in the first hierarchy according to the number of the labels of the text to be classified in the first hierarchy and the total number of the labels;
and the second determining module is used for determining the prior probability corresponding to the labels of the texts to be classified in each level according to the prior probability corresponding to the labels of the texts to be classified in the first level, the number of the labels corresponding to the texts to be classified in each level and the total number of the labels in each level.
16. The apparatus of claim 15, wherein the second determining module is further configured to calculate a number ratio of the number of labels in each level corresponding to the text to be classified in each level to the total number of labels in each level; normalizing the quantity ratio, and determining the prior probability corresponding to each label in each level according to the normalized quantity ratio and the prior probability corresponding to each label in the first level.
17. The apparatus of claim 11, wherein the feature extraction module comprises:
the first extraction module is used for extracting the characteristics of the text to be classified by utilizing a keyword extraction technology to obtain first characteristic information; the weight value of the first characteristic information is a preset first value;
the second extraction module is used for extracting target sentences in the text to be classified by utilizing semantic analysis rules, and performing dependency analysis and semantic role analysis based on the target sentences; obtaining second characteristic information based on the dependency analysis and semantic role analysis results; the weight value of the second characteristic information is a preset second value;
the third extraction module is used for carrying out entity matching extraction on the text to be classified by utilizing a predetermined expansion library to obtain third characteristic information, wherein the weight value of the third characteristic information is a preset third value; the sizes of the first numerical value, the second numerical value and the third numerical value are sequentially increased;
and the feature determining module is used for extracting text features in the text to be classified based on the weight value of the first feature information, the weight value of the second feature information and the weight value of the third feature information.
18. The apparatus of claim 17, wherein the feature determination module is further configured to determine fourth feature information, and a weight value of the fourth feature information, based on the same first feature information, second feature information, and third feature information, in response to the same feature information being present in the first feature information, the second feature information, and the third feature information; the weight value of the fourth characteristic information is obtained by adding the weight values corresponding to the same first characteristic information, second characteristic information and third characteristic information; and extracting text features in the text to be classified according to the weight values corresponding to the first feature information, the second feature information, the third feature information and the fourth feature information.
19. The apparatus of claim 11, wherein the apparatus further comprises: the text processing module is used for preprocessing the text to be classified, and the preprocessing comprises the following steps: and utilizing the domain dictionary of the target industry to segment the text to be classified and deactivating the word.
20. A text classification apparatus in the vertical field, the apparatus comprising:
the data acquisition module is used for acquiring texts to be classified;
The model processing module is used for inputting the text to be classified into a classification model obtained by training in advance, and outputting a content classification label corresponding to the text to be classified through the classification model; the classification model is obtained by training in the following mode: acquiring a training classified text, processing the training classified text by adopting the method according to any one of claims 1 to 9, and determining a content classification label corresponding to the training classified text; and training a language processing model based on the training classification text and the content classification label corresponding to the training classification text to obtain a classification model.
21. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 or claim 10 when the computer program is executed.
22. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 9 or claim 10.
23. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 9 or claim 10.
CN202410002843.9A 2024-01-02 2024-01-02 Text classification method, device and computer equipment in vertical field Pending CN117763152A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410002843.9A CN117763152A (en) 2024-01-02 2024-01-02 Text classification method, device and computer equipment in vertical field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410002843.9A CN117763152A (en) 2024-01-02 2024-01-02 Text classification method, device and computer equipment in vertical field

Publications (1)

Publication Number Publication Date
CN117763152A true CN117763152A (en) 2024-03-26

Family

ID=90318107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410002843.9A Pending CN117763152A (en) 2024-01-02 2024-01-02 Text classification method, device and computer equipment in vertical field

Country Status (1)

Country Link
CN (1) CN117763152A (en)

Similar Documents

Publication Publication Date Title
Sohangir et al. Big Data: Deep Learning for financial sentiment analysis
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
RU2628431C1 (en) Selection of text classifier parameter based on semantic characteristics
CN104834747B (en) Short text classification method based on convolutional neural networks
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN107688870B (en) Text stream input-based hierarchical factor visualization analysis method and device for deep neural network
Che et al. Tensor factorization with sparse and graph regularization for fake news detection on social networks
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN116975271A (en) Text relevance determining method, device, computer equipment and storage medium
CN113516094B (en) System and method for matching and evaluating expert for document
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
Kavitha et al. A review on machine learning techniques for text classification
CN112417082A (en) Scientific research achievement data disambiguation filing storage method
CN113792131B (en) Keyword extraction method and device, electronic equipment and storage medium
Anuradha et al. Fuzzy based summarization of product reviews for better analysis
CN116151258A (en) Text disambiguation method, electronic device and storage medium
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
CN117763152A (en) Text classification method, device and computer equipment in vertical field
Alsafadı Stance Classification for Fake News Detection with Machine Learning
CN115129863A (en) Intention recognition method, device, equipment, storage medium and computer program product
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
Canales et al. Evaluation of entity recognition algorithms in short texts
Grönberg Extracting salient named entities from financial news articles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination