CN110347821B

CN110347821B - Text category labeling method, electronic equipment and readable storage medium

Info

Publication number: CN110347821B
Application number: CN201910456149.3A
Authority: CN
Inventors: 过弋; 张振豪; 王志宏; 樊振; 韩美琪; 王家辉
Original assignee: East China University of Science and Technology; Shihezi University
Current assignee: East China University of Science and Technology; Shihezi University
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2023-08-25
Anticipated expiration: 2039-05-29
Also published as: CN110347821A

Abstract

The embodiment of the application relates to the technical field of computers, and discloses a method, electronic equipment and a readable storage medium for category labeling of a text hierarchical structure. The method for labeling the text category comprises the following steps: searching candidate categories corresponding to the text to be annotated by combining the cognitive factors; determining a classification model according to candidate categories, wherein the classification model is obtained through training according to each first sample text and a bottom sub-category corresponding to each first sample text, the bottom sub-category is positioned at the bottom of a category hierarchy structure where the candidate categories are positioned, the candidate categories are positioned at the top of the category hierarchy structure and are father categories, and the category hierarchy structure at least comprises 2 layers of categories; and determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category. According to the embodiment, the probability of marking errors can be reduced, and the accuracy of marking is improved.

Description

Text category labeling method, electronic equipment and readable storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a text category labeling method, electronic equipment and a readable storage medium.

Background

Due to the rapid growth of the internet, people are also increasingly dependent on obtaining information from the network. However, the rapid increase of the text data volume greatly influences the efficiency and result of information acquisition. In order to better provide for the retrieval of text, text data is typically labeled by category, for example, news-like text is labeled by category (e.g., sports, entertainment, etc.), so that the relevant text can be quickly and accurately retrieved when the user is retrieving. Early text labeling methods were derived from cognition, and people judged the category to which a certain text belongs by experience accumulated in daily life and certain inference rules. With the development of computer technology, it is desired to give intelligence to the machine AI (Artificial intelligence) to generate cognitive AI, so that the machine can generate experience through learning to automatically judge the text category. The research work in this aspect has achieved great success and has led to the direct reliance of people on algorithm decisions, ignoring other factors.

The inventor finds that at least the following problems exist in the related text classification technology: as the current text data volume is large, and the types of the types are increased along with the increase of the data, the probability of errors in the process of automatically labeling the types of the text is greatly increased, and meanwhile, the speed of labeling the types of the text is also reduced.

Disclosure of Invention

The embodiment of the application aims to provide a text category labeling method, electronic equipment and a readable storage medium, so that the probability of labeling errors can be reduced, and the labeling accuracy is improved.

In order to solve the above technical problems, the embodiment of the present application provides a method for labeling text categories, including: searching candidate categories corresponding to the text to be annotated; determining a classification model according to candidate categories, wherein the classification model is obtained through training according to each first sample text and a bottom sub-category corresponding to each first sample text, the bottom sub-category is positioned at the bottom of a category hierarchy structure where the candidate categories are positioned, the candidate categories are positioned at the top of the category hierarchy structure and are father categories, and the category hierarchy structure at least comprises 2 layers of categories; and determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category.

The embodiment of the application also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of text category labeling described above.

The embodiment of the application also provides a computer readable storage medium, and the method for labeling the text category is realized when the computer program is executed by a processor.

Compared with the prior art, the method and the device have the advantages that the types of the text types are huge, the size of the type used for determining the text to be marked can be reduced by searching the candidate type corresponding to the text to be marked, meanwhile, in the process of searching the candidate type, the obtained cognitive factor is combined with the similarity value, so that the similarity is corrected, the accuracy of searching each type is improved, and the number of bottom subcategories of the candidate type corresponding to the text to be marked is far smaller than that of all bottom subcategories, so that the actual text type of the text to be marked can be rapidly determined from the bottom subcategories of the candidate type through a classification model, the determination speed of determining the type of the text to be marked is improved, and the result generated by a machine learning algorithm is used as the cognitive factor instead of simply relying on the result, and the accuracy of searching the candidate type is improved by integrating the similarity and the cognitive factor; the classification model can be determined through the candidate category corresponding to the text to be marked, so that the classification model for determining the text to be marked has more pertinence, and the accuracy of determining the category of the text to be marked is improved; the candidate category is at the top layer of the category hierarchy and is a father category, and the category hierarchy comprises at least 2 layers of categories, and the sub categories of the text to be marked in each layer are judged in sequence without the sequence of the category hierarchy, but all the bottom sub categories of the corresponding candidate category are obtained directly, so that the influence caused by the error of category judgment can be reduced, the probability of marking error occurrence can be reduced, and the marking accuracy is improved.

In addition, searching candidate categories corresponding to the text to be annotated specifically comprises the following steps: acquiring a cognitive factor set of candidate categories of a text to be annotated, wherein the cognitive factor set can comprise matched candidate categories and initial probability values corresponding to the matched candidate categories; calculating the similarity between the category of the text to be annotated and each candidate category, and gathering each similarity to obtain a similarity set; determining a candidate category probability set according to the cognition factor set and the similarity set; and selecting the candidate category of the text to be marked according to a preset rule and the candidate category probability set. And combining the cognition factor set and the similarity set to determine a candidate category probability set, so that the accuracy of searching the candidate category corresponding to the text to be annotated is greatly improved.

In addition, the method for acquiring the cognitive factor set of the candidate category of the text to be annotated specifically comprises the following steps: inputting the text to be annotated into a preset initial candidate class classification model to obtain a candidate class cognition factor set of the text to be annotated, wherein the initial candidate class classification model is obtained according to each second sample text and the candidate class training corresponding to each second sample text. And through the pre-constructed initial candidate category classification model, the cognition factor set of the text to be annotated can be rapidly determined.

In addition, according to a preset rule and a candidate category probability set, selecting a candidate category of the text to be annotated, which specifically comprises: selecting a probability value larger than a preset threshold from the candidate category probability set, and taking a candidate category corresponding to the selected probability value as a candidate category of the text to be annotated; or arranging the probability values in the probability set of the candidate categories according to a descending order, selecting a preset number of probability values from the ordered probability set, and taking the candidate category corresponding to the selected probability value as the candidate category of the text to be annotated. According to the bottom subcategory set, a classification model corresponding to the bottom subcategory can be selected, so that the speed of determining the category of the text to be marked can be improved; the candidate category corresponding to the probability value with a larger value is selected, so that the speed and accuracy of determining the actual text category subsequently can be improved.

In addition, according to the candidate category, a classification model is determined, which specifically comprises: acquiring a bottom subcategory set formed by all bottom subcategories in the category hierarchy according to the candidate category and the category hierarchy corresponding to the candidate category; and determining a classification model according to the bottom subcategory set. According to the bottom subcategory set, a classification model corresponding to the bottom subcategory can be selected, so that the speed of determining the category of the text to be marked can be improved.

In addition, according to the bottom subcategory set, a classification model is determined, which specifically comprises: and determining a classification model according to the bottom subcategory set and the corresponding relation between the preset bottom subcategory set and the classification model. According to the corresponding relation, the classification model can be rapidly determined.

In addition, the training process of the classification model specifically comprises the following steps: according to the bottom subcategory set, a first sample set corresponding to the bottom subcategory set is obtained, wherein the first sample set comprises a first sample text corresponding to each bottom subcategory; taking each first text sample as input data of a classification model, and taking the bottom subcategory corresponding to each first text sample as output data of the classification model; and training to obtain a classification model according to the input data and the output data. The classification model is obtained through training according to the first text sample set corresponding to the bottom subcategory set, so that the classification model obtained through training is more targeted, and the speed of determining the actual text category of the text to be marked by using the classification model is improved.

In addition, after training to obtain the classification model and before determining the classification model according to the underlying subcategory set, the method for labeling the text category further comprises: and storing the classification model obtained by training and the corresponding relation between the classification model and the bottom subcategory set. The classification model obtained through training is stored in real time, so that the variety of the classification model can be continuously enriched, and the accuracy of determining the actual text type of the text to be marked by using the classification model in the follow-up process is improved.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

FIG. 1 is a specific flow chart of a method for text category labeling according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a class hierarchy provided in accordance with a first embodiment of the present application;

FIG. 3 is a schematic view of a class hierarchy corresponding to a candidate class according to a first embodiment of the present application;

FIG. 4 is a schematic flow chart of training a classification model in a text class labeling method according to a second embodiment of the present application;

fig. 5 is a schematic diagram of a specific structure of an electronic device according to a third embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.

The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present application, and the embodiments can be mutually combined and referred to without contradiction.

The first embodiment of the application relates to a method for labeling text categories. The text category labeling method can be applied to electronic equipment with a search function, such as a news client, search engine equipment and the like. By automatically labeling the text categories, the method is beneficial to the user to quickly locate related texts when searching the texts, and improves the speed of searching the related texts by the user. The text to be annotated can be the text of news, papers, magazines and the like. A specific flow of the text category labeling method is shown in fig. 1.

Step 101: searching candidate categories corresponding to the text to be annotated.

Specifically, to facilitate text management, the text category generally has a hierarchical structure, where each level of category indicates that the text content is different in scope, and the category hierarchical structure is described below in conjunction with fig. 2, where fig. 2 is a category hierarchical structure, and a category located at the top layer is a parent category, where the text content indicated by the parent category is the widest scope, for example, a may be represented as sports news, B may be represented as entertainment news, and C may be represented as financial news; class C and class D are located on the L2 layer, class C is ball news, and class D is skating news; l3 is a specific category, for example, H is football, I is badminton, G is baseball; e is a shorthand class.

Each parent category includes multiple layers of subcategories, with lower layers of subcategories indicating more specific ranges of text content and more relevant to the corresponding text.

In a specific implementation, the process of searching the candidate category corresponding to the text to be annotated is as follows: acquiring a cognitive factor set of candidate categories of a text to be annotated, wherein the cognitive factor set comprises matched candidate categories and initial probability values corresponding to the matched candidate categories; calculating the similarity between the category of the text to be marked and each stored category, and gathering each similarity to obtain a similarity set; determining a candidate category probability set according to the cognition factor set and the similarity set; and selecting the candidate category of the text to be marked according to a preset rule and the candidate category probability set.

Calculating the similarity between the category of the text to be annotated and each stored category, and introducing the similarity between the category candidate of the text to be annotated and the candidate category c:

firstly, acquiring a word frequency-inverse document frequency (TF-IDF) set D in a corpus set, wherein the corpus set is a set of a plurality of texts; inputting the text to be annotated into the set D, obtaining tf-idf vectors w of the text to be annotated, calculating the sum of tf-idf vectors of the texts belonging to the candidate category c in the set D, marking the sum as sum (c), calculating the similarity between the tf-idf vectors w of the text to be annotated and sum (c), and taking the similarity as the similarity between the category of the text to be annotated and the candidate category c.

Specifically, the set of cognitive factors for the candidate class of the text to be annotated may be determined by the initial candidate classification model. Inputting the text to be annotated into a preset initial candidate class classification model to obtain a candidate class cognition factor set of the text to be annotated, wherein the initial candidate class classification model is obtained according to each second sample text and the candidate class training corresponding to each second sample text.

The second sample text can be a sample text manually marked with a parent category, each second sample text is used as input data of the initial candidate category classification model, the marked parent category is used as output data, and the initial candidate category classification model can be trained and obtained according to the input data and the input data.

The manner in which the initial candidate classification model is trained is described below in a specific example:

collecting news texts, manually labeling father categories for each collected news text, taking the collected news texts as second sample texts, taking the second sample texts as input data, taking the father category corresponding to labeling as output data, training out the initial candidate category model by using a machine learning algorithm, wherein the machine learning algorithm can be naive Bayes, logistic regression, long-short-term memory neural networks and the like.

It may be appreciated that the set of cognitive factors includes candidate categories that match the text to be annotated and initial probability values for the matched candidate categories, each cognitive factor being an initial probability value for a corresponding candidate category. That is, the initial candidate category model may output one or more initial probability values of the parent category to which the text to be annotated belongs, for example, if the text to be annotated is input into the candidate category classification model, the cognitive factor of the parent category to which the text I to be annotated belongs as a category a and the cognitive factor of the parent category to which the text I to be annotated belongs as B are output.

And calculating the product of the similarity and the cognitive factor corresponding to each class, and taking the set formed by each calculated product as a candidate class probability value set. For example, the parent category has m total, respectively c ₁ ，c ₂ ，。。。，c _m The initial set of probability values for the candidate class of text to be annotated may be represented as p= { P ₁ ,p ₂ ，…，p _m The cognitive factor set is denoted as p= { P } ₁ ,p ₂ ,…,p _m Calculating the similarity between the text to be annotated and m candidate categories, wherein the similarity set can be expressed as U= { mu } ₁ ,μ ₂ ,..,μ _m The probability set of the candidate class may be denoted as P _s ＝{p ₁ *μ ₁ ,p ₂ *μ ₂ …p _m *μ _m }。

In order to facilitate the subsequent quick determination of the actual text category of the text to be annotated, the candidate category of the text to be annotated can be selected according to a preset rule and a candidate category probability set.

In a specific implementation, the selection may be performed according to a preset rule, where the preset rule may be set according to needs, for example, probability values in a probability set of candidate categories may be arranged in a descending order, a preset number of probability values may be selected from the sorted probability set, and a candidate category corresponding to the selected probability value is used as a candidate category of the text to be annotated, where the preset number includes at least 2. The number of the preset candidate categories can be set according to actual needs.

The probability value larger than a preset threshold value can be selected from the candidate class probability set, the candidate class corresponding to the selected probability value is used as the candidate class of the text to be annotated, the preset threshold value can be set according to requirements, and the larger the preset threshold value is, the more candidate classes with interference are filtered.

Step 102: determining a classification model according to the candidate category; the classification model is obtained through training according to each first sample text and the bottom sub-category corresponding to each first sample text, wherein the bottom sub-category is positioned at the bottom of the category hierarchy structure where the candidate category is positioned, the candidate category is positioned at the top of the category hierarchy structure, the candidate category is a father category, and the category hierarchy structure at least comprises 2 layers of categories.

In a specific implementation, according to a candidate category and a category hierarchy structure corresponding to the candidate category, acquiring a bottom subcategory set formed by all bottom subcategories in the category hierarchy structure; and determining a classification model according to the bottom subcategory set.

Specifically, the category hierarchy corresponding to the candidate category is preset, after the candidate category is determined, the category hierarchy corresponding to the candidate category can be obtained, and all the bottom sub-categories in the category hierarchy corresponding to the candidate category form a bottom sub-category set. Wherein the category hierarchy includes at least 2 levels of categories. The process of determining the underlying subcategory set is described below in one specific example:

the candidate categories obtained by searching are A and B, the category hierarchy structure corresponding to the candidate category is shown in fig. 3, the middle subcategory is ignored, all the bottom subcategories of the candidate category are obtained, for example, a1 to a3 and B1 to B4 in fig. 3 are all the bottom subcategories of the candidate category, and a1 to a3 and B1 to B4 are combined into a bottom subcategory set G { a1, a2, a3, B1, B2, B3 and B4}.

There are various ways to determine the classification model based on the underlying subcategory set. In a specific implementation, the classification model is determined according to the bottom subcategory set and a preset corresponding relation between the bottom subcategory set and the classification model.

Specifically, according to all the bottom subcategories in the overall category hierarchy structure of the text, the classification model corresponding to each possible bottom subcategory set can be trained in advance. And storing the corresponding relation between the preset bottom subcategory set and the classification model, so that after the bottom subcategory set is determined, a proper classification model can be quickly determined. Wherein the stored classification models may be aggregated to form a classification model set.

The new classification model may also be retrained based on the size of the underlying subcategory set and the text data corresponding to the underlying subcategory set, and the new trained classification model stored.

For example, if the bottom subcategory set is denoted as G, the stored classification model set C, and the text t to be annotated; searching whether a classification model corresponding to the bottom subcategory set G exists in the classification model set C, if so, acquiring the classification model, and determining the actual text category of the text to be marked by using the classification model; if the classification model B does not exist, training data corresponding to the bottom subcategory set G is acquired according to the scale of the bottom subcategory set G, a new classification model B is obtained through training, and the classification model B is added into the stored classification model set C.

It will be appreciated that when the number of candidate categories is less than n, n may be set according to actual needs, e.g., n is 100; the classification model may be obtained by training in a conventional machine learning manner, and if the number of candidate classes is greater than or equal to n, the classification model may be obtained by training in a deep learning manner, such as a convolutional neural network (Convolutional Neural Networks, abbreviated as "CNN"), a recurrent neural network (recurrent neural network, abbreviated as "RNN"), an antagonistic neural network (Generative Adversarial Nets, abbreviated as "GAN"), and the like.

Step 103: and determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category.

And taking the text to be marked as input data of the classification model, and obtaining the actual text category of the text to be marked. And labeling the text to be labeled according to the determined actual text category.

Compared with the prior art, the method and the device have the advantages that the types of the text types are huge, the size of the type used for determining the text to be marked can be reduced by searching the candidate type corresponding to the text to be marked, meanwhile, in the process of searching the candidate type, the obtained cognitive factors are combined with the similarity values, so that the similarity is corrected, the accuracy of searching each type of the text to be marked is improved, and the number of the bottom sub-types of the candidate type corresponding to the text to be marked is far smaller than that of all the bottom sub-types, so that the actual text type of the text to be marked can be determined from the bottom sub-types of the candidate type rapidly through a classification model, and the determination speed of determining the type of the text to be marked is improved; the classification model can be determined through the candidate category corresponding to the text to be marked, so that the classification model for determining the text to be marked has more pertinence, and the accuracy of determining the category of the text to be marked is improved; the candidate category is at the top layer of the category hierarchy and is a father category, and the category hierarchy comprises at least 2 layers of categories, and the sub categories of the text to be marked in each layer are judged in sequence without the sequence of the category hierarchy, but all the bottom sub categories of the corresponding candidate category are obtained directly, so that the influence caused by the error of category judgment can be reduced, the probability of marking error occurrence can be reduced, and the marking accuracy is improved.

A second embodiment of the application relates to a method of text category labeling. The second embodiment is a further improvement to the first embodiment, and the main improvement is that: in the second embodiment of the present application, a classification model corresponding to the bottom subcategory set may be obtained through training according to the bottom subcategory set, and after the classification model is obtained through training, the classification model is saved, so as to enrich the number of classification models, and reduce the labeling cost. The flow of training the classification model in this embodiment is shown in fig. 4.

Step 201: and acquiring a first text sample set corresponding to the bottom subcategory set according to the bottom subcategory set, wherein the first text sample set comprises a first sample text corresponding to each bottom subcategory.

Specifically, a first sample set corresponding to the bottom subcategory set may be collected according to the bottom subcategory set, where the first sample set includes first sample texts marked with the bottom subcategory.

Step 202: and taking each first text sample as input data of the classification model, and taking the bottom subcategory corresponding to each first text sample as output data of the classification model.

Step 203: and training to obtain a classification model according to the input data and the output data.

Specifically, the input data are each first sample text, the output data are bottom subcategories corresponding to each first sample text, the classification model continuously optimizes parameters in the corresponding relation between the input data and the output data in a machine learning mode, and finally the parameters are trained to the classification model. The machine learning mode can be a mode of supporting a vector machine, a random forest, a long-short-term memory neural network and the like.

Step 204: and storing the classification model obtained by training and the corresponding relation between the classification model and the bottom subcategory set.

Specifically, after the classification model is obtained, the classification model and the corresponding relation between the classification model and the bottom subcategory set can be saved, so that the number of the classification models is enriched, and the judgment of the text category to be marked can be more accurately performed.

According to the text category labeling method provided by the embodiment, the classification model is obtained through training according to the first text sample set corresponding to the bottom subcategory set, so that the classification model is more targeted, and the speed of determining the actual text category of the text to be labeled by using the classification model is improved. The classification model obtained through training is stored in real time, so that the variety of the classification model can be continuously enriched, and the accuracy of determining the actual text type of the text to be marked by using the classification model in the follow-up process is improved.

The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.

A third embodiment of the present application relates to an electronic device, a specific structure of which is shown in fig. 5, including: at least one processor 301; and a memory 302 communicatively coupled to the at least one processor 301; the memory 302 stores instructions executable by the at least one processor 301, the instructions being executable by the at least one processor 301 to enable the at least one processor 301 to perform the method of text category tagging described above.

Where the memory 302 and the processor 301 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses linking together various circuits of the one or more processors 301 and the memory 302. The bus may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., as are well known in the art and, therefore, will not be further described herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 301 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 301.

The processor 301 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 302 may be used to store data used by the processor in performing operations.

Those skilled in the art will appreciate that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, including instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims

1. A method for labeling text categories, comprising:

searching candidate categories corresponding to the text to be annotated;

determining a classification model according to the candidate category, wherein the classification model is obtained through training according to each first sample text and a bottom sub-category corresponding to each first sample text, the bottom sub-category is positioned at the bottom of a category hierarchy structure where the candidate category is positioned, the candidate category is positioned at the top of the category hierarchy structure and is a father category, and the category hierarchy structure at least comprises 2 layers of categories;

determining the actual text category of the text to be annotated according to the text to be annotated and the classification model, and annotating the text to be annotated according to the actual text category;

the searching for the candidate category corresponding to the text to be annotated specifically comprises the following steps:

acquiring a cognition factor set of the candidate category of the text to be annotated, wherein the cognition factor set comprises a matched candidate category and an initial probability value corresponding to the matched candidate category;

calculating the similarity between the category of the text to be annotated and each candidate category, and gathering each similarity to obtain a similarity set;

determining the candidate category probability set according to the cognition factor set and the similarity set;

wherein the determining the candidate category probability set according to the cognition factor set and the similarity set includes:

calculating products of the similarity corresponding to each class and the cognitive factors according to the cognitive factor set and the similarity set, and determining a set formed by each calculated product as the candidate class probability value set;

and selecting the candidate category of the text to be annotated according to a preset rule and the candidate category probability set.

2. The method for labeling text categories according to claim 1, wherein the obtaining the set of cognitive factors of the candidate category of the text to be labeled specifically comprises:

inputting the text to be annotated into a preset initial candidate class classification model to obtain a candidate class cognition factor set of the text to be annotated, wherein the initial candidate class classification model is obtained according to each second sample text and the candidate class training corresponding to each second sample text.

3. The method for labeling text according to claim 1, wherein selecting the candidate category of the text to be labeled according to a preset rule and the candidate category probability set specifically comprises:

selecting a probability value larger than a preset threshold from the candidate category probability set, and taking a candidate category corresponding to the selected probability value as the candidate category of the text to be annotated;

or,

and arranging probability values in a descending order in the probability set of the candidate categories, selecting a preset number of probability values from the ordered probability set, and taking the candidate categories corresponding to the selected probability values as the candidate categories of the text to be annotated, wherein the preset number at least comprises 2.

4. A method of text category labeling according to any of claims 1-3, characterized in that the determining a classification model from the candidate categories comprises in particular:

acquiring a bottom subcategory set formed by all bottom subcategories in a category hierarchy according to the candidate category and the category hierarchy corresponding to the candidate category;

and determining a classification model according to the bottom subcategory set.

5. The method for labeling text categories according to claim 4, wherein determining a classification model according to the bottom subcategory set specifically comprises:

and determining the classification model according to the bottom subcategory set and the preset corresponding relation between the bottom subcategory set and the classification model.

6. The method for labeling text categories according to claim 4, wherein the training process of the classification model specifically comprises:

according to the bottom subcategory set, a first sample set corresponding to the bottom subcategory set is obtained, wherein the first sample set comprises the first sample text corresponding to each bottom subcategory;

taking each first text sample as input data of the classification model, and taking the bottom subcategory corresponding to each first text sample as output data of the classification model;

and training to obtain the classification model according to the input data and the output data.

7. The method of text category labeling of claim 6, wherein after training to obtain the classification model and before the determining a classification model from the underlying set of subcategories, the method of text category labeling further comprises:

and storing the classification model obtained through training and the corresponding relation between the classification model and the bottom subcategory set.

8. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of text category labeling of any of claims 1-7.

9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of text category labeling of any of claims 1 to 7.