CN113536806B

CN113536806B - Text classification method and device

Info

Publication number: CN113536806B
Application number: CN202110810164.0A
Authority: CN
Inventors: 刘洋
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2021-07-18
Filing date: 2021-07-18
Publication date: 2023-09-08
Anticipated expiration: 2041-07-18
Also published as: CN113536806A

Abstract

The embodiment of the invention provides a text classification method and a text classification device, which relate to the technical field of information processing, and the method comprises the following steps: the method comprises the steps of calling first classification modes corresponding to first classification levels in parallel to classify texts to be classified, and obtaining candidate classification types of the texts to be classified in the first classification levels; according to each candidate classification type, obtaining an intermediate classification type for classifying the text to be classified along the hierarchical relationship of each candidate classification type between the first classification layers; and taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, and calling the second classification mode corresponding to each second classification level in a serial manner level by level along the level relation of each classification type among the second classification levels to classify the text to be classified until the preset lowest classification level is reached, so as to obtain the classification result of the text to be classified. By applying the scheme provided by the embodiment of the invention to classify the text, the accuracy of text classification can be improved.

Description

Text classification method and device

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a text classification method and apparatus.

Background

With the development of internet technology, more and more data are involved in various network-based application scenarios. Some application scenes require classifying texts in the scenes, so that subsequent processing is performed based on classification results. For example, in a customer service scene, the text issued by the user can be classified, and the problem category which the user desires to solve is located according to the classification result, so that the problem solving efficiency for the user is improved. The method is used for users to comment on and in dynamic social scenes, the texts posted by the users can be classified, emotion tendencies of the users are determined according to classification results, and therefore the texts posted by the users are filtered, and a network environment is maintained.

In the prior art, a preset unified classification mode is generally adopted to classify the text when classifying the text. For example, the unified classification method may be a classification method based on a bayesian algorithm. However, the text is generally more classified under various application scenarios. For example, in a news scenario, the categories of text may include: basketball news, football news, badminton news, financial news, real estate news, health preserving news, child care news, etc. In this case, when the unified classification mode is adopted to classify the text, the unified classification mode is required to consider the characteristics of various different types of texts, so that the robustness is high, however, in practical application, the characteristics are different from each other for various single classification modes, and the characteristics of various different types of texts are difficult to consider, so that the accuracy of text classification is easy to be low.

Disclosure of Invention

The embodiment of the invention aims to provide a text classification method and device so as to improve the accuracy of text classification. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a text classification method, where the method includes:

and calling first classification modes corresponding to the first classification levels in parallel to classify the text to be classified to obtain candidate classification types of the text to be classified in the first classification levels, wherein the first classification modes are used for: classifying the text according to all classification types included in the corresponding classification hierarchy;

according to each candidate classification type, obtaining an intermediate classification type for classifying the text to be classified along the hierarchical relationship of each candidate classification type between the first classification layers;

and taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, and calling the second classification mode corresponding to each second classification level in a serial manner level by level along the level relation of each classification type among the second classification levels to classify the text to be classified until reaching a preset lowest classification level to obtain a classification result of the text to be classified, wherein one second classification mode is used for: classifying the text according to all sub types included in one classification type in the corresponding classification level, wherein the first classification level is higher than the second classification level, and the second classification mode corresponding to the second classification level of the serial call is as follows: the text is classified in such a way that all sub-types included in the classification type obtained by the classification at the previous level.

In one embodiment of the present invention, the obtaining, according to each candidate classification type, an intermediate classification type for classifying the text to be classified along a hierarchical relationship between the candidate classification types and the first classification layer includes:

along the hierarchical relation of each candidate classification type between the first classification layers, selecting the classification type from each candidate classification type to obtain a plurality of type groups, wherein the number of classification types belonging to the same classification layer in each type group is 1, and classification types belonging to adjacent classification layers satisfy the following conditions: the classification type belonging to the low classification hierarchy is a subtype of the classification type belonging to the high classification hierarchy;

calculating the global probability that the classification type of the text to be classified is the candidate classification type included in each type group according to the probability that the classification type of the text to be classified is each candidate classification type;

selecting candidate type groups of the text to be classified from a plurality of type groups according to the order of the global probability from high to low;

and determining the candidate classification type with the lowest classification level in the selected candidate type group as an intermediate classification type for classifying the text to be classified along the level relation of each candidate classification type between the first classification layers.

In one embodiment of the present invention, the calculating the global probability that the classification type of the text to be classified is the candidate classification type included in each type group according to the probability that the classification type of the text to be classified is each candidate classification type includes:

and aiming at each type group, taking a preset weight value of each classification level as the weight of the candidate classification type belonging to each classification level in the type group, and carrying out weighted calculation on the probability that the classification type of the text to be classified is the candidate classification type in the type group to obtain the global probability.

In one embodiment of the present invention, the selecting the candidate type group of the text to be classified from a plurality of type groups according to the order of global probability from high to low includes:

and selecting a type group with highest global probability from a plurality of type groups as a candidate type group of the text to be classified.

In one embodiment of the invention, the first classification level is a consecutive classification level determined in the order of the classification levels from high to low, depending on the number of classification types comprised by each classification level.

In one embodiment of the invention, the method further comprises:

carrying out semantic analysis on a text to be classified to obtain a semantic vector representing the semantic of the text to be classified;

The step of calling the first classification modes corresponding to the first classification levels in parallel to classify the text to be classified to obtain candidate classification types of the text to be classified in the first classification levels comprises the following steps:

taking the semantic vector as input data of a first classification mode corresponding to each first classification level, and calling each first classification mode in parallel to classify the text to be classified to obtain candidate classification types of the text to be classified in each first classification level;

the step-by-step serial call of the second classification mode corresponding to each second classification level classifies the text to be classified, and the step-by-step serial call comprises the following steps:

and using the semantic vector as input data of a second classification mode corresponding to each second classification level, and serially calling each second classification mode level by level to classify the text to be classified.

In a second aspect, an embodiment of the present invention provides a text classification apparatus, including:

the first text classification module is used for calling first classification modes corresponding to the first classification levels in parallel to classify the text to be classified to obtain candidate classification types of the text to be classified in the first classification levels, wherein the first classification modes are used for: classifying the text according to all classification types included in the corresponding classification hierarchy;

The intermediate type obtaining module is used for obtaining intermediate classification types for classifying the text to be classified along the hierarchical relation of each candidate classification type among the first classification layers according to each candidate classification type;

the second text classification module is used for taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, calling the second classification mode corresponding to each second classification level in a serial manner level by level along the level relation of each classification type among the second classification levels to classify the text to be classified until reaching a preset lowest classification level to obtain a classification result of the text to be classified, wherein one second classification mode is used for: classifying the text according to all sub types included in one classification type in the corresponding classification level, wherein the first classification level is higher than the second classification level, and the second classification mode corresponding to the second classification level of the serial call is as follows: the text is classified in such a way that all sub-types included in the classification type obtained by the classification at the previous level.

In one embodiment of the present invention, the intermediate type obtaining module includes:

the type group obtaining submodule is used for selecting the classification types from the candidate classification types along the hierarchical relation of the candidate classification types among the first classification layers to obtain a plurality of type groups, wherein the number of the classification types belonging to the same classification hierarchy in each type group is 1, and the classification types belonging to adjacent classification hierarchies meet the following conditions: the classification type belonging to the low classification hierarchy is a subtype of the classification type belonging to the high hierarchy;

The probability calculation sub-module is used for calculating the global probability that the classification type of the text to be classified is the candidate classification type included in each type group according to the probability that the classification type of the text to be classified is each candidate classification type;

the type group selection submodule is used for selecting the candidate type group of the text to be classified from a plurality of type groups according to the sequence of the global probability from high to low;

and the type determining sub-module is used for determining the candidate classification type with the lowest classification level in the selected candidate type group as an intermediate classification type for classifying the text to be classified along the level relation of each candidate classification type between the first classification layers.

In one embodiment of the present invention, the probability calculation sub-module is specifically configured to:

In one embodiment of the present invention, the type group selection submodule is specifically configured to:

In one embodiment of the invention, the apparatus further comprises:

the semantic analysis module is used for carrying out semantic analysis on the text to be classified to obtain semantic vectors representing the voice of the text to be classified;

the first text classification module is specifically configured to:

the second text classification module is specifically configured to:

and taking the second classification mode corresponding to the intermediate classification type as an initial classification mode, taking the semantic vector as input data of the second classification mode corresponding to each second classification level along the hierarchical relation of each classification type among the second classification levels, and calling each second classification mode in a serial manner level by level to classify the text to be classified until reaching a preset lowest classification level, so as to obtain a classification result of the text to be classified.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the text classification method steps according to any one of the first aspect when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, which when executed by a processor implements the steps of the text classification method according to any one of the first aspects.

From the above, when the scheme provided by the embodiment of the invention is applied to classifying texts, the first classification modes corresponding to the first classification levels are called in parallel to classify the texts to be classified, so that candidate classification types of the texts to be classified at the first classification levels are obtained, then, the hierarchical relationship between the first classification levels of the candidate classification types is combined to obtain an intermediate classification type for classifying the texts to be classified, the second classification mode corresponding to the intermediate classification type is used as an initial classification mode, and the second classification mode corresponding to the second classification levels is called in series from level to level along the hierarchical relationship between the second classification levels of the classification types to classify the texts to be classified until the preset minimum classification level is reached, so as to obtain the classification result of the texts to be classified. In the process of classifying the text, the first classification modes are called in parallel to obtain the intermediate classification types, and then the second classification modes are called in series to obtain the classification results, so that not only is the hierarchical relationship of the classification types among the layers followed, but also a plurality of classification modes are combined for use, and the characteristics of different classification modes are exerted at different positions. Therefore, the scheme provided by the embodiment of the invention is applied to text classification, so that the accuracy of text classification can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic diagram of a tree structure of hierarchical classification according to an embodiment of the present invention;

fig. 2 is a flow chart of a first text classification method according to an embodiment of the present invention;

fig. 3 is a flow chart of a second text classification method according to an embodiment of the present invention;

fig. 4 is a flow chart of a third text classification method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a first text classification device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a second text classification apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a third text classification device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

The text classification accuracy is low by applying the prior art, and in order to solve the technical problem, the embodiment of the invention provides a text classification method and device.

In one embodiment of the present invention, there is provided a text classification method, including:

taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, and calling the second classification modes corresponding to the second classification levels in a serial manner level by level along the level relation of the classification types among the second classification levels to classify the text to be classified until the minimum preset classification level is reached, so as to obtain the classification result of the text to be classified, wherein one second classification mode is used for: classifying the text according to all sub types included in one classification type in the corresponding classification level, wherein the first classification level is higher than the second classification level, and the second classification mode corresponding to the second classification level of the serial call is as follows: the text is classified in such a way that all sub-types included in the classification type obtained by the classification at the previous level.

In the above, in the process of classifying the text by applying the scheme provided by the embodiment of the invention, each first classification mode is called in parallel to obtain the intermediate classification type, and each second classification mode is called in series to obtain the classification result, so that not only the hierarchical relationship of each classification type among layers is followed, but also a plurality of classification modes are combined for use, and the characteristics of different classification modes are exerted at different positions. Therefore, the text is classified by applying the scheme provided by the embodiment of the invention, and the accuracy of text classification can be improved.

For clarity of explanation of the concepts involved in embodiments of the present invention, referring to FIG. 1, a hierarchical classification represented in a tree structure is enumerated. As can be seen from fig. 1:

the category of "sports news" includes two sub-categories, "track and field news" and "ball news".

The category of "track and field news" includes two subtypes, namely "javelin news" and "sprint news".

The category type of "javelin news" includes two subtypes of "male javelin news" and "female javelin news".

The category type of "sprint news" includes two subtypes of "male sprint news" and "female sprint news".

The category of "ball news" includes three sub-categories of "basketball news", "football news" and "volleyball news".

The category of "basketball news" includes two sub-categories, "male basketball news" and "female basketball news".

The category of "football news" includes two sub-categories, "football news for men" and "football news for women".

The category type of "volleyball news" includes two subtypes of "male volleyball news" and "female volleyball news".

The concepts involved in the embodiments of the present invention will be described with reference to fig. 1.

The classification type is a type of text-described content that can be obtained when classifying text according to a specific characteristic. For example, according to different sports modes, the "sports news" text can be divided into the "track and field news" text or the "ball news" text, wherein the "track and field news" and the "ball news" are two classification types.

In addition, based on the above, in fig. 1, "sports news", "javelin news", "sprint news", "basketball news", "football news", "volleyball news", and the like are also classified into categories.

The classification hierarchy may be understood as the hierarchy at which each classification type is located relative to the root classification type. The root classification type refers to the classification type of root origin, and the text coverage of the root classification type is the widest. In addition, in one hierarchical classification, the classification hierarchy in which the root classification type is located is highest.

Based on the above, the category type "sports news" in fig. 1 is the root category type, and the category level is highest. Both "track and field news" and "ball news" are at the first level of classification that is subordinate to the root classification level "sports news", so "track and field news" and "ball news" are at the same classification level. The "javelin news", "sprint news", "basketball news", "football news" and "volleyball news" are all at the second-level classification level subordinate to the root classification level "sports news", so that the "javelin news", "sprint news", "basketball news", "football news" and "volleyball news" are at the same classification level. Similarly, "male javelin news", "female javelin news" … … "male volleyball news" and "female volleyball news" are at the same classification level.

The text classification method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention are described in detail below by means of a specific embodiment in combination with the tree structure schematic diagram of the hierarchical classification shown in fig. 1.

Referring to fig. 2, a flow chart of a first text classification method is provided, which includes the following steps S201-S203.

Step S201: and calling the first classification modes corresponding to the first classification levels in parallel to classify the text to be classified, and obtaining candidate classification types of the text to be classified in the first classification levels.

The first classification method is used for: and classifying the text according to all classification types included in the corresponding classification hierarchy.

That is, there is a correspondence between the first classification levels and the first classification modes, and one first classification level corresponds to one first classification mode. A first classification hierarchy includes a plurality of classification types. When classifying the text by using a first classification mode corresponding to the first classification level, determining one or more classification types to which the text belongs from all classification types contained in the first classification level, wherein the determined one or more classification types are called candidate classification types of the text to be classified in the first classification level.

For example, as shown in the schematic diagram of fig. 1, the classification type includes a classification level of "track and field news" and "ball news", and may correspond to a first classification mode, where the first classification mode classifies the text according to the "track and field news" and the "ball news" included in the classification level, and determines that the text belongs to the "track and field news" text or the "ball news" text.

There may be one first classification level or a plurality of first classification levels. Because one first classification level corresponds to one first classification mode, one first classification mode is called to classify the text, and the candidate classification type of the text to be classified in the first classification level can be obtained.

The first classification approach described above may be implemented based on a classifier, e.g., a bayesian classifier, etc. The first classification mode may also be implemented based on a classification model, for example, a Bert classification model, etc.

Since the classification level at which the root classification type is located may be considered to include only one classification type, there is typically no need to classify the text at that classification level, nor is the classification level typically regarded as the first classification level.

In one embodiment of the invention, the first classification level may be a consecutive classification level determined in the order of the classification levels from high to low, depending on the number of classification types included in each classification level.

The lower the hierarchy is, the more the number of classification types are contained, and when the classification is performed in the classification type contained in the classification hierarchy by using one classification mode, the characteristics of the text of each classification type need to be considered, so that the more information needs to be considered. In order to prevent the situation that the classification mode gives consideration to excessive information, in one implementation manner, whether one classification level is the first classification level may be determined according to the number of classification types included in the classification levels.

Specifically, if the number of classification types included in one classification level is smaller than a preset classification type number threshold, determining that the classification level is a first classification level, and if the number of classification types included in one classification level is larger than the preset threshold, determining that the classification level is not the first classification level.

For example, in the schematic diagram shown in fig. 1, the number of classification types included in the classification level 1, the classification level 2 and the classification level 3 is 2, 5 and 10 respectively except for the classification level with the highest level according to the order of the classification levels from high to low, and if the preset classification type number threshold is set to be 8, the classification level 1 and the classification level 2 in the schematic diagram are the first classification level; if the preset classification type number threshold is set to 4, the classification level 1 in the schematic diagram is the first classification level. And because the number of classification types contained in the classification hierarchy is continuously increasing, the first classification hierarchy meeting the classification type number less than the preset classification type number threshold is a continuous classification hierarchy.

Step S202: and obtaining intermediate classification types for classifying the text to be classified along the hierarchical relationship of each candidate classification type between the first classification layers according to each candidate classification type.

Hierarchical relationships between classification types at the classification level are based on containment and inclusion relationships between classification types in the classification level, and a classification type may include multiple sub-types. Based on this, for two adjacent classification levels, if the plurality of classification types in the low classification level are sub-types of one classification type in the high classification level, a hierarchical relationship exists between the one classification type in the high classification level and the plurality of classification types in the low classification level.

For example, in fig. 1, classification level 1 and classification level 2 are adjacent classification levels, classification level 1 is higher than classification level 2, "javelin news", "sprint news" is a subtype of "track and field news", and "basketball news", "football news" and "volleyball news" are subtypes of "ball news", so that there is a hierarchical relationship between "javelin news", "sprint news" and "track and field news", and "basketball news", "football news", "volleyball news" and "ball news".

Since the candidate classification types of the text to be classified at each first classification level are obtained in step S201, but there may be a hierarchical relationship between the candidate classification types at adjacent classification levels, the candidate classification types may be classified according to the hierarchical relationship between the candidate classification types at each first classification level, and then the classification type with the lowest classification level in the classification result may be regarded as the intermediate classification type.

Specifically, when the candidate classification types are classified, each of the obtained classification results may be: the number of classification types belonging to the same classification hierarchy is 1, and classification types belonging to adjacent classification hierarchies satisfy: the class types belonging to the low class hierarchy are subtypes of the class types belonging to the high class hierarchy.

For example, as shown in the schematic diagram of fig. 1, "track and field news" includes "javelin news" and "sprint news", and "ball news" includes "basketball news", "football news" and "volleyball news", and thus if there are two candidate classification types, one candidate classification type includes "ball news" and the other candidate classification type includes "football news", the obtained intermediate classification type is "football news"; if there are two candidate classification types, one candidate classification type includes "ball news" and the other candidate classification type includes "basketball news" and "sprint news", and since "ball news" and "sprint news" do not conform to the hierarchical relationship between the candidate classification types at the first classification level, the "ball news" and "basketball news" conform to the hierarchical relationship between the candidate classification types at the first classification level, and the obtained intermediate classification type is "basketball news".

Step S203: and taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, and calling the second classification mode corresponding to each second classification level in a serial manner level by level along the level relation of each classification type among the second classification levels to classify the text to be classified until the preset lowest classification level is reached, so as to obtain the classification result of the text to be classified.

Wherein a second classification is used to: classifying the text according to all sub types included in one classification type in the corresponding classification level, wherein the first classification level is higher than the second classification level, and the second classification mode corresponding to the second classification level of the serial call is as follows: the text is classified in such a way that all sub-types included in the classification type obtained by the classification at the previous level.

In the schematic diagram shown in fig. 1, the second classification level is a classification level other than the classification level where the root classification type is located and the first classification level. If class level 1 and class level 2 in fig. 1 are the first class level, class level 3 and class level 4 are the second class level; if class level 1 is the first class level in fig. 1, class levels 2, 3, and 4 are the second class levels.

Specifically, the above-mentioned second classification method is used for classifying the text according to all the sub-types included in one classification type in the corresponding classification level, that is, the text to be classified is classified into one classification type in one classification level, at this time, the second classification method corresponding to the classification level is called to classify the text to be classified, and the text to be classified can be classified into one of all the sub-types included in the classification type. For example, if the classification type of the text to be classified in one classification level is "ball news", a second classification mode corresponding to the classification level may be called to classify the text to be classified into one of "basketball news", "football news" and "volleyball news" included in the "ball news".

In the process of calling the second classification mode to classify the text to be classified, the classification type of the text to be classified in a second classification level corresponding to the second classification mode is known, and calling the second classification mode corresponding to the known classification type to classify the text to be classified, so that the classification type of the text to be classified in the next level can be determined.

The second classification modes are for the classification types contained in the corresponding second classification level, if one second classification level contains three classification types, the second classification modes corresponding to the second classification level also have three classification types respectively corresponding to the three classification types contained in the second classification level, and when the classification type of the text to be classified in the second classification level is one of the three classification types, the second classification mode corresponding to the classification type is called to classify the text to be classified.

The second classification mode corresponding to the second classification level of the serial call is a mode of classifying the text according to all sub-types included in the classification type obtained by classifying the text in the previous level, the classification type obtained by classifying the text to be classified in the previous level is obtained by classifying the text to be classified in the previous call by the second classification mode, the second classification mode called in the previous time corresponds to the previous level, and the second classification mode classifies the text according to all sub-types included in one classification type in the corresponding classification level, so that the classification type obtained by classifying the text in the previous level, namely, the classification type of the text to be classified in the second classification level corresponding to the second classification mode of the current serial call, and the second classification mode classifies the text according to all sub-types included in the obtained classification type.

This step is illustrated below with reference to the schematic diagram shown in fig. 1.

As shown in the schematic diagram in fig. 1, the intermediate classification type may be "football news", and the second classification mode corresponding to "football news" is called to classify the text to be classified, so as to obtain the classification type of the text to be classified at the second classification level with the highest level as "football news", where the second classification mode corresponding to "football news" classifies the text according to the subtype "football news" corresponding to "football news" and "football news for women"; then, a second classification mode corresponding to the ' football news for men ' is called to classify the text to be classified, and the classification type of the text to be classified at a second classification level with a second highest level is obtained, for example, the classification type can be ' football news for beach, wherein the second classification mode corresponding to the ' football news for men ' is used for classifying the text according to the subtype corresponding to the ' football news for men '; and then, a second classification mode corresponding to the 'male beach football news' is called to classify the text to be classified, so that the classification type of the text to be classified at a second classification level with the third highest level is obtained, wherein the second classification mode corresponding to the 'male beach football news' classifies the text according to the subtype corresponding to the 'male beach football news', and the process is repeated until the preset lowest classification level is reached.

The second classification method may be a classifier, or may be a classification model, for example, a bayesian classifier, a Bert classification model, or the like.

In the scheme provided by the embodiment of the invention, the text to be classified is classified by calling the first classification modes corresponding to the first classification levels in parallel to obtain candidate classification types of the text to be classified at the first classification levels, then combining the hierarchical relationship of the candidate classification types at the first classification levels to obtain intermediate classification types for classifying the text to be classified, taking the second classification mode corresponding to the intermediate classification types as an initial classification mode, calling the second classification mode corresponding to the second classification levels in a serial manner layer by layer along the hierarchical relationship of the classification types at the second classification levels, and classifying the text to be classified until the preset minimum classification level is reached, thus obtaining the classification result of the text to be classified. In the process of classifying the text, the first classification modes are called in parallel to obtain the intermediate classification types, and then the second classification modes are called in series to obtain the classification results, so that not only is the hierarchical relationship of the classification types among the layers followed, but also a plurality of classification modes are combined for use, and the characteristics of different classification modes are exerted at different positions. Therefore, by applying the text classification scheme provided by the embodiment of the invention, the accuracy of text classification can be improved.

In addition, since each first classification level includes a plurality of classification types, compared with a classification mode corresponding to each classification type in the first classification level, the classification mode related in the text classification process can be greatly reduced by a first classification level corresponding to a first classification mode in the embodiment of the invention.

For example, assume that there are two first classification levels and two second classification levels, the two first classification levels containing a classification type number of 20 and 100, respectively, and the two second classification levels containing a classification type number of 200, respectively. In this case, when the scheme provided by the embodiment of the invention is applied to text classification, the required classification mode is as follows: 1+1+200=202. In the case that one classification type in each first classification level corresponds to one classification mode, the required classification modes are: 1+100+200=301. It can be seen that 202 is much smaller than 301.

In an embodiment of the present invention, when the first classification mode and the second classification mode correspond to the classification models, the corresponding classification models may be obtained by training in advance in a supervised training mode, and may specifically be implemented in a training mode in the art, which is not described in detail herein.

In one embodiment of the present invention, referring to fig. 3, a flow chart of a second text classification method is provided, and compared with the embodiment shown in fig. 2, step S202 described above obtains, according to each candidate classification type, an intermediate classification type for classifying the text to be classified along the hierarchical relationship between the candidate classification types and the first classification layer, including the following steps S202A-S202D.

Step S202A: and selecting the classification type from the candidate classification types along the hierarchical relationship between the first classification layers of the candidate classification types to obtain a plurality of type groups.

Wherein, the number of classification types belonging to the same classification hierarchy in each type group is 1, and classification types belonging to adjacent classification hierarchies satisfy: the class types belonging to the low class hierarchy are subtypes of class types belonging to the high hierarchy.

The text to be classified may have one or more candidate classification types in one first classification level, one classification type is determined in the candidate classification types corresponding to each first classification level, and the determined classification types are combined to obtain a plurality of classification type combinations, wherein the classification type combinations with the classification types having a hierarchical relationship between the first classification levels are called a type group.

For example, as shown in the schematic diagram of fig. 1, the candidate classification type of the text to be classified in the first classification level with the highest level is "sports news"; the candidate classification types of the text to be classified in the first classification level with the second highest level are 'track and field news', 'ball news'; the candidate classification types of the text to be classified in the first classification level with the third highest level are 'sprint news', 'basketball news', then one classification type is determined in the candidate classification types corresponding to each first classification level, the determined classification types are combined to obtain four classification type combinations, and the first classification type combination is 'sports news, track news and sprint news'; the second category type is combined as "sports news, ball news, sprint news"; the third classification type combination is sports news, track and field news and basketball news; the first category type combination is "sports news, ball news, basketball news". Since the first classification type combination and the fourth classification type combination satisfy the hierarchical relationship of each candidate classification type at the first classification hierarchy, the first classification type combination is a type group, and the fourth classification type combination is a type group.

Step S202B: and calculating the global probability that the classification type of the text to be classified is the candidate classification type included in each type group according to the probability that the classification type of the text to be classified is each candidate classification type.

Under the condition, for each type group, taking the preset weight value of each classification level as the weight of the candidate classification type belonging to each classification level in the type group, and carrying out weighted calculation on the probability that the classification type of the text to be classified is the candidate classification type in the type group to obtain the global probability.

The preset weight value of the classification level may be set according to the importance degree of the classification level in all classification levels. When the preset weight values of the two classification levels are the same, the importance degrees of the two classification levels can be considered to be the same.

The preset weight values of the classification level may be 0.2, 0.5, 0.7, or other values. The sum of the weights of the candidate class types included in one class group may be equal to 1.

Step S202C: and selecting candidate type groups of the text to be classified from a plurality of type groups according to the order of the global probability from high to low.

In this case, a type group having the highest global probability may be selected from a plurality of type groups as a candidate type group of the text to be classified.

In another case, a plurality of type groups whose global probability satisfies a preset selection condition may be selected as candidate type groups of the text to be classified. For example, the preset selection conditions may be: the selecting a plurality of type groups with global probability larger than a preset value as candidate type groups of the text to be classified can also be: and selecting a preset number of type groups with highest global probability, and the like.

Step S202D: and determining the candidate classification type with the lowest classification level in the selected candidate type group as an intermediate classification type for classifying the text to be classified along the level relation of each candidate classification type between the first classification layers.

Since the candidate type group is selected from a plurality of type groups, the number of classification types belonging to the same classification hierarchy in the type group is 1, and therefore, the plurality of classification types included in the candidate type group respectively belong to respective classification hierarchies, and the classification type with the lowest hierarchy in the plurality of classification types is the basis for continuing classification subsequently, and therefore, the classification type with the lowest hierarchy is determined as the intermediate classification type.

If multiple type groups are selected as candidate type groups of the text to be classified in step S202C, multiple intermediate classification types for classifying the text to be classified can be obtained, and further after serial calling of the second classification modes corresponding to the second classification levels, multiple classification results of the text to be classified are obtained, and the final classification result of the text to be classified can be determined according to the confidence level of each classification result.

In the above-mentioned scheme provided by the embodiment of the invention, the probability that the classification type of the text to be classified is each candidate classification type is utilized to obtain the global probability that the classification type of the text to be classified is the candidate classification type included in each type group, the global probability is used as the basis for selecting the candidate type group of the text to be classified, and then the candidate classification type with the lowest classification level in the selected candidate type group is determined as the intermediate classification type for classifying the text to be classified. The probability that the classification type of the text to be classified is the candidate classification type intuitively reflects the probability that the classification type of the text to be classified is the candidate classification type, so that the global probability obtained by the probabilities also reflects the probability that the classification type of the text to be classified is the candidate classification type included in various types of groups on the premise of conforming to the hierarchical relationship between the candidate classification types and the first classification layer, the candidate classification type group of the text to be classified is selected according to the global probability, the intermediate classification type for classifying the text to be classified is finally obtained, and the accuracy of the obtained intermediate classification type is ensured.

In an embodiment of the present invention, referring to fig. 4, a flow chart of a third text classification method is provided, and compared with the embodiment shown in fig. 2, in this embodiment, before the step S201 of parallel calling the first classification modes corresponding to the first classification levels to classify the text to be classified to obtain candidate classification types of the text to be classified in the first classification levels, the following step S204 is further included.

Step S204: and carrying out semantic analysis on the text to be classified to obtain semantic vectors representing the semantics of the text to be classified.

The purpose of semantic analysis is to analyze the meaning expressed by the sentence, and perform semantic analysis on the text to be classified, that is, analyze the meaning that the text to be classified wants to express, and further judge which classification type the text to be classified belongs to.

The semantic vector is used for representing the semantic of the text to be classified, namely, the content in the text to be classified is converted into a specific vector in a semantic space, and then the semantic is quantitatively compared and analyzed, namely, the text to be classified can exist in the form of the semantic vector in the semantic space.

The semantic space is a medium for information transmission, and information transmission can be completed only when the transmission side and the receiving side have a common semantic space. Each text has its own inherent meaning that constitutes a semantic vector in semantic space.

Specifically, the semantic analysis of the text to be classified may be that the text to be classified is preprocessed, and then the preprocessed text is subjected to semantic analysis to obtain a semantic vector of the text to be classified.

Preprocessing the text to be processed may include extracting header information, performing font conversion, character conversion, filtering, etc. on the text.

For example, the text may be converted from a complex font to a simplified font, the text may be converted from a full-angle to a half-angle, the text may be filtered to filter dirty characters, and so on.

For example, a preprocessed text may be input into a pre-trained bert model according to character dimensions to obtain 768 dimensions of semantic vectors.

In this embodiment, step S201 above calls the first classification modes corresponding to the first classification levels in parallel to classify the text to be classified, so as to obtain candidate classification types of the text to be classified in the first classification levels, including the following step S201A.

Step S201A: and taking the semantic vector as input data of a first classification mode corresponding to each first classification level, and calling each first classification mode in parallel to classify the text to be classified to obtain candidate classification types of the text to be classified in each first classification level.

Under the condition that the semantic vector is used as input data, the first classification mode corresponding to each first classification level classifies the semantic vector in the same semantic space as the semantic vector, and candidate classification types of the text to be classified in each first classification level are obtained.

In this embodiment, the step S203 uses a second classification mode corresponding to the intermediate classification type as an initial classification mode, and serially calls the second classification modes corresponding to the second classification levels level by level along the level relation of the classification types between the second classification levels, so as to classify the text to be classified until reaching the preset lowest classification level, thereby obtaining the classification result of the text to be classified, including the following step S203A.

Step S203A: and taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, taking a semantic vector as input data of the second classification mode corresponding to each second classification level along the hierarchical relation of each classification type among the second classification levels, and calling each second classification mode in a serial manner level by level to classify the text to be classified until the preset lowest classification level is reached, so as to obtain a classification result of the text to be classified.

In the same manner as in step S201A, when the semantic vector is used as the input data, the second classification method corresponding to each second classification level classifies the semantic vector in the same semantic space as the semantic vector, and obtains the candidate classification type of the text to be classified in each second classification level.

In the scheme provided by the embodiment of the invention, the semantic vector representing the semantic meaning of the text to be classified is obtained by carrying out semantic analysis on the text to be classified, and the semantic vector representing the semantic meaning of the text to be classified is used as the input data of the first classification mode and the second classification mode because the text to be classified is classified into the text belonging to the specific classification mode and the meaning is usually expressed according to the text to be classified, so that the candidate classification type of the text to be classified at the first classification level corresponding to each first classification mode and the classification type of the text to be classified at the second classification level corresponding to each second classification mode can be accurately obtained, and the accuracy of the classification result of the text to be classified is improved.

Corresponding to the text classification method, the embodiment of the invention also provides a text classification device.

Referring to fig. 5, there is provided a schematic structural view of a first text classification apparatus, the apparatus comprising:

the first text classification module 501 is configured to call, in parallel, a first classification mode corresponding to each first classification level to classify a text to be classified, so as to obtain candidate classification types of the text to be classified in each first classification level, where the first classification mode is used for: and classifying the text according to all classification types included in the corresponding classification hierarchy.

And the intermediate type obtaining module 502 is configured to obtain, according to each candidate classification type, an intermediate classification type that classifies the text to be classified along a hierarchical relationship between the candidate classification types and the first classification layer.

A second text classification module 503, configured to use a second classification mode corresponding to the intermediate classification type as an initial classification mode, call the second classification mode corresponding to each second classification level in a serial manner level by level along a hierarchical relationship between the second classification levels of each classification type, and classify the text to be classified until reaching a preset lowest classification level, so as to obtain a classification result of the text to be classified, where one second classification mode is used for: classifying the text according to all sub types included in one classification type in the corresponding classification level, wherein the first classification level is higher than the second classification level, and the second classification mode corresponding to the second classification level of the serial call is as follows: the text is classified in such a way that all sub-types included in the classification type obtained by the classification at the previous level.

In the scheme provided by the embodiment of the invention, the candidate classification types of the texts to be classified in the first classification levels are obtained by calling the first classification modes corresponding to the first classification levels in parallel, then the intermediate classification types for classifying the texts to be classified are obtained by combining the hierarchical relationship of the candidate classification types in the first classification levels, the second classification modes corresponding to the intermediate classification types are used as the initial classification modes, and the second classification modes corresponding to the second classification levels are called in series in a hierarchical manner until the preset lowest classification level is reached, so that the classification result of the texts to be classified is obtained. In the process of classifying the text, the first classification modes are called in parallel to obtain the intermediate classification types, and then the second classification modes are called in series to obtain the classification results, so that not only is the hierarchical relationship of the classification types among the layers followed, but also a plurality of classification modes are combined for use, and the characteristics of the respective classification modes are exerted at different positions. Therefore, by applying the text classification scheme provided by the embodiment of the invention, the accuracy of text classification can be improved.

In an embodiment of the present invention, referring to fig. 6, a schematic structural diagram of a second text classification apparatus is provided, and in this embodiment, compared to the embodiment shown in fig. 5, the intermediate type obtaining module 502 includes:

a type group obtaining submodule 502A, configured to select a classification type from the candidate classification types along a hierarchical relationship between the candidate classification types and the first classification layer, to obtain a plurality of type groups, where the number of classification types belonging to the same classification layer in each type group is 1, and classification types belonging to adjacent classification layers satisfy: the class types belonging to the low class hierarchy are subtypes of class types belonging to the high hierarchy.

The probability calculation submodule 502B is configured to calculate, according to the probability that the classification type of the text to be classified is a candidate classification type, a global probability that the classification type of the text to be classified is a candidate classification type included in each type group.

In one embodiment of the present invention, the probability calculation sub-module 502B is specifically configured to:

The type group selection submodule 502C is configured to select, from a plurality of type groups, the candidate type groups of the text to be classified according to the order of the global probability from high to low.

In one embodiment of the present invention, the type group selection submodule 502C is specifically configured to:

The type determining submodule 502D is configured to determine a candidate classification type with the lowest classification level in the selected candidate type group as an intermediate classification type for classifying the text to be classified along a hierarchical relationship between the candidate classification types and the first classification level.

In the above-mentioned scheme provided by the embodiment of the invention, the probability that the classification type of the text to be classified is each candidate classification type is utilized to obtain the global probability that the classification type of the text to be classified is the candidate classification type included in each type group, the global probability is used as the basis for selecting the candidate type group of the text to be classified, and then the candidate classification type with the lowest classification level in the selected candidate type group is determined as the intermediate classification type for classifying the text to be classified. The probability that the classification type of the text to be classified is the candidate classification type reflects the probability that the classification type of the text to be classified is the candidate classification type intuitively, so that the global probability obtained by the probabilities reflects the probability that the classification type of the text to be classified is the candidate classification type included in various types of groups on the premise of conforming to the hierarchical relationship between the candidate classification types and the first classification layer, the candidate classification type group of the text to be classified is selected according to the global probability, the intermediate classification type for classifying the text to be classified is finally obtained, and the accuracy of the obtained intermediate classification type is ensured.

In one embodiment of the present invention, referring to fig. 7, there is provided a schematic structural diagram of a third text classification apparatus, the apparatus further including:

the semantic analysis module 504 is configured to, after the first text classification module 501 classifies the text to be classified according to the first classification mode corresponding to each first classification level, obtain candidate classification types of the text to be classified in each first classification level, perform semantic analysis on the text to be classified, and obtain a semantic vector representing speech of the text to be classified.

In this embodiment, the first text classification module 501 is specifically configured to:

and taking the semantic vector as input data of a first classification mode corresponding to each first classification level, and calling each first classification mode in parallel to classify the text to be classified to obtain candidate classification types of the text to be classified in each first classification level.

In this embodiment, the second text classification module 503 sequentially calls, level by level, a second classification mode corresponding to each second classification level to classify the text to be classified, which is specifically configured to:

The embodiment of the present invention further provides an electronic device, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804,

a memory 803 for storing a computer program;

the processor 801, when executing the program stored in the memory 803, implements the following steps:

Other schemes for implementing text classification by the processor 801 executing the program stored in the memory 803 are the same as those mentioned in the foregoing method embodiment, and will not be repeated here.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the text classification method according to any of the above embodiments.

In yet another embodiment of the present invention, a computer program product containing instructions that, when run on a computer, cause the computer to perform the text classification method of any of the above embodiments is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, computer readable storage media and computer program product embodiments, the description is relatively simple as it is substantially similar to method embodiments, as relevant points are found in the partial description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of text classification, the method comprising:

and taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, and calling the second classification mode corresponding to each second classification level in a serial manner level by level along the level relation of each classification type among the second classification levels to classify the text to be classified until reaching a preset lowest classification level to obtain a classification result of the text to be classified, wherein one second classification mode is used for: classifying the text according to all sub types included in one classification type in the corresponding classification level, wherein the first classification level is higher than the second classification level, and the second classification mode corresponding to the second classification level of the serial call is as follows: classifying the text according to all sub types included in the classification type obtained by the classification at the previous level;

The obtaining, according to each candidate classification type, an intermediate classification type for classifying the text to be classified along a hierarchical relationship of each candidate classification type between the first classification layers, including:

determining a candidate classification type with the lowest classification level in the selected candidate type group as an intermediate classification type for classifying the text to be classified along the level relation of each candidate classification type between the first classification layers;

The first classification level is a consecutive classification level determined in an order from high to low along the classification level according to the number of classification types included in each classification level.

2. The method according to claim 1, wherein calculating the global probability that the classification type of the text to be classified is the candidate classification type included in the various types of groups according to the probability that the classification type of the text to be classified is each candidate classification type comprises:

3. The method of claim 1, wherein selecting the candidate type group of the text to be classified from a plurality of type groups in order of global probability from high to low comprises:

4. A method according to any one of claims 1 to 3, wherein,

5. A method according to any one of claims 1-3, characterized in that the method further comprises:

6. A text classification device, the device comprising:

the second text classification module is used for taking a second classification mode corresponding to the intermediate classification type as an initial classification mode, calling the second classification mode corresponding to each second classification level in a serial manner level by level along the level relation of each classification type among the second classification levels to classify the text to be classified until reaching a preset lowest classification level to obtain a classification result of the text to be classified, wherein one second classification mode is used for: classifying the text according to all sub types included in one classification type in the corresponding classification level, wherein the first classification level is higher than the second classification level, and the second classification mode corresponding to the second classification level of the serial call is as follows: classifying the text according to all sub types included in the classification type obtained by the classification at the previous level;

The intermediate type obtaining module includes:

the type determining submodule is used for determining a candidate classification type with the lowest classification level in the selected candidate type group as an intermediate classification type for classifying the text to be classified along the level relation of each candidate classification type between the first classification levels;

7. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.