WO2022142025A1

WO2022142025A1 - Text classification method and apparatus, and terminal device and storage medium

Info

Publication number: WO2022142025A1
Application number: PCT/CN2021/090954
Authority: WO
Inventors: 马龙; 梁宸; 周元笙; 蒋佳惟; 陈思姣; 李炫�
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-28
Filing date: 2021-04-29
Publication date: 2022-07-07
Also published as: CN112632280B; CN112632280A

Abstract

The present application is applicable to the technical field of artificial intelligence. Provided are a text classification method and apparatus, and a terminal device and a storage medium. The method comprises: acquiring a text node structural diagram generated by a plurality of texts to be classified; extracting text features of each text to be classified, and respectively calculating a similarity distance between text nodes in pairs according to the text features; according to the similarity distance, filtering node lines in the text node structural diagram, so as to obtain a target text node structural diagram including each text node and the remaining node lines; and on the basis of a community discovery algorithm and the remaining node lines, classifying texts to be classified in the target text node structural diagram, so as to obtain a target group of the plurality of texts to be classified. By means of the method, preliminary filtering is first performed on the text node structural diagram generated by the texts to be classified, and the text node structural diagram is then grouped to obtain the target group, such that the accuracy of classifying the plurality of texts to be classified can be improved.

Description

Text classification method, device, terminal device and storage medium

This application claims the priority of the Chinese patent application with the application number 202011586743.3 and the invention title "Text Classification Method, Apparatus, Terminal Equipment and Storage Medium" filed in the China Patent Office on December 28, 2020, the entire contents of which are approved by Reference is incorporated in this application.

technical field

The present application belongs to the technical field of artificial intelligence, and in particular, relates to a text classification method, apparatus, terminal device and storage medium.

Background technique

At present, text classification is an important aspect in the field of artificial intelligence. As an important task of information processing, text classification aims to automatically classify unlabeled documents into a predetermined set of categories to solve the phenomenon of information clutter. In the existing text classification methods, the existing fast text classifier models or convolutional neural network models are usually used to classify texts. However, the inventor realizes that, by training a neural network model to classify text, it takes a lot of time to train the neural network model, and the classification accuracy is low when classifying text more finely.

technical problem

One of the purposes of the embodiments of the present application is to provide a text classification method, device, terminal device and storage medium, aiming to solve the problem of low accuracy of text classification in the prior art when classifying text through a neural network model question.

technical solutions

In order to solve the above-mentioned technical problems, the technical solutions adopted in the embodiments of the present application are:

In a first aspect, an embodiment of the present application provides a text classification method, including:

obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;

Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;

According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;

Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.

In a second aspect, an embodiment of the present application provides a text classification device, including:

an acquisition module, configured to acquire a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;

a computing module, for extracting the text features of each of the texts to be classified, and calculating the similarity distance between the two text nodes according to the text features;

a filtering module, configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines;

The first classification module is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines, and obtain a plurality of target groups of the text to be classified.

A third aspect of the embodiments of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When realized:

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement:

beneficial effect

Compared with the prior art, the embodiment of the present application has the beneficial effect of improving the accuracy of classifying multiple texts to be classified.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or exemplary technologies. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

Fig. 1 is the realization flow chart of a kind of text classification method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an implementation manner of S103 of a text classification method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an implementation manner of S104 of a text classification method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an implementation manner of S1043 of a text classification method provided by an embodiment of the present application;

Fig. 5 is the realization flow chart of a kind of text classification method provided by another embodiment of the present application;

6 is a schematic diagram of an implementation manner of S104B of a text classification method provided by an embodiment of the present application;

7 is a structural block diagram of a text classification device provided by an embodiment of the present application;

FIG. 8 is a structural block diagram of a terminal device provided by an embodiment of the present application.

Embodiments of the present invention

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

The text classification method provided by the embodiments of the present application can be applied to terminal devices such as tablet computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, etc. The embodiments of the present application do not make any specific types of terminal devices. limit.

S101. Obtain a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line.

In an application, the above-mentioned multiple texts to be classified may be texts such as papers, journals, or magazines, respectively, or may be a sentence or a paragraph, which is not limited. For multiple texts to be classified, each text to be classified can be regarded as a text node respectively, and the multiple text nodes can be connected by node lines to generate a text node structure diagram. The text to be classified may be pre-stored in a designated storage path of the terminal device, and then acquired by the terminal device; it may also be multiple texts to be classified transmitted by the user in real time, which is not limited.

It can be understood that the number of the above-mentioned node lines is related to the number of nodes of the text node. Exemplarily, when the number of text nodes is 2, the number of node lines is 1, when the number of text nodes is 3, the number of node lines is 3, when the number of text nodes is 4, the number of node lines is 6, and so on analogy. Specifically, y=(n-1)*n/2, where n (n is an integer, and n≥2) is the number of text nodes, and y is the number of node lines.

S102. Extract the text features of each of the texts to be classified, and calculate the similarity distance between the two text nodes according to the text features.

In application, the above similarity distance can be understood as the degree of similarity between two texts to be classified. Specifically, the terminal device may extract text features of each text to be classified, and calculate the similarity between the two text features as the similarity distance between the two texts to be classified.

Exemplarily, the above-mentioned extraction of the text features of the text to be classified may be that the text to be classified is segmented to obtain multiple text segments. Afterwards, the word consistent with the text segmentation can be determined in the preset word vector library, and the sequence number of the text segmentation can be determined. Finally, the word feature of the text segmentation can be generated according to the sequence number. Wherein, determining words consistent with text segmentation in the preset word vector library may be implemented according to a forward matching algorithm. Specifically, if the longest word segmentation character in the preset word segmentation database is 5, the first to fifth characters in the text to be classified may be used as initial word segmentation, and it is determined whether they exist in the word vector database. If it exists, the initial participle will be used as the target participle, and then the subsequent participle will be matched. If the initial participle does not exist in the word vector base, the character length is reduced from right to left. The first character to the fourth character in the text to be classified is used as the initial word segmentation, and again the initial word segmentation determines whether it exists in the word vector library. In this way, word segmentation is performed on a plurality of texts to be classified to obtain text segmentation.

In the application, after multiple text segmentations are determined, the word feature of the text segmentation can be generated according to the sequence number of each text segmentation in the word vector library. Specifically, the word feature dimension in each text segmentation is preset. Then, map the one-dimensional space (number corresponding to the sequence number) of each text segment to a multi-dimensional continuous vector space. Exemplarily, for a text segment with a sequence number of "5", the dimension of the feature vector is 10, and the word feature of the text segment can be [0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] to get the word features of each text segment. Based on this, the text features of the text to be classified can be represented by the word features of multiple text segmentations. After that, you can calculate the similarity distance between two text nodes by referring to the following steps, specifically:

The similarity distance includes the Euclidean distance; in S102, the text features of each of the texts to be classified are extracted, and the similarity distances between the two text nodes are respectively calculated according to the text features, specifically including The following sub-steps:

Extracting word vectors of multiple text segments in each of the texts to be classified, and determining word orders of the multiple text segments in the text to be classified.

For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Add and get the Euclidean distance between two text nodes.

In the application, the method of extracting the text word segmentation in the text to be classified and the word vector of the text word segmentation have been explained in the above S102, and will not be described again. It can be understood that, since the text to be classified is composed of a plurality of consecutive characters, each character has a corresponding character sequence in the text to be classified. Based on this, the above-mentioned multiple text segmentations obtained by segmenting the text to be classified in a character-based matching manner also have a corresponding word order.

In application, the calculation formula of Euclidean distance can be:

Among them, n is the number of text segments, ai represents the word vector of the i-th text segment in the a-th text to be classified, bi represents the word vector of the i-th text segment in the b-th text to be classified, p( a, b) represent the similarity distance between the text a to be classified and the text b to be classified. Wherein, if the number m1 of word segments of the text a to be classified is greater than the number m2 of word segments of the text b to be classified, n may be determined as m1. After that, for the text segmented after b2 in the to-be-classified text b, the word features are all represented by "0". In other applications, the above similarity distance may also calculate the cosine distance between two text nodes according to text features, which is not limited.

S103. Filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines.

In application, filtering the above node lines may be considered as deleting node lines whose similarity distance is less than a preset threshold. In this way, a target text node structure diagram composed of text nodes and remaining node lines is obtained. It can be understood that the similarity distance between the two text nodes is the length of the node line between the two text nodes.

Specifically, a preset threshold may be pre-stored in the terminal device. When the similarity distance is greater than the preset threshold, the node lines between the texts to be classified are retained; when the similarity distance is less than or equal to the preset threshold, the texts to be classified are deleted. node line. It can be understood that the node structure diagram of the target text obtained at this time has a certain similarity between the two texts to be classified connected by the node line. When clustering and grouping them, it can be considered that the two texts to be classified The probability of belonging to the same category grouping is greater.

S104. Based on the community discovery algorithm and the remaining node lines, classify the text to be classified in the node structure diagram of the target text, and obtain a plurality of target groups of the text to be classified.

In application, the above community discovery algorithms include but are not limited to Leiden algorithm and Louvain algorithm, which are not limited thereto. Community discovery algorithms are modularity-based classification algorithms that can be used to group multiple hierarchical (different categories) objects (texts to be classified). Among them, the modularity can be considered as a metric for evaluating the effect of each grouping after multiple targets are grouped differently. It can be understood that a plurality of texts to be classified are classified, and there is at least one text to be classified in each category. Exemplarily, if the target group contains 3 categories after classifying 5 to-be-categorized texts, it means that the number of the to-be-categorized texts contained in each category group may be 1 or 2, which is not limited. .

Specifically, after classifying the target text node structure graph, a plurality of target groups of the text to be classified are obtained. In order to determine the overall modularity Q of the target grouping at this time, the modularity Qi of each category of groupings in the target grouping needs to be calculated. Specifically, the calculation formula is as follows:

Among them, Qi is the modularity of the grouping of the ith category, m is the sum of the similarity distances of all node lines between the groups of all categories, ∑in The sum of the similarity distances of all the node lines between the groups of the ith category, ∑tot represents the sum of the similarity distances of all node lines connected to text nodes within the grouping of the ith category. Based on this, the modularity corresponding to the grouping of each category can be obtained after each classification. Furthermore, by adding up multiple modularities, the overall modularity Q of the target grouping can be obtained.

In this embodiment, a text node structure diagram is generated by using each text to be classified as a text node, then the similarity distance between each text to be classified is calculated, and the text node structure diagram is preliminarily filtered to obtain the target text node structure diagram , to realize the preliminary grouping of the text to be classified. Then, according to the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and the multiple texts to be classified are grouped again to obtain the target grouping. In this way, the accuracy of classifying multiple texts to be classified is improved.

Referring to FIG. 2, in a specific embodiment, in S103, according to the similarity distance, the node lines in the text node structure diagram are filtered to obtain a target text node structure including each text node and the remaining node lines. In the figure, it specifically includes the following sub-steps S1031-S1034, which are described in detail as follows:

S1031. For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;

S1032. If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold. as well as,

S1033. If it is determined that the similarity distance is greater than a preset threshold, keep the node lines corresponding to the similarity distance greater than the preset threshold.

S1034. Generate the target text node structure diagram based on the remaining node lines and the multiple text nodes.

In application, the above-mentioned preset threshold may be set by the user according to the actual situation, or may be a fixed value preset in the terminal device, which is not limited. It has been explained in S103 that when the similarity distance is less than or equal to the preset threshold, the node line corresponding to the similarity distance is deleted. And, when the similarity distance is greater than the preset threshold, the node line corresponding to the similarity distance is retained. In this way, the structure diagram of the target text node is generated, which will not be described again.

3, in a specific embodiment, in S104, based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained. , which specifically includes the following sub-steps S1041-S1046, which are detailed as follows:

S1041. Group each text node in the target text node structure diagram as a separate node.

In an application, when classifying a plurality of texts to be classified, the text nodes corresponding to each text to be classified are grouped as a node. Then, the same steps are sequentially performed on each node group to obtain a plurality of target groups of texts to be classified.

S1042. For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain multiple grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings The grouping of nodes made up by the fusion, and the remaining individual groupings of nodes.

In the application, in the target text node structure diagram, there are also node lines connecting the text nodes with other text nodes. Based on this, for the node grouping of any text node, any node grouping and any adjacent node grouping can be fused each time to obtain a fused node grouping and other individual node groupings.

Exemplarily, if there are k texts to be classified, there are k node groups. For any node group i, the node group i can be fused with the adjacent node group j, that is, two node groups of texts to be classified are regarded as a class of groups. At this time, in the formed grouping module, there are k-1 node groups. It can be understood that, among the k-1 groups, there is a node group composed of fusion, and k-2 other individual node groups. For other node groups, it also needs to be processed as in S1042 to obtain corresponding grouping modules.

S1043. Calculate the grouping modularity of each of the grouping modules respectively, where the grouping modularity is used to represent the classification effect of the grouping module.

In application, after a grouping module is determined, the grouping modularity of the grouping module is obtained by adding the modularity of the node grouping composed of fusion and the modularity of each other individual node grouping. That is, for a grouping module containing the above k-1 node groupings, it is necessary to calculate the corresponding modularity of each node grouping in the k-1 node groupings and add them to obtain the overall modularity of the grouping module (grouping modularity) . Specifically, the formula for calculating Qi in the above S104 may be referred to. Among them, Q represents the overall modularity of the grouping module (grouping modularity), and Qi represents the modularity of the grouping of the ith category (in the k-1 node groupings included in the grouping module, the modularity of the ith node grouping) , where i is an integer and i≤k-1.

In application, according to the above steps, if there are other adjacent node groups x in node group i, it is necessary to repeat the above steps for node group i and node group x to obtain the grouping modularity of a grouping module again. Then, after the node group i is processed, it is necessary to group the remaining individual nodes (j, x, or any other node group), and repeat the above steps S1042 and S1043. In this way, a plurality of grouping modules after classifying each node grouping, and the grouping modularity corresponding to each grouping module can be obtained.

S1044. In the multiple grouping modules, use the grouping module corresponding to the maximum value of the grouping modularity as the current target grouping module, and determine whether the current target grouping module is less than or equal to the grouping module of the previous target grouping module Spend.

In application, after obtaining the grouping modularity of each grouping module, the maximum value of various grouping modularity can be determined according to the value of the grouping modularity, and the grouping module corresponding to the maximum grouping modularity can be used as the current target grouping module .

S1045, if it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, then the last target grouping module is used as the target grouping of the multiple texts to be classified;

S1046. If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, for each node in the current target grouping module, perform the grouping of any node again. The grouping is merged with any adjacent node grouping to obtain various grouping modules, until the target grouping of each text to be classified is obtained.

In the application, after determining the current target grouping module, it is necessary to determine whether the current target grouping module is the optimal grouping among the multiple texts to be classified. Based on this, it is also necessary to compare the grouping modularity of the current target grouping module with the grouping modularity of the previous target grouping module to determine whether the current target grouping is the optimal grouping. Specifically, if the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, it indicates that the previous target grouping module is the optimal grouping (target grouping) of multiple texts to be classified. If the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, it indicates that the previous target grouping does not belong to the optimal grouping. Afterwards, in order to determine whether the current target grouping belongs to the optimal grouping, it is also necessary to group each node in the current target grouping module, and repeat S1042 to S1045 until the target grouping of each text to be classified is obtained.

It should be noted that when S1042 to S1045 are repeatedly performed, at this time, S1042 is directed to each node group in the current target group. That is, if the node group included in the current target grouping module is a node group formed by the fusion of node group i and node group j, and is grouped with other individual nodes, then when S1042 is executed, the number of any node group is k-1. At this time, the node group formed by the fusion of node group i and node group j will be treated as one node group for processing.

Specifically, for the grouping modularity Qk-1 of the current target grouping module, if Qk-1 is less than or equal to the grouping modularity Qk of the previous target grouping module, it indicates that the classification effect of the current target grouping module is less than or equal to the previous target grouping The classification effect of the module. Therefore, it shows that the previous target grouping module is the target grouping with the best effect. If the grouping modularity Qk-1 of the current target grouping module is greater than the grouping modularity Qk of the previous target grouping module, it indicates that the classification effect of the current target grouping module is higher than that of the previous target grouping module. Afterwards, the k-1 nodes of the current target grouping module are grouped repeatedly to perform S1042 to S1045 until the target grouping module Q with the best classification effect is obtained.

4 , in a specific embodiment, the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; at S1043 Respectively calculating the grouping modularity of each of the grouping modules specifically includes the following sub-steps S10431-S10434, which are detailed as follows:

S10431. For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, The current node grouping is the grouping of the modularity to be calculated currently in the grouping module.

S10432. Calculate the modularity of the current node grouping according to the first quantity and the second quantity.

S10433. Take the remaining individual node groups as the current node group in sequence, and calculate the modularity of the remaining individual node groups in the grouping module respectively.

S10434. Use the sum of the modularity of the current node grouping and the modularity of the remaining individual node groups as the grouping modularity of the grouping module.

In the application, the modularity of the current node grouping is calculated according to the first quantity and the second quantity. For details, reference may be made to the modularity calculation formula in the above S104, which will not be described again.

In an application, the above-mentioned current node grouping is a grouping that is currently in calculating modularity in the grouping module. It can be understood that, if the current node group has only one text node, the first number of node lines included in the current node group is 0. The above-mentioned second number is the number of node lines respectively connecting the remaining node groups to the current node group. It can be understood that if the current node group has multiple text nodes (groups obtained by merging multiple node groups), the remaining node groups are connected to any node group in the current node group, that is, the remaining node groups and There are connections between the current node groups.

Referring to FIG. 5, in one embodiment, in S104, based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and after a plurality of target groups of the text to be classified are obtained, It also includes the following steps S104A-S104B, which are detailed as follows:

S104A: Determine a classification range corresponding to each node group in the target group, where the classification range includes multiple target classifications.

In the application, after the target group is obtained, the number of node groups contained in the target group will be smaller than the initial number of node groups. That is, each node group in the target group now contains multiple text nodes. The node grouping formed by the fusion of the multiple text nodes can be considered to belong to a classification range. Among them, the above classification scope includes, but is not limited to, the scope of physical classification, the scope of chemical classification, the scope of computer classification and other subject classifications. It can also be the next-level classification range of a subject classification range. For example, the classification range may be a mechanical classification range, an electrical classification range, and an optical classification range. It can be understood that, for each classification range, it can also contain multiple specific target classifications. Exemplarily, the target classification included in the mechanical classification range may be quantum mechanical target classification, Newtonian mechanical target classification, and the like.

In an application, multiple texts to be classified can be in the same classification range but belong to different target classifications. It can also be texts that belong to multiple classification ranges and have different target classifications. Among them, the classification range corresponding to each node group in the target group is determined, which can be marked manually. A model for determining the classification range may also be pre-trained in the terminal device, and the model is only used to determine the classification range in each node group according to the text to be classified in each node group in the target group. It can be understood that the text similarity (similar distance between text nodes) of texts to be classified in different classification ranges across disciplines is usually very small. Therefore, multiple texts to be classified between the same node grouping are usually texts of one subject (classification scope). Based on this, the model only needs to predict the classification range according to any text to be classified in each node group, and then the classification range corresponding to the corresponding node group can be determined.

S104B. For any node grouping, perform secondary classification on a plurality of text nodes included in the node grouping according to preset keywords corresponding to a plurality of target categories in the classification range, to obtain the plurality of to-be-categorized The final destination grouping of the text.

In the application, it has been described above that each category range includes multiple target categories, and based on this, the terminal device may pre-store the keywords corresponding to each target category. Furthermore, for any node grouping, the preset keywords of each target category corresponding to the node grouping are used to compare with each text to be classified. In order to realize the secondary classification of the multiple texts to be classified contained in the node grouping, the final target grouping of the multiple unclassified texts is obtained.

It can be understood that the steps between the above steps S101 to S104 may be common to different business scenarios (different text classification business scenarios). When other business scenarios need to implement text classification services, the system resources including the above steps S101 to S104 can also be used for processing, which can greatly reduce the system resources of text classification required by different services, and there is no need to specially design for each text classification business scenario. different system resources.

Referring to FIG. 6 , in an embodiment, in S104B, according to preset keywords corresponding to multiple target categories in the classification range, secondary classification is performed on a plurality of text nodes included in the node grouping to obtain the The final target grouping of multiple texts to be classified further includes the following sub-steps S104B1-S104B3, which are described in detail as follows:

S104B1. For a plurality of text nodes included in any of the node groups, determine whether among the plurality of text nodes, there is a text to be classified corresponding to any text node that does not contain the preset keywords of the plurality of target classifications.

S104B2. If it is determined that among the multiple text nodes, the text to be classified corresponding to any text node does not contain the preset keywords of the multiple target classifications, obtain the text segmentation in the text to be classified.

In applications, the keywords corresponding to each target classification category usually have limitations. In an actual situation, in order to distinguish the target classification categories of each text to be classified, the preset keywords corresponding to the target categories should be different from each other. And in actual situations, the user cannot preset all keywords for each classification category. Based on this, there is a situation in which the preset keywords corresponding to each target category are not included in the multiple texts to be classified included in the node grouping.

In the application, text segmentation is performed on the text to be classified, for details, reference may be made to the example description of performing segmentation on the text to be classified in the above S102, which will not be explained again.

S104B3: Perform secondary classification on the text to be classified according to the text segmentation and a plurality of preset texts of known target classifications to obtain a final target group.

In an application, the above-mentioned preset texts of known target categories may be multiple texts preset by a user, and the preset texts include but are not limited to texts such as journals and papers.

In the application, according to text segmentation and multiple preset texts classified by known targets, the secondary classification of the text to be classified may be that the terminal device recognizes the text semantics according to the text segmentation, and then performs secondary classification of the to-be-categorized text. It is also possible to classify the text to be classified into the target classification when it is determined that the text segmentation often appears in the preset text corresponding to any target classification.

Exemplarily, for two texts to be classified in a node grouping: a "All objects always remain in a state of rest or in a state of uniform linear motion when they are not acted upon by a force or the resultant force is zero"; b "When you measure a When you have a particle's momentum, you can't tell where it is." Perform text segmentation on the text a to be classified, and multiple text segmentations can be obtained. For example, the text segmentation of the text a to be classified includes text segmentation such as force, resultant force, static state, and uniform linear motion state. However, in other preset texts of multiple known target classification categories (for example, papers or journals), the terminal device can separately count the number of articles that simultaneously contain the above text segmentation (text segmentation of the text a to be classified). Afterwards, from the articles that also contain the above text segmentation, the corresponding number of articles included in each target category is determined, and the target category corresponding to the maximum number of the included articles is used as the target category of the text to be classified. Through the above processing, it is found that most of the articles containing the above text segmentation (text segmentation of the text a to be classified) belong to the "Newtonian mechanics category". After the above processing is performed on the text b to be classified, it is found that multiple text segmentations in the text b to be classified, when other articles with known target classifications appear at the same time, most of the articles belong to the "quantum mechanical category". Based on this, it can be determined that the text a to be classified belongs to the final target grouping of "Newtonian mechanics category", and the text b to be classified belongs to the final target grouping of "quantum mechanics". In this way, when more precise text classification is performed on the text to be classified, the accuracy of text classification is improved.

Please refer to FIG. 7. FIG. 7 is a structural block diagram of a text classification apparatus provided by an embodiment of the present application. Each module included in the text classification apparatus in this embodiment is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 . For details, please refer to FIG. 1 to FIG. 6 and the related descriptions in the embodiments corresponding to FIG. 1 to FIG. 6 . For convenience of explanation, only the parts related to this embodiment are shown. Referring to FIG. 7 , the text classification apparatus 700 includes: an acquisition module 710, a calculation module 720, a filtering module 730 and a first classification module 740, wherein:

The obtaining module 710 is configured to obtain a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line.

The calculation module 720 is configured to extract the text features of each of the texts to be classified, and respectively calculate the similarity distance between the two text nodes according to the text features.

The filtering module 730 is configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines.

The first classification module 740 is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines to obtain a plurality of target groups of the text to be classified.

In one embodiment, the text to be classified includes multiple text segments, the text features are composed of word vectors of the multiple text segments, and the similarity distance includes the Euclidean distance; the computing module 720 is further configured to: :

Extract word vectors of multiple text segments in each of the texts to be classified, and determine the word order of the multiple text segments in the respective corresponding texts to be classified; for any pair of text nodes, according to the Describe the word vectors of the same word order in the text to be classified, calculate the Euclidean distance of each text segment, and add the Euclidean distances of each text segment to obtain the paired text nodes Euclidean distance between .

In one embodiment, the filtering module 730 is also used to:

For any similarity distance, determine whether the similarity example is less than or equal to the preset threshold; if it is determined that the similarity distance is less than or equal to the preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; And, if it is determined that the similarity distance is greater than a preset threshold, the node lines corresponding to the similarity distance greater than the preset threshold are reserved; based on the remaining node lines and the plurality of text nodes, the target text node structure diagram is generated.

In one embodiment, the first classification module 740 is further configured to:

Group each text node described in the target text node structure diagram as a separate node group; for any node grouping, each time the any node grouping is fused with any adjacent node grouping to obtain a variety of groupings module, each grouping module includes a node grouping formed by the fusion of any node grouping and one of the adjacent node groupings, and the remaining individual node groupings; respectively calculate the grouping modularity of each grouping module, and the grouping module The degree is used to represent the classification effect of the grouping module; in the various grouping modules, the grouping module corresponding to the maximum value of the grouping module degree is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module; if it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, then the previous target grouping module is used as the multi-target grouping module. A target grouping of texts to be classified; if it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, then for each node grouping in the current target grouping module, execute the described The step of merging any node grouping with any adjacent node grouping to obtain various grouping modules, until the target grouping of each text to be classified is obtained.

In one embodiment, the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; the first classification module 740 also uses At:

For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity currently to be calculated in the grouping module; the modularity of the current node grouping is calculated according to the first quantity and the second quantity; the remaining individual node groups are sequentially used as For the current node grouping, the modularity of the remaining individual node groups in the grouping module is calculated respectively; the sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the The grouping modularity of the grouping module.

In one embodiment, the text classification apparatus 700 further includes:

A determination module, configured to determine a classification range corresponding to each node group in the target group, where the classification range includes multiple target classifications.

The second classification module is configured to, for any node grouping, perform secondary classification on the plurality of text nodes included in the node grouping according to the preset keywords corresponding to the plurality of target classifications in the classification range, and obtain the Describe the final target grouping of multiple texts to be classified.

In one embodiment, the second classification module is further used for:

For a plurality of text nodes included in any of the node groups, it is judged whether, among the plurality of text nodes, the text to be classified corresponding to any text node does not contain the preset keywords of the plurality of target classifications; Among the multiple text nodes, if the text to be classified corresponding to any text node does not contain the preset keywords of the multiple target classifications, the text segmentation in the to-be-categorized text is obtained; according to the text segmentation and Knowing a plurality of preset texts of target classification, secondary classification is performed on the to-be-classified texts to obtain a final target grouping.

It should be understood that, in the structural block diagram of the text classification apparatus shown in FIG. 7 , each unit/module is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 , and for the embodiment corresponding to FIG. 1 to FIG. The steps in the above have been explained in detail in the above-mentioned embodiments. For details, please refer to FIG. 1 to FIG. 6 and the relevant descriptions in the corresponding embodiments of FIG. 1 to FIG.

FIG. 8 is a structural block diagram of a terminal device provided by another embodiment of the present application. As shown in FIG. 8 , the terminal device 800 of this embodiment includes: a processor 810 , a memory 820 , and a computer program 830 stored in the memory 820 and executable on the processor 810 , such as a program of a text classification method. When the processor 810 executes the computer program 830, the steps in each of the above embodiments of the text classification methods are implemented, for example, S101 to S104 shown in FIG. 1 . Alternatively, when the processor 810 executes the computer program 830, the functions of each module in the embodiment corresponding to FIG. 7 are implemented, for example, the functions of the modules 710 to 740 shown in FIG. 7 . Specifically as follows:

A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implements when the processor executes the computer program:

A computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is implemented when executed by a processor:

Illustratively, the computer program 830 may be divided into one or more modules, and the one or more modules are stored in the memory 820 and executed by the processor 810 to complete the present application. One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 830 in the terminal device 800 .

The terminal device may include, but is not limited to, the processor 810 and the memory 820 . Those skilled in the art can understand that FIG. 8 is only an example of the terminal device 800, and does not constitute a limitation to the terminal device 800. It may include more or less components than those shown in the figure, or combine some components, for example, the terminal device also Can include input and output devices, buses, etc.

The so-called processor 810 may be a central processing unit, and may also be other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. Wait. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 820 may be an internal storage unit of the terminal device 800 , such as a hard disk or a memory of the terminal device 800 . The memory 820 may also be an external storage device of the terminal device 800 , such as a plug-in hard disk, a smart memory card, a flash memory card, etc., which are equipped on the terminal device 800 . Further, the memory 820 may also include both an internal storage unit of the terminal device 800 and an external storage device.

The computer-readable storage medium may be an internal storage unit of the terminal device described in the foregoing embodiments, such as a hard disk or a memory of the terminal device. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium may also be an external storage device of the terminal device, for example, a pluggable hard disk, a smart memory card, a secure digital card, a flash memory card, etc. equipped on the terminal device.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A text classification method, which includes:

obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;

Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;

According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;

Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
The text classification method according to claim 1, wherein the text to be classified includes a plurality of text segments, the text features are composed of word vectors of the plurality of text segments, and the similarity distance includes a Euclidean distance ; Describe extracting the text feature of each described text to be classified, and calculate the similarity distance between the two text nodes according to the text feature respectively, including:

Extracting word vectors of multiple text segments in each of the texts to be classified, and determining the word order of the multiple text segments in the respective corresponding texts to be classified;

For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Summing is performed to obtain the Euclidean distance between the two text nodes.
The text classification method according to claim 1 or 2, wherein, according to the similarity distance, the node lines in the text node structure graph are filtered to obtain a data containing each text node and the remaining node lines. The target text node structure diagram, including:

For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;

If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; and,

If it is determined that the similarity distance is greater than a preset threshold, retaining the node lines corresponding to the similarity distance greater than the preset threshold;

The target text node structure diagram is generated based on the remaining node lines and the plurality of text nodes.
The text classification method according to claim 1, wherein, based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified to obtain a plurality of targets of the text to be classified grouping, including:

Grouping each text node in the target text node structure diagram as a separate node;

For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain a variety of grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings fused to form , and the rest of the individual node groupings;

Calculate the grouping modularity of each of the grouping modules respectively, and the grouping modularity is used to represent the classification effect of the grouping module;

In the multiple grouping modules, the grouping module corresponding to the maximum value of the grouping modularity is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module;

If it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, the last target grouping module is used as the target grouping of the multiple texts to be classified;

If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, then for each node grouping in the current target grouping module, the grouping of any node with the The steps of grouping and merging any adjacent nodes to obtain a variety of grouping modules until the target grouping of each text to be classified is obtained.
The text classification method according to claim 4, wherein the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups;

The calculating the grouping modularity of each of the grouping modules respectively includes:

For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity to be calculated currently in the grouping module;

calculating the modularity of the current node grouping according to the first number and the second number;

The remaining individual node groups are sequentially used as the current node group, and the modularity of the remaining individual node groups in the grouping module is calculated respectively;

The sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the grouping modularity of the grouping module.
The text classification method according to claim 4 or 5, wherein in the community-based discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of to-be-classified texts are obtained. After the target grouping of the text, it also includes:

determining a classification range corresponding to each node group in the target grouping, and the classification range includes multiple target classifications;

For any node grouping, according to preset keywords respectively corresponding to multiple target categories in the classification range, perform secondary classification on multiple text nodes included in the node grouping, and obtain the multiple text nodes to be classified. Final target grouping.
The text classification method according to claim 6, wherein the secondary classification is performed on the plurality of text nodes included in the node grouping according to the preset keywords respectively corresponding to the plurality of target classifications in the classification range, Obtain the final target grouping of the plurality of texts to be classified, including:

For a plurality of text nodes included in any one of the node groups, determine whether, among the plurality of text nodes, there is a text to be classified corresponding to any text node that does not contain the preset keywords of the plurality of target classifications;

If it is determined that among the plurality of text nodes, there is a text to be classified corresponding to any text node that does not contain the preset keywords of the plurality of target classifications, acquiring the text segmentation in the text to be classified;

According to the text segmentation and a plurality of preset texts of known target classification, secondary classification is performed on the to-be-classified text to obtain a final target grouping.
A text classification device, wherein the device comprises:

an acquisition module, configured to acquire a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;

a computing module, for extracting the text features of each of the texts to be classified, and calculating the similarity distance between the two text nodes according to the text features;

a filtering module, configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines;

The first classification module is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines, and obtain a plurality of target groups of the text to be classified.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements when the processor executes the computer program:

obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;

Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;

According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;

Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
The terminal device according to claim 9, wherein the text to be classified comprises a plurality of text segments, the text features are composed of word vectors of the plurality of text segments, and the similarity distance comprises a Euclidean distance; When the processor executes the computer program, it further implements:

Extracting word vectors of multiple text segments in each of the texts to be classified, and determining the word order of the multiple text segments in the respective corresponding texts to be classified;

For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Summing is performed to obtain the Euclidean distance between the two text nodes.
The terminal device according to claim 9 or 10, wherein, when the processor executes the computer program, it further implements:

For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;

If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; and,

If it is determined that the similarity distance is greater than a preset threshold, retaining the node lines corresponding to the similarity distance greater than the preset threshold;

The target text node structure diagram is generated based on the remaining node lines and the plurality of text nodes.
The terminal device according to claim 9 or 10, wherein, when the processor executes the computer program, it further implements:

Grouping each text node in the target text node structure diagram as a separate node;

For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain a variety of grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings fused to form , and the rest of the individual node groupings;

Calculate the grouping modularity of each of the grouping modules respectively, and the grouping modularity is used to represent the classification effect of the grouping module;

In the multiple grouping modules, the grouping module corresponding to the maximum value of the grouping modularity is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module;

If it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, the last target grouping module is used as the target grouping of the multiple texts to be classified;

If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, then for each node grouping in the current target grouping module, the grouping of any node with the The steps of grouping and merging any adjacent nodes to obtain a variety of grouping modules until the target grouping of each text to be classified is obtained.
The terminal device according to claim 12, wherein the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; the When the processor executes the computer program, it also implements:

For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity to be calculated currently in the grouping module;

calculating the modularity of the current node grouping according to the first number and the second number;

The remaining individual node groups are sequentially used as the current node group, and the modularity of the remaining individual node groups in the grouping module is calculated respectively;

The sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the grouping modularity of the grouping module.
The terminal device according to claim 12 or 13, wherein, when the processor executes the computer program, it further implements:

determining a classification range corresponding to each node group in the target grouping, and the classification range includes multiple target classifications;

For any node grouping, according to preset keywords respectively corresponding to multiple target categories in the classification range, perform secondary classification on multiple text nodes included in the node grouping, and obtain the multiple text nodes to be classified. Final target grouping.
A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:

obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;

Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;

According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;

Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
The computer-readable storage medium according to claim 15, wherein the text to be classified comprises a plurality of text segments, the text features are composed of word vectors of the plurality of text segments, and the similarity distance comprises Euclidean distance; the computer program, when executed by the processor, further implements:

Extracting word vectors of multiple text segments in each of the texts to be classified, and determining the word order of the multiple text segments in the respective corresponding texts to be classified;

For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Summing is performed to obtain the Euclidean distance between the two text nodes.
The computer-readable storage medium according to claim 15 or 6, wherein the text to be classified includes a plurality of text segmentations, the text features are composed of word vectors of the plurality of text segmentations, and the similarity distance includes Euclidean miles; the computer program, when executed by the processor, further implements:

For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;

If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; and,

If it is determined that the similarity distance is greater than a preset threshold, retaining the node lines corresponding to the similarity distance greater than the preset threshold;

The target text node structure diagram is generated based on the remaining node lines and the plurality of text nodes.
The computer-readable storage medium of claim 15, wherein the computer program, when executed by the processor, further implements:

Grouping each text node in the target text node structure diagram as a separate node;

For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain a variety of grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings fused to form , and the rest of the individual node groupings;

Calculate the grouping modularity of each of the grouping modules respectively, and the grouping modularity is used to represent the classification effect of the grouping module;

In the multiple grouping modules, the grouping module corresponding to the maximum value of the grouping modularity is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module;

If it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, the last target grouping module is used as the target grouping of the multiple texts to be classified;

If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, for each node grouping in the current target grouping module, execute the grouping of any node with the The steps of grouping and merging any adjacent nodes to obtain a variety of grouping modules until the target grouping of each text to be classified is obtained.
19. The computer-readable storage medium of claim 18, wherein the grouping modularity of the grouping module is the modularity of the groupings of nodes fused to make up the grouping module, added to the modularity of the remaining individual node groupings Obtained; when the computer program is executed by the processor, it also realizes:

For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity to be calculated currently in the grouping module;

calculating the modularity of the current node grouping according to the first number and the second number;

The remaining individual node groups are sequentially used as the current node group, and the modularity of the remaining individual node groups in the grouping module is calculated respectively;

The sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the grouping modularity of the grouping module.
The computer-readable storage medium of claim 18 or 19, wherein the computer program, when executed by the processor, further implements:

determining a classification range corresponding to each node group in the target grouping, and the classification range includes multiple target classifications;

For any node grouping, according to preset keywords respectively corresponding to multiple target categories in the classification range, perform secondary classification on multiple text nodes included in the node grouping, and obtain the multiple text nodes to be classified. Final target grouping.