CN112632280B

CN112632280B - Text classification method and device, terminal equipment and storage medium

Info

Publication number: CN112632280B
Application number: CN202011586743.3A
Authority: CN
Inventors: 马龙; 梁宸; 周元笙; 蒋佳惟; 陈思姣; 李炫�
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-05-24
Anticipated expiration: 2040-12-28
Also published as: WO2022142025A1; CN112632280A

Abstract

The application is applicable to the technical field of artificial intelligence, and provides a text classification method, a text classification device, a terminal device and a storage medium, wherein the method comprises the following steps: acquiring a text node structure diagram generated by a plurality of texts to be classified; extracting text features of each text to be classified, and respectively calculating the similar distance between every two text nodes according to the text features; filtering the node lines in the text node structure diagram according to the similar distance to obtain a target text node structure diagram containing each text node and the residual node lines; and classifying the texts to be classified in the target text node structure diagram based on a community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified. By adopting the method, the text node structure chart generated by the text to be classified is firstly subjected to preliminary filtering and then grouped again to obtain the target grouping, so that the accuracy of classifying a plurality of texts to be classified can be improved.

Description

Text classification method and device, terminal equipment and storage medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a text classification method and device, a terminal device and a storage medium.

Background

At present, text classification is an important aspect in the field of artificial intelligence, and is an important task of information processing, and the purpose of the text classification is to automatically classify unlabeled documents into a predetermined class set and solve the phenomenon of information disorder. In the existing text classification method, the existing fast text classifier model or convolutional neural network model is generally adopted to classify the text. However, it takes a lot of time to train the neural network model to classify the text by training the neural network model, and the classification accuracy is low when the text is classified more finely.

Disclosure of Invention

The embodiment of the application provides a text classification method, a text classification device, a terminal device and a storage medium, and can solve the problem that in the prior art, when a text is classified through a neural network model, the text classification accuracy is low.

In a first aspect, an embodiment of the present application provides a text classification method, including:

acquiring a text node structure chart, wherein the text node structure chart comprises a plurality of text nodes, each text node corresponds to a text to be classified, and every two text nodes are connected through a node line;

Extracting text features of each text to be classified, and respectively calculating the similar distance between every two text nodes according to the text features;

filtering the node lines in the text node structure diagram according to the similar distance to obtain a target text node structure diagram containing each text node and the residual node lines;

classifying the texts to be classified in the target text node structure diagram based on a community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified.

In one embodiment, the text to be classified includes a plurality of text segments, the text features are composed of word vectors of the text segments, and the similarity distance includes a euclidean distance; the extracting the text features of each text to be classified and respectively calculating the similar distance between every two text nodes according to the text features comprises the following steps:

extracting word vectors of a plurality of text participles in each text to be classified, and determining word sequences of the text participles in the text to be classified respectively corresponding to the text to be classified;

and aiming at any two text nodes, calculating the Euclidean distance of each text participle according to the word vector with the same word sequence in the text to be classified, and adding the Euclidean distances of each text participle to obtain the Euclidean distance between the two text nodes.

In an embodiment, the filtering the node lines in the text node structure diagram according to the similar distance to obtain a target text node structure diagram including each text node and remaining node lines includes:

judging whether the similarity example is smaller than or equal to a preset threshold value or not according to any similarity distance;

if the similarity distance is judged to be smaller than or equal to a preset threshold value, deleting the node line corresponding to the similarity distance smaller than or equal to the preset threshold value; and (c) a second step of,

if the similarity distance is judged to be larger than a preset threshold value, a node line corresponding to the similarity distance larger than the preset threshold value is reserved;

and generating the target text node structure chart based on the residual node lines and the plurality of text nodes.

In an embodiment, the classifying the texts to be classified in the target text node structure diagram based on the community discovery algorithm and the remaining node lines to obtain a plurality of target groups of the texts to be classified includes:

grouping each text node in the target text node structure chart as an independent node;

for any node group, fusing the node group with any adjacent node group each time to obtain a plurality of grouping modules, wherein each grouping module comprises a node group formed by fusing the node group with one adjacent node group and the rest of independent node groups;

Respectively calculating grouping modularity of each grouping module, wherein the grouping modularity is used for representing the classification effect of the grouping modules;

taking the grouping module corresponding to the maximum value of the grouping modularity as a current target grouping module in the various grouping modules, and judging whether the current target grouping module is less than or equal to the grouping modularity of the last target grouping module;

if the grouping modularity of the current target grouping module is judged to be less than or equal to the grouping modularity of the last target grouping module, taking the last target grouping module as the target grouping of the texts to be classified;

and if the grouping modularity of the current target grouping module is judged to be larger than that of the last target grouping module, the step of fusing any node group and any adjacent node group to obtain a plurality of grouping modules is executed again aiming at each node group in the current target grouping module until the target group of each text to be classified is obtained.

In one embodiment, the grouping modularity of the grouping module is obtained by adding the modularity of the node grouping fused to form the grouping module and the modularity of the other individual node grouping;

The calculating the grouping modularity of each grouping module respectively comprises:

for any grouping module, determining a first number of node lines included in a current node grouping in the grouping module and a second number of node lines connected with the current node grouping by the other independent node groupings respectively, wherein the current node grouping is a grouping of the current modularity to be calculated in the grouping module;

calculating the modularity of the current node grouping according to the first number and the second number;

sequentially taking the other independent node groups as the current node group, and respectively calculating the modularity of the other independent node groups in the grouping module;

and taking the sum of the modularity of the current node grouping and the modularity of the other independent node groupings as the grouping modularity of the grouping module.

In an embodiment, after classifying the texts to be classified in the target text node structure diagram based on the community discovery algorithm and the remaining node lines to obtain a plurality of target groups of the texts to be classified, the method further includes:

determining a classification range corresponding to each node group in the target group, wherein the classification range comprises a plurality of target classifications;

And aiming at any node group, carrying out secondary classification on a plurality of text nodes contained in the node group according to preset keywords respectively corresponding to a plurality of target classifications in the classification range to obtain a final target group of the plurality of texts to be classified.

In an embodiment, the performing secondary classification on a plurality of text nodes included in the node group according to preset keywords respectively corresponding to a plurality of target classifications in the classification range to obtain a final target group of the plurality of texts to be classified includes:

aiming at a plurality of text nodes contained in any node group, judging whether a text to be classified corresponding to any text node does not contain preset keywords of the target classifications exists in the text nodes;

if the fact that the text to be classified corresponding to any text node does not contain the preset keywords of the target classifications is judged to exist in the text nodes, text word segmentation in the text to be classified is obtained;

and performing secondary classification on the texts to be classified according to the text segmentation and a plurality of preset texts classified by known targets to obtain a final target group.

In a second aspect, an embodiment of the present application provides a text classification apparatus, including:

The system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a text node structure chart which comprises a plurality of text nodes, each text node corresponds to a text to be classified, and every two text nodes are connected through a node line;

the calculation module is used for extracting the text characteristics of each text to be classified and respectively calculating the similar distance between every two text nodes according to the text characteristics;

the filtering module is used for filtering the node lines in the text node structure diagram according to the similar distance to obtain a target text node structure diagram containing each text node and the rest node lines;

and the first classification module is used for classifying the texts to be classified in the target text node structure diagram based on a community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method according to any one of the above first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program is configured to, when executed by a processor, implement the method of any one of the above first aspects.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the method described in any one of the above first aspects.

Compared with the prior art, the embodiment of the application has the advantages that: the method comprises the steps of generating a text node structure diagram by taking each text to be classified as a text node, then calculating the similar distance between the texts to be classified, and preliminarily filtering the text node structure diagram to obtain a target text node structure diagram, so as to realize the preliminary grouping of the texts to be classified. And then, classifying the texts to be classified in the target text node structure diagram according to a community discovery algorithm and the residual node lines, and grouping the texts to be classified again to obtain a target group. Therefore, the accuracy of classifying a plurality of texts to be classified is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart illustrating an implementation of a text classification method according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating an implementation manner of S103 of a text classification method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an implementation manner of S104 in a text classification method according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating an implementation manner of S1043 in a text classification method according to an embodiment of the application;

FIG. 5 is a flowchart illustrating an implementation of a text classification method according to another embodiment of the present application;

fig. 6 is a schematic diagram illustrating an implementation manner of S104B of a text classification method according to an embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of a text classification apparatus according to an embodiment of the present application;

Fig. 8 is a block diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The text classification method provided by the embodiment of the application can be applied to terminal devices such as a tablet computer, a notebook computer, a super-mobile personal computer (UMPC), a netbook and the like, and the embodiment of the application does not limit the specific types of the terminal devices.

S101, a text node structure chart is obtained, the text node structure chart comprises a plurality of text nodes, each text node corresponds to a text to be classified, and every two text nodes are connected through a node line.

In application, the texts to be classified may be texts such as papers, periodicals, or magazines, or may be a sentence or a paragraph, which is not limited herein. For a plurality of texts to be classified, each text to be classified can be respectively used as a text node, and the plurality of text nodes are connected through a node line to generate a text node structure diagram. The text to be classified can be pre-stored in a designated storage path of the terminal equipment and then acquired by the terminal equipment; the text to be classified may also be transmitted in real time for the user, which is not limited to this.

It will be appreciated that the number of node lines described above is related to the number of nodes of the text node. Illustratively, the number of node lines is 1 when the number of text nodes is 2, 3 when the number of text nodes is 3, 6 when the number of text nodes is 4, and so on. Specifically, y is (n-1) × n/2, where n (n is an integer and n ≧ 2) is the number of text nodes, and y is the number of node lines.

S102, extracting text features of each text to be classified, and respectively calculating the similar distance between every two text nodes according to the text features.

In application, the similarity distance may be understood as a similarity degree between two texts to be classified. Specifically, the terminal device may extract text features of each text to be classified, and calculate a similarity between two text features as a similarity distance between two texts to be classified.

Illustratively, the extracting of the text features of the text to be classified may be to perform word segmentation on the text to be classified to obtain a plurality of text word segments. Then, a word consistent with the text segmentation word can be determined in a preset word vector library, and the sequence number of the text segmentation word can be determined. Finally, word features of the text segmentation words can be generated according to the sequence numbers. The word consistent with the text word segmentation determined in the preset word vector library can be realized according to a forward matching algorithm. Specifically, if the longest participle character in the preset participle library is 5, the first character to the fifth character in the text to be classified can be used as the initial participle, and whether the initial participle exists in the word vector library or not is determined. If the word segmentation exists, the initial word segmentation is used as a target word segmentation, and then the subsequent word segmentation is continuously matched. And if the initial participle does not exist in the word vector library, reducing the character length from right to left. Namely, the first character to the fourth character in the text to be classified are used as initial segmentation, and the initial segmentation is used for determining whether the initial segmentation exists in a word vector library or not. Therefore, the words of the texts to be classified are segmented to obtain the text segmentation words.

In application, after a plurality of text participles are determined, word features of the text participles can be generated according to the sequence number of each text participle in the word vector library. Specifically, word feature dimensions in each text segmentation are preset. Then, the one-dimensional space (the number corresponding to the sequence number) of each text word is mapped to a multi-dimensional continuous vector space. For example, for a text participle with a sequence number of "5", the feature vector dimension is 10, and the word feature of the text participle may be [0, 0, 0, 0, 1, 0, 0, 0], so as to obtain the word feature of each text participle. Based on the above, the text features of the text to be classified can be represented by the word features of a plurality of text participles. Then, the similar distance between every two text nodes can be calculated by referring to the following steps:

the similar distance comprises a euclidean distance; in the step S102 of extracting the text feature of each text to be classified and calculating the similarity distance between every two text nodes according to the text feature, the method specifically includes the following substeps:

extracting word vectors of a plurality of text participles in each text to be classified, and determining word sequences of the text participles in the text to be classified.

And aiming at any two text nodes, calculating the Euclidean distance of each text word according to the word vectors of the same word sequence in the text to be classified, and adding the Euclidean distances of each text word to obtain the Euclidean distance between every two text nodes.

In the application, the above-mentioned extraction manner of the text participles in the text to be classified and the text participle word vector are already explained in the above-mentioned S102, which will not be explained again. It is understood that, since the text to be classified is composed of a plurality of characters in succession, each character has a corresponding character sequence in the text to be classified. Based on the above, the plurality of text participles obtained by participling the text to be classified in the character matching manner also have corresponding word sequences.

In application, the euclidean distance may be calculated as:

wherein n is the number of the text participles, ai represents the word vector of the ith text participle in the ith text to be classified, bi represents the word vector of the ith text participle in the mth text to be classified, and p (a, b) represents the similar distance between the text a to be classified and the text b to be classified. If the number m1 of the participles of the text a to be classified is greater than the number m2 of the participles of the text b to be classified, n can be determined as m 1. Then, for the text participles after b2 in the text b to be classified, the word characteristics are all represented by "0". In other applications, the similarity distance may also be calculated as a cosine distance between every two text nodes according to the text features, which is not limited.

S103, filtering the node lines in the text node structure diagram according to the similar distance to obtain a target text node structure diagram containing each text node and the rest node lines.

In application, the filtering processing is performed on the node line, and it may be considered that the node line with the similar distance smaller than the preset threshold is deleted. Thus, a target text node structure chart consisting of the text nodes and the residual node lines is obtained. It can be understood that the similar distance between every two text nodes is the length of the node line between every two text nodes.

Specifically, a preset threshold value can be prestored in the terminal device, and when the similarity distance is greater than the preset threshold value, a node line between the texts to be classified is reserved; and when the similarity distance is smaller than or equal to a preset threshold value, deleting the node lines between the texts to be classified. It can be understood that, in the node structure diagram of the target text obtained at this time, two texts to be classified connected by a node line have certain similarity, and when clustering and grouping are performed on the two texts to be classified, the probability that the two texts to be classified belong to the same category group is considered to be higher.

And S104, classifying the texts to be classified in the target text node structure diagram based on a community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified.

In application, the community discovery algorithm includes, but is not limited to, Leiden algorithm and Louvain algorithm, which is not limited to this. The community discovery algorithm is a modularity-based classification algorithm that can be used to group multiple hierarchical (different classes) objects (text to be classified). Wherein, the modularity may be considered as a measure for evaluating the effect of each grouping after the plurality of objects are grouped differently. It is understood that a plurality of texts to be classified are classified, at least one text to be classified in each class. For example, if 5 texts to be classified are classified, and the target group includes 3 classes, the number of texts to be classified included in each class group may be 1 or 2, and this is not limited.

Specifically, after the target text node structure diagram is classified, a plurality of target groups of texts to be classified are obtained. To determine the overall modularity Q of the target group, the modularity Qi of each class of group in the target group is calculated. Specifically, the calculation formula is as follows:

wherein Qi is of the ith classThe modularity of the groups, m is the sum of the similar distances of all node lines between the groups of all the categories, Σ in is the sum of the similar distances of all node lines between the groups of the ith category, and Σ tot represents the sum of the similar distances of all node lines connected to text nodes in the groups of the ith category. Based on this, the modularity corresponding to each group of each category after each classification can be obtained. Further, the plurality of modularity degrees are added to obtain the overall modularity Q of the target packet.

In this embodiment, each text to be classified is used as a text node, a text node structure diagram is generated, then a similar distance between the texts to be classified is calculated, and the text node structure diagram is preliminarily filtered to obtain a target text node structure diagram, so that preliminary grouping of the texts to be classified is realized. And then, classifying the texts to be classified in the target text node structure diagram according to a community discovery algorithm and the residual node lines, and grouping the texts to be classified again to obtain a target group. Therefore, the accuracy of classifying a plurality of texts to be classified is improved.

Referring to fig. 2, in a specific embodiment, in step S103, according to the similar distance, filtering the node lines in the text node structure diagram to obtain a target text node structure diagram including each text node and remaining node lines, the following substeps S1031 to S1034 are specifically included, which are detailed as follows:

s1031, judging whether the similar examples are smaller than or equal to a preset threshold value or not according to any similar distance;

and S1032, if the similarity distance is judged to be smaller than or equal to a preset threshold value, deleting the node line corresponding to the similarity distance smaller than or equal to the preset threshold value. And (c) a second step of,

And S1033, if the similarity distance is judged to be larger than a preset threshold value, reserving a node line corresponding to the similarity distance larger than the preset threshold value.

S1034, generating the target text node structure chart based on the residual node lines and the text nodes.

In application, the preset threshold may be set by a user according to actual conditions, or may be a fixed value preset in the terminal device, which is not limited to this. In S103, it has been described that when the similarity distance is smaller than or equal to the preset threshold value, the node line corresponding to the similarity distance is deleted. And when the similar distance is larger than a preset threshold value, reserving the node line corresponding to the similar distance. Thus, the target text node structure diagram is generated, which will not be described again.

Referring to fig. 3, in a specific embodiment, in S104, based on the community discovery algorithm and the remaining node lines, the method specifically includes the following substeps S1041-S1046 in classifying the texts to be classified in the target text node structure diagram to obtain a plurality of target groups of texts to be classified, which are described in detail as follows:

s1041, grouping each text node in the target text node structure chart as an independent node.

In the application, when a plurality of texts to be classified are classified, text nodes corresponding to each text to be classified are grouped as one node. And then, sequentially carrying out the same step processing on each node group to obtain a plurality of target groups of texts to be classified.

And S1042, for any node group, fusing the node group with any adjacent node group each time to obtain a plurality of grouping modules, wherein each grouping module comprises a node group formed by fusing the node group with one of the adjacent node groups, and the rest of the node groups.

In application, in the target text node structure diagram, node lines connecting text nodes with other text nodes also exist. Based on this, for any node grouping of text nodes, any node grouping can be fused with any adjacent node grouping at a time to obtain a fused node grouping and the rest of the individual node groupings.

Illustratively, if there are k texts to be classified, there are k node groups. For any node group i, the node group i and an adjacent node group j can be fused, that is, the node groups of two texts to be classified are taken as a group. At this time, k-1 node groups are formed in the formed grouping module. It will be appreciated that there is one fused component node group out of the k-1 groups, and k-2 remaining individual node groups. For other node groups, the processing is also performed as in S1042, and a corresponding grouping module is obtained.

And S1043, respectively calculating the grouping modularity of each grouping module, wherein the grouping modularity is used for expressing the classification effect of the grouping modules.

In application, after a grouping module is determined, the grouping modularity of the grouping module is obtained by adding the modularity of the node grouping formed by fusion to the modularity of each of the other independent node groupings. That is, for the grouping module containing the k-1 node groups, the modularity corresponding to each node group in the k-1 node groups needs to be calculated and added to obtain the overall modularity (grouping modularity) of the grouping module. Specifically, the formula of Qi may be calculated in S104. Wherein Q represents the overall modularity (grouping modularity) of the grouping module, and Qi represents the modularity of the ith class of grouping (the modularity of the ith node grouping in k-1 node groupings included in the grouping module), where i is an integer and i ≦ k-1.

In application, according to the above steps, if the node group i has other adjacent node groups x, the above steps are repeated for the node group i and the node group x, and the grouping modularity of one grouping module is obtained again. Then, after the node group i is processed, the steps S1042 and S1043 need to be repeated for the remaining individual node groups (j, x, or any other node group). Therefore, a plurality of grouping modules after each node grouping is classified and a grouping module degree corresponding to each grouping module can be obtained.

And S1044, in the plurality of grouping modules, taking the grouping module corresponding to the maximum value of the grouping modularity as a current target grouping module, and judging whether the current target grouping module is less than or equal to the grouping modularity of the last target grouping module.

In application, after the grouping modularity of each grouping module is obtained, the maximum value of the grouping modularity can be determined according to the numerical value of the grouping modularity, and the grouping module corresponding to the grouping modularity with the maximum value is used as the current target grouping module.

S1045, if the grouping modularity of the current target grouping module is judged to be less than or equal to the grouping modularity of the last target grouping module, taking the last target grouping module as the target grouping of the texts to be classified;

and S1046, if the grouping modularity of the current target grouping module is judged to be greater than that of the last target grouping module, for each node group in the current target grouping module, performing the step of fusing any node group with any adjacent node group again to obtain a plurality of grouping modules until the target group of each text to be classified is obtained.

In application, after the current target grouping module is determined, whether the current target grouping module is an optimal grouping in a plurality of texts to be classified needs to be determined. Based on the above, the grouping modularity of the current target grouping module needs to be compared with the grouping modularity of the last target grouping module to determine whether the current target grouping is the optimal grouping. Specifically, if the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, it indicates that the previous target grouping module is the optimal grouping (target grouping) of the texts to be classified. If the grouping modularity of the current target grouping module is larger than that of the last target grouping module, the last target grouping module is not the optimal grouping. Then, in order to determine whether the current target grouping belongs to the optimal grouping, S1042 to S1045 are repeatedly executed on each node grouping in the current target grouping module until a target grouping of each text to be classified is obtained.

It should be noted that, when S1042 to S1045 are repeatedly executed, S1042 is directed to each node packet in the current target packet. That is, if the node group included in the current target grouping module is a node group formed by fusing the node group i and the node group j, and the node group is a separate node group, the number of any node group is k-1 when S1042 is executed. At this time, the node packet formed by fusing the node packet i and the node packet j is treated as one node packet.

Specifically, for the grouping modularity Qk-1 of the current target grouping module, if Qk-1 is less than or equal to the grouping modularity Qk of the previous target grouping module, it indicates that the classification effect of the current target grouping module is less than or equal to the classification effect of the previous target grouping module. Therefore, the last target grouping module is shown as the target grouping with the optimal effect. If the grouping modularity Qk-1 of the current target grouping module is larger than the grouping modularity Qk of the last target grouping module, the classification effect of the current target grouping module is higher than that of the last target grouping module. And then repeating the steps S1042 to S1045 for k-1 node groups of the current target grouping module until a target grouping module Q with the optimal classification effect is obtained.

Referring to fig. 4, in a specific embodiment, the grouping modularity of the grouping module is obtained by merging the modularity of the node groups constituting the grouping module, and adding the modularity of the remaining individual node groups; in S1043, calculating the grouping modularity of each grouping module, the following substeps S10431-S10434 are specifically included, which are detailed as follows:

s10431, for any grouping module, determining a first number of node lines included in a current node grouping in the grouping module and a second number of node lines connected to the current node grouping by the other independent node groupings, respectively, wherein the current node grouping is a grouping of a current modularity to be calculated in the grouping module.

And S10432, calculating the modularity of the current node grouping according to the first quantity and the second quantity.

And S10433, sequentially taking the other independent node groups as the current node group, and respectively calculating the modularity of the other independent node groups in the grouping module.

And S10434, taking the sum of the modularity of the current node grouping and the modularity of the other independent node grouping as the grouping modularity of the grouping module.

In application, the modularity of the current node grouping is calculated according to the first number and the second number, which may specifically refer to the modularity calculation formula in S104, and will not be described again.

In application, the current node grouping is a grouping in the grouping module which is currently at the calculation modularity. It can be understood that, if the current node group has only one text node, the first number of node lines included in the current node group is 0. The second number is the number of node lines of the other node groups respectively connected with the current node group. It can be understood that, if the current node group has a plurality of text nodes (a group obtained by fusing a plurality of node groups), the remaining node groups are connected to any node group in the current node group, that is, there is a connection between the remaining node group and the current node group.

Referring to fig. 5, in an embodiment, after the step S104 classifies the texts to be classified in the target text node structure diagram based on the community discovery algorithm and the remaining node lines to obtain a plurality of target groups of the texts to be classified, the following steps S104A-S104B are further included, which are detailed as follows:

S104A, determining a classification range corresponding to each node group in the target group, wherein the classification range comprises a plurality of target classifications.

In application, after the target packet is obtained, the number of node packets contained in the target packet is smaller than the initial number of node packets. That is, each node group in the target group contains a plurality of text nodes. The node group formed by fusing the text nodes can be considered to belong to a classification range. The classification range includes, but is not limited to, physical classification range, chemical classification range, computer classification range, and other subject classifications. Or a next-level classification range of a certain discipline classification range. For example, the classification range may be a mechanical classification range, an electrical classification range, or an optical classification range. It will be appreciated that for each classification range, it may also contain a plurality of specific target classifications. For example, the target classification included in the mechanical classification range may be a quantum mechanical target classification, a newton mechanical target classification, or the like.

In application, a plurality of texts to be classified can be texts in the same classification range but belonging to different target classifications. The text belonging to a plurality of classification ranges and classified by different targets may be used. The classification range corresponding to each node group in the target group is determined, and the classification range can be labeled manually. Or training a model for determining the classification range in the terminal device in advance, wherein the model is only used for determining the classification range in each node group according to the text to be classified of each node group in the target group. It can be understood that the text similarity (similarity distance between text nodes) of the text to be classified across different classification ranges of disciplines is usually very small. Therefore, a plurality of texts to be classified between the same node group are texts of one discipline (classification range) in general. Based on the method, the model can determine the classification range corresponding to the corresponding node group only by predicting the classification range according to any text to be classified in each node group.

S104B, aiming at any node group, carrying out secondary classification on a plurality of text nodes contained in the node group according to preset keywords respectively corresponding to a plurality of target classifications in the classification range to obtain a final target group of the plurality of texts to be classified.

In application, it has been described above that each classification range includes a plurality of target classifications, and based on this, the terminal device may store keywords corresponding to each target classification in advance. And then, for any node group, comparing the preset keywords of each target class corresponding to the node group with each text to be classified respectively. And performing secondary classification on the texts to be classified contained in the node groups to obtain a final target group of the texts to be classified.

It is understood that the steps from step S101 to step S104 may be common to different service scenarios (service scenarios of different text classifications). When other service scenes need to realize the text classification service, the system resources comprising the steps from S101 to S104 can be used for processing, so that the system resources of the text classification required by different services can be greatly reduced, and different system resources do not need to be specially set for each service scene of the text classification.

Referring to fig. 6, in an embodiment, in S104B, according to preset keywords respectively corresponding to a plurality of target classifications in the classification range, performing secondary classification on a plurality of text nodes included in the node grouping to obtain a final target grouping of the plurality of texts to be classified, the method further includes the following sub-steps S104B1-S104B3, which are detailed as follows:

S104B1, aiming at a plurality of text nodes contained in any node group, judging whether a text to be classified corresponding to any text node does not contain preset keywords of a plurality of target classifications exists in the plurality of text nodes.

S104B2, if it is judged that the text to be classified corresponding to any text node does not contain the preset keywords of the target classifications, obtaining text word segmentation in the text to be classified.

In application, the keywords corresponding to each target classification category generally have limitations. In practical situations, in order to distinguish the target classification category of each text to be classified, the preset keywords corresponding to the target classifications of the text to be classified should be different from each other. And in practical situations, the user cannot preset all keywords for each classification category. Based on this, there is a case that none of the texts to be classified included in the node group includes the preset keyword corresponding to each target classification.

In application, the text to be classified is subjected to text segmentation, and specific reference may be made to the above-mentioned exemplary description of the text to be classified being subjected to text segmentation in S102, which is not explained again.

S104B3, performing secondary classification on the texts to be classified according to the text segmentation and a plurality of preset texts classified by known targets to obtain a final target group.

In application, the preset text with the known target classification may be a plurality of texts preset by a user, and the preset text includes, but is not limited to, texts of periodicals, papers, and the like.

In the application, the secondary classification of the text to be classified can be performed according to the text segmentation and a plurality of preset texts classified by known targets, and the terminal equipment identifies the text semantics according to the text segmentation and further performs the secondary classification of the text to be classified. The text to be classified can also be divided into the target classifications when the text segmentation is determined to frequently appear in the preset text corresponding to any target classification.

Illustratively, for two texts to be classified in one node group: a 'all objects always keep a static state or a uniform linear motion state when not being subjected to force or the resultant force is zero'; b "when you measure the momentum of a particle, you do not measure its position". And performing text segmentation on the text a to be classified to obtain a plurality of text segments. For example, the text segmentation of the text a to be classified includes text segmentation such as force, resultant force, static state, uniform linear motion state, and the like. However, in other preset texts (e.g., papers or periodicals) of a plurality of known target classification categories, the terminal device may count the number of articles simultaneously containing the text segmentation (the text segmentation of the text a to be classified) respectively. And then, respectively determining the number of articles correspondingly contained in each target classification from the articles simultaneously containing the text participles, and taking the target classification corresponding to the maximum value of the number of the contained articles as the target classification of the text to be classified. Through the above processing, it is found that most of the articles including the text participles (text participles of the text a to be classified) belong to the "newtonian category". After the text b to be classified is subjected to the above processing, a plurality of text participles in the text b to be classified are found, and when other articles classified by known targets appear simultaneously, most of the articles belong to the quantum mechanical category. Based on the above, the final target group of the text a to be classified belonging to the Newton mechanics category can be determined, and the final target group of the text b to be classified belonging to the quantum mechanics category can be determined. Therefore, when the text to be classified is more accurately classified, the accuracy of text classification is improved.

Referring to fig. 7, fig. 7 is a block diagram of a text classification apparatus according to an embodiment of the present disclosure. In this embodiment, each module included in the text classification apparatus is used to execute each step in the embodiments corresponding to fig. 1 to 6. Please refer to fig. 1 to 6 and fig. 1 to 6 for the corresponding embodiments. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 7, the text classification apparatus 700 includes: an obtaining module 710, a calculating module 720, a filtering module 730, and a first classifying module 740, wherein:

the obtaining module 710 is configured to obtain a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to one text to be classified, and every two text nodes are connected by a node line.

And the calculating module 720 is configured to extract text features of each text to be classified, and calculate a similar distance between every two text nodes according to the text features.

And a filtering module 730, configured to filter the node lines in the text node structure diagram according to the similar distance, so as to obtain a target text node structure diagram including each text node and the remaining node lines.

The first classification module 740 is configured to classify the texts to be classified in the target text node structure diagram based on a community discovery algorithm and the remaining node lines, so as to obtain a plurality of target groups of the texts to be classified.

In one embodiment, the text to be classified comprises a plurality of text segments, the text features are composed of word vectors of the text segments, and the similarity distance comprises a euclidean distance; the calculation module 720 is further configured to:

extracting word vectors of a plurality of text participles in each text to be classified, and determining word sequences of the plurality of text participles in the corresponding texts to be classified;

In one embodiment, the filtering module 730 is further configured to:

judging whether the similarity example is smaller than or equal to a preset threshold value or not aiming at any similarity distance;

if the similarity distance is judged to be smaller than or equal to a preset threshold value, deleting the node line corresponding to the similarity distance smaller than or equal to the preset threshold value; and the number of the first and second groups,

In an embodiment, the first classification module 740 is further configured to:

In one embodiment, the grouping modularity of the grouping module is obtained by adding the modularity of the node grouping fused to form the grouping module and the modularity of the other individual node grouping; the first classification module 740 is further configured to:

for any grouping module, determining a first number of node lines contained in a current node grouping in the grouping module and a second number of node lines respectively connected with the current node grouping by the other independent node groupings, wherein the current node grouping is a grouping of a current modularity to be calculated in the grouping module;

sequentially taking the rest of the independent node groups as the current node group, and respectively calculating the modularity of the rest of the independent node groups in the grouping module;

In one embodiment, the text classification apparatus 700 further includes:

and the determining module is used for determining a classification range corresponding to each node group in the target group, wherein the classification range comprises a plurality of target classifications.

And the second classification module is used for carrying out secondary classification on a plurality of text nodes contained in the node group according to preset keywords respectively corresponding to a plurality of target classifications in the classification range aiming at any node group to obtain a final target group of the plurality of texts to be classified.

In an embodiment, the second classification module is further configured to:

aiming at a plurality of text nodes contained in any node group, judging whether a text to be classified corresponding to any text node does not contain preset keywords of the target classifications in the text nodes;

And carrying out secondary classification on the texts to be classified according to the text segmentation and a plurality of preset texts classified by known targets to obtain a final target group.

It should be understood that, in the structural block diagram of the text classification device shown in fig. 7, each unit/module is used to execute each step in the embodiment corresponding to fig. 1 to 6, and each step in the embodiment corresponding to fig. 1 to 6 has been explained in detail in the above embodiment, specifically please refer to the relevant description in the embodiments corresponding to fig. 1 to 6 and fig. 1 to 6, which is not repeated herein.

Fig. 8 is a block diagram of a terminal device according to another embodiment of the present application. As shown in fig. 8, the terminal apparatus 800 of this embodiment includes: a processor 810, a memory 820, and a computer program 830, such as a program for a text classification method, stored in the memory 820 and executable on the processor 810. The processor 810, when executing the computer program 830, implements the steps in the various embodiments of the text classification methods described above, such as S101 to S104 shown in fig. 1. Alternatively, the processor 810, when executing the computer program 830, implements the functions of the modules in the embodiment corresponding to fig. 7, for example, the functions of the modules 710 to 740 shown in fig. 7, and refer to the related description in the embodiment corresponding to fig. 7 specifically.

Illustratively, the computer program 830 may be divided into one or more units, which are stored in the memory 820 and executed by the processor 810 to accomplish the present application. One or more elements may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 830 in the terminal device 800.

The terminal equipment may include, but is not limited to, a processor 810, a memory 820. Those skilled in the art will appreciate that fig. 8 is merely an example of a terminal device 800, and does not constitute a limitation of terminal device 800, and may include more or fewer components than shown, or some of the components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The processor 810 may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 820 may be an internal storage unit of the terminal apparatus 800, such as a hard disk or a memory of the terminal apparatus 800. The memory 820 may also be an external storage device of the terminal device 800, such as a plug-in hard disk, a smart card, a flash memory card, etc. provided on the terminal device 800. Further, the memory 820 may also include both internal and external memory units of the terminal apparatus 800.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of text classification, comprising:

filtering the node lines in the text node structure chart according to the similar distance to obtain a target text node structure chart containing each text node and the rest node lines;

classifying the texts to be classified in the target text node structure chart based on a community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified;

the step of classifying the texts to be classified in the target text node structure diagram based on the community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified comprises the following steps:

2. The text classification method according to claim 1, wherein the text to be classified contains a plurality of text segments, the text features are composed of word vectors of the plurality of text segments, and the similarity distance includes a euclidean distance; the extracting the text features of each text to be classified and respectively calculating the similar distance between every two text nodes according to the text features comprises the following steps:

3. The text classification method according to claim 1 or 2, wherein the filtering the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines comprises:

judging whether the similar distance is smaller than or equal to a preset threshold value or not aiming at any similar distance;

4. The text classification method of claim 1, characterized in that the grouping modularity of the grouping module is obtained by fusing the modularity of the node groupings that make up the grouping module, added to the modularity of the remaining individual node groupings;

5. The text classification method according to claim 1 or 4, wherein after classifying the texts to be classified in the target text node structure diagram based on the community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified, the method further comprises:

6. The text classification method according to claim 5, wherein the performing secondary classification on the plurality of text nodes included in the node group according to the preset keywords respectively corresponding to the plurality of target classifications in the classification range to obtain a final target group of the plurality of texts to be classified comprises:

If the text to be classified corresponding to any text node does not contain the preset keywords of the target classifications, obtaining text participles in the text to be classified;

7. A text classification apparatus, comprising:

the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a text node structure chart, the text node structure chart comprises a plurality of text nodes, each text node corresponds to a text to be classified, and every two text nodes are connected through a node line;

the calculation module is used for extracting the text features of each text to be classified and respectively calculating the similar distance between every two text nodes according to the text features;

the first classification module is used for classifying texts to be classified in the target text node structure diagram based on a community discovery algorithm and the residual node lines to obtain a plurality of target groups of the texts to be classified;

The first classification module is used for grouping each text node in the target text node structure chart as an independent node; for any node grouping, fusing any node grouping with any adjacent node grouping each time to obtain a plurality of grouping modules, wherein each grouping module comprises a node grouping formed by fusing any node grouping with one adjacent node grouping and the rest of independent node groupings; respectively calculating the grouping modularity of each grouping module, wherein the grouping modularity is used for representing the classification effect of the grouping modules; taking the grouping module corresponding to the maximum value of the grouping modularity as a current target grouping module in the various grouping modules, and judging whether the current target grouping module is less than or equal to the grouping modularity of the last target grouping module; if the grouping modularity of the current target grouping module is judged to be smaller than or equal to the grouping modularity of the last target grouping module, taking the last target grouping module as a target grouping of the texts to be classified; and if the grouping modularity of the current target grouping module is judged to be larger than that of the last target grouping module, the step of fusing any node group and any adjacent node group to obtain a plurality of grouping modules is executed again aiming at each node group in the current target grouping module until the target group of each text to be classified is obtained.

8. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.