WO2022142025A1 - Text classification method and apparatus, and terminal device and storage medium - Google Patents

Text classification method and apparatus, and terminal device and storage medium Download PDF

Info

Publication number
WO2022142025A1
WO2022142025A1 PCT/CN2021/090954 CN2021090954W WO2022142025A1 WO 2022142025 A1 WO2022142025 A1 WO 2022142025A1 CN 2021090954 W CN2021090954 W CN 2021090954W WO 2022142025 A1 WO2022142025 A1 WO 2022142025A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
node
grouping
target
classified
Prior art date
Application number
PCT/CN2021/090954
Other languages
French (fr)
Chinese (zh)
Inventor
马龙
梁宸
周元笙
蒋佳惟
陈思姣
李炫�
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022142025A1 publication Critical patent/WO2022142025A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present application belongs to the technical field of artificial intelligence, and in particular, relates to a text classification method, apparatus, terminal device and storage medium.
  • text classification is an important aspect in the field of artificial intelligence.
  • text classification aims to automatically classify unlabeled documents into a predetermined set of categories to solve the phenomenon of information clutter.
  • the existing fast text classifier models or convolutional neural network models are usually used to classify texts.
  • the inventor realizes that, by training a neural network model to classify text, it takes a lot of time to train the neural network model, and the classification accuracy is low when classifying text more finely.
  • One of the purposes of the embodiments of the present application is to provide a text classification method, device, terminal device and storage medium, aiming to solve the problem of low accuracy of text classification in the prior art when classifying text through a neural network model question.
  • an embodiment of the present application provides a text classification method, including:
  • the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
  • the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  • an embodiment of the present application provides a text classification device, including:
  • an acquisition module configured to acquire a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
  • a computing module for extracting the text features of each of the texts to be classified, and calculating the similarity distance between the two text nodes according to the text features
  • a filtering module configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines;
  • the first classification module is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines, and obtain a plurality of target groups of the text to be classified.
  • a third aspect of the embodiments of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When realized:
  • the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
  • the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement:
  • the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
  • the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  • the embodiment of the present application has the beneficial effect of improving the accuracy of classifying multiple texts to be classified.
  • Fig. 1 is the realization flow chart of a kind of text classification method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an implementation manner of S103 of a text classification method provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of an implementation manner of S104 of a text classification method provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of an implementation manner of S1043 of a text classification method provided by an embodiment of the present application.
  • Fig. 5 is the realization flow chart of a kind of text classification method provided by another embodiment of the present application.
  • FIG. 6 is a schematic diagram of an implementation manner of S104B of a text classification method provided by an embodiment of the present application.
  • FIG. 7 is a structural block diagram of a text classification device provided by an embodiment of the present application.
  • FIG. 8 is a structural block diagram of a terminal device provided by an embodiment of the present application.
  • the text classification method provided by the embodiments of the present application can be applied to terminal devices such as tablet computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, etc.
  • terminal devices such as tablet computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, etc.
  • UMPCs ultra-mobile personal computers
  • the embodiments of the present application do not make any specific types of terminal devices. limit.
  • the above-mentioned multiple texts to be classified may be texts such as papers, journals, or magazines, respectively, or may be a sentence or a paragraph, which is not limited.
  • each text to be classified can be regarded as a text node respectively, and the multiple text nodes can be connected by node lines to generate a text node structure diagram.
  • the text to be classified may be pre-stored in a designated storage path of the terminal device, and then acquired by the terminal device; it may also be multiple texts to be classified transmitted by the user in real time, which is not limited.
  • the number of the above-mentioned node lines is related to the number of nodes of the text node.
  • the number of node lines is 1, when the number of text nodes is 3, the number of node lines is 3, when the number of text nodes is 4, the number of node lines is 6, and so on analogy.
  • y (n-1)*n/2, where n (n is an integer, and n ⁇ 2) is the number of text nodes, and y is the number of node lines.
  • the above similarity distance can be understood as the degree of similarity between two texts to be classified.
  • the terminal device may extract text features of each text to be classified, and calculate the similarity between the two text features as the similarity distance between the two texts to be classified.
  • the above-mentioned extraction of the text features of the text to be classified may be that the text to be classified is segmented to obtain multiple text segments. Afterwards, the word consistent with the text segmentation can be determined in the preset word vector library, and the sequence number of the text segmentation can be determined. Finally, the word feature of the text segmentation can be generated according to the sequence number.
  • determining words consistent with text segmentation in the preset word vector library may be implemented according to a forward matching algorithm. Specifically, if the longest word segmentation character in the preset word segmentation database is 5, the first to fifth characters in the text to be classified may be used as initial word segmentation, and it is determined whether they exist in the word vector database.
  • the initial participle will be used as the target participle, and then the subsequent participle will be matched. If the initial participle does not exist in the word vector base, the character length is reduced from right to left. The first character to the fourth character in the text to be classified is used as the initial word segmentation, and again the initial word segmentation determines whether it exists in the word vector library. In this way, word segmentation is performed on a plurality of texts to be classified to obtain text segmentation.
  • the word feature of the text segmentation can be generated according to the sequence number of each text segmentation in the word vector library. Specifically, the word feature dimension in each text segmentation is preset. Then, map the one-dimensional space (number corresponding to the sequence number) of each text segment to a multi-dimensional continuous vector space. Exemplarily, for a text segment with a sequence number of "5", the dimension of the feature vector is 10, and the word feature of the text segment can be [0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] to get the word features of each text segment. Based on this, the text features of the text to be classified can be represented by the word features of multiple text segmentations. After that, you can calculate the similarity distance between two text nodes by referring to the following steps, specifically:
  • the similarity distance includes the Euclidean distance; in S102, the text features of each of the texts to be classified are extracted, and the similarity distances between the two text nodes are respectively calculated according to the text features, specifically including The following sub-steps:
  • n is the number of text segments
  • ai represents the word vector of the i-th text segment in the a-th text to be classified
  • bi represents the word vector of the i-th text segment in the b-th text to be classified
  • p( a, b) represent the similarity distance between the text a to be classified and the text b to be classified.
  • n may be determined as m1.
  • the word features are all represented by "0".
  • the above similarity distance may also calculate the cosine distance between two text nodes according to text features, which is not limited.
  • filtering the above node lines may be considered as deleting node lines whose similarity distance is less than a preset threshold.
  • a target text node structure diagram composed of text nodes and remaining node lines is obtained. It can be understood that the similarity distance between the two text nodes is the length of the node line between the two text nodes.
  • a preset threshold may be pre-stored in the terminal device.
  • the similarity distance is greater than the preset threshold, the node lines between the texts to be classified are retained; when the similarity distance is less than or equal to the preset threshold, the texts to be classified are deleted. node line. It can be understood that the node structure diagram of the target text obtained at this time has a certain similarity between the two texts to be classified connected by the node line. When clustering and grouping them, it can be considered that the two texts to be classified The probability of belonging to the same category grouping is greater.
  • the above community discovery algorithms include but are not limited to Leiden algorithm and Louvain algorithm, which are not limited thereto.
  • Community discovery algorithms are modularity-based classification algorithms that can be used to group multiple hierarchical (different categories) objects (texts to be classified). Among them, the modularity can be considered as a metric for evaluating the effect of each grouping after multiple targets are grouped differently. It can be understood that a plurality of texts to be classified are classified, and there is at least one text to be classified in each category. Exemplarily, if the target group contains 3 categories after classifying 5 to-be-categorized texts, it means that the number of the to-be-categorized texts contained in each category group may be 1 or 2, which is not limited. .
  • the modularity Qi of each category of groupings in the target grouping needs to be calculated.
  • the calculation formula is as follows: Among them, Qi is the modularity of the grouping of the ith category, m is the sum of the similarity distances of all node lines between the groups of all categories, ⁇ in The sum of the similarity distances of all the node lines between the groups of the ith category, ⁇ tot represents the sum of the similarity distances of all node lines connected to text nodes within the grouping of the ith category.
  • the modularity corresponding to the grouping of each category can be obtained after each classification. Furthermore, by adding up multiple modularities, the overall modularity Q of the target grouping can be obtained.
  • a text node structure diagram is generated by using each text to be classified as a text node, then the similarity distance between each text to be classified is calculated, and the text node structure diagram is preliminarily filtered to obtain the target text node structure diagram , to realize the preliminary grouping of the text to be classified. Then, according to the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and the multiple texts to be classified are grouped again to obtain the target grouping. In this way, the accuracy of classifying multiple texts to be classified is improved.
  • the node lines in the text node structure diagram are filtered to obtain a target text node structure including each text node and the remaining node lines.
  • the figure specifically includes the following sub-steps S1031-S1034, which are described in detail as follows:
  • the above-mentioned preset threshold may be set by the user according to the actual situation, or may be a fixed value preset in the terminal device, which is not limited. It has been explained in S103 that when the similarity distance is less than or equal to the preset threshold, the node line corresponding to the similarity distance is deleted. And, when the similarity distance is greater than the preset threshold, the node line corresponding to the similarity distance is retained. In this way, the structure diagram of the target text node is generated, which will not be described again.
  • the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  • S1041-S1046 which are detailed as follows:
  • the text nodes corresponding to each text to be classified are grouped as a node. Then, the same steps are sequentially performed on each node group to obtain a plurality of target groups of texts to be classified.
  • each grouping module includes the any node grouping and one of the adjacent node groupings The grouping of nodes made up by the fusion, and the remaining individual groupings of nodes.
  • any node grouping and any adjacent node grouping can be fused each time to obtain a fused node grouping and other individual node groupings.
  • the node group i can be fused with the adjacent node group j, that is, two node groups of texts to be classified are regarded as a class of groups.
  • the formed grouping module there are k-1 node groups. It can be understood that, among the k-1 groups, there is a node group composed of fusion, and k-2 other individual node groups. For other node groups, it also needs to be processed as in S1042 to obtain corresponding grouping modules.
  • the grouping modularity of the grouping module is obtained by adding the modularity of the node grouping composed of fusion and the modularity of each other individual node grouping. That is, for a grouping module containing the above k-1 node groupings, it is necessary to calculate the corresponding modularity of each node grouping in the k-1 node groupings and add them to obtain the overall modularity of the grouping module (grouping modularity) .
  • the formula for calculating Qi in the above S104 may be referred to.
  • Q represents the overall modularity of the grouping module (grouping modularity)
  • Qi represents the modularity of the grouping of the ith category (in the k-1 node groupings included in the grouping module, the modularity of the ith node grouping) , where i is an integer and i ⁇ k-1.
  • the maximum value of various grouping modularity can be determined according to the value of the grouping modularity, and the grouping module corresponding to the maximum grouping modularity can be used as the current target grouping module .
  • the current target grouping module After determining the current target grouping module, it is necessary to determine whether the current target grouping module is the optimal grouping among the multiple texts to be classified. Based on this, it is also necessary to compare the grouping modularity of the current target grouping module with the grouping modularity of the previous target grouping module to determine whether the current target grouping is the optimal grouping. Specifically, if the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, it indicates that the previous target grouping module is the optimal grouping (target grouping) of multiple texts to be classified.
  • the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, it indicates that the previous target grouping does not belong to the optimal grouping. Afterwards, in order to determine whether the current target grouping belongs to the optimal grouping, it is also necessary to group each node in the current target grouping module, and repeat S1042 to S1045 until the target grouping of each text to be classified is obtained.
  • S1042 is directed to each node group in the current target group. That is, if the node group included in the current target grouping module is a node group formed by the fusion of node group i and node group j, and is grouped with other individual nodes, then when S1042 is executed, the number of any node group is k-1. At this time, the node group formed by the fusion of node group i and node group j will be treated as one node group for processing.
  • the grouping modularity Qk-1 of the current target grouping module if Qk-1 is less than or equal to the grouping modularity Qk of the previous target grouping module, it indicates that the classification effect of the current target grouping module is less than or equal to the previous target grouping The classification effect of the module. Therefore, it shows that the previous target grouping module is the target grouping with the best effect. If the grouping modularity Qk-1 of the current target grouping module is greater than the grouping modularity Qk of the previous target grouping module, it indicates that the classification effect of the current target grouping module is higher than that of the previous target grouping module. Afterwards, the k-1 nodes of the current target grouping module are grouped repeatedly to perform S1042 to S1045 until the target grouping module Q with the best classification effect is obtained.
  • the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; at S1043 Respectively calculating the grouping modularity of each of the grouping modules specifically includes the following sub-steps S10431-S10434, which are detailed as follows:
  • the current node grouping is the grouping of the modularity to be calculated currently in the grouping module.
  • the modularity of the current node grouping is calculated according to the first quantity and the second quantity.
  • the modularity calculation formula in the above S104 which will not be described again.
  • the above-mentioned current node grouping is a grouping that is currently in calculating modularity in the grouping module. It can be understood that, if the current node group has only one text node, the first number of node lines included in the current node group is 0. The above-mentioned second number is the number of node lines respectively connecting the remaining node groups to the current node group. It can be understood that if the current node group has multiple text nodes (groups obtained by merging multiple node groups), the remaining node groups are connected to any node group in the current node group, that is, the remaining node groups and There are connections between the current node groups.
  • the text to be classified in the node structure diagram of the target text is classified, and after a plurality of target groups of the text to be classified are obtained, It also includes the following steps S104A-S104B, which are detailed as follows:
  • S104A Determine a classification range corresponding to each node group in the target group, where the classification range includes multiple target classifications.
  • the number of node groups contained in the target group will be smaller than the initial number of node groups. That is, each node group in the target group now contains multiple text nodes.
  • the node grouping formed by the fusion of the multiple text nodes can be considered to belong to a classification range.
  • the above classification scope includes, but is not limited to, the scope of physical classification, the scope of chemical classification, the scope of computer classification and other subject classifications. It can also be the next-level classification range of a subject classification range.
  • the classification range may be a mechanical classification range, an electrical classification range, and an optical classification range. It can be understood that, for each classification range, it can also contain multiple specific target classifications.
  • the target classification included in the mechanical classification range may be quantum mechanical target classification, Newtonian mechanical target classification, and the like.
  • multiple texts to be classified can be in the same classification range but belong to different target classifications. It can also be texts that belong to multiple classification ranges and have different target classifications. Among them, the classification range corresponding to each node group in the target group is determined, which can be marked manually.
  • a model for determining the classification range may also be pre-trained in the terminal device, and the model is only used to determine the classification range in each node group according to the text to be classified in each node group in the target group. It can be understood that the text similarity (similar distance between text nodes) of texts to be classified in different classification ranges across disciplines is usually very small. Therefore, multiple texts to be classified between the same node grouping are usually texts of one subject (classification scope). Based on this, the model only needs to predict the classification range according to any text to be classified in each node group, and then the classification range corresponding to the corresponding node group can be determined.
  • S104B For any node grouping, perform secondary classification on a plurality of text nodes included in the node grouping according to preset keywords corresponding to a plurality of target categories in the classification range, to obtain the plurality of to-be-categorized The final destination grouping of the text.
  • each category range includes multiple target categories
  • the terminal device may pre-store the keywords corresponding to each target category.
  • the preset keywords of each target category corresponding to the node grouping are used to compare with each text to be classified.
  • the final target grouping of the multiple unclassified texts is obtained.
  • the steps between the above steps S101 to S104 may be common to different business scenarios (different text classification business scenarios).
  • the system resources including the above steps S101 to S104 can also be used for processing, which can greatly reduce the system resources of text classification required by different services, and there is no need to specially design for each text classification business scenario. different system resources.
  • S104B according to preset keywords corresponding to multiple target categories in the classification range, secondary classification is performed on a plurality of text nodes included in the node grouping to obtain the final target grouping of multiple texts to be classified further includes the following sub-steps S104B1-S104B3, which are described in detail as follows:
  • the text to be classified corresponding to any text node does not contain the preset keywords of the multiple target classifications, obtain the text segmentation in the text to be classified.
  • the keywords corresponding to each target classification category usually have limitations.
  • the preset keywords corresponding to the target categories should be different from each other.
  • the user cannot preset all keywords for each classification category. Based on this, there is a situation in which the preset keywords corresponding to each target category are not included in the multiple texts to be classified included in the node grouping.
  • S104B3 Perform secondary classification on the text to be classified according to the text segmentation and a plurality of preset texts of known target classifications to obtain a final target group.
  • the above-mentioned preset texts of known target categories may be multiple texts preset by a user, and the preset texts include but are not limited to texts such as journals and papers.
  • the secondary classification of the text to be classified may be that the terminal device recognizes the text semantics according to the text segmentation, and then performs secondary classification of the to-be-categorized text. It is also possible to classify the text to be classified into the target classification when it is determined that the text segmentation often appears in the preset text corresponding to any target classification.
  • a All objects always remain in a state of rest or in a state of uniform linear motion when they are not acted upon by a force or the resultant force is zero
  • b When you measure a When you have a particle's momentum, you can't tell where it is.
  • the text segmentation of the text a to be classified includes text segmentation such as force, resultant force, static state, and uniform linear motion state.
  • the terminal device can separately count the number of articles that simultaneously contain the above text segmentation (text segmentation of the text a to be classified). Afterwards, from the articles that also contain the above text segmentation, the corresponding number of articles included in each target category is determined, and the target category corresponding to the maximum number of the included articles is used as the target category of the text to be classified. Through the above processing, it is found that most of the articles containing the above text segmentation (text segmentation of the text a to be classified) belong to the "Newtonian mechanics category".
  • FIG. 7 is a structural block diagram of a text classification apparatus provided by an embodiment of the present application. Each module included in the text classification apparatus in this embodiment is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 .
  • the text classification apparatus 700 includes: an acquisition module 710, a calculation module 720, a filtering module 730 and a first classification module 740, wherein:
  • the obtaining module 710 is configured to obtain a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line.
  • the calculation module 720 is configured to extract the text features of each of the texts to be classified, and respectively calculate the similarity distance between the two text nodes according to the text features.
  • the filtering module 730 is configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines.
  • the first classification module 740 is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines to obtain a plurality of target groups of the text to be classified.
  • the text to be classified includes multiple text segments, the text features are composed of word vectors of the multiple text segments, and the similarity distance includes the Euclidean distance; the computing module 720 is further configured to: :
  • the filtering module 730 is also used to:
  • any similarity distance determine whether the similarity example is less than or equal to the preset threshold; if it is determined that the similarity distance is less than or equal to the preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; And, if it is determined that the similarity distance is greater than a preset threshold, the node lines corresponding to the similarity distance greater than the preset threshold are reserved; based on the remaining node lines and the plurality of text nodes, the target text node structure diagram is generated.
  • the first classification module 740 is further configured to:
  • each text node described in the target text node structure diagram as a separate node group; for any node grouping, each time the any node grouping is fused with any adjacent node grouping to obtain a variety of groupings module, each grouping module includes a node grouping formed by the fusion of any node grouping and one of the adjacent node groupings, and the remaining individual node groupings; respectively calculate the grouping modularity of each grouping module, and the grouping module The degree is used to represent the classification effect of the grouping module; in the various grouping modules, the grouping module corresponding to the maximum value of the grouping module degree is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module; if it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, then the previous target grouping module is used as the multi-target grouping module.
  • a target grouping of texts to be classified if it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, then for each node grouping in the current target grouping module, execute the described The step of merging any node grouping with any adjacent node grouping to obtain various grouping modules, until the target grouping of each text to be classified is obtained.
  • the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; the first classification module 740 also uses At:
  • any grouping module determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity currently to be calculated in the grouping module; the modularity of the current node grouping is calculated according to the first quantity and the second quantity; the remaining individual node groups are sequentially used as For the current node grouping, the modularity of the remaining individual node groups in the grouping module is calculated respectively; the sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the The grouping modularity of the grouping module.
  • the text classification apparatus 700 further includes:
  • a determination module configured to determine a classification range corresponding to each node group in the target group, where the classification range includes multiple target classifications.
  • the second classification module is configured to, for any node grouping, perform secondary classification on the plurality of text nodes included in the node grouping according to the preset keywords corresponding to the plurality of target classifications in the classification range, and obtain the Describe the final target grouping of multiple texts to be classified.
  • the second classification module is further used for:
  • any of the node groups For a plurality of text nodes included in any of the node groups, it is judged whether, among the plurality of text nodes, the text to be classified corresponding to any text node does not contain the preset keywords of the plurality of target classifications; Among the multiple text nodes, if the text to be classified corresponding to any text node does not contain the preset keywords of the multiple target classifications, the text segmentation in the to-be-categorized text is obtained; according to the text segmentation and Knowing a plurality of preset texts of target classification, secondary classification is performed on the to-be-classified texts to obtain a final target grouping.
  • each unit/module is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 , and for the embodiment corresponding to FIG. 1 to FIG.
  • the steps in the above have been explained in detail in the above-mentioned embodiments.
  • FIG. 8 is a structural block diagram of a terminal device provided by another embodiment of the present application.
  • the terminal device 800 of this embodiment includes: a processor 810 , a memory 820 , and a computer program 830 stored in the memory 820 and executable on the processor 810 , such as a program of a text classification method.
  • the processor 810 executes the computer program 830, the steps in each of the above embodiments of the text classification methods are implemented, for example, S101 to S104 shown in FIG. 1 .
  • the processor 810 executes the computer program 830, the functions of each module in the embodiment corresponding to FIG. 7 are implemented, for example, the functions of the modules 710 to 740 shown in FIG. 7 .
  • the functions of each module in the embodiment corresponding to FIG. 7 are implemented, for example, the functions of the modules 710 to 740 shown in FIG. 7 . Specifically as follows:
  • a terminal device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implements when the processor executes the computer program:
  • the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
  • the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  • a computer-readable storage medium stores a computer program, and the computer program is implemented when executed by a processor:
  • the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
  • the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  • the computer program 830 may be divided into one or more modules, and the one or more modules are stored in the memory 820 and executed by the processor 810 to complete the present application.
  • One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 830 in the terminal device 800 .
  • the terminal device may include, but is not limited to, the processor 810 and the memory 820 .
  • FIG. 8 is only an example of the terminal device 800, and does not constitute a limitation to the terminal device 800. It may include more or less components than those shown in the figure, or combine some components, for example, the terminal device also Can include input and output devices, buses, etc.
  • the so-called processor 810 may be a central processing unit, and may also be other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. Wait.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 820 may be an internal storage unit of the terminal device 800 , such as a hard disk or a memory of the terminal device 800 .
  • the memory 820 may also be an external storage device of the terminal device 800 , such as a plug-in hard disk, a smart memory card, a flash memory card, etc., which are equipped on the terminal device 800 . Further, the memory 820 may also include both an internal storage unit of the terminal device 800 and an external storage device.
  • the computer-readable storage medium may be an internal storage unit of the terminal device described in the foregoing embodiments, such as a hard disk or a memory of the terminal device.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium may also be an external storage device of the terminal device, for example, a pluggable hard disk, a smart memory card, a secure digital card, a flash memory card, etc. equipped on the terminal device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application is applicable to the technical field of artificial intelligence. Provided are a text classification method and apparatus, and a terminal device and a storage medium. The method comprises: acquiring a text node structural diagram generated by a plurality of texts to be classified; extracting text features of each text to be classified, and respectively calculating a similarity distance between text nodes in pairs according to the text features; according to the similarity distance, filtering node lines in the text node structural diagram, so as to obtain a target text node structural diagram including each text node and the remaining node lines; and on the basis of a community discovery algorithm and the remaining node lines, classifying texts to be classified in the target text node structural diagram, so as to obtain a target group of the plurality of texts to be classified. By means of the method, preliminary filtering is first performed on the text node structural diagram generated by the texts to be classified, and the text node structural diagram is then grouped to obtain the target group, such that the accuracy of classifying the plurality of texts to be classified can be improved.

Description

文本分类方法、装置、终端设备及存储介质Text classification method, device, terminal device and storage medium
本申请要求于2020年12月28日在中国专利局提交的、申请号为202011586743.3、发明名称为“文本分类方法、装置、终端设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011586743.3 and the invention title "Text Classification Method, Apparatus, Terminal Equipment and Storage Medium" filed in the China Patent Office on December 28, 2020, the entire contents of which are approved by Reference is incorporated in this application.
技术领域technical field
本申请属于人工智能技术领域,尤其涉及一种文本分类方法、装置、终端设备及存储介质。The present application belongs to the technical field of artificial intelligence, and in particular, relates to a text classification method, apparatus, terminal device and storage medium.
背景技术Background technique
目前,文本分类是人工智能领域重要的一个方面,文本分类作为信息处理的重要任务,其目的在于自动分类无标签文档到预定的类别集合中,解决信息杂乱的现象。在现有的文本分类方法中,通常都是采用现有的快速文本分类器模型或卷积神经网络模型对文本进行分类。然而,发明人意识到,通过训练神经网络模型对文本进行分类,其训练神经网络模型需要花费大量的时间,而且在对文本进行更为精细的分类时,其分类准确率低。At present, text classification is an important aspect in the field of artificial intelligence. As an important task of information processing, text classification aims to automatically classify unlabeled documents into a predetermined set of categories to solve the phenomenon of information clutter. In the existing text classification methods, the existing fast text classifier models or convolutional neural network models are usually used to classify texts. However, the inventor realizes that, by training a neural network model to classify text, it takes a lot of time to train the neural network model, and the classification accuracy is low when classifying text more finely.
技术问题technical problem
本申请实施例的目的之一在于:提供一种文本分类方法、装置、终端设备及存储介质,旨在解决现有技术中通过神经网络模型对文本进行分类时,对文本分类准确率低的技术问题。One of the purposes of the embodiments of the present application is to provide a text classification method, device, terminal device and storage medium, aiming to solve the problem of low accuracy of text classification in the prior art when classifying text through a neural network model question.
技术解决方案technical solutions
为解决上述技术问题,本申请实施例采用的技术方案是:In order to solve the above-mentioned technical problems, the technical solutions adopted in the embodiments of the present application are:
第一方面,本申请实施例提供了一种文本分类方法,包括:In a first aspect, an embodiment of the present application provides a text classification method, including:
获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
第二方面,本申请实施例提供了一种文本分类装置,包括:In a second aspect, an embodiment of the present application provides a text classification device, including:
获取模块,用于获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;an acquisition module, configured to acquire a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
计算模块,用于提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;a computing module, for extracting the text features of each of the texts to be classified, and calculating the similarity distance between the two text nodes according to the text features;
过滤模块,用于根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;a filtering module, configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines;
第一分类模块,用于基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。The first classification module is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines, and obtain a plurality of target groups of the text to be classified.
本申请实施例的第三方面提供了一种终端设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:A third aspect of the embodiments of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When realized:
获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement:
获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
有益效果beneficial effect
本申请实施例与现有技术相比存在的有益效果是:提高对多个待分类文本进行分类的准确性。Compared with the prior art, the embodiment of the present application has the beneficial effect of improving the accuracy of classifying multiple texts to be classified.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or exemplary technologies. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1是本申请一实施例提供的一种文本分类方法的实现流程图;Fig. 1 is the realization flow chart of a kind of text classification method provided by an embodiment of the present application;
图2是本申请一实施例提供的一种文本分类方法的S103的一种实现方式示意图;FIG. 2 is a schematic diagram of an implementation manner of S103 of a text classification method provided by an embodiment of the present application;
图3是本申请一实施例提供的一种文本分类方法的S104的一种实现方式示意图;FIG. 3 is a schematic diagram of an implementation manner of S104 of a text classification method provided by an embodiment of the present application;
图4是本申请一实施例提供的一种文本分类方法的S1043的一种实现方式示意图;FIG. 4 is a schematic diagram of an implementation manner of S1043 of a text classification method provided by an embodiment of the present application;
图5是本申请另一实施例提供的一种文本分类方法的实现流程图;Fig. 5 is the realization flow chart of a kind of text classification method provided by another embodiment of the present application;
图6是本申请一实施例提供的一种文本分类方法的S104B的一种实现方式示意图;6 is a schematic diagram of an implementation manner of S104B of a text classification method provided by an embodiment of the present application;
图7是本申请实施例提供的一种文本分类装置的结构框图;7 is a structural block diagram of a text classification device provided by an embodiment of the present application;
图8是本申请实施例提供的一种终端设备的结构框图。FIG. 8 is a structural block diagram of a terminal device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请实施例提供的文本分类方法可以应用于平板电脑、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。The text classification method provided by the embodiments of the present application can be applied to terminal devices such as tablet computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, etc. The embodiments of the present application do not make any specific types of terminal devices. limit.
S101、获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接。S101. Obtain a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line.
在应用中,上述多个待分类文本可以分别为论文、期刊或杂志等文本,也可以为一句话或者一段话,对此不作限定。对于多个待分类文本,可将每个待分类文本分别作为一个文本节点,并将多个文本节点之间通过节点线进行连接,生成文本节点结构图。其中,待分类文本可以预先存储在终端设备的指定存储路径下,而后被终端设备进行获取;也可以为用户实时传输的多个待分类文本,对此不作限定。In an application, the above-mentioned multiple texts to be classified may be texts such as papers, journals, or magazines, respectively, or may be a sentence or a paragraph, which is not limited. For multiple texts to be classified, each text to be classified can be regarded as a text node respectively, and the multiple text nodes can be connected by node lines to generate a text node structure diagram. The text to be classified may be pre-stored in a designated storage path of the terminal device, and then acquired by the terminal device; it may also be multiple texts to be classified transmitted by the user in real time, which is not limited.
可以理解的是,上述节点线的数量与文本节点的节点数量相关。示例性的,当文本节点的数量为2时,节点线的数量为1,当文本节点的数量为3时,节点线的数量为3,当文本节点为4时,节点线为6,依此类推。具体的,y=(n-1)*n/2,其中,n(n为整数,且n≥2)为文本节点的数量,y为节点线的数量。It can be understood that the number of the above-mentioned node lines is related to the number of nodes of the text node. Exemplarily, when the number of text nodes is 2, the number of node lines is 1, when the number of text nodes is 3, the number of node lines is 3, when the number of text nodes is 4, the number of node lines is 6, and so on analogy. Specifically, y=(n-1)*n/2, where n (n is an integer, and n≥2) is the number of text nodes, and y is the number of node lines.
S102、提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离。S102. Extract the text features of each of the texts to be classified, and calculate the similarity distance between the two text nodes according to the text features.
在应用中,上述相似距离可以理解为两个待分类文本之间的相似程度。具体的,终端设备可提取每个待分类文本的文本特征,并计算两个文本特征之间的相似度,作为两个待分类文本之间的相似距离。In application, the above similarity distance can be understood as the degree of similarity between two texts to be classified. Specifically, the terminal device may extract text features of each text to be classified, and calculate the similarity between the two text features as the similarity distance between the two texts to be classified.
示例性的,上述提取待分类文本的文本特征可以为,对待分类文本进行分词,得到多个文本分词。之后,可在预设的词向量库中确定与文本分词一致的词,以及确定该文本分词的顺序号。最后,可根据该顺序号生成该文本分词的词特征。其中,在预设的词向量库中确定与文本分词一致的词可以为根据正向匹配算法实现。具体的,若预设的分词库中最长的分词字符为5,则可将待分类文本中的第一个字符至第五个字符作为初始分词,并确定是否存在词向量库中。若存在,则将初始分词作为目标分词,而后对后续分词继续进行匹配。若初始分词不存在于词向量库中,则从右向左减少字符长度。即将待分类文本中的第一个字符至第四个字符作为初始分词,再次该初始分词确定是否存在词向量库中。以此,对多个待分类文本进行分词得到文本分词。Exemplarily, the above-mentioned extraction of the text features of the text to be classified may be that the text to be classified is segmented to obtain multiple text segments. Afterwards, the word consistent with the text segmentation can be determined in the preset word vector library, and the sequence number of the text segmentation can be determined. Finally, the word feature of the text segmentation can be generated according to the sequence number. Wherein, determining words consistent with text segmentation in the preset word vector library may be implemented according to a forward matching algorithm. Specifically, if the longest word segmentation character in the preset word segmentation database is 5, the first to fifth characters in the text to be classified may be used as initial word segmentation, and it is determined whether they exist in the word vector database. If it exists, the initial participle will be used as the target participle, and then the subsequent participle will be matched. If the initial participle does not exist in the word vector base, the character length is reduced from right to left. The first character to the fourth character in the text to be classified is used as the initial word segmentation, and again the initial word segmentation determines whether it exists in the word vector library. In this way, word segmentation is performed on a plurality of texts to be classified to obtain text segmentation.
在应用中,在确定多个文本分词后,可根据词向量库中每个文本分词的顺序号,生成该文本分词的词特征。具体的,预先设置每个文本分词中词特征维度。而后,将每个文本分词的一维空间(顺序号对应的数字)映射到多维度的连续向量空间。示例性的,对于顺序号为“5”的文本分词,特征向量维度为10,该文本分词的词特征则可为[0,0,0,0,1,0,0,0,0,0],以此得到每个文本分词的词特征。基于此,待分类文本的文本特征即可由多个文本分词的词特征进行表示。之后,可参照如下步骤计算两两文本节点之间的相似距离,具体的:In the application, after multiple text segmentations are determined, the word feature of the text segmentation can be generated according to the sequence number of each text segmentation in the word vector library. Specifically, the word feature dimension in each text segmentation is preset. Then, map the one-dimensional space (number corresponding to the sequence number) of each text segment to a multi-dimensional continuous vector space. Exemplarily, for a text segment with a sequence number of "5", the dimension of the feature vector is 10, and the word feature of the text segment can be [0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ] to get the word features of each text segment. Based on this, the text features of the text to be classified can be represented by the word features of multiple text segmentations. After that, you can calculate the similarity distance between two text nodes by referring to the following steps, specifically:
所述相似距离包括欧几里得距离;在S102所述提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离中,具体包括如下子步骤:The similarity distance includes the Euclidean distance; in S102, the text features of each of the texts to be classified are extracted, and the similarity distances between the two text nodes are respectively calculated according to the text features, specifically including The following sub-steps:
提取每个所述待分类文本中多个文本分词的词向量,并确定所述多个文本分词分别在所述待分类文本中的词顺序。Extracting word vectors of multiple text segments in each of the texts to be classified, and determining word orders of the multiple text segments in the text to be classified.
针对任一两两文本节点,根据所述待分类文本中相同词顺序的词向量,计算每个所述文本分词的欧几里得距离,并将每个所述文本分词的欧几里得距离进行加和得到两两文本节点之间的欧几里得距离。For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Add and get the Euclidean distance between two text nodes.
在应用中,上述待分类文本中文本分词的提取方式,以及文本分词词向量已在上述S102中进行解释,对此不再进行说明。可以理解的是,因待分类文本中是由连续性的多个字符进行组成的,每个字符均在待分类文本中具有对应的字符顺序。基于此,在上述以字符为匹配方式对待分类文本进行分词得到的多个文本分词,也具有相应的词顺序。In the application, the method of extracting the text word segmentation in the text to be classified and the word vector of the text word segmentation have been explained in the above S102, and will not be described again. It can be understood that, since the text to be classified is composed of a plurality of consecutive characters, each character has a corresponding character sequence in the text to be classified. Based on this, the above-mentioned multiple text segmentations obtained by segmenting the text to be classified in a character-based matching manner also have a corresponding word order.
在应用中,欧几里得距离的计算公式可为:
Figure PCTCN2021090954-appb-000001
其中,n为文本分词的个数,ai表示第a个待分类文本中的第i个文本分词的词向量,bi表示第b个待分类文本中的第i个文本分词的词向量,p(a,b)表示待分类文本a与待分类文本b之间的相似距离。其中,若待分类文本a的分词个数m1大于待分类文本b的分词个数m2,则可将n确定为m1。之后,对于待分类文本b中处于b2之后的文本分词,其词特征均以“0”进行表示。在其他应用中,上述相似距离还可根据文本特征,计算两两文本节点之间的余弦距离,对此不作限定。
In application, the calculation formula of Euclidean distance can be:
Figure PCTCN2021090954-appb-000001
Among them, n is the number of text segments, ai represents the word vector of the i-th text segment in the a-th text to be classified, bi represents the word vector of the i-th text segment in the b-th text to be classified, p( a, b) represent the similarity distance between the text a to be classified and the text b to be classified. Wherein, if the number m1 of word segments of the text a to be classified is greater than the number m2 of word segments of the text b to be classified, n may be determined as m1. After that, for the text segmented after b2 in the to-be-classified text b, the word features are all represented by "0". In other applications, the above similarity distance may also calculate the cosine distance between two text nodes according to text features, which is not limited.
S103、根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图。S103. Filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines.
在应用中,对上述节点线进行过滤处理,可认为是将相似距离小于预设阈值的节点线进行删除。以此,得到文本节点和剩余节点线组成的目标文本节点结构图。可以理解的是,两两文本节点之间的相似距离即为两两文本节点之间节点线的长度。In application, filtering the above node lines may be considered as deleting node lines whose similarity distance is less than a preset threshold. In this way, a target text node structure diagram composed of text nodes and remaining node lines is obtained. It can be understood that the similarity distance between the two text nodes is the length of the node line between the two text nodes.
具体的,终端设备内部可预先存储有预设阈值,当相似距离大于预设阈值时,保留待分类文本之间的节点线;当相似距离小于或等于预设阈值时,删除待分类文本之间的节点线。可以理解的是,此时得到的目标文本节点结构图,其节点线连接的两个待分类文本之间具有一定的相似性,在对其进行聚类分组时,可认为该两个待分类文本之间属于同一类别分组的概率更大。Specifically, a preset threshold may be pre-stored in the terminal device. When the similarity distance is greater than the preset threshold, the node lines between the texts to be classified are retained; when the similarity distance is less than or equal to the preset threshold, the texts to be classified are deleted. node line. It can be understood that the node structure diagram of the target text obtained at this time has a certain similarity between the two texts to be classified connected by the node line. When clustering and grouping them, it can be considered that the two texts to be classified The probability of belonging to the same category grouping is greater.
S104、基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。S104. Based on the community discovery algorithm and the remaining node lines, classify the text to be classified in the node structure diagram of the target text, and obtain a plurality of target groups of the text to be classified.
在应用中,上述社区发现算法包括但不限于Leiden算法、Louvain算法,对此不作限定。社区发现算法为基于模块度的分类算法,其可用于对多个层次性(不同类别)的目标(待分类文本)进行分组。其中,模块度可认为是在多个目标进行不同的分组后,对每次的分组进行效果评估的度量值。可以理解的是,对多个待分类文本进行分类,每个类别中至少一个待分类文本。示例性的,若对5个待分类文本进行分类后,目标分组包含3个类别,则表示每个类别的分组包含待分类文本的个数可能为1个,也可能为2,对此不作限定。In application, the above community discovery algorithms include but are not limited to Leiden algorithm and Louvain algorithm, which are not limited thereto. Community discovery algorithms are modularity-based classification algorithms that can be used to group multiple hierarchical (different categories) objects (texts to be classified). Among them, the modularity can be considered as a metric for evaluating the effect of each grouping after multiple targets are grouped differently. It can be understood that a plurality of texts to be classified are classified, and there is at least one text to be classified in each category. Exemplarily, if the target group contains 3 categories after classifying 5 to-be-categorized texts, it means that the number of the to-be-categorized texts contained in each category group may be 1 or 2, which is not limited. .
具体的,对目标文本节点结构图进行分类后,得到多个待分类文本的目标分组。为确定此时目标分组的整体模块度Q,则需对目标分组中每个类别的分组的模块度Qi进行计算。具体的,计算公式如下:
Figure PCTCN2021090954-appb-000002
其中,Qi为第i个类别的分组的模块度,m为所有类别的分组之间所有节点线的相似距离之和,∑in第i个类别的分组之间所有节点线的相似距离之和,∑tot表示与第i个类别的分组内的文本节点相连的所有节点线的相似距离之和。基于此,可得到每次分类后,每个类别的分组分别对应的模块度。进而,将多个模块度进行加和,即可得到目标分组的整体模块度Q。
Specifically, after classifying the target text node structure graph, a plurality of target groups of the text to be classified are obtained. In order to determine the overall modularity Q of the target grouping at this time, the modularity Qi of each category of groupings in the target grouping needs to be calculated. Specifically, the calculation formula is as follows:
Figure PCTCN2021090954-appb-000002
Among them, Qi is the modularity of the grouping of the ith category, m is the sum of the similarity distances of all node lines between the groups of all categories, ∑in The sum of the similarity distances of all the node lines between the groups of the ith category, ∑tot represents the sum of the similarity distances of all node lines connected to text nodes within the grouping of the ith category. Based on this, the modularity corresponding to the grouping of each category can be obtained after each classification. Furthermore, by adding up multiple modularities, the overall modularity Q of the target grouping can be obtained.
在本实施例中,通过将每个待分类文本作为文本节点,生成文本节点结构图,之后计算每个待分类文本之间的相似距离,对文本节点结构图进行初步过滤得到目标文本节点结构图,实现对待分类文本的初步分组。然后,根据社区发现算法和剩余节点线,对目标文本节点结构图中的待分类文本进行分类,对多个待分类文本再次进行分组,得到目标分组。以此,提高对多个待分类文本进行分类的准确性。In this embodiment, a text node structure diagram is generated by using each text to be classified as a text node, then the similarity distance between each text to be classified is calculated, and the text node structure diagram is preliminarily filtered to obtain the target text node structure diagram , to realize the preliminary grouping of the text to be classified. Then, according to the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and the multiple texts to be classified are grouped again to obtain the target grouping. In this way, the accuracy of classifying multiple texts to be classified is improved.
参照图2,在一具体实施例中,在S103根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图中,具体包括如下子步骤S1031-S1034,详述如下:Referring to FIG. 2, in a specific embodiment, in S103, according to the similarity distance, the node lines in the text node structure diagram are filtered to obtain a target text node structure including each text node and the remaining node lines. In the figure, it specifically includes the following sub-steps S1031-S1034, which are described in detail as follows:
S1031、针对任一相似距离,判断所述相似举例是否小于或等于预设阈值;S1031. For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;
S1032、若判定所述相似距离小于或等于预设阈值,则删除小于或等于预设阈值的所述相似距离对应的节点线。以及,S1032. If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold. as well as,
S1033、若判定所述相似距离大于预设阈值,则保留大于预设阈值的所述相似距离对应的节点线。S1033. If it is determined that the similarity distance is greater than a preset threshold, keep the node lines corresponding to the similarity distance greater than the preset threshold.
S1034、基于剩余节点线和所述多个文本节点,生成所述目标文本节点结构图。S1034. Generate the target text node structure diagram based on the remaining node lines and the multiple text nodes.
在应用中,上述预设阈值可以由用户根据实际情况进行设定,也可以为终端设备中预先设定的固定值,对此不作限定。在S103中已说明在相似距离小于或等于预设阈值时,删除该相似距离对应的节点线。以及,在相似距离大于预设阈值时,保留该相似距离对应的节点线。以此,生成所述目标文本节点结构图,对此不再进行说明。In application, the above-mentioned preset threshold may be set by the user according to the actual situation, or may be a fixed value preset in the terminal device, which is not limited. It has been explained in S103 that when the similarity distance is less than or equal to the preset threshold, the node line corresponding to the similarity distance is deleted. And, when the similarity distance is greater than the preset threshold, the node line corresponding to the similarity distance is retained. In this way, the structure diagram of the target text node is generated, which will not be described again.
参照图3,在一具体实施例中,在S104基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组中,具体包括如下子步骤S1041-S1046,详述如下:3, in a specific embodiment, in S104, based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained. , which specifically includes the following sub-steps S1041-S1046, which are detailed as follows:
S1041、将所述目标文本节点结构图中所述每个文本节点均作为单独的节点分组。S1041. Group each text node in the target text node structure diagram as a separate node.
在应用中,上述在对多个待分类文本进行分类时,是将每个待分类文本对应的文本节点均作为一个节点分组。而后,对每个节点分组依次进行同样的步骤处理,得到多个待分类文本的目标分组。In an application, when classifying a plurality of texts to be classified, the text nodes corresponding to each text to be classified are grouped as a node. Then, the same steps are sequentially performed on each node group to obtain a plurality of target groups of texts to be classified.
S1042、对于任一节点分组,每次将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块,每种分组模块包括所述任一节点分组与其中一个相邻节点分组融合组成的节点分组,以及其余单独的节点分组。S1042. For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain multiple grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings The grouping of nodes made up by the fusion, and the remaining individual groupings of nodes.
在应用中,在目标文本节点结构图中,还存在文本节点与其余文本节点相连的节点线。基于此,对于任一文本节点的节点分组,可每次将任一节点分组与任一相邻的节点分组融合得到融合的节点分组,以及其余单独的节点分组。In the application, in the target text node structure diagram, there are also node lines connecting the text nodes with other text nodes. Based on this, for the node grouping of any text node, any node grouping and any adjacent node grouping can be fused each time to obtain a fused node grouping and other individual node groupings.
示例性的,若有k个待分类文本,则有k个节点分组。对于任一节点分组i,可将该节点分组i与相邻的节点分组j进行融合,即将两个待分类文本的节点分组作为一类分组。此时,形成的分组模块中,则有k-1个节点分组。可以理解的是,该k-1个分组中,有一个融合组成的节点分组,以及k-2个其余单独的节点分组。对于其他的节点分组,也需如S1042中进行处理,得到相应的分组模块。Exemplarily, if there are k texts to be classified, there are k node groups. For any node group i, the node group i can be fused with the adjacent node group j, that is, two node groups of texts to be classified are regarded as a class of groups. At this time, in the formed grouping module, there are k-1 node groups. It can be understood that, among the k-1 groups, there is a node group composed of fusion, and k-2 other individual node groups. For other node groups, it also needs to be processed as in S1042 to obtain corresponding grouping modules.
S1043、分别计算所述每种分组模块的分组模块度,所述分组模块度用于表示所述分组模块的分类效果。S1043. Calculate the grouping modularity of each of the grouping modules respectively, where the grouping modularity is used to represent the classification effect of the grouping module.
在应用中,在确定一种分组模块后,该分组模块的分组模块度,则由融合组成的节点分组的模块度,与各个其余单独的节点分组的模块度相加得到。即对于包含上述k-1个节点分组的分组模块,需要分别计算k-1个节点分组中,每个节点分组对应的模块度并进行加和,得到分组模块的整体模块度(分组模块度)。具体的,可参照上述S104中计算Qi的公式。其中,Q代表分组模块的整体模块度(分组模块度),Qi代表第i个类别的分组的模块度(该分组模块包含的k-1个节点分组中,第i个节点分组的模块度),其中,i为整数,且i≤k-1。In application, after a grouping module is determined, the grouping modularity of the grouping module is obtained by adding the modularity of the node grouping composed of fusion and the modularity of each other individual node grouping. That is, for a grouping module containing the above k-1 node groupings, it is necessary to calculate the corresponding modularity of each node grouping in the k-1 node groupings and add them to obtain the overall modularity of the grouping module (grouping modularity) . Specifically, the formula for calculating Qi in the above S104 may be referred to. Among them, Q represents the overall modularity of the grouping module (grouping modularity), and Qi represents the modularity of the grouping of the ith category (in the k-1 node groupings included in the grouping module, the modularity of the ith node grouping) , where i is an integer and i≤k-1.
在应用中,根据上述步骤,若节点分组i还有其余相邻节点分组x,还需将节点分组i与节点分组x重复上述步骤,再次得到一个分组模块的分组模块度。然后,处理完节点分组i之后,需要再对其余单独的节点分组(j、x,或任一其他节点分组),重复上述步骤S1042与步骤S1043。以此,可得到对每个节点分组进行分类后的多个分组模块,以及每个分组模块对应的分组模块度。In application, according to the above steps, if there are other adjacent node groups x in node group i, it is necessary to repeat the above steps for node group i and node group x to obtain the grouping modularity of a grouping module again. Then, after the node group i is processed, it is necessary to group the remaining individual nodes (j, x, or any other node group), and repeat the above steps S1042 and S1043. In this way, a plurality of grouping modules after classifying each node grouping, and the grouping modularity corresponding to each grouping module can be obtained.
S1044、在所述多种分组模块中,将所述分组模块度的最大值对应的分组模块作为当前目标分组模块,并判断所述当前目标分组模块是否小于或等于上一目标分组模块的分组模块度。S1044. In the multiple grouping modules, use the grouping module corresponding to the maximum value of the grouping modularity as the current target grouping module, and determine whether the current target grouping module is less than or equal to the grouping module of the previous target grouping module Spend.
在应用中,在得到每种分组模块的分组模块度后,可根据分组模块度的数值确定多种分组模块度的最大值,并将最大值的分组模块度对应的分组模块作为当前目标分组模块。In application, after obtaining the grouping modularity of each grouping module, the maximum value of various grouping modularity can be determined according to the value of the grouping modularity, and the grouping module corresponding to the maximum grouping modularity can be used as the current target grouping module .
S1045、若判定所述当前目标分组模块的分组模块度小于或等于上一目标分组模块的分组模块度,则将所述上一目标分组模块作为所述多个待分类文本的目标分组;S1045, if it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, then the last target grouping module is used as the target grouping of the multiple texts to be classified;
S1046、若判定所述当前目标分组模块的分组模块度大于上一目标分组模块的分组模块度,则针对所述当前目标分组模块中的每个节点分组,再次执行所述将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块的步骤,直至得到每个待分类文本的目标分组。S1046. If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, for each node in the current target grouping module, perform the grouping of any node again. The grouping is merged with any adjacent node grouping to obtain various grouping modules, until the target grouping of each text to be classified is obtained.
在应用中,在确定当前目标分组模块后,需要确定当前目标分组模块是否为多个待分类文本中的最优分组。基于此,还需将当前目标分组模块的分组模块度与上一目标分组模块的分组模块度进行比较,才可确定当前目标分组是否为最优分组。具体的,若当前目标分组模块的分组模块度小于或等于上一目标分组模块的分组模块度,则表明上一目标分组模块即为多个待分类文本的最优分组(目标分组)。若当前目标分组模块的分组模块度大于上一目标分组模块的分组模块度,则表明上一目标分组不属于最优分组。之后,为确定 当前目标分组是否属于最优分组,还需要对当前目标分组模块中的每个节点分组,重复执行S1042至S1045,直至得到每个待分类文本的目标分组。In the application, after determining the current target grouping module, it is necessary to determine whether the current target grouping module is the optimal grouping among the multiple texts to be classified. Based on this, it is also necessary to compare the grouping modularity of the current target grouping module with the grouping modularity of the previous target grouping module to determine whether the current target grouping is the optimal grouping. Specifically, if the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, it indicates that the previous target grouping module is the optimal grouping (target grouping) of multiple texts to be classified. If the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, it indicates that the previous target grouping does not belong to the optimal grouping. Afterwards, in order to determine whether the current target grouping belongs to the optimal grouping, it is also necessary to group each node in the current target grouping module, and repeat S1042 to S1045 until the target grouping of each text to be classified is obtained.
需要说明的是,在重复执行S1042至S1045时,此时S1042针对的是当前目标分组中的每个节点分组。即若上述当前目标分组模块包含的节点分组为节点分组i与节点分组j融合组成的节点分组,与其余单独的节点分组,则在执行S1042时,任一节点分组的数量为k-1个。此时,节点分组i与节点分组j融合组成的节点分组将视为一个节点分组进行处理。It should be noted that when S1042 to S1045 are repeatedly performed, at this time, S1042 is directed to each node group in the current target group. That is, if the node group included in the current target grouping module is a node group formed by the fusion of node group i and node group j, and is grouped with other individual nodes, then when S1042 is executed, the number of any node group is k-1. At this time, the node group formed by the fusion of node group i and node group j will be treated as one node group for processing.
具体的,对于当前目标分组模块的分组模块度Qk-1,若Qk-1小于或等于上一目标分组模块的分组模块度Qk,则表明当前目标分组模块的分类效果小于或等于上一目标分组模块的分类效果。因此,表明上一目标分组模块即为效果最优的目标分组。若当前目标分组模块的分组模块度Qk-1,大于上一目标分组模块的分组模块度Qk,则表明当前目标分组模块的分类效果高于上一目标分组模块的分类效果。之后,将当前目标分组模块的k-1个节点分组重复执行S1042至S1045,直至得到分类效果最优的目标分组模块Q。Specifically, for the grouping modularity Qk-1 of the current target grouping module, if Qk-1 is less than or equal to the grouping modularity Qk of the previous target grouping module, it indicates that the classification effect of the current target grouping module is less than or equal to the previous target grouping The classification effect of the module. Therefore, it shows that the previous target grouping module is the target grouping with the best effect. If the grouping modularity Qk-1 of the current target grouping module is greater than the grouping modularity Qk of the previous target grouping module, it indicates that the classification effect of the current target grouping module is higher than that of the previous target grouping module. Afterwards, the k-1 nodes of the current target grouping module are grouped repeatedly to perform S1042 to S1045 until the target grouping module Q with the best classification effect is obtained.
参照图4,在一具体实施例中,所述分组模块的分组模块度由融合组成所述分组模块的节点分组的模块度,与所述其余单独的节点分组的模块度相加得到;在S1043分别计算所述每种分组模块的分组模块度中,具体包括如下子步骤S10431-S10434,详述如下:4 , in a specific embodiment, the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; at S1043 Respectively calculating the grouping modularity of each of the grouping modules specifically includes the following sub-steps S10431-S10434, which are detailed as follows:
S10431、针对任一分组模块,确定所述分组模块中当前节点分组包含的节点线的第一数量,以及所述其余单独的节点分组分别与所述当前节点分组连接的节点线的第二数量,所述当前节点分组为所述分组模块中当前待计算模块度的分组。S10431. For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, The current node grouping is the grouping of the modularity to be calculated currently in the grouping module.
S10432、根据所述第一数量和所述第二数量,计算所述当前节点分组的模块度。S10432. Calculate the modularity of the current node grouping according to the first quantity and the second quantity.
S10433、依次将所述其余单独的节点分组作为所述当前节点分组,分别计算所述分组模块中所述其余单独的节点分组的模块度。S10433. Take the remaining individual node groups as the current node group in sequence, and calculate the modularity of the remaining individual node groups in the grouping module respectively.
S10434、将所述当前节点分组的模块度和所述其余单独的节点分组的模块度之和,作为所述分组模块的分组模块度。S10434. Use the sum of the modularity of the current node grouping and the modularity of the remaining individual node groups as the grouping modularity of the grouping module.
在应用中,根据第一数量和第二数量计算当前节点分组的模块度,具体可参照上述S104中的模块度计算公式,对此不再进行说明。In the application, the modularity of the current node grouping is calculated according to the first quantity and the second quantity. For details, reference may be made to the modularity calculation formula in the above S104, which will not be described again.
在应用中,上述当前节点分组为分组模块中当前处于计算模块度的分组。可以理解的是,若当前节点分组只有一个文本节点,则当前节点分组包含的节点线的第一数量即为0。上述第二数量即为其余节点分组分别与当前节点分组连接的节点线的数量。可以理解的是,对于当前节点分组若具有多个文本节点(由多个节点分组进行融合得到的分组),则其余节点分组与当前节点分组中的任一节点分组相连,即为其余节点分组与当前节点分组之间具有连接。In an application, the above-mentioned current node grouping is a grouping that is currently in calculating modularity in the grouping module. It can be understood that, if the current node group has only one text node, the first number of node lines included in the current node group is 0. The above-mentioned second number is the number of node lines respectively connecting the remaining node groups to the current node group. It can be understood that if the current node group has multiple text nodes (groups obtained by merging multiple node groups), the remaining node groups are connected to any node group in the current node group, that is, the remaining node groups and There are connections between the current node groups.
参照图5,在一实施例中,在S104基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组之后,还包括如下步骤S104A-S104B,详述如下:Referring to FIG. 5, in one embodiment, in S104, based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and after a plurality of target groups of the text to be classified are obtained, It also includes the following steps S104A-S104B, which are detailed as follows:
S104A、确定所述目标分组中每个节点分组对应的分类范围,所述分类范围包含多个目标分类。S104A: Determine a classification range corresponding to each node group in the target group, where the classification range includes multiple target classifications.
在应用中,在得到目标分组后,此时目标分组包含的节点分组的数量将会小于初始的节点分组数量。即此时目标分组中每个节点分组均包含有多个文本节点。该多个文本节点融合组成的节点分组即可认为属于一个分类范围。其中,上述分类范围包括但不限于物理学分类范围、化学分类范围、计算机分类范围等学科分类。也可以为某一学科分类范围的下一层次分类范围。例如,分类范围可以为力学分类范围、电学分类范围、光学分类范围。可以理解的是,对于每个分类范围,其还可包含多个具体的目标分类。示例性的,对于力学分类范围包含的目标分类,可以为量子力学目标分类、牛顿力学目标分类等。In the application, after the target group is obtained, the number of node groups contained in the target group will be smaller than the initial number of node groups. That is, each node group in the target group now contains multiple text nodes. The node grouping formed by the fusion of the multiple text nodes can be considered to belong to a classification range. Among them, the above classification scope includes, but is not limited to, the scope of physical classification, the scope of chemical classification, the scope of computer classification and other subject classifications. It can also be the next-level classification range of a subject classification range. For example, the classification range may be a mechanical classification range, an electrical classification range, and an optical classification range. It can be understood that, for each classification range, it can also contain multiple specific target classifications. Exemplarily, the target classification included in the mechanical classification range may be quantum mechanical target classification, Newtonian mechanical target classification, and the like.
在应用中,多个待分类文本可以为同一个分类范围,但属于不同目标分类的文本。也可以为分别属于多个分类范围,且不同目标分类的文本。其中,确定目标分组中每个节点 分组对应的分类范围,可以由人工进行标注。也可以在终端设备中预先训练用于确定分类范围的模型,该模型只用于根据目标分组中每个节点分组的待分类文本,确定每个节点分组中的分类范围。可以理解的是,跨学科的不同分类范围的待分类文本,其文本相似度(文本节点之间的相似距离)通常非常小。因此,同一节点分组之间的多个待分类文本通常为一个学科(分类范围)的文本。基于此,模型只需根据每个节点分组中的任一待分类文本进行分类范围预测,即可确定相应节点分组对应的分类范围。In an application, multiple texts to be classified can be in the same classification range but belong to different target classifications. It can also be texts that belong to multiple classification ranges and have different target classifications. Among them, the classification range corresponding to each node group in the target group is determined, which can be marked manually. A model for determining the classification range may also be pre-trained in the terminal device, and the model is only used to determine the classification range in each node group according to the text to be classified in each node group in the target group. It can be understood that the text similarity (similar distance between text nodes) of texts to be classified in different classification ranges across disciplines is usually very small. Therefore, multiple texts to be classified between the same node grouping are usually texts of one subject (classification scope). Based on this, the model only needs to predict the classification range according to any text to be classified in each node group, and then the classification range corresponding to the corresponding node group can be determined.
S104B、针对任一节点分组,根据所述分类范围中多个目标分类分别对应的预设关键词,对所述节点分组中包含的多个文本节点进行二次分类,得到所述多个待分类文本的最终目标分组。S104B. For any node grouping, perform secondary classification on a plurality of text nodes included in the node grouping according to preset keywords corresponding to a plurality of target categories in the classification range, to obtain the plurality of to-be-categorized The final destination grouping of the text.
在应用中,上述已说明每个分类范围均包含多个目标分类,基于此,终端设备可预先存储每个目标分类分别对应的关键词。进而,对于任一节点分组,使用该节点分组对应的各个目标类别的预设关键词,分别与每个待分类文本进行比较。以实现对节点分组中包含的多个待分类文本进行二次分类,得到多个待分类文本的最终目标分组。In the application, it has been described above that each category range includes multiple target categories, and based on this, the terminal device may pre-store the keywords corresponding to each target category. Furthermore, for any node grouping, the preset keywords of each target category corresponding to the node grouping are used to compare with each text to be classified. In order to realize the secondary classification of the multiple texts to be classified contained in the node grouping, the final target grouping of the multiple unclassified texts is obtained.
可以理解的是,上述步骤S101至步骤S104之间的步骤可以为不同业务场景(不同的文本分类的业务场景)所公用。在其他业务场景需实现文本分类业务时,也可使用包含上述S101至S104步骤的系统资源进行处理,可大量减少不同业务所需文本分类的系统资源,不用为每个文本分类的业务场景专门设定不同的系统资源。It can be understood that the steps between the above steps S101 to S104 may be common to different business scenarios (different text classification business scenarios). When other business scenarios need to implement text classification services, the system resources including the above steps S101 to S104 can also be used for processing, which can greatly reduce the system resources of text classification required by different services, and there is no need to specially design for each text classification business scenario. different system resources.
参照图6,在一实施例中,在S104B根据所述分类范围中多个目标分类分别对应的预设关键词,对所述节点分组中包含的多个文本节点进行二次分类,得到所述多个待分类文本的最终目标分组中,还包括如下子步骤S104B1-S104B3,详述如下:Referring to FIG. 6 , in an embodiment, in S104B, according to preset keywords corresponding to multiple target categories in the classification range, secondary classification is performed on a plurality of text nodes included in the node grouping to obtain the The final target grouping of multiple texts to be classified further includes the following sub-steps S104B1-S104B3, which are described in detail as follows:
S104B1、针对任一所述节点分组包含的多个文本节点,判断所述多个文本节点中,是否存在任一文本节点对应的待分类文本未包含所述多个目标分类的预设关键词。S104B1. For a plurality of text nodes included in any of the node groups, determine whether among the plurality of text nodes, there is a text to be classified corresponding to any text node that does not contain the preset keywords of the plurality of target classifications.
S104B2、若判定所述多个文本节点中,存在任一文本节点对应的待分类文本未包含所述多个目标分类的预设关键词,则获取所述待分类文本中的文本分词。S104B2. If it is determined that among the multiple text nodes, the text to be classified corresponding to any text node does not contain the preset keywords of the multiple target classifications, obtain the text segmentation in the text to be classified.
在应用中,每个目标分类类别对应的关键词通常具有局限性。在实际情况下,为区分每个待分类文本的目标分类类别,其目标分类各自对应的预设关键词应各不相同。且在实际情况下,用户无法为每个分类类别预置所有的关键词。基于此,存在节点分组中包含的多个待分类文本中,均未包含每个目标分类对应的预设关键词的情况。In applications, the keywords corresponding to each target classification category usually have limitations. In an actual situation, in order to distinguish the target classification categories of each text to be classified, the preset keywords corresponding to the target categories should be different from each other. And in actual situations, the user cannot preset all keywords for each classification category. Based on this, there is a situation in which the preset keywords corresponding to each target category are not included in the multiple texts to be classified included in the node grouping.
在应用中,对待分类文本进行文本分词,具体可参照上述S102中对待分类文本进行分词的示例说明,对此不再进行解释。In the application, text segmentation is performed on the text to be classified, for details, reference may be made to the example description of performing segmentation on the text to be classified in the above S102, which will not be explained again.
S104B3、根据所述文本分词和已知目标分类的多个预设文本,对所述待分类文本进行二次分类,得到最终目标分组。S104B3: Perform secondary classification on the text to be classified according to the text segmentation and a plurality of preset texts of known target classifications to obtain a final target group.
在应用中,上述预设的已知目标分类的文本可以为用户预先设定的多个文本,该预设文本包括但不限于期刊、论文等文本。In an application, the above-mentioned preset texts of known target categories may be multiple texts preset by a user, and the preset texts include but are not limited to texts such as journals and papers.
在应用中,根据文本分词和已知目标分类的多个预设文本,对待分类文本进行二次分类可以为,终端设备根据文本分词识别文本语义,进而对待分类文本进行二次分类。也可以为在确定文本分词经常出现在任一目标分类对应的预设文本时,将该待分类文本划分为该目标分类。In the application, according to text segmentation and multiple preset texts classified by known targets, the secondary classification of the text to be classified may be that the terminal device recognizes the text semantics according to the text segmentation, and then performs secondary classification of the to-be-categorized text. It is also possible to classify the text to be classified into the target classification when it is determined that the text segmentation often appears in the preset text corresponding to any target classification.
示例性的,对于一个节点分组中的两个待分类文本:a“一切物体在没有受到力或合力为零的作用时,总保持静止状态或匀速直线运动状态”;b“当你测量了一个粒子的动量时,你就测不准它的位置”。对待分类文本a进行文本分词,可得到多个文本分词。例如,待分类文本a的文本分词中包含力、合力、静止状态、匀速直线运动状态等文本分词。然而,在其他多个已知目标分类类别的预设文本中(例如,论文或期刊),终端设备可分别统计同时包含上述文本分词(待分类文本a的文本分词)的文章数量。之后,从同时包含上述文本分词的文章中,分别确定在每个目标分类中对应包含的文章数量,并将包含的文章数量的最大值对应的目标分类,作为该待分类文本的目标分类。通过上述处理发现,包含上 述文本分词(待分类文本a的文本分词)的文章大多属于“牛顿力学类别”。而对待分类文本b进行上述处理后,发现待分类文本b中的多个文本分词,在其他已知目标分类的文章同时出现时,该文章大多数属于“量子力学类别”。基于此,可确定待分类文本a属于“牛顿力学类别”的最终目标分组,待分类文本b属于“量子力学”的最终目标分组。以此,在对待分类文本进行更为精确的文本分类时,提高对文本分类的准确率。Exemplarily, for two texts to be classified in a node grouping: a "All objects always remain in a state of rest or in a state of uniform linear motion when they are not acted upon by a force or the resultant force is zero"; b "When you measure a When you have a particle's momentum, you can't tell where it is." Perform text segmentation on the text a to be classified, and multiple text segmentations can be obtained. For example, the text segmentation of the text a to be classified includes text segmentation such as force, resultant force, static state, and uniform linear motion state. However, in other preset texts of multiple known target classification categories (for example, papers or journals), the terminal device can separately count the number of articles that simultaneously contain the above text segmentation (text segmentation of the text a to be classified). Afterwards, from the articles that also contain the above text segmentation, the corresponding number of articles included in each target category is determined, and the target category corresponding to the maximum number of the included articles is used as the target category of the text to be classified. Through the above processing, it is found that most of the articles containing the above text segmentation (text segmentation of the text a to be classified) belong to the "Newtonian mechanics category". After the above processing is performed on the text b to be classified, it is found that multiple text segmentations in the text b to be classified, when other articles with known target classifications appear at the same time, most of the articles belong to the "quantum mechanical category". Based on this, it can be determined that the text a to be classified belongs to the final target grouping of "Newtonian mechanics category", and the text b to be classified belongs to the final target grouping of "quantum mechanics". In this way, when more precise text classification is performed on the text to be classified, the accuracy of text classification is improved.
请参阅图7,图7是本申请实施例提供的一种文本分类装置的结构框图。本实施例中文本分类装置包括的各模块用于执行图1至图6对应的实施例中的各步骤。具体请参阅图1至图6以及图1至图6所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图7,文本分类装置700包括:获取模块710、计算模块720、过滤模块730以及第一分类模块740,其中:Please refer to FIG. 7. FIG. 7 is a structural block diagram of a text classification apparatus provided by an embodiment of the present application. Each module included in the text classification apparatus in this embodiment is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 . For details, please refer to FIG. 1 to FIG. 6 and the related descriptions in the embodiments corresponding to FIG. 1 to FIG. 6 . For convenience of explanation, only the parts related to this embodiment are shown. Referring to FIG. 7 , the text classification apparatus 700 includes: an acquisition module 710, a calculation module 720, a filtering module 730 and a first classification module 740, wherein:
获取模块710,用于获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接。The obtaining module 710 is configured to obtain a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line.
计算模块720,用于提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离。The calculation module 720 is configured to extract the text features of each of the texts to be classified, and respectively calculate the similarity distance between the two text nodes according to the text features.
过滤模块730,用于根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图。The filtering module 730 is configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines.
第一分类模块740,用于基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。The first classification module 740 is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines to obtain a plurality of target groups of the text to be classified.
在一实施例中,所述待分类文本包含多个文本分词,所述文本特征由所述多个文本分词的词向量组成,所述相似距离包括欧几里得距离;计算模块720还用于:In one embodiment, the text to be classified includes multiple text segments, the text features are composed of word vectors of the multiple text segments, and the similarity distance includes the Euclidean distance; the computing module 720 is further configured to: :
提取每个所述待分类文本中多个文本分词的词向量,并确定所述多个文本分词分别在各自对应的所述待分类文本中的词顺序;针对任一两两文本节点,根据所述待分类文本中相同词顺序的词向量,计算每个所述文本分词的欧几里得距离,并将每个所述文本分词的欧几里得距离进行加和得到所述两两文本节点之间的欧几里得距离。Extract word vectors of multiple text segments in each of the texts to be classified, and determine the word order of the multiple text segments in the respective corresponding texts to be classified; for any pair of text nodes, according to the Describe the word vectors of the same word order in the text to be classified, calculate the Euclidean distance of each text segment, and add the Euclidean distances of each text segment to obtain the paired text nodes Euclidean distance between .
在一实施例中,过滤模块730还用于:In one embodiment, the filtering module 730 is also used to:
针对任一相似距离,判断所述相似举例是否小于或等于预设阈值;若判定所述相似距离小于或等于预设阈值,则删除小于或等于预设阈值的所述相似距离对应的节点线;以及,若判定所述相似距离大于预设阈值,则保留大于预设阈值的所述相似距离对应的节点线;基于剩余节点线和所述多个文本节点,生成所述目标文本节点结构图。For any similarity distance, determine whether the similarity example is less than or equal to the preset threshold; if it is determined that the similarity distance is less than or equal to the preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; And, if it is determined that the similarity distance is greater than a preset threshold, the node lines corresponding to the similarity distance greater than the preset threshold are reserved; based on the remaining node lines and the plurality of text nodes, the target text node structure diagram is generated.
在一实施例中,第一分类模块740还用于:In one embodiment, the first classification module 740 is further configured to:
将所述目标文本节点结构图中所述每个文本节点均作为单独的节点分组;对于任一节点分组,每次将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块,每种分组模块包括所述任一节点分组与其中一个相邻节点分组融合组成的节点分组,以及其余单独的节点分组;分别计算所述每种分组模块的分组模块度,所述分组模块度用于表示所述分组模块的分类效果;在所述多种分组模块中,将所述分组模块度的最大值对应的分组模块作为当前目标分组模块,并判断所述当前目标分组模块是否小于或等于上一目标分组模块的分组模块度;若判定所述当前目标分组模块的分组模块度小于或等于上一目标分组模块的分组模块度,则将所述上一目标分组模块作为所述多个待分类文本的目标分组;若判定所述当前目标分组模块的分组模块度大于上一目标分组模块的分组模块度,则针对所述当前目标分组模块中的每个节点分组,再次执行所述将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块的步骤,直至得到每个待分类文本的目标分组。Group each text node described in the target text node structure diagram as a separate node group; for any node grouping, each time the any node grouping is fused with any adjacent node grouping to obtain a variety of groupings module, each grouping module includes a node grouping formed by the fusion of any node grouping and one of the adjacent node groupings, and the remaining individual node groupings; respectively calculate the grouping modularity of each grouping module, and the grouping module The degree is used to represent the classification effect of the grouping module; in the various grouping modules, the grouping module corresponding to the maximum value of the grouping module degree is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module; if it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, then the previous target grouping module is used as the multi-target grouping module. A target grouping of texts to be classified; if it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, then for each node grouping in the current target grouping module, execute the described The step of merging any node grouping with any adjacent node grouping to obtain various grouping modules, until the target grouping of each text to be classified is obtained.
在一实施例中,所述分组模块的分组模块度由融合组成所述分组模块的节点分组的模块度,与所述其余单独的节点分组的模块度相加得到;第一分类模块740还用于:In one embodiment, the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; the first classification module 740 also uses At:
针对任一分组模块,确定所述分组模块中当前节点分组包含的节点线的第一数量,以及所述其余单独的节点分组分别与所述当前节点分组连接的节点线的第二数量,所述当前节点分组为所述分组模块中当前待计算模块度的分组;根据所述第一数量和所述第二数量, 计算所述当前节点分组的模块度;依次将所述其余单独的节点分组作为所述当前节点分组,分别计算所述分组模块中所述其余单独的节点分组的模块度;将所述当前节点分组的模块度和所述其余单独的节点分组的模块度之和,作为所述分组模块的分组模块度。For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity currently to be calculated in the grouping module; the modularity of the current node grouping is calculated according to the first quantity and the second quantity; the remaining individual node groups are sequentially used as For the current node grouping, the modularity of the remaining individual node groups in the grouping module is calculated respectively; the sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the The grouping modularity of the grouping module.
在一实施例中,文本分类装置700还包括:In one embodiment, the text classification apparatus 700 further includes:
确定模块,用于确定所述目标分组中每个节点分组对应的分类范围,所述分类范围包含多个目标分类。A determination module, configured to determine a classification range corresponding to each node group in the target group, where the classification range includes multiple target classifications.
第二分类模块,用于针对任一节点分组,根据所述分类范围中多个目标分类分别对应的预设关键词,对所述节点分组中包含的多个文本节点进行二次分类,得到所述多个待分类文本的最终目标分组。The second classification module is configured to, for any node grouping, perform secondary classification on the plurality of text nodes included in the node grouping according to the preset keywords corresponding to the plurality of target classifications in the classification range, and obtain the Describe the final target grouping of multiple texts to be classified.
在一实施例中,第二分类模块还用于:In one embodiment, the second classification module is further used for:
针对任一所述节点分组包含的多个文本节点,判断所述多个文本节点中,是否存在任一文本节点对应的待分类文本未包含所述多个目标分类的预设关键词;若判定所述多个文本节点中,存在任一文本节点对应的待分类文本未包含所述多个目标分类的预设关键词,则获取所述待分类文本中的文本分词;根据所述文本分词和已知目标分类的多个预设文本,对所述待分类文本进行二次分类,得到最终目标分组。For a plurality of text nodes included in any of the node groups, it is judged whether, among the plurality of text nodes, the text to be classified corresponding to any text node does not contain the preset keywords of the plurality of target classifications; Among the multiple text nodes, if the text to be classified corresponding to any text node does not contain the preset keywords of the multiple target classifications, the text segmentation in the to-be-categorized text is obtained; according to the text segmentation and Knowing a plurality of preset texts of target classification, secondary classification is performed on the to-be-classified texts to obtain a final target grouping.
当理解的是,图7示出的文本分类装置的结构框图中,各单元/模块用于执行图1至图6对应的实施例中的各步骤,而对于图1至图6对应的实施例中的各步骤已在上述实施例中进行详细解释,具体请参阅图1至图6以及图1至图6所对应的实施例中的相关描述,此处不再赘述。It should be understood that, in the structural block diagram of the text classification apparatus shown in FIG. 7 , each unit/module is used to execute each step in the embodiment corresponding to FIG. 1 to FIG. 6 , and for the embodiment corresponding to FIG. 1 to FIG. The steps in the above have been explained in detail in the above-mentioned embodiments. For details, please refer to FIG. 1 to FIG. 6 and the relevant descriptions in the corresponding embodiments of FIG. 1 to FIG.
图8是本申请另一实施例提供的一种终端设备的结构框图。如图8所示,该实施例的终端设备800包括:处理器810、存储器820以及存储在存储器820中并可在处理器810运行的计算机程序830,例如文本分类方法的程序。处理器810执行计算机程序830时实现上述各个文本分类方法各实施例中的步骤,例如图1所示的S101至S104。或者,处理器810执行计算机程序830时实现上述图7对应的实施例中各模块的功能,例如,图7所示的模块710至740的功能。具体如下所述:FIG. 8 is a structural block diagram of a terminal device provided by another embodiment of the present application. As shown in FIG. 8 , the terminal device 800 of this embodiment includes: a processor 810 , a memory 820 , and a computer program 830 stored in the memory 820 and executable on the processor 810 , such as a program of a text classification method. When the processor 810 executes the computer program 830, the steps in each of the above embodiments of the text classification methods are implemented, for example, S101 to S104 shown in FIG. 1 . Alternatively, when the processor 810 executes the computer program 830, the functions of each module in the embodiment corresponding to FIG. 7 are implemented, for example, the functions of the modules 710 to 740 shown in FIG. 7 . Specifically as follows:
一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implements when the processor executes the computer program:
获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:A computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is implemented when executed by a processor:
获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
示例性的,计算机程序830可以被分割成一个或多个模块,一个或者多个模块被存储 在存储器820中,并由处理器810执行,以完成本申请。一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序830在终端设备800中的执行过程。Illustratively, the computer program 830 may be divided into one or more modules, and the one or more modules are stored in the memory 820 and executed by the processor 810 to complete the present application. One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 830 in the terminal device 800 .
终端设备可包括,但不仅限于,处理器810、存储器820。本领域技术人员可以理解,图8仅仅是终端设备800的示例,并不构成对终端设备800的限定,可以包括比图示更多或更少的部件,或者组合某些部件,例如终端设备还可以包括输入输出设备、总线等。The terminal device may include, but is not limited to, the processor 810 and the memory 820 . Those skilled in the art can understand that FIG. 8 is only an example of the terminal device 800, and does not constitute a limitation to the terminal device 800. It may include more or less components than those shown in the figure, or combine some components, for example, the terminal device also Can include input and output devices, buses, etc.
所称处理器810可以是中央处理单元,还可以是其他通用处理器、数字信号处理器、专用集成电路、现成可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 810 may be a central processing unit, and may also be other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. Wait. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
存储器820可以是终端设备800的内部存储单元,例如终端设备800的硬盘或内存。存储器820也可以是终端设备800的外部存储设备,例如终端设备800上配备的插接式硬盘,智能存储卡,闪存卡等。进一步地,存储器820还可以既包括终端设备800的内部存储单元也包括外部存储设备。The memory 820 may be an internal storage unit of the terminal device 800 , such as a hard disk or a memory of the terminal device 800 . The memory 820 may also be an external storage device of the terminal device 800 , such as a plug-in hard disk, a smart memory card, a flash memory card, etc., which are equipped on the terminal device 800 . Further, the memory 820 may also include both an internal storage unit of the terminal device 800 and an external storage device.
所述计算机可读存储介质可以是前述实施例所述的终端设备的内部存储单元,例如所述终端设备的硬盘或内存。所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质也可以是所述终端设备的外部存储设备,例如所述终端设备上配备的插接式硬盘,智能存储卡安全数字卡,闪存卡等。The computer-readable storage medium may be an internal storage unit of the terminal device described in the foregoing embodiments, such as a hard disk or a memory of the terminal device. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium may also be an external storage device of the terminal device, for example, a pluggable hard disk, a smart memory card, a secure digital card, a flash memory card, etc. equipped on the terminal device.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims (20)

  1. 一种文本分类方法,其中,包括:A text classification method, which includes:
    获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
    提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
    根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
    基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  2. 如权利要求1所述的文本分类方法,其中,所述待分类文本包含多个文本分词,所述文本特征由所述多个文本分词的词向量组成,所述相似距离包括欧几里得距离;所述提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离,包括:The text classification method according to claim 1, wherein the text to be classified includes a plurality of text segments, the text features are composed of word vectors of the plurality of text segments, and the similarity distance includes a Euclidean distance ; Describe extracting the text feature of each described text to be classified, and calculate the similarity distance between the two text nodes according to the text feature respectively, including:
    提取每个所述待分类文本中多个文本分词的词向量,并确定所述多个文本分词分别在各自对应的所述待分类文本中的词顺序;Extracting word vectors of multiple text segments in each of the texts to be classified, and determining the word order of the multiple text segments in the respective corresponding texts to be classified;
    针对任一两两文本节点,根据所述待分类文本中相同词顺序的词向量,计算每个所述文本分词的欧几里得距离,并将每个所述文本分词的欧几里得距离进行加和得到所述两两文本节点之间的欧几里得距离。For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Summing is performed to obtain the Euclidean distance between the two text nodes.
  3. 如权利要求1或2所述的文本分类方法,其中,所述根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图,包括:The text classification method according to claim 1 or 2, wherein, according to the similarity distance, the node lines in the text node structure graph are filtered to obtain a data containing each text node and the remaining node lines. The target text node structure diagram, including:
    针对任一相似距离,判断所述相似举例是否小于或等于预设阈值;For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;
    若判定所述相似距离小于或等于预设阈值,则删除小于或等于预设阈值的所述相似距离对应的节点线;以及,If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; and,
    若判定所述相似距离大于预设阈值,则保留大于预设阈值的所述相似距离对应的节点线;If it is determined that the similarity distance is greater than a preset threshold, retaining the node lines corresponding to the similarity distance greater than the preset threshold;
    基于剩余节点线和所述多个文本节点,生成所述目标文本节点结构图。The target text node structure diagram is generated based on the remaining node lines and the plurality of text nodes.
  4. 如权利要求1所述的文本分类方法,其中,所述基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组,包括:The text classification method according to claim 1, wherein, based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified to obtain a plurality of targets of the text to be classified grouping, including:
    将所述目标文本节点结构图中所述每个文本节点均作为单独的节点分组;Grouping each text node in the target text node structure diagram as a separate node;
    对于任一节点分组,每次将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块,每种分组模块包括所述任一节点分组与其中一个相邻节点分组融合组成的节点分组,以及其余单独的节点分组;For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain a variety of grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings fused to form , and the rest of the individual node groupings;
    分别计算所述每种分组模块的分组模块度,所述分组模块度用于表示所述分组模块的分类效果;Calculate the grouping modularity of each of the grouping modules respectively, and the grouping modularity is used to represent the classification effect of the grouping module;
    在所述多种分组模块中,将所述分组模块度的最大值对应的分组模块作为当前目标分组模块,并判断所述当前目标分组模块是否小于或等于上一目标分组模块的分组模块度;In the multiple grouping modules, the grouping module corresponding to the maximum value of the grouping modularity is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module;
    若判定所述当前目标分组模块的分组模块度小于或等于上一目标分组模块的分组模块度,则将所述上一目标分组模块作为所述多个待分类文本的目标分组;If it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, the last target grouping module is used as the target grouping of the multiple texts to be classified;
    若判定所述当前目标分组模块的分组模块度大于上一目标分组模块的分组模块度,则针对所述当前目标分组模块中的每个节点分组,再次执行所述将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块的步骤,直至得到每个待分类文本的目标分组。If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, then for each node grouping in the current target grouping module, the grouping of any node with the The steps of grouping and merging any adjacent nodes to obtain a variety of grouping modules until the target grouping of each text to be classified is obtained.
  5. 如权利要求4所述的文本分类方法,其中,所述分组模块的分组模块度由融合组成所述分组模块的节点分组的模块度,与所述其余单独的节点分组的模块度相加得到;The text classification method according to claim 4, wherein the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups;
    所述分别计算所述每种分组模块的分组模块度,包括:The calculating the grouping modularity of each of the grouping modules respectively includes:
    针对任一分组模块,确定所述分组模块中当前节点分组包含的节点线的第一数量,以及所述其余单独的节点分组分别与所述当前节点分组连接的节点线的第二数量,所述当前节点分组为所述分组模块中当前待计算模块度的分组;For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity to be calculated currently in the grouping module;
    根据所述第一数量和所述第二数量,计算所述当前节点分组的模块度;calculating the modularity of the current node grouping according to the first number and the second number;
    依次将所述其余单独的节点分组作为所述当前节点分组,分别计算所述分组模块中所述其余单独的节点分组的模块度;The remaining individual node groups are sequentially used as the current node group, and the modularity of the remaining individual node groups in the grouping module is calculated respectively;
    将所述当前节点分组的模块度和所述其余单独的节点分组的模块度之和,作为所述分组模块的分组模块度。The sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the grouping modularity of the grouping module.
  6. 如权利要求4或5所述的文本分类方法,其中,在所述基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组之后,还包括:The text classification method according to claim 4 or 5, wherein in the community-based discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of to-be-classified texts are obtained. After the target grouping of the text, it also includes:
    确定所述目标分组中每个节点分组对应的分类范围,所述分类范围包含多个目标分类;determining a classification range corresponding to each node group in the target grouping, and the classification range includes multiple target classifications;
    针对任一节点分组,根据所述分类范围中多个目标分类分别对应的预设关键词,对所述节点分组中包含的多个文本节点进行二次分类,得到所述多个待分类文本的最终目标分组。For any node grouping, according to preset keywords respectively corresponding to multiple target categories in the classification range, perform secondary classification on multiple text nodes included in the node grouping, and obtain the multiple text nodes to be classified. Final target grouping.
  7. 如权利要求6所述的文本分类方法,其中,所述根据所述分类范围中多个目标分类分别对应的预设关键词,对所述节点分组中包含的多个文本节点进行二次分类,得到所述多个待分类文本的最终目标分组,包括:The text classification method according to claim 6, wherein the secondary classification is performed on the plurality of text nodes included in the node grouping according to the preset keywords respectively corresponding to the plurality of target classifications in the classification range, Obtain the final target grouping of the plurality of texts to be classified, including:
    针对任一所述节点分组包含的多个文本节点,判断所述多个文本节点中,是否存在任一文本节点对应的待分类文本未包含所述多个目标分类的预设关键词;For a plurality of text nodes included in any one of the node groups, determine whether, among the plurality of text nodes, there is a text to be classified corresponding to any text node that does not contain the preset keywords of the plurality of target classifications;
    若判定所述多个文本节点中,存在任一文本节点对应的待分类文本未包含所述多个目标分类的预设关键词,则获取所述待分类文本中的文本分词;If it is determined that among the plurality of text nodes, there is a text to be classified corresponding to any text node that does not contain the preset keywords of the plurality of target classifications, acquiring the text segmentation in the text to be classified;
    根据所述文本分词和已知目标分类的多个预设文本,对所述待分类文本进行二次分类,得到最终目标分组。According to the text segmentation and a plurality of preset texts of known target classification, secondary classification is performed on the to-be-classified text to obtain a final target grouping.
  8. 一种文本分类装置,其中,所述装置包括:A text classification device, wherein the device comprises:
    获取模块,用于获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;an acquisition module, configured to acquire a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
    计算模块,用于提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;a computing module, for extracting the text features of each of the texts to be classified, and calculating the similarity distance between the two text nodes according to the text features;
    过滤模块,用于根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;a filtering module, configured to filter the node lines in the text node structure diagram according to the similarity distance to obtain a target text node structure diagram including each text node and the remaining node lines;
    第一分类模块,用于基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。The first classification module is configured to classify the text to be classified in the node structure diagram of the target text based on the community discovery algorithm and the remaining node lines, and obtain a plurality of target groups of the text to be classified.
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements when the processor executes the computer program:
    获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
    提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
    根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
    基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  10. 根据权利要求9所述的终端设备,其中,所述待分类文本包含多个文本分词,所述文本特征由所述多个文本分词的词向量组成,所述相似距离包括欧几里得距离;所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9, wherein the text to be classified comprises a plurality of text segments, the text features are composed of word vectors of the plurality of text segments, and the similarity distance comprises a Euclidean distance; When the processor executes the computer program, it further implements:
    提取每个所述待分类文本中多个文本分词的词向量,并确定所述多个文本分词分别在各自对应的所述待分类文本中的词顺序;Extracting word vectors of multiple text segments in each of the texts to be classified, and determining the word order of the multiple text segments in the respective corresponding texts to be classified;
    针对任一两两文本节点,根据所述待分类文本中相同词顺序的词向量,计算每个所述文本分词的欧几里得距离,并将每个所述文本分词的欧几里得距离进行加和得到所述两两文本节点之间的欧几里得距离。For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Summing is performed to obtain the Euclidean distance between the two text nodes.
  11. 根据权利要求9或10所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9 or 10, wherein, when the processor executes the computer program, it further implements:
    针对任一相似距离,判断所述相似举例是否小于或等于预设阈值;For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;
    若判定所述相似距离小于或等于预设阈值,则删除小于或等于预设阈值的所述相似距离对应的节点线;以及,If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; and,
    若判定所述相似距离大于预设阈值,则保留大于预设阈值的所述相似距离对应的节点线;If it is determined that the similarity distance is greater than a preset threshold, retaining the node lines corresponding to the similarity distance greater than the preset threshold;
    基于剩余节点线和所述多个文本节点,生成所述目标文本节点结构图。The target text node structure diagram is generated based on the remaining node lines and the plurality of text nodes.
  12. 根据权利要求9或10所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9 or 10, wherein, when the processor executes the computer program, it further implements:
    将所述目标文本节点结构图中所述每个文本节点均作为单独的节点分组;Grouping each text node in the target text node structure diagram as a separate node;
    对于任一节点分组,每次将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块,每种分组模块包括所述任一节点分组与其中一个相邻节点分组融合组成的节点分组,以及其余单独的节点分组;For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain a variety of grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings fused to form , and the rest of the individual node groupings;
    分别计算所述每种分组模块的分组模块度,所述分组模块度用于表示所述分组模块的分类效果;Calculate the grouping modularity of each of the grouping modules respectively, and the grouping modularity is used to represent the classification effect of the grouping module;
    在所述多种分组模块中,将所述分组模块度的最大值对应的分组模块作为当前目标分组模块,并判断所述当前目标分组模块是否小于或等于上一目标分组模块的分组模块度;In the multiple grouping modules, the grouping module corresponding to the maximum value of the grouping modularity is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module;
    若判定所述当前目标分组模块的分组模块度小于或等于上一目标分组模块的分组模块度,则将所述上一目标分组模块作为所述多个待分类文本的目标分组;If it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, the last target grouping module is used as the target grouping of the multiple texts to be classified;
    若判定所述当前目标分组模块的分组模块度大于上一目标分组模块的分组模块度,则针对所述当前目标分组模块中的每个节点分组,再次执行所述将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块的步骤,直至得到每个待分类文本的目标分组。If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, then for each node grouping in the current target grouping module, the grouping of any node with the The steps of grouping and merging any adjacent nodes to obtain a variety of grouping modules until the target grouping of each text to be classified is obtained.
  13. 根据权利要求12所述的终端设备,其中,所述分组模块的分组模块度由融合组成所述分组模块的节点分组的模块度,与所述其余单独的节点分组的模块度相加得到;所述处理器执行所述计算机程序时还实现:The terminal device according to claim 12, wherein the grouping modularity of the grouping module is obtained by adding the modularity of the node groupings that fuse the grouping module to the modularity of the remaining individual node groups; the When the processor executes the computer program, it also implements:
    针对任一分组模块,确定所述分组模块中当前节点分组包含的节点线的第一数量,以及所述其余单独的节点分组分别与所述当前节点分组连接的节点线的第二数量,所述当前节点分组为所述分组模块中当前待计算模块度的分组;For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity to be calculated currently in the grouping module;
    根据所述第一数量和所述第二数量,计算所述当前节点分组的模块度;calculating the modularity of the current node grouping according to the first number and the second number;
    依次将所述其余单独的节点分组作为所述当前节点分组,分别计算所述分组模块中所述其余单独的节点分组的模块度;The remaining individual node groups are sequentially used as the current node group, and the modularity of the remaining individual node groups in the grouping module is calculated respectively;
    将所述当前节点分组的模块度和所述其余单独的节点分组的模块度之和,作为所述分组模块的分组模块度。The sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the grouping modularity of the grouping module.
  14. 根据权利要求12或13所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 12 or 13, wherein, when the processor executes the computer program, it further implements:
    确定所述目标分组中每个节点分组对应的分类范围,所述分类范围包含多个目标分类;determining a classification range corresponding to each node group in the target grouping, and the classification range includes multiple target classifications;
    针对任一节点分组,根据所述分类范围中多个目标分类分别对应的预设关键词,对所 述节点分组中包含的多个文本节点进行二次分类,得到所述多个待分类文本的最终目标分组。For any node grouping, according to preset keywords respectively corresponding to multiple target categories in the classification range, perform secondary classification on multiple text nodes included in the node grouping, and obtain the multiple text nodes to be classified. Final target grouping.
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:
    获取文本节点结构图,所述文本节点结构图包含多个文本节点,每个文本节点分别对应一个待分类文本,且两两文本节点之间通过节点线进行连接;obtaining a text node structure diagram, where the text node structure diagram includes a plurality of text nodes, each text node corresponds to a text to be classified, and two text nodes are connected by a node line;
    提取每个所述待分类文本的文本特征,并根据所述文本特征分别计算所述两两文本节点之间的相似距离;Extracting the text features of each of the texts to be classified, and calculating the similarity distances between the two text nodes according to the text features;
    根据所述相似距离,对所述文本节点结构图中的节点线进行过滤,得到包含所述每个文本节点和剩余节点线的目标文本节点结构图;According to the similarity distance, filter the node lines in the text node structure diagram to obtain the target text node structure diagram including each text node and the remaining node lines;
    基于社区发现算法和所述剩余节点线,对所述目标文本节点结构图中的待分类文本进行分类,得到多个待分类文本的目标分组。Based on the community discovery algorithm and the remaining node lines, the text to be classified in the node structure diagram of the target text is classified, and a plurality of target groups of the text to be classified are obtained.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述待分类文本包含多个文本分词,所述文本特征由所述多个文本分词的词向量组成,所述相似距离包括欧几里得距离;所述计算机程序被处理器执行时还实现:The computer-readable storage medium according to claim 15, wherein the text to be classified comprises a plurality of text segments, the text features are composed of word vectors of the plurality of text segments, and the similarity distance comprises Euclidean distance; the computer program, when executed by the processor, further implements:
    提取每个所述待分类文本中多个文本分词的词向量,并确定所述多个文本分词分别在各自对应的所述待分类文本中的词顺序;Extracting word vectors of multiple text segments in each of the texts to be classified, and determining the word order of the multiple text segments in the respective corresponding texts to be classified;
    针对任一两两文本节点,根据所述待分类文本中相同词顺序的词向量,计算每个所述文本分词的欧几里得距离,并将每个所述文本分词的欧几里得距离进行加和得到所述两两文本节点之间的欧几里得距离。For any pair of text nodes, calculate the Euclidean distance of each text segment according to the word vector of the same word order in the text to be classified, and calculate the Euclidean distance of each text segment Summing is performed to obtain the Euclidean distance between the two text nodes.
  17. 根据权利要求15或6所述的计算机可读存储介质,其中,所述待分类文本包含多个文本分词,所述文本特征由所述多个文本分词的词向量组成,所述相似距离包括欧几里得距离;所述计算机程序被处理器执行时还实现:The computer-readable storage medium according to claim 15 or 6, wherein the text to be classified includes a plurality of text segmentations, the text features are composed of word vectors of the plurality of text segmentations, and the similarity distance includes Euclidean miles; the computer program, when executed by the processor, further implements:
    针对任一相似距离,判断所述相似举例是否小于或等于预设阈值;For any similarity distance, determine whether the similarity example is less than or equal to a preset threshold;
    若判定所述相似距离小于或等于预设阈值,则删除小于或等于预设阈值的所述相似距离对应的节点线;以及,If it is determined that the similarity distance is less than or equal to a preset threshold, delete the node line corresponding to the similarity distance less than or equal to the preset threshold; and,
    若判定所述相似距离大于预设阈值,则保留大于预设阈值的所述相似距离对应的节点线;If it is determined that the similarity distance is greater than a preset threshold, retaining the node lines corresponding to the similarity distance greater than the preset threshold;
    基于剩余节点线和所述多个文本节点,生成所述目标文本节点结构图。The target text node structure diagram is generated based on the remaining node lines and the plurality of text nodes.
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:The computer-readable storage medium of claim 15, wherein the computer program, when executed by the processor, further implements:
    将所述目标文本节点结构图中所述每个文本节点均作为单独的节点分组;Grouping each text node in the target text node structure diagram as a separate node;
    对于任一节点分组,每次将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块,每种分组模块包括所述任一节点分组与其中一个相邻节点分组融合组成的节点分组,以及其余单独的节点分组;For any node grouping, each time the any node grouping and any adjacent node grouping are fused to obtain a variety of grouping modules, each grouping module includes the any node grouping and one of the adjacent node groupings fused to form , and the rest of the individual node groupings;
    分别计算所述每种分组模块的分组模块度,所述分组模块度用于表示所述分组模块的分类效果;Calculate the grouping modularity of each of the grouping modules respectively, and the grouping modularity is used to represent the classification effect of the grouping module;
    在所述多种分组模块中,将所述分组模块度的最大值对应的分组模块作为当前目标分组模块,并判断所述当前目标分组模块是否小于或等于上一目标分组模块的分组模块度;In the multiple grouping modules, the grouping module corresponding to the maximum value of the grouping modularity is used as the current target grouping module, and it is judged whether the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module;
    若判定所述当前目标分组模块的分组模块度小于或等于上一目标分组模块的分组模块度,则将所述上一目标分组模块作为所述多个待分类文本的目标分组;If it is determined that the grouping modularity of the current target grouping module is less than or equal to the grouping modularity of the previous target grouping module, the last target grouping module is used as the target grouping of the multiple texts to be classified;
    若判定所述当前目标分组模块的分组模块度大于上一目标分组模块的分组模块度,则针对所述当前目标分组模块中的每个节点分组,再次执行所述将所述任一节点分组与任一相邻的节点分组融合得到多种分组模块的步骤,直至得到每个待分类文本的目标分组。If it is determined that the grouping modularity of the current target grouping module is greater than the grouping modularity of the previous target grouping module, for each node grouping in the current target grouping module, execute the grouping of any node with the The steps of grouping and merging any adjacent nodes to obtain a variety of grouping modules until the target grouping of each text to be classified is obtained.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述分组模块的分组模块度由融合组成所述分组模块的节点分组的模块度,与所述其余单独的节点分组的模块度相加 得到;所述计算机程序被处理器执行时还实现:19. The computer-readable storage medium of claim 18, wherein the grouping modularity of the grouping module is the modularity of the groupings of nodes fused to make up the grouping module, added to the modularity of the remaining individual node groupings Obtained; when the computer program is executed by the processor, it also realizes:
    针对任一分组模块,确定所述分组模块中当前节点分组包含的节点线的第一数量,以及所述其余单独的节点分组分别与所述当前节点分组连接的节点线的第二数量,所述当前节点分组为所述分组模块中当前待计算模块度的分组;For any grouping module, determine a first number of node lines included in the current node group in the grouping module, and a second number of node lines respectively connected to the current node group by the remaining individual node groups, and the The current node grouping is the grouping of the modularity to be calculated currently in the grouping module;
    根据所述第一数量和所述第二数量,计算所述当前节点分组的模块度;calculating the modularity of the current node grouping according to the first number and the second number;
    依次将所述其余单独的节点分组作为所述当前节点分组,分别计算所述分组模块中所述其余单独的节点分组的模块度;The remaining individual node groups are sequentially used as the current node group, and the modularity of the remaining individual node groups in the grouping module is calculated respectively;
    将所述当前节点分组的模块度和所述其余单独的节点分组的模块度之和,作为所述分组模块的分组模块度。The sum of the modularity of the current node grouping and the modularity of the remaining individual node groups is taken as the grouping modularity of the grouping module.
  20. 根据权利要求18或19所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:The computer-readable storage medium of claim 18 or 19, wherein the computer program, when executed by the processor, further implements:
    确定所述目标分组中每个节点分组对应的分类范围,所述分类范围包含多个目标分类;determining a classification range corresponding to each node group in the target grouping, and the classification range includes multiple target classifications;
    针对任一节点分组,根据所述分类范围中多个目标分类分别对应的预设关键词,对所述节点分组中包含的多个文本节点进行二次分类,得到所述多个待分类文本的最终目标分组。For any node grouping, according to preset keywords respectively corresponding to multiple target categories in the classification range, perform secondary classification on multiple text nodes included in the node grouping, and obtain the multiple text nodes to be classified. Final target grouping.
PCT/CN2021/090954 2020-12-28 2021-04-29 Text classification method and apparatus, and terminal device and storage medium WO2022142025A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011586743.3 2020-12-28
CN202011586743.3A CN112632280B (en) 2020-12-28 2020-12-28 Text classification method and device, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022142025A1 true WO2022142025A1 (en) 2022-07-07

Family

ID=75286176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090954 WO2022142025A1 (en) 2020-12-28 2021-04-29 Text classification method and apparatus, and terminal device and storage medium

Country Status (2)

Country Link
CN (1) CN112632280B (en)
WO (1) WO2022142025A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632280B (en) * 2020-12-28 2022-05-24 平安科技(深圳)有限公司 Text classification method and device, terminal equipment and storage medium
CN115767204A (en) * 2022-11-10 2023-03-07 北京奇艺世纪科技有限公司 Video processing method, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering
US20180181639A1 (en) * 2016-12-23 2018-06-28 Alcatel-Lucent Usa Inc. Method and apparatus for data-driven face-to-face interaction detection
CN108388651A (en) * 2018-02-28 2018-08-10 北京理工大学 A kind of file classification method based on the kernel of graph and convolutional neural networks
CN110929509A (en) * 2019-10-16 2020-03-27 上海大学 Louvain community discovery algorithm-based field event trigger word clustering method
CN112632280A (en) * 2020-12-28 2021-04-09 平安科技(深圳)有限公司 Text classification method and device, terminal equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10671936B2 (en) * 2017-04-06 2020-06-02 Universite Paris Descartes Method for clustering nodes of a textual network taking into account textual content, computer-readable storage device and system implementing said method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering
US20180181639A1 (en) * 2016-12-23 2018-06-28 Alcatel-Lucent Usa Inc. Method and apparatus for data-driven face-to-face interaction detection
CN108388651A (en) * 2018-02-28 2018-08-10 北京理工大学 A kind of file classification method based on the kernel of graph and convolutional neural networks
CN110929509A (en) * 2019-10-16 2020-03-27 上海大学 Louvain community discovery algorithm-based field event trigger word clustering method
CN112632280A (en) * 2020-12-28 2021-04-09 平安科技(深圳)有限公司 Text classification method and device, terminal equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FAN WEICHENT: "Research on Text Analysis Model Based on Community Detection", MASTER THESIS, TIANJIN POLYTECHNIC UNIVERSITY, CN, no. 1, 15 January 2019 (2019-01-15), CN , XP055949270, ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN112632280B (en) 2022-05-24
CN112632280A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
WO2021072885A1 (en) Method and apparatus for recognizing text, device and storage medium
CN107463605B (en) Method and device for identifying low-quality news resource, computer equipment and readable medium
CN105022754B (en) Object classification method and device based on social network
WO2022142025A1 (en) Text classification method and apparatus, and terminal device and storage medium
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
WO2021237570A1 (en) Image auditing method and apparatus, device, and storage medium
Karthikeyan et al. Probability based document clustering and image clustering using content-based image retrieval
WO2012165135A1 (en) Database logging method and logging device relating to approximate nearest neighbor search
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
Patil et al. Machine learning techniques for the classification of fake news
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
CN112381038A (en) Image-based text recognition method, system and medium
Mckenzie et al. Power weighted shortest paths for clustering Euclidean data
CN110134852B (en) Document duplicate removal method and device and readable medium
Savchenko Clustering and maximum likelihood search for efficient statistical classification with medium-sized databases
JP6017277B2 (en) Program, apparatus and method for calculating similarity between contents represented by set of feature vectors
JP2019086979A (en) Information processing device, information processing method, and program
CN112257689A (en) Training and recognition method of face recognition model, storage medium and related equipment
WO2022257455A1 (en) Determination metod and apparatus for similar text, and terminal device and storage medium
CN112417154B (en) Method and device for determining similarity of documents
JP2022151837A (en) Method, System and Computer Program (Content Analysis Message Routing)
CN113962221A (en) Text abstract extraction method and device, terminal equipment and storage medium
CN110457455B (en) Ternary logic question-answer consultation optimization method, system, medium and equipment
CN110196976B (en) Text emotional tendency classification method and device and server

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912787

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912787

Country of ref document: EP

Kind code of ref document: A1