CN111259154B - Data processing method and device, computer equipment and storage medium - Google Patents

Data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111259154B
CN111259154B CN202010082953.2A CN202010082953A CN111259154B CN 111259154 B CN111259154 B CN 111259154B CN 202010082953 A CN202010082953 A CN 202010082953A CN 111259154 B CN111259154 B CN 111259154B
Authority
CN
China
Prior art keywords
node
tree
nodes
topology
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010082953.2A
Other languages
Chinese (zh)
Other versions
CN111259154A (en
Inventor
黄文璨
邱东洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010082953.2A priority Critical patent/CN111259154B/en
Publication of CN111259154A publication Critical patent/CN111259154A/en
Application granted granted Critical
Publication of CN111259154B publication Critical patent/CN111259154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The embodiment of the application discloses a data processing method, a data processing device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring at least two original texts; acquiring text vectors corresponding to at least two original texts respectively, and constructing a clustering tree according to a vector distance between the at least two text vectors; taking the tree nodes used for clustering in the clustering tree as node clusters; assembling original texts corresponding to the text vectors in the node clusters to obtain a fusion text; and extracting keywords of the fused text as keywords of at least two original texts. By adopting the embodiment of the application, the clustering efficiency can be improved.

Description

Data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of natural language text intelligent analysis technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.
Background
In the process of clustering original texts, a computer device with a text clustering function can acquire K original texts as central points of the K clusters, and can determine vector distances between other original texts (for example, an original text a) except the central point and each central point. Wherein K is a positive integer less than the number of original texts. At this time, the computer apparatus may add the original text a to a cluster corresponding to a center point having a minimum distance from the original text a (e.g., cluster a), and re-determine a center point of the cluster a.
It should be understood that when the number of the original texts is large, when the computer device determines the cluster to which each original text belongs except the central point, all the original texts need to participate in the calculation, so that the calculation amount of the computer device during the clustering process is large, and the clustering efficiency of the computer device is reduced.
Content of application
The embodiment of the application provides a data processing method, a data processing device, computer equipment and a storage medium, and clustering efficiency can be improved.
An embodiment of the present application provides a data processing method, including:
acquiring at least two original texts;
acquiring text vectors corresponding to at least two original texts respectively, and constructing a clustering tree according to a vector distance between the at least two text vectors;
taking the tree nodes used for clustering in the clustering tree as node clusters;
assembling original texts corresponding to the text vectors in the node clusters to obtain a fusion text;
and extracting keywords of the fused text as keywords of at least two original texts.
An embodiment of the present application provides a data processing apparatus, which is integrated in a computer device, and includes:
the first acquisition module is used for acquiring at least two original texts;
the second acquisition module is used for acquiring text vectors corresponding to at least two original texts respectively;
the building module is used for building a clustering tree according to the vector distance between at least two text vectors;
the first determining module is used for taking the tree nodes used for clustering in the clustering tree as node clusters;
the assembling module is used for assembling the original texts corresponding to the text vectors in the node clusters to obtain fused texts;
and the extraction module is used for extracting the key words of the fused text as the key words of at least two original texts.
Wherein, this second acquisition module includes:
the first word segmentation unit is used for performing part-of-speech tagging on at least two original texts and performing word segmentation on the at least two original texts according to the tagged part-of-speech;
the matching unit is used for matching the participles according to the stop word list and the reserved word list to obtain a text to be filtered; the deactivation word list is used for storing the participles to be filtered, and the reserved word list is used for determining the filtering relation of the participles not belonging to the deactivation word list;
the eliminating unit is used for eliminating texts to be filtered from at least two original texts to obtain texts to be coded;
the extraction unit is used for extracting the characteristics of the text to be coded to obtain an initial text vector corresponding to the text to be coded;
and the dimension reduction unit is used for carrying out dimension reduction processing on the initial text vector to obtain the text vector.
Wherein the extraction unit includes:
the character encoding subunit is used for performing character encoding on characters in the text to be encoded to obtain character vectors corresponding to the characters in the text to be encoded;
the position coding subunit is used for carrying out position coding on characters in the text to be coded to obtain position vectors corresponding to the characters in the text to be coded;
and the first determining subunit is used for determining an initial text vector corresponding to the text to be coded based on the character vector and the position vector.
Wherein, this construction module includes:
the first determining unit is used for determining at least two text vectors as topological nodes and acquiring vector distances between the topological nodes;
the construction unit is used for constructing a shortest path topological graph according to the topological nodes and the vector distance between the topological nodes; the weight parameter of the edge in the shortest path topology graph is determined by the vector distance;
the partitioning unit is used for partitioning the topological nodes in the shortest path topological graph based on the weight parameters of the edges in the shortest path topological graph to obtain an initial clustering tree corresponding to the shortest path topological graph;
and the deleting unit is used for deleting the tree nodes in the initial clustering tree according to the number of the topology nodes contained in the tree nodes in the initial clustering tree to obtain the clustering tree.
Wherein, this construction element includes:
the construction subunit is used for constructing an initial topological graph according to the topological nodes and the vector distances among the topological nodes; the initial topological graph comprises topological nodes tiTopology node tjAnd topology node tkI, j and k are positive integers less than or equal to N, N is the number of at least two text vectors, and i, j and k are different from each other;
a first adding subunit, configured to select a topology node t from the initial topology mapiDetermining and topological node tiTopological node t corresponding to connecting edge with minimum weight parameterjConnecting the topology node tiAnd topology node tjAdd to shortest path topology graph;
the second determining subunit is configured to determine, as remaining topology nodes, topology nodes in the initial topology map, except for the topology nodes included in the shortest path topology map;
a second adding subunit for determining a topology node t among the remaining topology nodesiAnd topology node tjTopological node t corresponding to connecting edge with minimum weight parameterkConnecting the topology node tkAnd adding the nodes to the shortest path topological graph until the remaining topological nodes are empty, and finishing the construction of the shortest path topological graph.
Wherein the first determination unit includes:
a third determining subunit for determining the topological node tiThe original distances corresponding to the associated K topology nodes are respectively corresponding to the topology nodes tiDetermining the original distance with the maximum as the first distance; k is a positive integer less than or equal to N;
a fourth determining subunit for determining the topological node tjThe original distances corresponding to the associated K topology nodes are respectively corresponding to the topology nodes tjDetermining the original distance with the maximum as the second distance;
a fifth determining subunit for determining the topological node tiAnd topology node tjOriginal distance between, topological node tiAnd topology node tjThe original distance between the first and second distances is determined as a third distance;
a sixth determining subunit, configured to determine the topology node t based on the first distance, the second distance, and the third distanceiAnd topology node tjThe vector distance between.
Wherein the pruning unit includes:
a first obtaining subunit, configured to obtain a sub-tree node a of a parent tree node in the initial clustering treemThe number of the included topology nodes is used as a first number, and a subtree node a of the parent tree node is obtainednThe number of included topological nodes is taken as a second number; m and n are positive integers less than or equal to F, F is the number of tree nodes contained in the initial clustering tree, and m and n are different;
a first deleting subunit, configured to delete the sub-tree node a if both the first number and the second number do not reach the number thresholdmAnd subtree node an
A reservation subunit, configured to reserve the sub-tree node a if the first number and the second number both reach the number thresholdmAnd subtree node an
A second deletion subunit, configured to delete the sub-tree node a if the first number does not reach the number threshold and the second number reaches the number thresholdmReplacing parent tree node with subtree node an
And the seventh determining subunit is used for obtaining the clustering tree according to the reserved tree nodes.
Wherein the first determining module comprises:
a second determination unit for determining a parent tree node b in the cluster treexStability of (2)The parameter is used as a first stability parameter, and a sub-tree node b corresponding to the parent tree node is usedyStability parameter and subtree node bzThe sum of the stability parameters of (a) as a second stability parameter; the stability parameter is determined based on the vector distance;
the replacing unit is used for replacing the first stability parameter with the second stability parameter if the first stability parameter is smaller than the second stability parameter;
a deleting unit for deleting the parent tree node b if the first stability parameter is greater than or equal to the second stability parameterxDetermining as a tree node for clustering, deleting sub-tree node bzAnd subtree node by
And a third determining unit, configured to use the tree nodes for clustering as node clusters in the clustering tree.
Wherein the second deletion subunit is further configured to:
if the first number does not reach the number threshold and the second number reaches the number threshold, the deleted subtree node amDetermining the topological node as a noise node;
acquiring a central topological node of a node cluster in a clustering tree;
determining a node cluster with the minimum vector distance to the noise node as a target node cluster based on the vector distance between the noise node and the central topological node;
and adding the noise node to the target node cluster.
Wherein, this extraction module includes:
the second word segmentation unit is used for segmenting words of the fused text;
the statistical unit is used for counting the frequency of the participles in the node cluster and the inverse frequency of the participles in other node clusters except the node cluster;
a fourth determining unit, configured to determine a keyword evaluation parameter of the participle based on the frequency and the inverse frequency;
a fifth determining unit, configured to select p segmented words as keywords of the node cluster based on the keyword evaluation parameter; p is a positive integer.
Wherein, the device still includes:
the second determining module is used for carrying out mean processing on the vector distance between the topological nodes in the node cluster to obtain a first mean value corresponding to the topological nodes in the node cluster;
and the first sequencing module is used for sequencing the first average value and displaying the original text corresponding to the topological node in the node cluster according to the sequencing result of the first average value.
The number of the node clusters is at least two;
the device still includes:
the third determining module is used for carrying out mean processing on the first mean value of each of the at least two node clusters to obtain a second mean value used for representing the clustering effect of the node clusters;
and the second sorting module is used for sorting the second average values respectively corresponding to the at least two node clusters and displaying the at least two node clusters according to the sorting result of the second average values.
One aspect of the present application provides a computer device, comprising: a processor, a memory, a network interface;
the processor is connected with a memory and a network interface, wherein the network interface is used for providing data communication functions, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the method in one aspect of the embodiment of the application.
An aspect of the application provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, perform a method in an aspect of an embodiment of the application.
According to the text vectors associated with at least two original texts, the vector distance between the text vectors can be determined. Wherein the text vector is determined by encoding the deep semantic information of the original text. Furthermore, the clustering tree can be constructed according to the distance vector, and then the tree nodes used for clustering can be quickly used as the node clusters according to the stability parameters of the tree nodes of the clustering tree, and the keywords of the node clusters are extracted, so that the clustering efficiency of the original text can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;
fig. 2 is a schematic view of a scenario for performing service data interaction according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a BERT model provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a scenario for determining a vector distance between two topological nodes according to an embodiment of the present application;
fig. 6 is a schematic view of a scenario for constructing a shortest path topology provided in an embodiment of the present application;
fig. 7 is a schematic view of a scenario for constructing an initial clustering tree according to an embodiment of the present application;
fig. 8 is a schematic view of a scenario for constructing a cluster tree according to an embodiment of the present application;
fig. 9 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 10 is a schematic view of a scenario for determining an answer text corresponding to a question text according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a server 2000 and a user terminal cluster, and the user terminal cluster may include a plurality of user terminals, as shown in fig. 1, specifically, a user terminal 3000a, a user terminal 3000b, user terminals 3000c, …, and a user terminal 3000 n.
As shown in fig. 1, the user terminals 3000a, 3000b, 3000c, …, and 3000n may be respectively in network connection with the server 2000, so that each user terminal may interact data with the server 2000 through the network connection.
As shown in fig. 1, each user terminal in the user terminal cluster may be installed with a target application, and when the target application runs in each user terminal, the target application may perform data interaction with the server 2000 shown in fig. 1, where the target application may be an application that performs service processing in an artificial intelligence field such as an automation customer service field. For example, the target application may be used for pre-sale shopping guide, after-sale service, hospital referral, medical aid diagnosis, and game assistants (e.g., game acquaintances) in a game scene, and the like.
It is understood that by Artificial Intelligence (AI), we mean a new technology science that uses a digital computer or data computer controlled computer device (e.g., server 2000 shown in fig. 1) to simulate, extend and extend human Intelligence. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
It is to be understood that the computer device in the embodiment of the present application may be an entity terminal having a text clustering function, the entity terminal may be the server 2000 shown in fig. 1, and may also be a terminal device corresponding to the question and answer robot, which is not limited herein.
For convenience of understanding, in the embodiment of the present application, one user terminal may be selected from the plurality of user terminals shown in fig. 1 as a target user terminal, and the target user terminal may include: the intelligent terminal comprises an intelligent terminal with a text clustering function, such as a smart phone, a tablet computer and a desktop computer. For example, in the embodiment of the present application, the user terminal 3000a shown in fig. 1 may be used as a target user terminal, and a target application may be integrated in the target user terminal, and at this time, the target user terminal may implement data interaction with the server 2000 through a service data platform corresponding to the target application.
For easy understanding, please refer to fig. 2, which is a schematic view of a scenario for performing service data interaction according to an embodiment of the present application. The server 10a in this embodiment of the application may take the server 2000 shown in fig. 1 as an example to illustrate a specific implementation manner of clustering the original text by a computer device. As shown in fig. 2, the user terminal 10b in this embodiment may be any one of the user terminals (e.g., the user terminal 3000b) in the user terminal cluster shown in fig. 1. The database 10c in the embodiment of the present application may be a database 10c having a network connection relationship with the server 10 a. The data stored in the database 10c includes a question text and an answer text.
It should be understood that the user terminal 10b may respond to a trigger operation of a user corresponding to the user terminal 10b, so that a question text may be sent to the server 10 a. The trigger operation may include a contact operation such as a click or a long press, or may also include a non-contact operation such as a voice or a gesture, which is not limited herein. The question text may be a question associated with a game, which is posed by a game player in a game scenario. For example, "how much speed xx equipment with speed attribute can be raised? ". The raw text may also present professional knowledge related questions to students in the learning scenario. For example, "what is the concept of acceleration? ". The raw text may also present questions about the item for the customer in the shopping scenario. For example, "what level of the mall is women's clothing? ". The original text may also be a question posed in other scenarios, which are not limited herein.
Further, after receiving the question text (e.g., question text 1), the server 10a can quickly determine the similarity between the question text 1 and each question text in the database 10c, so that the question text (e.g., question text 2) with the highest matching degree can be determined from the database 10 c. Further, the server 10a may use the answer text (for example, answer text 2) corresponding to the question text 2 as the answer text of the question text 1, and return the answer text 2 to the user terminal 10 b.
It should be understood that the server 10a, when obtaining the question text sent by the user terminal 10b, may send the question text to the database 10c, so that the database 10c may store the question text proposed by the user corresponding to the user terminal 10 b. It will be appreciated that the server 10a may cluster the question text stored in the database 10 c. Wherein the server 10a may periodically retrieve at least two original texts from the database 10 c. The original text acquired by the server 10a may be the question text stored in the database 10 c.
Further, the server 10a may obtain text vectors corresponding to at least two original texts, respectively. The server 10a may determine a vector distance between at least two text vectors so that a cluster tree may be constructed. The cluster tree may include a root node, a leaf node, and an intermediate node between the root node and the leaf node. The root node contains text vectors associated with at least two original texts, and the leaf nodes include text vectors having the same vector distance. For example, if the original text obtained by the server from the database 10c may correspond to 10 text vectors, respectively, the root node of the cluster tree may include the 10 text vectors.
Further, the server 10a may use the tree nodes for clustering in the clustering tree as the node clusters according to the stability parameters corresponding to the tree nodes in the clustering tree. Wherein the stability parameter is used to indicate a degree of stability of the tree node, the stability parameter being determined based on the vector distance. It can be understood that if the stability parameter of a tree node (e.g., tree node a) in the cluster tree is larger, it can be understood that the stability of the tree node a is better. The node cluster in this embodiment is a cluster obtained by clustering the original text by the server 10 a.
It can be understood that the server 10a may assemble the original texts corresponding to the text vectors included in the node cluster to obtain the fused text. Further, the server 10a may extract a keyword in the fused text as a keyword of the node cluster, so as to obtain the keyword corresponding to at least two original texts acquired by the server 10 a. Further, the server 10a may display the node clusters obtained by clustering the at least two original texts and the keywords corresponding to each node cluster on the display interface 100 of the user terminal 10b, so that the staff associated with the user terminal 10b may quickly perform manual screening, thereby expanding the problem texts stored in the database 10 c. For example, the server 10a performs clustering processing on the acquired original text to obtain two clustering clusters, i.e., a node cluster a and a node cluster B. The extracted keywords in the node cluster a may be "how", "advertisement", and "indexing". The keywords extracted by the node cluster B may be "memory", "network", and "black screen". The latest clustered text data may be displayed on the display interface 100, for example, the question text 10 shown in fig. 2 may be the latest clustered text data.
It can be understood that, in the embodiment of the present application, the server 10a performs clustering processing on the original text, and can quickly obtain the node cluster associated with the original text and the keyword corresponding to the node cluster, so as to perform subsequent manual screening, thereby finding a new problem type in a large amount of original texts (i.e., problem texts), and also identifying different question methods of the problem type, thereby expanding the database 10 c.
The specific implementation manner of clustering the original text by the computer device to obtain the keywords corresponding to the original text may refer to the following embodiments corresponding to fig. 3 to fig. 10.
Further, please refer to fig. 3, which is a flowchart illustrating a data processing method according to an embodiment of the present application. As shown in fig. 3, the method may include:
s101, at least two original texts are obtained.
Specifically, the computer device in the embodiment of the present application may periodically obtain at least two original texts from a database having a network connection relationship with the computer device. Wherein the original text is a question text stored in a database, and the question text is sent to the database by a user terminal having a network connection relationship with the computer device through the computer device.
In this embodiment of the present application, the computer device may be an entity terminal having a text clustering function, and the entity terminal may be the server 2000 shown in fig. 1, and may also be a terminal device corresponding to the question and answer robot, which is not limited herein.
S102, text vectors corresponding to at least two original texts are obtained, and a clustering tree is constructed according to the vector distance between the at least two text vectors.
Wherein the cluster tree may include a root node, a leaf node, and an intermediate node between the root node and the leaf node; the root node includes the text vectors associated with the at least two original texts, and the leaf node includes text vectors having the same vector distance.
Specifically, the computer device may obtain text vectors corresponding to at least two original texts, respectively. Wherein one original text corresponds to one text vector. Further, the computer device may determine each of the at least two text vectors as a topological node, and may obtain a vector distance between the topological nodes. At this time, the computer device may construct a shortest path topology map according to the topology nodes and the vector distances between the topology nodes. Wherein the weight parameter of the edge in the shortest path topology map is determined by the vector distance. The computer device may divide the topology nodes in the shortest path topology graph based on the weight parameters of the edges in the shortest path topology graph to obtain an initial cluster tree corresponding to the shortest path topology graph. Further, the computer device may delete the tree node in the initial clustering tree according to the number of topology nodes included in the tree node in the initial clustering tree, so as to obtain the clustering tree.
It should be understood that the computer device may obtain text vectors corresponding to at least two respective original texts. It is understood that the computer device may perform part-of-speech tagging on at least two original texts obtained from the database, and perform word segmentation on the at least two original texts according to the tagged part-of-speech. Further, the computer device can match the participle according to the stop word list and the reserved word list to obtain the text to be filtered. In this embodiment, a text without an actual meaning may be determined as a text to be filtered. For example, a sentence "aaaaaaaa" is a text having no actual meaning.
The stop word list is a word that is automatically filtered before processing natural language data (or text) in information retrieval to save storage space and improve search efficiency, and the word is called a stop word. For example, "hey," "hum," "gound," "or," "but not," and punctuation marks, etc. In other words, the computer device may collectively refer to the participles that do not belong in the deactivated vocabulary as non-deactivated words according to the deactivated vocabulary.
The reserved vocabulary may include special vocabulary (e.g., utility vocabulary in a question-answering robot scenario) and acceptable part-of-speech tables. The special vocabulary is a plurality of words with real meanings defined manually, such as a dragon-like and elytrigia repens, an XX armor, an XX character seal, and the like. The acceptable parts of speech table may include adjectives, exclamations, adjectives, idioms, acronyms, successors, idioms, nouns, vocabularies, pronouns, place words, time words, verbs, moods, and the like.
Further, the computer device may determine a filtering relationship for the non-stop words based on the reserved vocabulary. If a word (e.g., word a) does not belong to a word in the inactive word list and the word a does not belong to a word in the reserved word list, the computer device may determine the filtering relation of the word a as a filtering word to be filtered. If a word (e.g., word b) does not belong to a word in the deactivated vocabulary and the word b belongs to a word in the retained vocabulary, the computer device may determine the filtering relationship of the word b as a retained word that needs to be retained.
It can be understood that, if all the obtained participles are filter words after the participles are performed in one original text a, the computer device may determine the original text a as the text to be filtered. For example, the text a to be filtered may be "12345678980" which is a text having no actual meaning.
Further, the computer device may remove the text to be filtered from the acquired original text, thereby obtaining the text to be encoded. The method and the device for encoding the text can determine original texts except the text to be filtered as the text to be encoded. For example, the original text acquired by the computer device may be original text a, original text b, original text c, and original text d. Further, the computer device may determine the text to be filtered, i.e. the original text a, from the stop word list and the reserved word list. At this time, the computer device may determine the original text b, the original text c, and the original text d as texts to be encoded.
It should be understood that the computer device may perform feature extraction on the text to be encoded, so that an initial text vector corresponding to the text to be encoded may be obtained. It can be understood that, when the computer device performs feature extraction on a text to be encoded, an initial text vector corresponding to the text to be encoded can be obtained through a language representation model (BERT). The BERT Model is a multi-layer bidirectional encoder pre-trained on a Chinese Wikipedia large data set through a Mask Language Model (MLM) and a Next Sentence prediction Model (NSP).
For easy understanding, please refer to fig. 4, which is a schematic structural diagram of a BERT model according to an embodiment of the present application. As shown in fig. 4, the BERT model may include h identical layers stacked together, where h may be a positive integer. Each layer has two branch layers, one of which can be a multi-headed self-attention layer and a normalization layer, and the other of which can be a simple fully-connected feedforward network (i.e., a position-by-position forward propagation layer) and a normalization layer.
It should be understood that the computer device may perform character encoding on characters in the text to be encoded, so that a character vector corresponding to the characters in the text to be encoded may be obtained. It will be appreciated that the computer device may convert each character in the text to be encoded into a character vector by representing each character with a unique token (e.g., an integer number). Wherein the dimension of the character vector can be the output dimension d of the BERT model branch layermodel. Further, the computer device may perform position coding on the characters in the text to be coded, so as to obtain a position vector corresponding to the characters in the text to be coded. The position coding mode has various choices, and the learned position coding mode have fixed positions.
Specifically, the formula of the position code may be as shown in the following formulas (1) and (2):
Figure BDA0002380953480000131
Figure BDA0002380953480000132
where pos is the position of the character, i is the dimension, dmodelIs the output dimension of the BERT model branch layer.
It can be understood that the dimension of the character vector of each character of the text to be encoded and the position vector of the character can be dmodelThe computer device can superimpose the character vector and the position vector for each character, resulting in a stack of characters of the text to be encodedAdding vectors and transmitting to a multi-head self-attention layer in the BERT model.
Specifically, the output matrix in the multi-head attention layer may be as shown in equations (3) to (5):
where:headi=Attention(QWi Q,KWi K,VWi V), (3)
MultiHead(Q,K,V)=Concat(head1,...,headh)WO, (4)
Figure BDA0002380953480000133
q, K, V can be the feature vector, W, transmitted from the text to be encoded to the multi-headed attention layeri Q、Wi K、Wi VMay each be a mapped parameter matrix, WOThe trainable weight parameter may be the multi-head self-attention level, h is the number of layers of the multi-head self-attention level, dkCan be the dimension of K, and dk=dmodel/h。
Further, the computer device may transmit an output vector of text to be encoded output from the multi-headed self-attention layer to a position-by-position forward propagation layer in the BERT model. It should be understood that the computer device may refer to the output vector output from the BERT model as the initial text vector. Specifically, the formula of the position-by-position forward propagation layer may be represented by the following formula (6):
FFN(x)=max(0,xW1+b1)W2+b2, (6)
where x is the output vector of the text to be encoded output from the multi-headed self-attention layer, W1、W2、b1And b2May be trainable weight parameters.
As shown in FIG. 4, the BERT model may include 12 layers of multi-headed self-attention layers, the vector dimension d encoded by the BERTmodelMay be 768 dimensions. Wherein the text to be codedThere may be 30 characters in the text a, and each character in the text a to be coded is respectively subjected to character coding, so that 30 character vectors with 1 × 768 dimensions can be obtained. Further, each character in the text a to be encoded is respectively position-encoded, so that 30 position vectors with dimensions of 1 × 768 can be obtained.
At this time, the computer device may obtain 30 superimposed vectors of 1 × 768 dimensions based on the position vector and the character vector. Further, the computer device may merge the superimposed vectors of the characters of the text to be encoded together, so that a feature vector of 30 × 768 dimensions of the text to be encoded transmitted to the multi-head self-attention layer may be obtained. It should be understood that the feature vector of 30 × 768 dimensions passes through the multi-head self-attention layer and the position-by-position forward propagation layer in the BERT model, and an initial text vector a corresponding to the text a to be encoded can be obtained.
Further, the computer device may perform dimension reduction on the initial text vector, so as to obtain a text vector of the text to be encoded. In the embodiment of the present application, the initial text vector after the dimension reduction processing may be referred to as a text vector. It will be appreciated that the computer device may obtain the search field K and the minimum accommodation distance. The search field K can be used for determining the number of adjacent initial text vectors, wherein K is a positive integer less than or equal to N, and N is the number of at least two text vectors; the minimum accommodation distance may be used to control how closely the initial text vector allows embedding.
The computer device may optionally select an initial text vector from the obtained initial text vectors as the target initial text vector. The target initial text vector may be a feature vector having a high dimensional space, for example, the target initial text vector may be a feature vector having dimensions of 30 × 768. Then, the computer device may determine original distances corresponding to the K initial text vectors associated with the target initial text vector in the high-dimensional space (e.g., 768-dimensional space), and project the target initial text vectors into a low-dimensional space (e.g., 2-dimensional space) to obtain text vectors corresponding to the target initial text vectors. Further, the computer device may determine, in the low-dimensional space, original distances corresponding to the K text vectors associated with the text vector corresponding to the target initial text vector, respectively.
At this time, the computer device may optimize the text vector represented in the low-dimensional space based on the minimum accommodation distance. In other words, the computer device may use a gradient descent algorithm (e.g., a random gradient algorithm) to minimize the difference between the determined original distances. The gradient descent algorithm may include a plurality of different algorithms, such as a batch gradient algorithm, a random gradient algorithm, a compromise gradient algorithm, and the like. For the random gradient algorithm, the computer equipment can continuously judge and select the shortest path under the current target initial text vector, so that the optimal result can be achieved under the shortest path.
For a specific implementation manner of determining, by the computer device, the original distances corresponding to the K text vectors associated with the text vector corresponding to the target initial text vector, respectively, refer to the description in the embodiment corresponding to fig. 5 below, and details are not repeated here.
It should be understood that the embodiment of the present application may employ a data dimension reduction tool to perform dimension reduction processing on the initial text vector. For example, a streaming Dimension Reduction tool (UMAP). The UMAP is a dimension reduction algorithm based on various learning technologies and topological data analysis ideas, and is characterized in that the manifold space structure and density space distribution of a data set (namely an initial text vector of a text to be coded) can be reserved, so that the clustering effect can be improved when the text to be coded is clustered. Wherein, the UMAP can perform dimension reduction processing on the text vector of the original text by setting a search domain (e.g., 24) and a minimum accommodation distance (e.g., 0).
For example, as shown in fig. 4, the initial text vector a corresponding to the text a to be encoded may be a vector with dimensions of 30 × 768, and the computer device may perform dimension reduction on the initial text vector a by using a public stream-type dimension reduction tool UMAP, so as to obtain the text vector corresponding to the text a to be encoded, where the text vector may be a vector with dimensions of 30 × 2, and the dimensions of the initial text vector of the text to be encoded are reduced to a great extent, so that the text to be encoded is more dense in spatial distribution, and the memory occupies a smaller space, and further the text to be encoded has a better clustering effect.
Further, the computer device may perform clustering processing on the text vectors of the text to be encoded, so that a clustering tree associated with the original text may be obtained. In the embodiment of the present application, an open-source Clustering method (for example, Density-Based Clustering Based on Hierarchical Density Estimates, abbreviated as HDBSCAN) may be used to perform Clustering processing on at least two text vectors obtained by the computer device. It should be appreciated that the computer device may determine a text vector associated with the original text as a topological node. Wherein the number of text vectors determined by the computer device is N, N being a positive integer greater than or equal to 2. Further, the computer may obtain vector distances between the topology nodes.
Specifically, the formula for determining the vector distance may be as shown in the following formula (7):
dmreach-K(a,b)=max{coreK(a),coreK(b),d(a,b)}, (7)
wherein K may be a search field determined by the computer device, the search field is used for determining the number of adjacent topology nodes, and K is a positive integer less than or equal to N. coreK(x) It can represent the vector distance having the largest original distance from the topological node x within the search domain. d (a, b) may represent the original distance between the topological nodes a and b.
Wherein, it is understood that the topology node determined by the computer device may comprise the topology node tiAnd the topological node tjAnd i and j are positive integers different from each other. It should be appreciated that the computer device may determine the topological node tiThe original distances corresponding to the K associated topological nodes respectively are to be compared with the topological node tiHas the advantages ofThe largest original distance is determined as the first distance. Further, the computer device may determine the topological node tjThe original distances corresponding to the K associated topological nodes respectively are to be compared with the topological node tjThe original distance having the largest value is determined as the second distance. The computer device may then determine the topological node tiAnd the topological node tjOriginal distance between, the topological node tiAnd the topological node tjThe original distance therebetween is determined as the third distance. Further, the computer device may determine the topological node t based on the first distance, the second distance, and the third distanceiAnd the topological node tjThe vector distance between.
For easy understanding, please refer to fig. 5, which is a schematic diagram of a scenario for determining a vector distance between two topology nodes according to an embodiment of the present application. As shown in fig. 5, the topology nodes in the topology node map in the embodiment of the present application are text vectors corresponding to the original texts respectively acquired by the computer device, where the number N of the text vectors corresponding to the original texts respectively acquired by the computer device may be 12, so that the topology nodes in the topology node map may be 12. The search domains K determined by the computer device in the embodiment of the present application may be 3 as an example, so as to describe a specific implementation manner of determining the vector distance.
As shown in fig. 5, the computer device may determine a vector distance between topology node 1 and topology node 2. Here, it is understood that the computer device may determine the original distances corresponding to the 3 topology nodes associated with the topology node 1, so as to determine that the original distance between the topology node 1 and the topology node 3 (i.e., d (1,3)) is the largest, and further, may determine the original distance d (1,3) as the first distance.
Further, the computer device may determine the original distances corresponding to the 3 topology nodes associated with the topology node 2, respectively, so that it may be determined that the original distance between the topology node 2 and the topology node 4 (i.e., d (2,4)) is the largest, and further, it may be determined that the original distance d (2,4) is the second distance. Then, the computer device may determine an original distance between the topology node 1 and the topology node 2 (i.e., d (1,2)) as a third distance. Further, the computer device may select a maximum distance from the first distance, the second distance, and the third distance according to equation (7), and determine the maximum distance as a vector distance (e.g., a third distance) between the topology node 1 and the topology node 2.
It should be appreciated that the computer device can construct a shortest path topology graph by the topology nodes and the vector distances between the topology nodes. The topology graph constructed according to the edge having the smallest weight parameter with the topology node can be determined as the shortest path topology graph. The weight parameters of the edges in the shortest path topology graph are determined by the vector distance between the topology nodes.
Wherein it is understood that the computer device may construct an initial topology graph based on the topology nodes and the vector distances between the topology nodes. Wherein the initial topology graph may include topology nodes tiTopology node tjAnd topology node tkI, j and k are positive integers less than or equal to N, N is the number of the at least two text vectors, and i, j and k are different from each other.
Further, the computer device may select a topology node t in the initial topology graphiThen, the topological node t can be determinediTopological node t corresponding to connecting edge with minimum weight parameterjAnd connecting the topology node tiAnd the topological node tjAdd to the shortest path topology map. Further, the computer device may determine topology nodes in the initial topology map other than the topology nodes included in the shortest path topology map as remaining topology nodes. The computer device may then determine, among the remaining topology nodes, a topology node t associated with the topology node tiAnd the topological node tjTopological node t corresponding to connecting edge with minimum weight parameterkAnd connecting the topology node tkAdding the topology node into the shortest path topology graph until the remaining topology nodes are empty and finishedAnd constructing the shortest path topological graph in pairs.
For easy understanding, please refer to fig. 6, which is a schematic view of a scenario for constructing a shortest path topology diagram according to an embodiment of the present application. As shown in fig. 6, the initial topology a in the embodiment of the present application is constructed by the computer device according to the vector distances between the topology nodes. The initial topology map a may include a plurality of topology nodes, and specifically may include: topology node a, topology node b, topology node c, topology node d, topology node e and topology node f.
It should be understood that the computer device may randomly select one topology node a in the initial topology graph a and determine the weight parameters of the connection edges of other topology nodes and the topology node a. As shown in fig. 6, (a, 1) under the topology node c of the topology map a indicates that the weight parameter of the connection edge of the topology node c and the topology node a is 1, and the weight parameter is determined by the vector distance between the topology node a and the topology node c.
Further, as shown in the topology a in fig. 6, the computer device may determine the topology node c corresponding to the connection edge having the smallest weight parameter (i.e., 1) of the topology node a. At this time, the computer device may add the topology node a and the topology node c to the shortest path topology map. It should be understood that the computer device may determine topology nodes other than the topology nodes comprised by the shortest path topology graph as remaining topology nodes. The remaining topology nodes in the topology map a of fig. 6 may specifically include a topology node b, a topology node d, a topology node e, and a topology node f.
As shown in the topology b in fig. 6, the computer apparatus may determine the topology node f corresponding to the connection edge having the smallest weight parameter (i.e., 4) of the topology node a and the topology node c among the remaining topology nodes in the topology a of fig. 6. At this point, the computer device may add the topology node f to the shortest path topology graph. The remaining topology nodes in the topology map b of fig. 6 may specifically include a topology node b, a topology node d, and a topology node e.
As shown in the topology c in fig. 6, the computer apparatus may determine the topology node d corresponding to the connection edge having the smallest weight parameter (i.e., 2) among the topology nodes remaining in the topology b of fig. 6, the topology node a, the topology node c, and the topology node f. At this point, the computer device may add the topology node d to the shortest path topology graph. The remaining topology nodes in the topology map c of fig. 6 may specifically include a topology node b and a topology node e.
As shown in the topology d in fig. 6, the computer apparatus may determine the topology node b corresponding to the connection edge having the smallest weight parameter (i.e., 5) among the topology nodes remaining in the topology c of fig. 6, the topology node a, the topology node c, the topology node f, and the topology node d. At this point, the computer device may add the topology node b to the shortest path topology graph. The remaining topology nodes in the topology map d of fig. 6 may specifically include the topology node e.
As shown in the topology e in fig. 6, the computer apparatus may determine the topology node e corresponding to the connection edge having the smallest weight parameter (i.e., 3) of the topology node a, the topology node c, the topology node f, the topology node d, and the topology node b among the remaining topology nodes in the topology d of fig. 6. At this point, the computer device may add the topology node e to the shortest path topology graph. The remaining topology nodes in topology e of fig. 6 are empty.
It should be appreciated that the computer device may complete the construction of the shortest path topology graph when the remaining topology nodes are empty. The shortest path topology map associated with the original text, which is constructed by the computer device in the embodiment of the present application, may be as shown in the shortest path topology map B in fig. 6.
It should be understood that the computer device may also divide the topology nodes in the shortest path topology graph based on the weight parameters of the edges in the shortest path topology graph, so that an initial cluster tree corresponding to the shortest path topology graph may be obtained.
For easy understanding, please refer to fig. 7, which is a schematic view of a scenario for constructing an initial clustering tree according to an embodiment of the present application, and as shown in fig. 7, a weight parameter of an edge in a shortest path topology graph X in the embodiment of the present application is determined by a vector distance between topology nodes. The number of text vectors respectively corresponding to original texts acquired by the computer device in the embodiment of the present application may be 15, so that the shortest path topology graph X may include 15 topology nodes, and specifically may include topology node a, topology node b, topology node c, topology node d, topology node e, topology node f, topology node g, topology node h, topology node i, topology node j, topology node k, topology node l, topology node m, topology node n, and topology node o.
It should be appreciated that the computer device may order the edges in the shortest path topology graph X according to the weight parameters (e.g., according to an increasing order), such that the topology nodes at both ends of an edge may be divided into tree nodes in the initial clustering tree. As shown in fig. 7, the computer device may refer to an edge between topology node a and topology node b as edge ab. Wherein the weight parameter of the edge ab is 1. And so on, and will not be described herein.
At this time, the computer device may sort based on the weight parameters of the edges. For example, edge ab, edge bc, edge cd, edge ij, edge jk, edge mn, edge ef, edge eg, edge hl, edge mo, edge hi, edge de, edge dh, and edge lm. Further, since the weight parameters of the edge ab, the edge bc, and the edge cd are all 1, and the topology node a, the topology node b, the topology node c, and the topology node d are adjacent, the computer device may divide the topology nodes at the two ends of the edge ab, the edge bc, and the edge cd into the same level (i.e., the area a), and determine the level as the tree node a for constructing the initial clustering tree.
Further, the computer apparatus may divide the topology node i, the topology node j, and the topology node k into a hierarchy (i.e., a region B) according to the edge ij and the edge jk having the weight parameter of 1 and determine the hierarchy as a tree node B for constructing the initial clustering tree. The computer apparatus may divide the topology nodes m and the topology nodes n into a hierarchy (i.e., region C) according to the edge mn having the weight parameter of 1 and determine the hierarchy as the tree node C for constructing the initial clustering tree. The computer apparatus may divide the topology node e, the topology node f, and the topology node g into a hierarchy (i.e., region D) according to the edge ef and the edge eg having the weight parameter of 2 and determine the hierarchy as a tree node D for constructing the initial cluster tree. The computer device may divide the topology node h and the topology node l into a hierarchy (i.e., region E) according to the edge hl having the weight parameter of 2 and determine the hierarchy as the tree node E for constructing the initial clustering tree.
For the edge mo having the weight parameter of 3, since the topology node m belongs to the topology node in the area C, the computer apparatus may divide the topology node m, the topology node n, and the topology node o in the area C into one hierarchy (i.e., the area F) and determine the hierarchy as the tree node F for constructing the initial clustering tree.
For the edge hi with the weight parameter of 3, since the topology node h belongs to the topology node in the area E and the topology node i belongs to the topology node in the area B, the computer apparatus may divide the topology node in the area E and the topology node in the area B into one hierarchy (i.e., the area G) and determine the hierarchy as the tree node G for constructing the initial clustering tree.
For the edge de with the weight parameter of 4, since the topology node D belongs to the topology node in the area a and the topology node e belongs to the topology node in the area D, the computer device may divide the topology node in the area a and the topology node in the area D into a hierarchy (i.e., an area H) and determine the hierarchy as the tree node H for constructing the initial clustering tree.
For an edge dh with a weight parameter of 5, since the topology node d belongs to the topology node in the region H and the topology node H belongs to the topology node in the region G, the computer apparatus may divide the topology node in the region H and the topology node in the region G into a hierarchy (i.e., a region I) and determine the hierarchy as a tree node I for constructing the initial clustering tree.
For the edge ln with the weight parameter of 7, since the topology node l belongs to the topology node in the region I and the topology node n belongs to the topology node in the region F, the computer device may divide the topology node in the region I and the topology node in the region F into one hierarchy (i.e., the region F) and determine the hierarchy as the tree node J for constructing the initial clustering tree.
At this time, the tree nodes of the initial cluster tree may include all the topology nodes in the shortest path topology graph X, and the computer device may complete the construction of the initial cluster tree. Wherein the initial cluster tree may be as shown by initial cluster tree Y in fig. 7.
Further, the computer device may delete the tree node in the initial clustering tree according to the number of topology nodes included in the tree node in the initial clustering tree, so as to obtain a clustering tree. Wherein it is understood that the computer device may obtain the subtree node a of the parent tree node in the initial cluster treemThe number of topology nodes involved is taken as the first number. Further, the computer device may obtain a subtree node a of the parent tree nodenThe number of topology nodes involved as the second number. Wherein m and n are both positive integers less than or equal to F, F is the number of tree nodes contained in the initial clustering tree, and m and n are different.
If neither the first quantity nor the second quantity reaches a quantity threshold, the computer device may delete the sub-tree node amAnd the subtree node an. If the first number and the second number both reach the number threshold, the computer device may reserve the sub-tree node amAnd the subtree node an. If the first number does not reach the number threshold and the second number reaches the number threshold, the computer device may delete the sub-tree node amAnd replacing the parent tree node with the sub tree node an. Further, the computer device may derive a clustering tree from the retained tree nodes.
For easy understanding, please refer to fig. 8, which is a schematic view of a scenario for constructing a cluster tree according to an embodiment of the present application. As shown in fig. 8, the initial cluster tree 80 in the embodiment of the present application may be the initial cluster tree Y shown in fig. 7. It should be understood that the computer device in the embodiment of the present application may obtain the cluster tree 81 corresponding to the initial cluster tree 80 according to the number of topology nodes included in the tree nodes in the initial cluster tree 80.
Therein, it is to be understood that the computer device may traverse the entire initial cluster tree 80 from top to bottom, starting from the root node J in the initial cluster tree 80. For the tree node J, the tree node I and the tree node F in the initial clustering tree 80, the tree node J may be referred to as a parent tree node in the present embodiment, and the tree node J may include the sub-tree node (tree node I) and the sub-tree node (tree node F). It is understood that the computer device can obtain the number of topology nodes contained by the tree node I (e.g., 12) and can also obtain the number of topology nodes contained by the tree node F (e.g., 3). At this time, the computer device may determine that both the number of topology nodes included in the tree node I and the number of topology nodes included in the tree node F reach a number threshold (e.g., 3), and the computer device may reserve the tree node I and the tree node F.
For the tree node I, the tree node H and the tree node G in the initial clustering tree 80, the tree node I may be referred to as a parent tree node in the present embodiment, and the tree node I may include the sub-tree node (tree node H) and the sub-tree node (tree node G). It is to be understood that the computer device may obtain the number of topology nodes contained by the tree node H (e.g., 7) and may also obtain the number of topology nodes contained by the tree node G (e.g., 5). At this time, the computer device may determine that the number of topology nodes included in the tree node H and the number of topology nodes included in the tree node G both reach a number threshold (e.g., 3), and then the computer device may reserve the tree node H and the tree node G.
For the tree node H, the tree node a and the tree node D in the initial clustering tree 80, the tree node H may be referred to as a parent tree node in the present embodiment, and the tree node H may include the sub-tree node (tree node a) and the sub-tree node (tree node D). It is understood that the computer device can obtain the number of topology nodes contained by the tree node a (e.g., 4) and can also obtain the number of topology nodes contained by the tree node D (e.g., 3). At this time, the computer device may determine that the number of topology nodes included in the tree node a and the number of topology nodes included in the tree node D both reach a number threshold (e.g., 3), and the computer device may reserve the tree node a and the tree node D.
For the tree node G, the tree node E and the tree node B in the initial clustering tree 80, the tree node G may be referred to as a parent tree node in the present embodiment, and the tree node G may include the sub-tree node (tree node E) and the sub-tree node (tree node B). It is understood that the computer device may obtain the number of topology nodes contained by the tree node E (e.g., 2) and may also obtain the number of topology nodes contained by the tree node B (e.g., 3). At this time, the computer device may determine that the number of topology nodes included in the tree node E does not reach the number threshold (e.g., 3), and the number included in the tree node B reaches the number threshold, the computer device may delete the tree node E and replace the tree node G with the tree node B.
For the tree node F and the tree node C in the initial clustering tree 80, the tree node G may be referred to as a parent tree node, and the tree node G may include the subtree node (tree node C) and the subtree node (topology node o). It is understood that the computer device may obtain the number of topology nodes contained by the tree node C (e.g., 2) and may also obtain the number of topology nodes contained by the tree node C (e.g., 3). At this time, the computer device may determine that neither the number of topology nodes included in the tree node C nor the topology node o has reached the number threshold (e.g., 3), and the computer device may delete the sub-tree node E and the topology node o.
At this time, the computer device may obtain a cluster tree 81 as shown in fig. 8 from the retained tree nodes.
And S103, taking the tree nodes used for clustering in the clustering tree as node clusters.
In particular, the computer device may associate a parent tree node b in the cluster treexAs the first stability parameter, and may be the parent tree nodeSub-tree node b corresponding to pointyStability parameter and subtree node bzAs a second stability parameter. Wherein x, y and z are positive integers less than or equal to O, O is the number of tree nodes in the clustering tree, and x, y and z are different from each other; the stability parameter is determined based on the vector distance. It should be appreciated that the computer device may replace the first stability parameter with the second stability parameter if the first stability parameter is less than the second stability parameter. If the first stability parameter is greater than or equal to the second stability parameter, the computer device may assign the parent tree node b to the node bxDetermined as a tree node for clustering, and the sub-tree node b may be deletedzAnd the subtree node by. The computer device may then treat the tree nodes for clustering as node clusters in the cluster tree.
In addition, if the first number is less than a number threshold and the second number is greater than a number threshold, the computer device may delete the subtree node amThe topological node in (1) is determined as a noise node. Further, the computer device may obtain a central topology node of a cluster of nodes in the cluster tree. The computer device may determine a node cluster having a minimum vector distance from the noise node as a target node cluster based on a vector distance between the noise node and the central topology node, and may further add the noise node to the target node cluster.
It should be appreciated that the computer device may determine a stability parameter for a tree node in a cluster tree corresponding to the original text based on the vector distance between the topological nodes. Specifically, the computer device determines the stability parameter of the tree node S as shown in the following equation (8):
Figure BDA0002380953480000231
wherein λ ispThe broken connecting edge pair of the topology node p contained in the tree node S in the cluster tree leaves the tree node S because of the splitThe inverse of the corresponding vector distance. Lambda [ alpha ]deathThe reciprocal of the vector distance corresponding to the broken connecting edge when the tree node S is split may be used.
It is understood that the clustering tree 81 corresponding to fig. 8 in the embodiment of the present application is determined according to the shortest path topology diagram X in fig. 7. For example, the stability parameter of the tree node H in the cluster tree 81 may be determined according to equation (8). Wherein λ of the tree node HdeathMay be the reciprocal of the vector distance between topology node d and topology node e (e.g., 1/4), the computer device determines that the stability parameter for the tree node H may be 2.75.
Further, the computer apparatus may traverse the entire cluster tree 81 from bottom to top starting from the leaf node a, so that the tree nodes for clustering in the cluster tree 81 may be regarded as node clusters. For example, for a tree node a (subtree node), a tree node D (subtree node), and a tree node H (parent tree node) in the clustering tree 81, the computer device may take the stability parameter of the tree node H as a first stability parameter (e.g., 2.75), and the sum of the stability parameter of the tree node a (e.g., 3) and the stability parameter of the tree node D (e.g., 4) as a second stability parameter (e.g., 7). At this time, the computer device may determine that the first stability parameter is less than the second stability parameter, and the computer device may replace the first stability parameter of the tree node H with the second stability parameter, that is, the stability parameter of the tree node H may be 7.
For a tree node H (subtree node), a tree node B (subtree node), and a tree node I (parent tree node) in the clustering tree 81, the computer apparatus may take the stability parameter of the tree node I as a first stability parameter (e.g., 10), and the sum of the stability parameter of the tree node H (e.g., 7) and the stability parameter of the tree node B (e.g., 1.3) as a second stability parameter (e.g., 8.3). At this time, the computer device may determine that the first stability parameter is greater than the second stability parameter, and then the computer device may determine the tree node I as a tree node for clustering, and may delete the tree node H and the tree node B. Further, the computer device may treat the tree node I as a node cluster in the cluster tree 81.
In addition, in order to reduce data loss caused by deleting part of tree nodes in the initial clustering tree, the computer device may re-determine the node cluster to which the topology node belongs to the topology node in the deleted tree nodes, so that time complexity of the computer device in performing service processing may be reduced.
It is to be understood that, for the tree node G, the tree node E and the tree node B, since the computer of the tree node E can make the number of topology nodes included in the tree node E in the initial clustering tree Y shown in fig. 7 not reach the number threshold (for example, 3), and the number included in the tree node B reaches the number threshold, the computer device can delete the tree node E and replace the tree node G with the tree node B. At this time, the computer apparatus may determine the topology nodes (i.e., the topology node h and the topology node l) among the tree nodes E as noise nodes.
Further, the computer apparatus may obtain a central topology node of the node cluster in the cluster tree 81 shown in fig. 8. For example, the computer device may determine that the node clusters in the cluster tree 81 are a tree node H and a tree node F, and then the computer device may determine that the central topology node of the tree node H is a topology node d and the topology node of the tree node F is a topology node m.
At this time, for the topology node h (i.e., the noise node), the computer device may determine the vector distance between the topology node h and the topology node d and the topology node m, respectively. Further, the computer device may determine a node cluster having a minimum vector distance from the topological node H as a target node cluster (e.g., tree node H), and add the topological node H to the target node cluster.
For a topology node l (i.e., a noise node), the computer device may determine a vector distance between the topology node l and the topology node d and the topology node m, respectively. Further, the computer device may determine a node cluster having a minimum vector distance from the topological node l as a target node cluster (e.g., tree node H), and add the topological node l to the target node cluster.
And S104, assembling the original texts corresponding to the text vectors in the node cluster to obtain a fusion text.
Specifically, the computer device can assemble the original texts corresponding to the text vectors in the node cluster, so as to obtain the fused text.
It should be understood that the node cluster determined by the computer device may contain a plurality of topological nodes, wherein a topological node corresponds to a text vector. For example, the number of text vectors respectively corresponding to the original texts acquired by the computer device may be 100. It will be appreciated that the computer device may perform a clustering process on 100 text vectors, so that a plurality of node clusters (e.g., 5) may be obtained. The computer device can assemble the original texts corresponding to the text vectors in each of the 5 node clusters, so as to obtain fused texts corresponding to each node cluster.
It is understood that a certain node cluster (e.g., node cluster a) determined by the computer device may contain 15 topological nodes, i.e., 15 text vectors. At this time, the computer device may assemble the original texts corresponding to the 15 text vectors, respectively, so as to obtain a fused text corresponding to the node cluster a.
It is understood that, as shown in the display interface 100 of the user terminal 10b in fig. 2, the fused text corresponding to the node cluster a may be "how to shift how to change the original three-level skin, how to add the ID number, how to quickly turn the gun around, and how to upgrade". The fusion text corresponding to the node cluster B can be that 5g opens vpn, namely after the network is restarted due to a problem, the mobile phone network memory has no problem, the real name system also needs to be forced to be offline, and the network is stable.
And S105, extracting the keywords of the fused text as the keywords of at least two original texts.
In particular, the computer device may perform word segmentation on the fused text. Further, the computer device may count a frequency of the word segmentation in the node cluster, and may count an inverse frequency of the word segmentation in other node clusters other than the node cluster. Wherein the inverse frequency may be determined by the number of node clusters determined by the computer device and the number of node clusters in which the participle appears. Based on the frequency and the inverse frequency, the computer device may determine a keyword evaluation parameter for the participle. At this time, the computer device may select p segmented words as the keywords of the node cluster based on the keyword evaluation parameter, so as to obtain the keywords in all the node clusters determined by the at least two original texts. p is a positive integer.
Specifically, the formula for the computer device to determine the keyword evaluation parameter of the participle may be as shown in the following formula (9):
w=tf*log(N/df), (9)
where tf may be the number of times the participle appears in the node cluster, df may be the number of node clusters where the participle appears, and N may be the determined number of node clusters.
It should be appreciated that the computer device may employ a Chinese word segmentation tool such as jieba to segment the fused text corresponding to the node cluster A, as shown in the display interface 100 of the user terminal 10b in FIG. 2. Further, the computer device may count the number of times (frequency) that the word segmentation occurs in the node cluster a, and count the inverse frequency of the word segmentation in other node clusters other than the node cluster, and may further determine the keyword evaluation parameter of the word segmentation according to formula (9).
Further, the computer device may rank (e.g., descending or ascending) the segments in the node cluster a based on the keyword evaluation parameter, so that p segments with larger keyword evaluation parameters may be selected as the keywords of the node cluster a. As shown in fig. 2, the computer device can determine 3 segments as keywords of the node cluster a, i.e., how, advertising, indexing.
The computer device in the embodiment of the application can determine the vector distance between the text vectors according to the acquired text vectors associated with at least two original texts. Wherein the text vector is determined by encoding the deep semantic information of the original text. Furthermore, the clustering tree can be constructed according to the distance vector, and then the tree nodes used for clustering can be quickly used as the node clusters according to the stability parameters of the tree nodes of the clustering tree, and the keywords of the node clusters are extracted, so that the clustering efficiency of the original text can be improved.
Further, please refer to fig. 9, which is a flowchart illustrating a data processing method according to an embodiment of the present application. As shown in fig. 9, the method may include:
s201, at least two original texts are obtained.
Specifically, the computer device in the embodiment of the present application may periodically obtain at least two original texts from a database having a network connection relationship with the computer device. Wherein the original text is a question text entered by a user into the computer device.
S202, text vectors corresponding to at least two original texts are obtained, and a clustering tree is constructed according to the vector distance between the at least two text vectors.
The clustering tree comprises a root node, leaf nodes and intermediate nodes between the root node and the leaf nodes; the root node includes text vectors associated with the at least two original texts, and the leaf node includes text vectors having the same vector distance.
Specifically, the computer device may obtain text vectors corresponding to at least two original texts, respectively. Wherein one original text corresponds to one text vector. Further, the computer device may determine each of the at least two text vectors as a topological node, and may obtain a vector distance between the topological nodes. At this time, the computer device may construct a shortest path topology map according to the topology nodes and the vector distances between the topology nodes. Wherein the weight parameter of the edge in the shortest path topology map is determined by the vector distance. The computer device may divide the topology nodes in the shortest path topology graph based on the weight parameters of the edges in the shortest path topology graph to obtain an initial cluster tree corresponding to the shortest path topology graph. Further, the computer device may delete the tree node in the initial clustering tree according to the number of topology nodes included in the tree node in the initial clustering tree, so as to obtain the clustering tree.
And S203, taking the tree nodes used for clustering in the clustering tree as node clusters.
In particular, the computer device may associate a parent tree node b in the cluster treexAs the first stability parameter, and may use the subtree node b corresponding to the parent tree nodeyStability parameter and subtree node bzAs a second stability parameter. Wherein x, y and z are positive integers less than or equal to O, O is the number of tree nodes in the clustering tree, and x, y and z are different from each other; the stability parameter is determined based on the vector distance. It should be appreciated that the computer device may replace the first stability parameter with the second stability parameter if the first stability parameter is less than the second stability parameter. If the first stability parameter is greater than or equal to the second stability parameter, the computer device may assign the parent tree node b to the node bxDetermined as a tree node for clustering, and the sub-tree node b may be deletedzAnd the subtree node by. The computer device may then treat the tree nodes for clustering as node clusters in the cluster tree.
And S204, assembling the original texts corresponding to the text vectors in the node cluster to obtain a fusion text.
Specifically, the computer device can assemble the original texts corresponding to the text vectors in the node cluster to obtain the fused text.
S205, extracting the key words of the fused text as the key words of at least two original texts.
In particular, the computer device may perform word segmentation on the fused text. Further, the computer device may count a frequency of the word segmentation in the node cluster, and may count an inverse frequency of the word segmentation in other node clusters other than the node cluster. Wherein the inverse frequency may be determined by the number of node clusters determined by the computer device and the number of node clusters in which the participle appears. Based on the frequency and the inverse frequency, the computer device may determine a keyword evaluation parameter for the participle. At this time, the computer device may select p segmented words as the keywords of the node cluster based on the keyword evaluation parameter, so as to obtain the keywords in all the node clusters determined by the at least two original texts. p is a positive integer.
For specific implementation of steps S201 to S205, reference may be made to the description of steps S101 to S105 in the embodiment corresponding to fig. 3, which will not be described herein again.
S206, carrying out mean processing on the vector distance between the topological nodes in the node cluster to obtain a first mean value corresponding to the topological nodes in the node cluster;
s207, sorting the first average value, and displaying the original text corresponding to the topological node in the node cluster according to the sorting result of the first average value.
S208, carrying out mean processing on the first mean value of each node cluster in the at least two node clusters to obtain a second mean value for representing the clustering effect of the node clusters;
s209, sorting the second average values respectively corresponding to the at least two node clusters, and displaying the at least two node clusters according to the sorting result of the second average values.
In order to improve the efficiency of manual screening later, the computer device can sequence the determined node clusters and the original texts in the node clusters, so that the node cluster display with a good clustering effect can be displayed in front.
It should be understood that the node clusters that the computer device can derive based on the original text can be node cluster a and node cluster B as shown in fig. 2. At this time, for the node cluster a, the computer device may perform averaging processing on the vector distance between the topology nodes in the node cluster a, so that a first average value corresponding to the topology nodes in the node cluster a may be obtained.
For example, the node cluster a may contain 6 topology nodes,namely, topology node 1, topology node 2, topology node 3, topology node 4, topology node 5 and topology node 6. The computer device may average a vector distance d (1,2) between the topology nodes 1 and 2, a vector distance d (1,3) between the topology nodes 1 and 3, a vector distance d (1, 4) between the topology nodes 1 and 4, a vector distance d (1, 5) between the topology nodes 1 and 5, and a vector distance d (1, 6) between the topology nodes 1 and 6, so as to obtain a first average value d (1,2) corresponding to the topology node 111
Further, the computer device may also determine a first average value d corresponding to the topology node 212The first average value d corresponding to the topology node 313The first average value d corresponding to the topology node 414The first average value d corresponding to the topology node 515The first average value d corresponding to the topology node 616. It is understood that the description of the computer device averaging the vector distances of the other topological nodes in the node cluster a can be referred to the description of the computer device averaging the topological node d11The specific implementation process of the mean processing is performed on the vector distance, and is not further described herein.
It should be understood that the computer device may be paired with d as shown in FIG. 211、d12、d13、d14、d15And d16And sorting (for example, ascending processing) is performed, and then the original text corresponding to the topology node can be shown according to the sorting result. It can be understood that the first average value of the topology nodes corresponding to the original text "how to wrap the skin with the original three levels" is larger than the first average value of the topology nodes corresponding to the original text "how to fast turn the gun".
For the node cluster B, the computer device may also perform an averaging process on the vector distance between the topology nodes in the node cluster B, so as to obtain a first average value, i.e. d, corresponding to the topology nodes in the node cluster B21、d22、d23、d24And d25. Further, theThe computer device may perform sorting according to the first average value of the node cluster B, and may further display the original text corresponding to the topology node according to the sorting result.
Further, the computer device may average a first average value (i.e., d) of the cluster of nodes A11、d12、d13、d14、d15And d16) Performing mean value processing to obtain a second mean value d for representing the clustering effect of the node cluster A1. The computer device may also average a first average value (i.e., d) of the node cluster B21、d22、d23、d24And d25) Performing an averaging process to obtain a second average d representing the clustering effect of the node cluster B2
It should be understood that the computer device may be paired with d as shown in FIG. 21And d2And performing sorting processing, and displaying the node cluster A and the node cluster B according to the sorting result of the second average value. It can be understood that the second average value of the node cluster a is greater than the second average value of the node cluster B, in other words, the clustering effect of the node cluster a is better than that of the node cluster B, so that the efficiency of manually screening problem texts in the following process can be improved.
Further, please refer to fig. 10, which is a scene diagram illustrating a scenario of determining an answer text corresponding to a question text according to an embodiment of the present application. It should be appreciated that the computer device may be the server 2000 shown in FIG. 1, wherein the computer device may be a server corresponding to a game (e.g., Game A) application. The target user terminal in this embodiment may be a user terminal that installs the game a application, and the target user terminal may be any one user terminal (for example, the user terminal 3000b) in the user terminal cluster in fig. 1. The database in the embodiment of the present application may store the question text and the answer text related to the game a.
It can be understood that, in the application scenario of the game a, the computer device may obtain, according to the obtained original text, a node cluster corresponding to the original text through clustering. The original text may be the question text for the game a by the user terminal associated with the game a application. For example, the question text may be "how many defense attributes XX armors can add," how many game coins one game diamond can exchange, "what the opened periodic activities of five weeks each" or the like, which is not listed here.
The number of the node clusters determined by the computer device may include a plurality of node clusters, and in the embodiment of the present application, 3 node clusters may be taken as an example, as shown in the clustering result in fig. 10, and the embodiment may specifically include a node cluster 10, a node cluster 20, and a node cluster 30. Wherein the node cluster 10 can be a cluster of problems associated with equipment attributes in the game a, the node cluster 20 can be a cluster of problems associated with a game mall in the game a, and the node cluster 30 can be a cluster of problems associated with game activities in the game a. Further, background staff corresponding to the game a can manually screen the problems in the node cluster based on the node cluster determined by the computer device and the keywords corresponding to the node cluster, so that the problem text in the database can be expanded.
It should be understood that the target user corresponding to the target user terminal may propose a question text a for the game a, for example, the question text a may be "how many coins are needed for the XX sword to purchase". At this time, the target user terminal may transmit the question text a to the computer device. Further, the computer device may quickly determine, based on the question text a, a similarity corresponding to the question text a and each question text in the database, respectively. Wherein the computer device may determine the question text having the highest similarity as the target question text.
At this time, the computer device may obtain, in the database, an answer text (e.g., answer text b) corresponding to the target question text. For example, the target question text may be "XX sword worth in the game mall" and the answer text b may be "XX sword worth 40 medals in the game mall". At this time, the computer device may take the answer text b as an answer text of the question text a and return the answer text b to the target user terminal.
The computer device in the embodiment of the application can determine the vector distance between the text vectors according to the acquired text vectors associated with at least two original texts. Wherein the text vector is determined by encoding the deep semantic information of the original text. Furthermore, the clustering tree can be constructed according to the distance vector, and then the tree nodes used for clustering can be quickly used as the node clusters according to the stability parameters of the tree nodes of the clustering tree, and the keywords of the node clusters are extracted, so that the clustering efficiency of the original text can be improved.
Further, please refer to fig. 11, which is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing means may be a computer program (comprising program code) running on a computer device, e.g. an application software; the data processing device can be used for executing the corresponding steps in the method provided by the embodiment of the application. As shown in fig. 11, the data processing apparatus 1 may be operated on a computer device, which may be the server 10a in the embodiment corresponding to fig. 2. The data processing apparatus 1 may include: the system comprises a first obtaining module 11, a second obtaining module 12, a constructing module 13, a first determining module 14, a splicing module 15, an extracting module 16, a second determining module 17, a first ordering module 18, a third determining module 19 and a second ordering module 20.
The first obtaining module 11 is configured to obtain at least two original texts;
the second obtaining module 12 is configured to obtain text vectors corresponding to at least two original texts, respectively.
Wherein, the second obtaining module 12 includes: the system comprises a first segmentation unit 121, a matching unit 122, a rejection unit 123, an extraction unit 124 and a dimension reduction unit 125.
The first word segmentation unit 121 is configured to perform part-of-speech tagging on at least two original texts, and perform word segmentation on the at least two original texts according to the tagged part-of-speech;
the matching unit 122 is configured to match the participles according to the stop word list and the reserved word list to obtain a text to be filtered; the deactivation word list is used for storing the participles to be filtered, and the reserved word list is used for determining the filtering relation of the participles not belonging to the deactivation word list;
the eliminating unit 123 is configured to eliminate a text to be filtered from at least two original texts to obtain a text to be encoded;
the extracting unit 124 is configured to perform feature extraction on the text to be encoded, so as to obtain an initial text vector corresponding to the text to be encoded.
Wherein the extracting unit 124 includes: a character encoding sub-unit 1241, a position encoding sub-unit 1242, and a first determination sub-unit 1243.
The character encoding subunit 1241 is configured to perform character encoding on characters in a text to be encoded, so as to obtain character vectors corresponding to the characters in the text to be encoded;
the position encoding subunit 1242 is configured to perform position encoding on characters in the text to be encoded, so as to obtain position vectors corresponding to the characters in the text to be encoded;
the first determining subunit 1243 is configured to determine, based on the character vector and the position vector, an initial text vector corresponding to the text to be encoded.
The specific implementation manners of the character encoding subunit 1241, the position encoding subunit 1242 and the first determining subunit 1243 may refer to the description of the initial text vector in the embodiment corresponding to fig. 4, and details will not be repeated here.
The dimension reduction unit 125 is configured to perform dimension reduction processing on the initial text vector to obtain a text vector.
For specific implementation manners of the first word segmentation unit 121, the matching unit 122, the removing unit 123, the extracting unit 124 and the dimension reduction unit 125, reference may be made to the description of step S102 in the embodiment corresponding to fig. 3, and details will not be further described here.
The building module 13 is configured to build a cluster tree according to a vector distance between at least two text vectors.
Wherein, this construction module 13 includes: a first determination unit 131, a construction unit 132, a division unit 133, and a deletion unit 134.
The first determining unit 131 is configured to determine at least two text vectors as topology nodes, and obtain vector distances between the topology nodes.
Wherein the first determining unit 131 includes: a third determining subunit 1311, a fourth determining subunit 1312, a fifth determining subunit 1313, and a sixth determining subunit 1314.
The third determining subunit 1311, configured to determine the topological node tiThe original distances corresponding to the associated K topology nodes are respectively corresponding to the topology nodes tiDetermining the original distance with the maximum as the first distance; k is a positive integer less than or equal to N;
the fourth determining subunit 1312 is configured to determine the topological node tjThe original distances corresponding to the associated K topology nodes are respectively corresponding to the topology nodes tjDetermining the original distance with the maximum as the second distance;
the fifth determining subunit 1313, configured to determine the topology node tiAnd topology node tjOriginal distance between, topological node tiAnd topology node tjThe original distance between the first and second distances is determined as a third distance;
the sixth determining subunit 1314, configured to determine the topology node t based on the first distance, the second distance, and the third distanceiAnd topology node tjThe vector distance between.
For specific implementation manners of the third determining subunit 1311, the fourth determining subunit 1312, the fifth determining subunit 1313, and the sixth determining subunit 1314, reference may be made to the description of the vector distance in the embodiment corresponding to fig. 5, and details will not be further described here.
The constructing unit 132 is configured to construct a shortest path topology map according to the topology nodes and the vector distances between the topology nodes; the weight parameter of an edge in the shortest path topology graph is determined by the vector distance.
Wherein, this construction element 132 includes: a construction sub-unit 1321, a first addition sub-unit 1322, a second determination sub-unit 1323 and a second addition sub-unit 1324.
The constructing subunit 1321 is configured to construct an initial topology map according to the topology nodes and the vector distances between the topology nodes; the initial topological graph comprises topological nodes tiTopology node tjAnd topology node tkI, j and k are positive integers less than or equal to N, N is the number of at least two text vectors, and i, j and k are different from each other;
the first adding subunit 1322 is configured to select a topology node t from the initial topology mapiDetermining and topological node tiTopological node t corresponding to connecting edge with minimum weight parameterjConnecting the topology node tiAnd topology node tjAdd to shortest path topology graph;
the second determining subunit 1323 is configured to determine, as a remaining topology node, a topology node in the initial topology map, except for the topology node included in the shortest path topology map;
the second adding subunit 1324 is configured to determine the topology node t from the remaining topology nodesiAnd topology node tjTopological node t corresponding to connecting edge with minimum weight parameterkConnecting the topology node tkAnd adding the nodes to the shortest path topological graph until the remaining topological nodes are empty, and finishing the construction of the shortest path topological graph.
For specific implementation manners of the constructing subunit 1321, the first adding subunit 1322, the second determining subunit 1323, and the second adding subunit 1324, reference may be made to the description of the shortest path topology diagram in the embodiment corresponding to fig. 6, and details will not be described here again.
The partitioning unit 133 is configured to partition the topology nodes in the shortest path topology graph based on the weight parameters of the edges in the shortest path topology graph, so as to obtain an initial cluster tree corresponding to the shortest path topology graph;
the deleting unit 134 is configured to delete the tree nodes in the initial clustering tree according to the number of topology nodes included in the tree nodes in the initial clustering tree, so as to obtain the clustering tree.
Wherein the pruning unit 134 includes: a first acquiring sub-unit 1341, a first deleting sub-unit 1342, a retaining sub-unit 1343, a second deleting sub-unit 1344 and a seventh determining sub-unit 1345.
The first obtaining subunit 1341 is configured to obtain a sub-tree node a of a parent tree node in the initial cluster treemThe number of the included topology nodes is used as a first number, and a subtree node a of the parent tree node is obtainednThe number of included topological nodes is taken as a second number; m and n are positive integers less than or equal to F, F is the number of tree nodes contained in the initial clustering tree, and m and n are different;
the first deleting subunit 1342 is configured to delete the subtree node a if both the first quantity and the second quantity do not reach the quantity thresholdmAnd subtree node an
The reservation subunit 1343 is configured to reserve the subtree node a if the first number and the second number both reach the number thresholdmAnd subtree node an
The second deleting subunit 1344 is configured to delete the subtree node a if the first number does not reach the number threshold and the second number reaches the number thresholdmReplacing parent tree node with subtree node an
Wherein the second deletion subunit 1344 is further configured to:
if the first number does not reach the number threshold and the second number reaches the number threshold, the deleted subtree node amDetermining the topological node as a noise node; acquiring a central topological node of a node cluster in a clustering tree; determining a node cluster with the minimum vector distance to the noise node as a target node cluster based on the vector distance between the noise node and the central topological node; and adding the noise node to the target node cluster.
The seventh determining subunit 1345 is configured to obtain a cluster tree according to the reserved tree nodes.
The specific implementation manners of the first obtaining subunit 1341, the first deleting subunit 1342, the reserving subunit 1343, the second deleting subunit 1344, and the seventh determining subunit 1345 may refer to the description of constructing the cluster tree in the embodiment corresponding to fig. 3, and will not be described again.
For specific implementation manners of the first determining unit 131, the constructing unit 132, the dividing unit 133 and the deleting unit 134, reference may be made to the description of step S102 in the embodiment corresponding to fig. 3, and details will not be further described here.
The first determining module 14 is configured to use the tree nodes for clustering in the cluster tree as the node cluster.
Wherein the first determining module 14 comprises: a second determination unit 141, a replacement unit 142, a deletion unit 143, and a third determination unit 144.
The second determining unit 141 is configured to determine the parent node b in the cluster treexThe stability parameter of (2) is taken as a first stability parameter, and a sub-tree node b corresponding to the parent tree node is taken as a second stability parameteryStability parameter and subtree node bzThe sum of the stability parameters of (a) as a second stability parameter; the stability parameter is determined based on the vector distance;
the replacing unit 142 is configured to replace the first stability parameter with the second stability parameter if the first stability parameter is smaller than the second stability parameter;
the deleting unit 143 is configured to delete the parent tree node b if the first stability parameter is greater than or equal to the second stability parameterxDetermining as a tree node for clustering, deleting sub-tree node bzAnd subtree node by
The third determining unit 144 is configured to use the tree nodes for clustering as node clusters in the cluster tree.
For specific implementation manners of the second determining unit 141, the replacing unit 142, the deleting unit 143, and the third determining unit 144, reference may be made to the description of step S103 in the embodiment corresponding to fig. 3, and details will not be further described here.
The assembling module 15 is configured to assemble original texts corresponding to the text vectors in the node clusters to obtain fused texts;
the extraction module 16 is configured to extract keywords of the fused text as keywords of at least two original texts.
Wherein the extraction module 16 comprises: a second segmentation unit 161, a statistics unit 162, a fourth determination unit 163 and a fifth determination unit 164.
The second word segmentation unit 161 is configured to segment the fused text;
the counting unit 162 is configured to count the frequency of the participle in the node cluster and the inverse frequency of the participle in other node clusters except the node cluster;
the fourth determining unit 163 for determining keyword evaluation parameters of the participles based on the frequency and the inverse frequency;
the fifth determining unit 164, configured to select p segmented words as the keywords of the node cluster based on the keyword evaluation parameter; p is a positive integer.
For specific implementation manners of the second word segmentation unit 161, the statistics unit 162, the fourth determination unit 163, and the fifth determination unit 164, reference may be made to the description of step S105 in the embodiment corresponding to fig. 3, and details will not be further described here.
The second determining module 17 is configured to perform an average processing on the vector distances between the topological nodes in the node cluster to obtain a first average value corresponding to the topological nodes in the node cluster;
the first sorting module 18 is configured to sort the first average value, and display an original text corresponding to a topology node in a node cluster according to a sorting result of the first average value.
The number of the node clusters is at least two;
the third determining module 19 is configured to perform an average processing on the first average value of each of the at least two node clusters to obtain a second average value used for representing a clustering effect of the node clusters;
the second sorting module 20 is configured to sort the second average values respectively corresponding to the at least two node clusters, and display the at least two node clusters according to a sorting result of the second average values.
For specific implementation manners of the first obtaining module 11, the second obtaining module 12, the constructing module 13, the first determining module 14, the assembling module 15, the extracting module 16, the second determining module 17, the first sorting module 18, the third determining module 19, and the second sorting module 20, reference may be made to the description of step S201 to step S209 in the embodiment corresponding to fig. 9, which will not be further described herein. In addition, the beneficial effects of the same method are not described in detail.
Further, please refer to fig. 12, which is a schematic diagram of a computer device according to an embodiment of the present application. As shown in fig. 12, the computer device 1000 may be the user terminal 10b in the corresponding embodiment of fig. 2, and the computer device 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display) and a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface and a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally also be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 12, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
In the computer apparatus 1000 shown in fig. 12, the network interface 1004 is mainly used for network communication with the user terminal; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
acquiring at least two original texts;
acquiring text vectors corresponding to at least two original texts respectively, and constructing a clustering tree according to a vector distance between the at least two text vectors;
taking the tree nodes used for clustering in the clustering tree as node clusters;
assembling original texts corresponding to the text vectors in the node clusters to obtain a fusion text;
and extracting keywords of the fused text as keywords of at least two original texts.
It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3 and fig. 9, and may also perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 11, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the aforementioned data processing apparatus 1 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method in the embodiment corresponding to fig. 3 or fig. 9 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. As an example, program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network, which may comprise a block chain system.
It will be understood by those skilled in the art that all or part of the processes of the embodiments may be implemented by hardware related to instructions of a computer program, and the computer program may be stored in a computer readable storage medium, and when executed, may include processes such as those of the embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (13)

1. A data processing method, comprising:
acquiring at least two original texts;
acquiring text vectors corresponding to the at least two original texts respectively;
determining the at least two text vectors as topological nodes, and acquiring vector distances between the topological nodes;
constructing a shortest path topological graph according to the topological nodes and the vector distance between the topological nodes; the weight parameter of an edge in the shortest path topology graph is determined by the vector distance;
based on the weight parameters of the edges in the shortest path topological graph, dividing the topological nodes in the shortest path topological graph to obtain an initial clustering tree corresponding to the shortest path topological graph;
according to the number of topological nodes contained in the tree nodes in the initial clustering tree, carrying out deletion processing on the tree nodes in the initial clustering tree to obtain a clustering tree;
taking the tree nodes used for clustering in the clustering tree as node clusters;
assembling original texts corresponding to the text vectors in the node cluster to obtain a fusion text;
extracting keywords of the fused text to serve as keywords of the at least two original texts;
wherein the pruning the tree nodes in the initial cluster tree comprises:
deleting tree nodes of which the number of included topology nodes does not reach a number threshold value, and determining the topology nodes in the deleted tree nodes as noise nodes;
acquiring a central topological node of a node cluster in the clustering tree;
determining a node cluster having a minimum vector distance with the noise node as a target node cluster based on the vector distance between the noise node and the central topology node;
adding the noise node to the target node cluster.
2. The method according to claim 1, wherein the obtaining text vectors corresponding to the at least two original texts respectively comprises:
performing word segmentation on the at least two original texts;
matching the word segmentation according to the stop word list and the reserved word list to obtain a text to be filtered; the deactivation word list is used for storing participles to be filtered, and the reserved word list is used for determining the filtering relation of the participles which do not belong to the deactivation word list;
removing the text to be filtered from the at least two original texts to obtain a text to be coded;
extracting the characteristics of the text to be coded to obtain an initial text vector corresponding to the text to be coded;
and performing dimensionality reduction on the initial text vector to obtain the text vector.
3. The method according to claim 2, wherein the extracting features of the text to be encoded to obtain an initial text vector corresponding to the text to be encoded comprises:
carrying out character encoding on characters in the text to be encoded to obtain character vectors corresponding to the characters in the text to be encoded;
carrying out position coding on characters in the text to be coded to obtain position vectors corresponding to the characters in the text to be coded;
and determining an initial text vector corresponding to the text to be coded based on the character vector and the position vector.
4. The method of claim 1, wherein constructing the shortest path topology graph according to the topology nodes and the vector distances between the topology nodes comprises:
constructing an initial topological graph according to the topological nodes and the vector distance between the topological nodes; the initial topological graph comprises topological nodes tiTopology node tjAnd topology node tkI, j and k are positive integers less than or equal to N, N is the number of the at least two text vectors, and i, j and k are different from each other;
selecting a topological node t from the initial topological graphiDetermining the topological node tiTopological node t corresponding to connecting edge with minimum weight parameterjConnecting the topology node tiAnd said topological node tjAdd to shortest path topology graph;
determining topology nodes in the initial topology graph except the topology nodes contained in the shortest path topology graph as residual topology nodes;
determining the topology node t among the remaining topology nodesiAnd said topological node tjTopological node t corresponding to connecting edge with minimum weight parameterkConnecting the topology node tkAnd adding the nodes to the shortest path topological graph until the residual topological nodes are empty, and finishing the construction of the shortest path topological graph.
5. The method according to claim 4, wherein the determining the at least two text vectors as topological nodes and obtaining the vector distance between the topological nodes comprises:
determining the topological node tiThe original distances corresponding to the associated K topological nodes respectively are to be compared with the topological node tiWith the largest originThe distance is determined as a first distance; k is a positive integer less than or equal to N;
determining the topological node tjThe original distances corresponding to the associated K topological nodes respectively are to be compared with the topological node tjDetermining the original distance with the maximum as the second distance;
determining the topological node tiWith said topological node tjOriginal distance between, connecting said topological nodes tiWith said topological node tjThe original distance between the first and second distances is determined as a third distance;
determining the topological node t based on the first distance, the second distance, and the third distanceiWith said topological node tjThe vector distance between.
6. The method according to claim 1, wherein the pruning the tree nodes in the initial clustering tree according to the number of topology nodes included in the tree nodes in the initial clustering tree to obtain a clustering tree comprises:
obtaining a subtree node a of a parent tree node in the initial clustering treemThe number of the included topological nodes is used as a first number, and the subtree nodes a of the parent tree nodes are obtainednThe number of included topological nodes is taken as a second number; m and n are positive integers less than or equal to F, F is the number of tree nodes contained in the initial clustering tree, and m and n are different;
if the first quantity and the second quantity do not reach the quantity threshold value, deleting the subtree node amAnd said subtree node an
If the first quantity and the second quantity both reach the quantity threshold value, the subtree node a is reservedmAnd said subtree node an
If the first quantity does not reach the quantity threshold value and the second quantity reaches the quantity threshold value, deleting the subtree node amReplacing the parent tree node with the subtree node an
And obtaining the clustering tree according to the reserved tree nodes.
7. The method according to claim 1, wherein the step of regarding the tree nodes for clustering in the cluster tree as node clusters comprises:
a parent tree node b in the clustering tree is connectedxThe stability parameter of (2) is taken as a first stability parameter, and a sub-tree node b corresponding to the parent tree node is taken as a second stability parameteryStability parameter and subtree node bzThe sum of the stability parameters of (a) as a second stability parameter; the stability parameter is determined based on the vector distance;
if the first stability parameter is less than the second stability parameter, replacing the first stability parameter with the second stability parameter;
if the first stability parameter is greater than or equal to the second stability parameter, the parent tree node b is usedxDetermining as a tree node for clustering, deleting the sub-tree node bzAnd said sub-tree node by
And taking the tree nodes for clustering as node clusters in the clustering tree.
8. The method according to claim 1, wherein the extracting the keywords of the fused text as the keywords of the at least two original texts comprises:
segmenting the fused text;
counting the frequency of the participle in the node cluster and the inverse frequency of the participle in other node clusters except the node cluster;
determining keyword evaluation parameters of the participles based on the frequency and the inverse frequency;
selecting p participles as the key words of the node cluster based on the key word evaluation parameters; and p is a positive integer.
9. The method of claim 1, further comprising:
carrying out mean processing on vector distances among the topological nodes in the node cluster to obtain a first mean value corresponding to the topological nodes in the node cluster;
and sequencing the first average value, and displaying the original text corresponding to the topological node in the node cluster according to the sequencing result of the first average value.
10. The method of claim 9, wherein the number of node clusters is at least two;
the method further comprises the following steps:
carrying out mean processing on the first mean value of each of at least two node clusters to obtain a second mean value for representing the clustering effect of the node clusters;
and sequencing second average values respectively corresponding to the at least two node clusters, and displaying the at least two node clusters according to the sequencing result of the second average values.
11. A data processing apparatus, comprising:
the first acquisition module is used for acquiring at least two original texts;
the second obtaining module is used for obtaining text vectors corresponding to the at least two original texts respectively;
the building module is used for building a clustering tree according to the vector distance between at least two text vectors;
the first determining module is used for taking the tree nodes used for clustering in the clustering tree as node clusters;
the assembling module is used for assembling the original texts corresponding to the text vectors in the node clusters to obtain fused texts;
the extraction module is used for extracting the keywords of the fused text as the keywords of the at least two original texts;
wherein the building block comprises:
the first determining unit is used for determining at least two text vectors as topological nodes and acquiring vector distances between the topological nodes;
the construction unit is used for constructing a shortest path topological graph according to the topological nodes and the vector distance between the topological nodes; the weight parameter of the edge in the shortest path topology graph is determined by the vector distance;
the partitioning unit is used for partitioning the topological nodes in the shortest path topological graph based on the weight parameters of the edges in the shortest path topological graph to obtain an initial clustering tree corresponding to the shortest path topological graph;
a deleting unit, configured to delete a tree node whose number of included topology nodes does not reach a number threshold according to the number of topology nodes included in the tree node in the initial clustering tree, and determine the topology node in the deleted tree node as a noise node; acquiring a central topological node of a node cluster in the clustering tree; determining a node cluster having a minimum vector distance with the noise node as a target node cluster based on the vector distance between the noise node and the central topology node; adding the noise node to the target node cluster.
12. A computer device, comprising: a processor, a memory, a network interface;
the processor is connected to a memory for providing data communication functions, a network interface for storing a computer program, and a processor for calling the computer program to perform the method according to any one of claims 1 to 10.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method according to any one of claims 1-10.
CN202010082953.2A 2020-02-07 2020-02-07 Data processing method and device, computer equipment and storage medium Active CN111259154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010082953.2A CN111259154B (en) 2020-02-07 2020-02-07 Data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010082953.2A CN111259154B (en) 2020-02-07 2020-02-07 Data processing method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111259154A CN111259154A (en) 2020-06-09
CN111259154B true CN111259154B (en) 2021-04-13

Family

ID=70948234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010082953.2A Active CN111259154B (en) 2020-02-07 2020-02-07 Data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111259154B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111884832B (en) * 2020-06-29 2022-06-14 华为技术有限公司 Method for acquiring passive network topology information and related equipment
CN111768001B (en) * 2020-06-30 2024-01-23 平安国际智慧城市科技股份有限公司 Language model training method and device and computer equipment
CN113380414B (en) * 2021-05-20 2023-11-10 心医国际数字医疗系统(大连)有限公司 Data acquisition method and system based on big data
CN113537416A (en) * 2021-09-17 2021-10-22 深圳市安软科技股份有限公司 Method and related equipment for converting text into image based on generative confrontation network
CN116566995B (en) * 2023-07-10 2023-09-22 安徽中科晶格技术有限公司 Block chain data transmission method based on classification and clustering algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN107103043A (en) * 2017-03-29 2017-08-29 国信优易数据有限公司 A kind of Text Clustering Method and system
CN108875760A (en) * 2017-05-11 2018-11-23 阿里巴巴集团控股有限公司 clustering method and device
CN110413745A (en) * 2019-06-21 2019-11-05 阿里巴巴集团控股有限公司 Selection represents the method for text, determines the method and device of typical problem

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053992B (en) * 2009-11-10 2014-12-10 阿里巴巴集团控股有限公司 Clustering method and system
CN107656948B (en) * 2016-11-14 2019-05-07 平安科技(深圳)有限公司 The problems in automatically request-answering system clustering processing method and device
CN109885684B (en) * 2019-01-31 2022-11-22 腾讯科技(深圳)有限公司 Cluster-like processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN107103043A (en) * 2017-03-29 2017-08-29 国信优易数据有限公司 A kind of Text Clustering Method and system
CN108875760A (en) * 2017-05-11 2018-11-23 阿里巴巴集团控股有限公司 clustering method and device
CN110413745A (en) * 2019-06-21 2019-11-05 阿里巴巴集团控股有限公司 Selection represents the method for text, determines the method and device of typical problem

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
【机器学习】密度聚类算法之HDBSCAN;ACM_hades的博客;《https://blog.csdn.net/ACM_hades/article/details/90906677》;20190605;第1-10页 *
基于HDBSCAN动态跟踪客户用电行为模式;王继业 等;《供用电》;20190131;全文 *
最小生成树-Prim算法和Kruskal算法;裸睡的猪;《https://www.cnbolgs.com/ggzhangxiaochao/p/9070873.html》;20180522;第1-7页 *

Also Published As

Publication number Publication date
CN111259154A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111259154B (en) Data processing method and device, computer equipment and storage medium
Garreta et al. Learning scikit-learn: machine learning in python
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN108874878A (en) A kind of building system and method for knowledge mapping
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN106959946B (en) Text semantic feature generation optimization method based on deep learning
CN112417289B (en) Information intelligent recommendation method based on deep clustering
CN112580328A (en) Event information extraction method and device, storage medium and electronic equipment
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN115130538A (en) Training method of text classification model, text processing method, equipment and medium
CN114528898A (en) Scene graph modification based on natural language commands
CN110162637A (en) Information Atlas construction method, device and equipment
CN113849653B (en) Text classification method and device
CN114942994A (en) Text classification method, text classification device, electronic equipment and storage medium
CN110889505A (en) Cross-media comprehensive reasoning method and system for matching image-text sequences
CN114490926A (en) Method and device for determining similar problems, storage medium and terminal
CN113761192A (en) Text processing method, text processing device and text processing equipment
CN114461943B (en) Deep learning-based multi-source POI semantic matching method and device and storage medium thereof
CN112463974A (en) Method and device for establishing knowledge graph
CN113255345B (en) Semantic recognition method, related device and equipment
CN115329210A (en) False news detection method based on interactive graph layered pooling
CN111368531B (en) Translation text processing method and device, computer equipment and storage medium
CN113392220A (en) Knowledge graph generation method and device, computer equipment and storage medium
KR20210150103A (en) Collaborative partner recommendation system and method based on user information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40024294

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant