CN109033084B

CN109033084B - Semantic hierarchical tree construction method and device

Info

Publication number: CN109033084B
Application number: CN201810836275.7A
Authority: CN
Inventors: 蔡世清; 郑凯; 段立新; 江建军; 夏虎
Original assignee: Guoxin Youe Data Co Ltd
Current assignee: Guoxin Youe Data Co Ltd
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2022-10-28
Anticipated expiration: 2038-07-26
Also published as: CN109033084A

Abstract

The application provides a semantic hierarchical tree construction method and a semantic hierarchical tree construction device, wherein the method comprises the following steps: classifying the data set to obtain at least one group, wherein each group comprises at least one word; performing interclass hierarchical clustering on the at least one group to obtain a first hierarchical clustering subtree, wherein each group is a leaf node of the first hierarchical clustering subtree; performing intra-group hierarchical clustering on each group to obtain a second hierarchical clustering subtree corresponding to each group, wherein words included in each group are leaf nodes of the second hierarchical clustering subtree corresponding to the group; and constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree. The embodiment of the application can be used for quickly constructing the semantic hierarchical tree based on the large-scale data set.

Description

Semantic hierarchical tree construction method and device

Technical Field

The application relates to the technical field of natural language processing, in particular to a semantic hierarchical tree construction method and a semantic hierarchical tree construction device.

Background

In the field of natural language processing, existing language models typically rely on machine learning algorithms. The nature of machine learning is predictive; after the machine learning model is trained through a large amount of training data sets to obtain a natural language processing model, the data to be processed can be input into the trainingGood tasteThe natural language processing model obtains a prediction result corresponding to the data to be processed.

Most natural language processing models require the highest probability of predicting the option from a million-level vocabulary or set of entities when performing language processing tasks. For example, a machine translation model needs to predict the meaning of a word in the context of a target to be translated at each time step; also, for example, the entity recognition model needs to predict the entity pointed to by the text segment, i.e., the entity class. Because the option with the highest probability is predicted from a million-level vocabulary or an entity set, the language processing task executed by the natural language processing model needs to perform ultra-large-scale matrix operation on an output layer, which consumes great computing resources and has low scene support performance with high real-time requirement.

In order to solve the problems, the current adopted means is to replace the matrix operation of mapping from a hidden layer to an output layer of the original natural language processing model from one step to execute step by step along the Huffman coding tree, so that the prediction of each word can reach the leaf node of the Huffman coding tree only by a small amount of binary logistic regression, and the final prediction result is obtained. However, the huffman coding tree is obtained based on word frequency, and cannot represent the relationship between words, and two very similar words may be divided into completely different branches. This results in very off-spectrum results when there is a bias in the prediction of words.

If the semantic hierarchical tree is used for replacing the Huffman coding tree in the natural language processing model, the calculation complexity cannot meet the implementation requirement for large-scale (such as millions) data sets because the semantic hierarchical tree needs to measure the relationship matrix between every two data when being constructed.

Disclosure of Invention

In view of this, an object of the embodiments of the present application is to provide a semantic hierarchy tree construction method and apparatus, which can quickly construct a semantic hierarchy tree based on a large-scale data set.

In a first aspect, an embodiment of the present application provides a semantic hierarchy tree construction method, including:

classifying the data set to obtain at least one group, wherein each group comprises at least one word;

performing intergroup hierarchical clustering on the at least one group to obtain a first hierarchical clustering subtree, wherein each group is a leaf node of the first hierarchical clustering subtree; and

performing intra-group hierarchical clustering on each group to obtain a second hierarchical clustering subtree corresponding to each group, wherein words included in each group are leaf nodes of the second hierarchical clustering subtree corresponding to the group;

and constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree.

Optionally, classifying the data set to obtain at least one group, specifically including:

and clustering the words in the data set according to the similarity among the words in the data set to obtain the at least one group.

Optionally, clustering the words in the data set according to the similarity between the words in the data set to obtain the at least one group, specifically including:

(i) Taking all words in the data set as a first cluster;

(ii) Determining a cluster center of a first cluster, wherein a vector corresponding to the cluster center is an average value of vectors corresponding to all words in the first cluster, and determining words in a preset similarity range with a word closest to the cluster center of the first cluster as a center to form a second cluster;

(iii) (iii) taking the second cluster as the first cluster, returning to the step (ii) for calculation until an iteration stop condition is met, taking the finally obtained second cluster as a group after clustering, and taking the words in the second cluster as the words completing clustering;

(iv) And (3) taking all the unfinished clustered words in the data set as a first cluster, returning to the step (ii) for calculation until all the words in the data set are clustered, and obtaining a plurality of clustered groups.

Optionally, clustering the words in the data set according to the similarity between the words in the data set, specifically including:

randomly selecting K words from the data set as initial clustering centers according to a preset grouping number K;

for each initial cluster center, performing the following steps:

(i) Taking the words with the similarity smaller than a first preset similarity with the clustering center and the clustering center as a first cluster, and calculating the cluster center of the first cluster, wherein the vector corresponding to the cluster center is the average value of the vectors corresponding to the words in the first cluster;

(ii) Determining words in a preset similarity range with the center by taking the word closest to the cluster center similarity of the first cluster as the center to form a second cluster;

(iii) And (3) taking the second cluster as a new first cluster, returning to the step (i) to execute calculation until an iteration stop condition is met, and taking the finally obtained second cluster as a clustered group.

(i) Taking any one word in the words which are not clustered currently in the data set as a clustering center, and sequentially calculating the similarity between other words which are not clustered currently and the clustering center;

(ii) According to the sequence from big to small, the similarity between other words which are not clustered currently and the clustering center obtains a preset number of words from other words which are not clustered currently and divides the words into the same group with the clustering center, and all the words in the group are used as the words which are clustered completely;

(iii) And (5) returning to the step (i) for calculation until all the words in the data set are clustered.

(i) Taking the data set as a set to be split; (ii) Randomly selecting 2 words from the set to be split as an initial clustering center;

(iii) Respectively calculating the similarity between each word in the set to be split and the two initial clustering centers, grouping the words into a group with the clustering center with higher similarity, and splitting to obtain two intermediate groups;

(iii) And (3) if the number of words contained in the intermediate grouping is greater than a preset word number threshold, taking the intermediate grouping as a new set to be split, returning to the step (ii) to perform calculation until the number of words contained in the intermediate grouping is not greater than the preset word number threshold, and taking the intermediate grouping of which the number of words is not greater than the preset word number threshold as a clustered group.

Optionally, the iteration stop condition comprises one or more of: the words in the second cluster are not changed any more, the number of the words in the second cluster is not more than a preset word number threshold, and the iteration number reaches a preset number threshold.

Optionally, the performing inter-group hierarchical clustering on the at least one group to obtain a first hierarchical clustering sub-tree specifically includes:

calculating the average value of the vectors corresponding to all words in each group to obtain the average vector corresponding to the group;

and performing intergroup hierarchical clustering on each group according to the average vector corresponding to each group.

Optionally, the constructing the semantic hierarchy tree according to the first-level clustering sub-tree and the second-level clustering sub-tree specifically includes:

and connecting the first-level clustering subtree and the second-level clustering subtree by taking the root node of the first-level clustering subtree as the root node of the semantic level tree and taking the root node of the second-level clustering subtree as the leaf node of the second-level clustering subtree to generate the semantic level tree.

In a second aspect, an embodiment of the present application further provides a semantic hierarchy tree constructing apparatus, including:

the grouping module is used for classifying the data set to obtain at least one group, and each group comprises at least one word;

the inter-group hierarchical clustering module is used for performing inter-group hierarchical clustering on the at least one group to obtain a first hierarchical clustering sub-tree, wherein each group is a leaf node of the first hierarchical clustering sub-tree;

the intra-group hierarchical clustering module is used for performing intra-group hierarchical clustering on each group to obtain a second hierarchical clustering subtree corresponding to each group, wherein words included in each group are leaf nodes of the second hierarchical clustering subtree corresponding to the group;

and the semantic hierarchical clustering tree constructing module is used for constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree.

The method comprises the steps of classifying a data set to obtain at least one group, wherein each group comprises at least one word; performing interclass hierarchical clustering on the at least one group to obtain a first hierarchical clustering subtree, wherein each group is a leaf node of the first hierarchical clustering subtree; performing intra-group hierarchical clustering on each group to obtain a second hierarchical clustering subtree corresponding to each group, wherein words included in each group are leaf nodes of the second hierarchical clustering subtree corresponding to the group; and constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree, so that the speed of constructing the hierarchical clustering tree can be increased, the calculation amount required by constructing the hierarchical clustering tree is reduced, and the calculation complexity is reduced, thereby meeting the requirement of quickly constructing the semantic hierarchical tree on the basis of a large-scale data set.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flowchart illustrating a semantic hierarchy tree construction method provided in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a specific method for obtaining a data set in a semantic hierarchy tree construction method provided in an embodiment of the present application;

fig. 3 is a flowchart illustrating a specific method for clustering words in a data set in the semantic hierarchy tree building method provided in the embodiment of the present application;

fig. 4 is a flowchart illustrating a specific method for clustering words in a data set in the semantic hierarchy tree constructing method provided in the embodiment of the present application;

fig. 5 is a flowchart illustrating a third specific method for clustering words in a data set in the semantic hierarchy tree construction method provided in the embodiment of the present application;

fig. 6 is a flowchart illustrating a fourth specific method for clustering words in a data set in the semantic hierarchy tree construction method provided in the embodiment of the present application;

FIG. 7 is a diagram illustrating a structure of a second-level clustering sub-tree in an example provided by an embodiment of the present application;

FIG. 8 is a diagram illustrating a structure of a first-level clustering sub-tree in an example provided by an embodiment of the present application;

FIG. 9 is a diagram illustrating a structure of a hierarchical clustering tree in an example provided by an embodiment of the present application;

FIG. 10 is a schematic structural diagram illustrating a semantic hierarchy tree building apparatus according to an embodiment of the present application;

fig. 11 shows a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

At present, when a semantic hierarchical tree is constructed, because the semantic hierarchical tree needs to measure a relation matrix before every two data when being constructed, the computational complexity cannot meet the implementation requirement for a large-scale (such as a million-level) data set. Based on the above, the semantic hierarchical tree construction method and device provided by the application can be used for quickly constructing the semantic hierarchical tree on the basis of a large-scale data set.

Different from the prior art, the method and the device carry out interclass hierarchical clustering and intraclass hierarchical clustering on at least one group of a data set classification stroke, and construct a semantic hierarchical tree according to a first hierarchical clustering sub-tree and a second hierarchical clustering sub-tree which are obtained by the interclass hierarchical clustering. The semantic hierarchical tree is used for replacing the role of the Huffman coding tree in a natural language processing model, so that even if the predicted words have deviation, a similar option can be returned, and a very off-spectrum result cannot be obtained.

In order to facilitate understanding of the embodiment, a semantic hierarchy tree construction method disclosed in the embodiment of the present application is first described in detail.

Referring to fig. 1, a semantic hierarchy tree construction method provided in the embodiment of the present application includes:

s101: the data set is classified to obtain at least one group, and each group comprises at least one word.

When implemented in detail, a data set refers to a collection of data that includes a plurality of words. In the process of training the natural language processing model, in order to ensure the accuracy of the natural language processing model, it is necessary to ensure that the number of words in the data set is as large as possible.

Referring to fig. 2, the embodiment of the present application provides a way to obtain the data set:

s201: and obtaining the corpus from a preset platform.

Here, the corpus may be crawled from a preset platform through technologies such as crawlers, crawling tools, and the like. When the linguistic data are crawled, the linguistic data can be crawled without any limitation, namely, the linguistic data can be used as the crawled linguistic data as long as the linguistic data appear on a preset platform. Optionally, as the application of the vocabulary is changed continuously, the relevance between the vocabulary and other vocabularies can also be changed along with the change of the application; for example, the term "dog food" originally means "special food for feeding dogs" and can now be interpreted as "love". Certain limitation can also be performed on the crawled corpuses, for example, the generation time of the crawled corpuses is limited. For example, corpora within 3 years from the current time are acquired.

Optionally, when the corpus is obtained, in order to determine the domain keyword of a certain domain more quickly, the corpus of the domain determined in the preset platform may be obtained in a targeted manner. Thus, the domain keyword corresponding to each domain can be quickly acquired.

S202: and performing word segmentation processing on the speech by adopting a word segmentation model obtained by pre-training to obtain a plurality of words, and taking a set formed by the words as the data set.

For example, the segmentation model may be any one of a character string matching-based segmentation model, a statistical-based segmentation model, a neural network-based segmentation model, and an N-shortest path-based segmentation model.

The word segmentation principle of the word segmentation model based on character string matching is as follows: according to a certain strategy, matching the Chinese character string to be analyzed with a vocabulary entry in a 'sufficiently large' machine dictionary, and if a certain character string is found in the dictionary, the matching is successful, namely, a word is recognized. According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.

The word segmentation principle of the word segmentation model based on statistics is as follows: and counting the combination frequency of adjacent co-occurring words in the material, and calculating the co-occurrence information of the words. And defining the mutual occurrence information of the two characters, and calculating the adjacent co-occurrence probability of the two Chinese characters X and Y. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of compactness is higher than a certain threshold, it is considered that the word group may constitute a word.

The word segmentation principle based on the N-shortest path word segmentation model is as follows: and finding out all possible words in the word string according to the dictionary, and constructing a word segmentation directed acyclic graph. Each word corresponds to a directed edge in the graph and is assigned a corresponding side length (weight). Then, aiming at the segmentation graph, in all paths from the starting point to the end point, the path sets of which the length values are arranged in a strict ascending order (the values at any two different positions are not equal, the same below) are sequentially 1 st, 2 nd, \8230, i < th >, 8230, and N < th > path sets are used as corresponding rough segmentation result sets. If two or more paths are equal in length, the ith paths are parallel in length and are listed in the rough-dividing result set without influencing the sequence numbers of other paths, and the size of the final rough-dividing result set is larger than or equal to N.

After the data sets are obtained, the data sets are classified. In order to be able to divide similar or homogeneous words into the same branch of the semantic hierarchy tree when constructing the semantic hierarchy tree, the classification of the data set is generally performed based on the similarity between words in the data set.

To obtain the similarity between words, the words in the dataset may be mapped into a high-dimensional space, forming a vector for each word. The distance between the vectors can be used to characterize the similarity between corresponding words. The closer the distance between the vectors is, the higher the similarity between the corresponding words is; the distance between vectors is about far, the lower the similarity between corresponding words.

In the embodiment of the present application, a word2vec algorithm may be adopted to obtain a vector of each word in a data set. word2vec is word vector mapping, words are mapped into a new space, calculation statistics is carried out in a large amount of linguistic data, training is carried out in a neural network, each word is represented by a multi-dimensional continuous real number vector, and a word2vec model is a large matrix and stores the representation vectors of all the words. The similarity between words can be determined by finding the distance between the vectors corresponding to the words. The distance between vectors may include: one or more of an euclidean distance, a manhattan distance, a chebyshev distance, a minkowski distance, a normalized euclidean distance, a mahalanobis distance, an included angle cosine, a hamming distance, a jaccard distance, a correlation distance, an information entropy.

And classifying the data set to form one or two groups with one less word, wherein the similarity between words in each group needs to meet a certain similarity requirement. In an embodiment of the present application, the at least one group may be obtained by clustering words in a data set according to similarity between the words in the data set. In a specific implementation, any one of the following clustering methods may be used to cluster the words in the data set:

one is as follows: a first method for clustering words in a data set provided in an embodiment of the present application is shown in fig. 3, and includes:

s301: all words in the data set are taken as a first cluster.

S302: determining a cluster center of a first cluster, wherein a vector corresponding to the cluster center is an average value of vectors corresponding to all words in the first cluster, and determining words in a preset similarity range with a word closest to the cluster center of the first cluster as a center to form a second cluster;

s303: detecting whether an iteration stop condition is met; if so, S304 is performed. If not, the second cluster is taken as the first cluster, and the step S302 is returned to for calculation.

S304: taking the finally obtained second cluster as a group after clustering, and taking the words in the second cluster as the words completing clustering; s305 is executed.

S305: detecting whether unfinished clustered words exist currently; if yes, taking all unfinished clustered words in the data set as a first cluster, returning to the step S302 for calculation, and if not, ending.

In the method for clustering words in the data set, when clustering is performed for the first time, all words in the data set are not completely clustered, so that when clustering is performed for the first time, all words are used as a first cluster, and a vector corresponding to a cluster center of the first cluster is calculated. When clustering is not performed for the first time, because some words are clustered, the words which are clustered in the clustering process are removed, and only the remaining words which are not clustered are clustered, that is, the words which are not clustered in the data set are used as a first cluster, and a vector corresponding to the cluster center of the first cluster is calculated.

Here, when calculating the vector corresponding to the cluster center of the first cluster, the average value of the vectors corresponding to the words in the first cluster is obtained. At this time, the dimensions of the vectors of the words in the first cluster are the same. Assuming that the dimension of the vector of each word in the first cluster is m × n, the dimension of the vector of the cluster center is also m × n, and each element in the vector of the cluster center is an average value of corresponding position elements in the vectors of all words in the first cluster.

The elements of row i and column j of the cluster center can be represented as: b is _i,j . Assuming that there are k words in the first cluster, the element in the ith row and jth column of the s-th word in the first cluster can be represented as A _i,j ^S Then B is _i,j Satisfies the following formula (1):

in each iteration period, after the vector of the cluster center of the first cluster is obtained through calculation, in order to obtain the word with the closest similarity to the cluster center, the distance between the vector of each word in the first cluster and the vector of the first cluster center is calculated in sequence, the word corresponding to the vector with the smallest distance is used as the word with the closest similarity to the cluster center of the first cluster, the words with the similarity to the center within a preset similarity range are determined from the rest words in the first cluster to form a second cluster, the second cluster is used as a new first cluster, and the step of calculating the vector of the cluster center of the first cluster is executed again until the iteration stop condition is met.

It should be noted that, in order to increase the convergence rate of each classification, in the process of multiple iterations, the preset similarity ranges used in each iteration are different, and the values of the preset similarity ranges are gradually decreased as the number of iterations increases.

And when the iteration stop condition is met, taking the second cluster obtained by the last iteration as a classification after clustering, taking the words in the second cluster obtained by the last iteration as the words completing clustering, and then performing the iteration process on the words not completing clustering until all the words complete clustering.

The iteration stop condition here includes at least one of the following conditions: 1) The words in the second cluster no longer change; in this case, it is required that the preset similarity range also has a threshold value in the process of multiple iterations. 2) The iteration times reach a set time threshold value. 3) The number of words in the second cluster is not greater than a preset word number threshold.

In condition 1), the words in the second cluster are no longer changed, indicating that the best cluster has been formed, and the iteration may be stopped. In the condition 2), in order to save the amount of computation, the maximum value of the number of iterations may be set, and if the number of iterations reaches the set number threshold, the iteration of the iteration cycle may be stopped, and the last obtained historical departure point included in the second cluster is regarded as one type. In the condition 3), if the number of words in the second cluster is not greater than the preset threshold value of the number of words, then when the second-level clustering subtree is subsequently constructed, the calculated amount is limited within a certain range, and the current limit requirement on the calculated amount can be met.

And the second step is as follows: a second method for clustering words in a data set provided in the embodiment of the present application is shown in fig. 4, and includes:

for each initial cluster center, performing the following steps:

s401: taking the words with the similarity smaller than a first preset similarity with the clustering center and the clustering center as a first cluster, and calculating the cluster center of the first cluster, wherein the vector corresponding to the cluster center is the average value of the vectors corresponding to the words in the first cluster;

s402: determining words in a preset similarity range with the center by taking the word closest to the cluster center similarity of the first cluster as the center to form a second cluster;

s403: and taking the second cluster as a new first cluster, returning to the step S401 to execute calculation until an iteration stop condition is met, and taking the finally obtained second cluster as a clustered group.

In specific implementation, the number of the initial clustering centers can be specifically set according to actual needs; specifically, in order to limit the amount of computation required in constructing the hierarchical clustering subtree based on each grouping, the number of words in each grouping needs to be limited to a certain range, and the larger the number of words included in the data set, the larger the value of K.

For example, if the number of words included in the data set is 100 ten thousand and the number of words in each grouping formed is required to be no greater than 10000, the ratio of the number of words in the data set to the maximum number of words in each grouping may be taken as the value of K, e.g., K is taken to be 100 in this example.

In addition, in this example, in order to make the number of words in each group spatially margin, when the value of K is set, the sum of the ratio of the number of words in the data set to the maximum number of words in each group and the value of the preset percentage of the ratio may also be determined as K. For example, the ratio of the number of words in the data set to the maximum number of words in each grouping, and the sum of the values of 10% of this ratio, is determined to be K, i.e., K =100+ 10%, to be 110.

After K is determined, K initial clustering centers are selected from the vocabulary of the data set. Then, for each initial clustering center, calculating the similarity between each vocabulary except the initial clustering center and the initial clustering center in turn.

Here, the similarity between each word and the initial clustering center is similar to the method for determining the similarity between the word and the clustering center of the first cluster in the embodiment corresponding to fig. 3, and is not repeated herein.

For example, if there are 100W words in the data set and the value of K is determined to be 110, the 110 initial cluster centers determined from one million words are: x1 to X110. For X1, distances between 999999 words out of 100 ten thousand words except for X1 and X1 are sequentially calculated. If the distance between a certain word and X1 is smaller than a first preset similarity, the word and X1 are divided into the same cluster, namely the first cluster. The mean of the vectors of all words in the first cluster is then taken as the cluster center. And then determining the words in a preset similarity range with the center by taking the word closest to the cluster center as the center to form a second cluster, taking the second cluster as a new first cluster, returning to the step of calculating the cluster center coordinates of the first cluster until an iteration stop condition is met, and taking the finally obtained second cluster as a clustered class.

Specifically, if a calendar word selected as the initial cluster center is classified into a certain class in the iteration process, the iteration process is not performed based on the initial cluster center. Determining a word from the rest unfinished clustered words as an initial clustering center, and performing the iterative process aiming at the new clustering center; at this time, the number of classifications finally obtained is the same as the number of K. The iteration process can be performed only for other initial clustering centers after the initial distance center north is divided into a certain classification; at this time, the final number of classifications is less than the number of K.

In this embodiment, the iteration stop condition is similar to that in the embodiment corresponding to fig. 3, and is not described again here.

And the third step: a third method for clustering words in a data set provided in the embodiment of the present application is shown in fig. 5, and includes:

s501: taking any one word in the words which are not clustered currently in the data set as a clustering center, and sequentially calculating the similarity between other words which are not clustered currently and the clustering center;

s502: according to the sequence from large to small, the similarity between other words which are not clustered currently and the clustering center obtains a preset number of words from other words which are not clustered currently and divides the words into the same group with the clustering center, and all the words in the group are used as the words which are clustered completely;

s503: and detecting whether the word of the unfinished cluster exists currently. If yes, the process jumps to S501, and if not, the process is ended.

S503: and returning to the step S501 for calculation until all the words in the data set are clustered.

In a specific implementation, the method for calculating the similarity between the word that can be clustered until the word is not currently completed and the clustering center is similar to the method for calculating the similarity between the word and the clustering center of the first cluster in the embodiment corresponding to fig. 3, and details are not repeated here.

In this embodiment, the condition for constraining the clustering result is the number of words in each classification, so that the number of words included in each classification is limited within a certain range, so as to reduce the amount of computation required for hierarchical clustering in a group.

The third clustering method is simpler and more computationally efficient than the first and second clustering methods, but the accuracy is reduced compared to the two clustering methods.

Further, in addition to constraining each group by the number of words in each group, each group may be constrained by the similarity between words in each group.

For example: the above S502 may also be: and dividing the words of which the number is less than the preset number and the similarity with the center is less than a preset similarity threshold value into the same group with the clustering center from the words of the current unfinished clustering according to the sequence of the similarity between other words of the current unfinished clustering and the clustering center from large to small, and taking all the words in the group as the words of which the clustering is finished.

And the fourth step: a fourth method for clustering words in a data set provided in the embodiment of the present application is shown in fig. 6, and includes:

s601: taking the data set as a set to be split;

s602: randomly selecting 2 words from the set to be split as an initial clustering center;

s603: respectively calculating the similarity between each word in the set to be split and the two initial clustering centers, grouping the words into a group with the clustering center with higher similarity, and splitting to obtain two intermediate groups;

s604: and detecting whether the number of words contained in the intermediate packet is greater than a preset word number threshold, if so, taking the intermediate packet as a set to be split, and returning to the step S602 to execute calculation. If not, S605 is executed.

S605: and taking the middle group of which the number of the included words is not more than a preset word number threshold value as a clustered group.

In a specific implementation, the similarity between each word in the set to be split and the two initial clustering centers is similar to the method for obtaining the similarity between the word and the clustering center of the first cluster in the embodiment corresponding to fig. 3, and details are not repeated here.

The embodiment divides the words in the data set into a plurality of classifications in a recursive mode based on the similarity between each word in the set to be split and two treated clustering centers, so that the similarity between the words in each classification is relatively close, the number of the words in each classification is not more than a preset word number threshold, and the required calculation amount during hierarchical clustering in the groups is reduced.

After the data set is classified into a plurality of groups, intra-group hierarchical clustering is performed on each group, and component hierarchical clustering is performed on all groups.

In the specific implementation, the hierarchical clustering in the groups and the hierarchical clustering among the groups are not in sequence.

S102: and performing interclass hierarchical clustering on the at least one group to obtain a first hierarchical clustering subtree, wherein each group is a leaf node of the first hierarchical clustering subtree.

In a specific implementation, when performing inter-group hierarchical clustering on the groups formed in S101, the words in each group are regarded as a whole, and hierarchical clustering is performed to generate a first-level clustering sub-tree.

Here, the average vector corresponding to each group may be obtained by taking an average of vectors corresponding to all words in each group, and the corresponding group may be characterized by the average vector. Then, when performing inter-group hierarchical clustering on each group, it is possible to perform the clustering based on the average vector corresponding to each group.

When performing inter-group hierarchical clustering on each group, the following method may be adopted:

each group is taken as a cluster, and the similarity between every two clusters is calculated.

Determining a plurality of cluster pairs according to the sequence of similarity from large to small; each cluster pair includes two clusters, and different cluster pairs include different clusters.

And merging the two clusters belonging to the same cluster pair together to form a new cluster, and executing the process of calculating the similarity between the two clusters aiming at the new cluster until all the clusters corresponding to the groups are merged together.

Wherein each group is a leaf node of the formed first-level clustering subtree; the root node of the first hierarchical clustering sub-tree includes all of the groups. Each cluster pair constitutes a node between a leaf node and a root node.

It should be noted that, after two clusters belonging to the same cluster pair are merged together to form a new cluster, an average vector of the two clusters is obtained according to average vectors corresponding to the two clusters in the merged cluster pair, and the average vector of the two clusters is used to characterize the new cluster formed by merging the two clusters together.

S103: and carrying out intra-group hierarchical clustering on each group to obtain a second hierarchical clustering subtree corresponding to each group, wherein words included in each group are leaf nodes of the second hierarchical clustering subtree corresponding to the group.

In a specific implementation, hierarchical clustering is performed on each group, namely hierarchical clustering is performed on words included in each group, and each word in the group is a single individual.

When performing intra-group hierarchical clustering on each group, the following method may be adopted: each word is taken as a cluster, and the similarity between every two clusters is calculated. Determining a plurality of cluster pairs according to the sequence of similarity from large to small; each cluster pair includes two clusters, and different cluster pairs include different clusters. Combining two clusters belonging to the same cluster pair together to form a new cluster, and executing the process for calculating the similarity between the two clusters aiming at the new cluster until all clusters corresponding to the groups are combined together.

Each word is a leaf node of the formed second-level clustering subtree; the root node of the second hierarchical clustering sub-tree includes all the words in all the corresponding groupings. Each cluster pair constitutes a node between a leaf node and a root node.

It should be noted here that, after two clusters belonging to the same cluster pair are merged together to form a new cluster, the vector of the new cluster is used to characterize the new cluster formed by merging two clusters together according to the average of the vectors corresponding to the two clusters in the merged cluster pair as the vector of the new cluster.

S104: and constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree.

In a specific implementation, the first-level clustering subtree and the second-level clustering subtree are both part of a semantic hierarchy tree to be constructed. And the hierarchy of the first-level clustering subtree is higher than that of the second-level clustering subtree.

Since the leaf nodes of the first-level clustering subtree are each group, that is, each leaf node of the first-level clustering subtree includes all words in the corresponding group, and the root node of the second-level clustering subtree includes all words in the corresponding group, the first-level clustering subtree and the second-level clustering subtree can be connected from here, that is: and connecting the first-level clustering subtree and the second-level clustering subtree by taking the root node of the first-level clustering subtree as the root node of the semantic level tree and taking the root node of the second-level clustering subtree as the leaf node of the second-level clustering subtree to generate the semantic level tree.

For example, the embodiments of the present application further provide an example, and the above process is described, it should be noted that the magnitude of the data set used in the present embodiment is only for clarity of description, and does not represent the magnitude of the data set in the actual implementation process.

The data set comprises 100 words, the data set is classified according to the similarity between the words, 10 classifications A-J are obtained, and the words in the 10 classifications are respectively as follows: A1-A10, B1-B10, C1-C10, 8230, 8230and J1-J10.

When the intra-group hierarchical clustering is performed on the class a, a second hierarchical clustering sub-tree is obtained as shown in fig. 7. When the interclass hierarchical clustering is performed on the A-J, the obtained first hierarchical clustering subtree is shown in fig. 8, and the first hierarchical clustering subtree and the second hierarchical clustering subtree are connected together to form a hierarchical clustering tree shown in fig. 9.

In this embodiment of the present application, if the hierarchical clustering tree is constructed in a conventional manner, assuming that f terms are included in the vocabulary set, and similarity is to be calculated for each two of the f terms, the calculation amount is:

when the semantic hierarchy tree is constructed by the method provided by the embodiment of the application, the grouping is assumed to be 100, and the number of words in each grouping is

The calculated amount satisfies:

it can be seen that when f reaches a certain magnitude, the number of times of similarity matching is greatly reduced, so that the speed of constructing the hierarchical clustering tree can be increased, the calculation amount required for constructing the hierarchical clustering tree is reduced, the calculation complexity is reduced, and the requirement of quickly constructing the semantic hierarchical tree on the basis of a million-level data set is met.

The method comprises the steps of classifying a data set to obtain at least one group, wherein each group comprises at least one word; performing interclass hierarchical clustering on the at least one group to obtain a first hierarchical clustering subtree, wherein each group is a leaf node of the first hierarchical clustering subtree; performing intra-group hierarchical clustering on each group to obtain a second hierarchical clustering subtree corresponding to each group, wherein words included in each group are leaf nodes of the second hierarchical clustering subtree corresponding to the group; and constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree, so that the speed of constructing the hierarchical clustering tree can be increased, the calculation amount required for constructing the hierarchical clustering tree is reduced, and the calculation complexity is reduced, thereby meeting the requirement of quickly constructing the semantic hierarchical tree on the basis of a million-level data set.

Based on the same inventive concept, the embodiment of the present application further provides a semantic hierarchical tree construction device corresponding to the semantic hierarchical tree construction method, and because the principle of solving the problem of the device in the embodiment of the present application is similar to that of the semantic hierarchical tree construction method in the embodiment of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 10, a semantic hierarchy tree constructing apparatus provided in the embodiment of the present application specifically includes:

a grouping module 10, configured to classify a data set to obtain at least one group, where each of the at least one group includes at least one word;

the inter-group hierarchical clustering module 20 is configured to perform inter-group hierarchical clustering on the at least one group to obtain a first hierarchical clustering sub-tree, where each group is a leaf node of the first hierarchical clustering sub-tree;

the intra-group hierarchical clustering module 30 is configured to perform intra-group hierarchical clustering on each group to obtain a second hierarchical clustering sub-tree corresponding to each group, where words included in each group are leaf nodes of the second hierarchical clustering sub-tree corresponding to the group;

and the semantic hierarchical clustering tree constructing module 40 is configured to construct the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree.

Optionally, the grouping module 10 is specifically configured to classify the data set to obtain at least one group through the following steps:

Optionally, the grouping module 10 is specifically configured to cluster the words in the data set according to the similarity between the words in the data set by the following steps to obtain the at least one grouping:

(i) Taking all words in the data set as a first cluster;

(iii) (iii) taking the second cluster as the first cluster, returning to the step (ii) for calculation until an iteration stop condition is met, taking the finally obtained second cluster as a group after clustering, and taking words in the second cluster as words completing clustering;

(iv) And (5) taking all unfinished clustered words in the data set as a first cluster, returning to the step (ii) for calculation until all words in the data set are clustered, and obtaining a plurality of clustered groups.

Optionally, the grouping module 10 is specifically configured to cluster the words in the data set according to the similarity between the words in the data set by the following steps to obtain the at least one group: randomly selecting K words from the data set as initial clustering centers according to a preset grouping number K;

for each initial cluster center, performing the following steps:

clustering the words in the data set according to the similarity among the words in the data set, specifically comprising:

(ii) According to the sequence from large to small, the similarity between other words which are not clustered currently and the clustering center obtains a preset number of words from other words which are not clustered currently and divides the words into the same group with the clustering center, and all the words in the group are used as the words which are clustered completely;

Optionally, the grouping module 10 is specifically configured to cluster the words in the data set according to the similarity between the words in the data set by the following steps to obtain the at least one group:

Optionally, the iteration stop condition comprises one or more of: the words in the second cluster do not change any more, the number of words in the second cluster is not more than a preset word number threshold, and the iteration number reaches a preset number threshold.

Optionally, the intergroup hierarchical clustering module 20 is specifically configured to perform intergroup hierarchical clustering on the at least one group to obtain a first hierarchical clustering sub-tree by:

calculating the average value of the vectors corresponding to all the words in each group to obtain the average vector corresponding to the group;

and performing interclass hierarchical clustering on each group according to the average vector corresponding to each group.

Optionally, the semantic hierarchical clustering tree constructing module 40 is specifically configured to construct the semantic hierarchical tree according to the first hierarchical clustering sub-tree and the second hierarchical clustering sub-tree by:

and taking the root node of the first-level clustering subtree as the root node of the semantic hierarchy tree, taking the root node of the second-level clustering subtree as the leaf node of the second-level clustering subtree, and connecting the first-level clustering subtree and the second-level clustering subtree to generate the semantic hierarchy tree.

In this embodiment, specific functions and interaction manners of the grouping module 10, the inter-group hierarchical clustering module 20, the intra-group hierarchical clustering module 30, and the semantic hierarchical clustering tree constructing module 40 may refer to the descriptions of the embodiments corresponding to fig. 1 to 8, and are not described herein again.

For the semantic hierarchy tree construction method in fig. 1, an embodiment of the present application further provides a computer device, as shown in fig. 11, the computer device includes a memory 1000, a processor 2000 and a computer program stored in the memory 1000 and executable on the processor 2000, where the processor 2000 implements the steps of the semantic hierarchy tree construction method when executing the computer program.

Specifically, the memory 1000 and the processor 2000 can be general memories and general processors, which are not limited herein, and when the processor 2000 runs a computer program stored in the memory 1000, the semantic hierarchical tree construction method can be executed, so as to solve the problem that the computation complexity cannot meet the implementation requirement for a data set at a million level because the semantic hierarchical tree needs to measure a relationship matrix between every two data when constructing the semantic hierarchical tree, thereby achieving the purpose of increasing the speed of constructing the hierarchical tree, reducing the computation amount required for constructing the hierarchical tree, and reducing the computation complexity, thereby meeting the effect of quickly constructing the semantic hierarchical tree on the basis of the data set at the million level.

Corresponding to the semantic hierarchy tree construction method in fig. 1, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the semantic hierarchy tree construction method.

Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when a computer program on the storage medium is run, the semantic hierarchical tree construction method can be executed, so that the problem that the calculation complexity cannot meet the implementation requirement for a million-level data set because the semantic hierarchical tree needs to measure a relationship matrix between every two data sets during construction is solved, the speed of constructing the hierarchical clustering tree is increased, the calculation amount required by constructing the hierarchical clustering tree is reduced, and the calculation complexity is reduced, thereby meeting the effect of quickly constructing the semantic hierarchical tree on the basis of the million-level data set.

The semantic hierarchical tree construction method and the computer program product of the apparatus provided in the embodiment of the present application include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and will not be described herein again.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system and the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A semantic hierarchy tree construction method is characterized by comprising the following steps:

performing interclass hierarchical clustering on the at least one group to obtain a first hierarchical clustering subtree, wherein each group is a leaf node of the first hierarchical clustering subtree; and

constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree;

the classifying of the data set to obtain at least one group specifically includes:

(i) Taking all words in the data set as a first cluster;

2. The method of claim 1, wherein clustering the terms in the data set according to the similarity between the terms in the data set specifically comprises:

for each initial cluster center, performing the following steps:

3. The method according to claim 1, wherein clustering the terms in the data set according to the similarity between the terms in the data set specifically comprises:

4. The method according to claim 1, wherein clustering the terms in the data set according to the similarity between the terms in the data set specifically comprises:

5. The method according to claim 1 or 2, wherein the iteration stop condition comprises one or more of: the words in the second cluster are not changed any more, the number of the words in the second cluster is not more than a preset word number threshold, and the iteration number reaches a preset number threshold.

6. The method according to claim 1, wherein the inter-group hierarchical clustering of the at least one group to obtain a first hierarchical clustering sub-tree specifically comprises:

7. The method according to claim 1, wherein the constructing the semantic hierarchy tree according to the first hierarchical clustering sub-tree and the second hierarchical clustering sub-tree comprises:

8. A semantic hierarchy tree construction apparatus, comprising:

the system comprises a group hierarchical clustering module, a group hierarchical clustering module and a group hierarchical clustering module, wherein the group hierarchical clustering module is used for performing group hierarchical clustering on at least one group to obtain a first hierarchical clustering subtree, and each group is a leaf node of the first hierarchical clustering subtree;

the intra-group hierarchical clustering module is used for performing intra-group hierarchical clustering on each group to obtain a second hierarchical clustering sub-tree corresponding to each group, wherein words included in each group are leaf nodes of the second hierarchical clustering sub-tree corresponding to the group;

the semantic hierarchical clustering tree constructing module is used for constructing the semantic hierarchical tree according to the first hierarchical clustering subtree and the second hierarchical clustering subtree;

the grouping module is configured to classify the data set, and when at least one group is obtained, the grouping module is specifically configured to:

(i) Taking all words in the data set as a first cluster;