CN109033084A

CN109033084A - A kind of semantic hierarchies tree constructing method and device

Info

Publication number: CN109033084A
Application number: CN201810836275.7A
Authority: CN
Inventors: 蔡世清; 郑凯; 段立新; 江建军; 夏虎
Original assignee: Guoxin Youe Data Co Ltd
Current assignee: Guoxin Youe Data Co Ltd
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2018-12-18
Anticipated expiration: 2038-07-26
Also published as: CN109033084B

Abstract

This application provides a kind of semantic hierarchies tree constructing method and devices, wherein this method comprises: classifying to data set, obtains at least one grouping, each of at least one described grouping includes at least one word；Hierarchical clustering carrying out group at least one described grouping, obtains the first hierarchical clustering subtree, wherein each grouping is the leaf node of the first hierarchical clustering subtree；And hierarchical clustering in group is carried out to each grouping, obtain the second hierarchical clustering subtree corresponding with each grouping, wherein included word is the leaf node of the second hierarchical clustering subtree corresponding with the grouping in each grouping；According to the first hierarchical clustering subtree and the second hierarchical clustering subtree, the semantic hierarchies tree is constructed.The embodiment of the present application can be based on large-scale data set rapid build semantic hierarchies tree.

Description

A kind of semantic hierarchies tree constructing method and device

Technical field

This application involves natural language processing technique field, in particular to a kind of semantic hierarchies tree constructing method with And device.

Background technique

In natural language processing field, existing language model often relies on machine learning algorithm.The sheet of machine learning Matter is prediction；Machine learning model is trained by a large amount of training dataset, after obtaining Natural Language Processing Models, It can cross and pending data is input to trainingIt is goodNatural Language Processing Models, obtain corresponding with pending data prediction and tie Fruit.

Most Natural Language Processing Models require the vocabulary from million ranks when executing language processing tasks Or the highest option of prediction probability in entity sets.Such as Machine Translation Model needs to predict a word on each time step Meaning of the language under target context to be translated；For another example entity recognition model needs to predict entity pointed by text fragments, i.e., in fact Body classification.Due to be will from the vocabulary or entity sets of million ranks the highest option of prediction probability, natural language Language processing tasks performed by processing model need to do ultra-large matrix operation in output layer, this can consume great meter Resource is calculated, and supportive not high to the very high scene of requirement of real-time.

To solve the above-mentioned problems, the means taken at present are by the hidden layer of original Natural Language Processing Models to defeated The mapping of layer is substituted for and executes along huffman coding tree substep, in this way to each word from the matrix operation settled at one go out Prediction only need to undergo a small amount of dualistic logistic regression, just can reach the leaf node from huffman coding tree, obtain finally Prediction result.But huffman coding tree is obtained based on word frequency, can not indicate the relationship between word, two very phases Close word may be assigned under entirely different branch.This causes to obtain very in the prediction to word there are when deviation The result to go against accepted conventions.

If using effect of the semantic hierarchies tree substitution huffman coding tree in Natural Language Processing Models, due to semantic layer Secondary tree needs the relational matrix of continuous data between any two when building, for the data of extensive (such as: million ranks) For set, the complexity of calculating is unable to satisfy realization and requires.

Summary of the invention

In view of this, the embodiment of the present application is designed to provide a kind of semantic hierarchies tree constructing method and device, energy Enough it is based on large-scale data set rapid build semantic hierarchies tree.

In a first aspect, the embodiment of the present application provides a kind of semantic hierarchies tree constructing method, comprising:

Classify to data set, obtain at least one grouping, each of at least one described grouping includes at least one Word；

Hierarchical clustering carrying out group at least one described grouping, obtains the first hierarchical clustering subtree, wherein each grouping It is the leaf node of the first hierarchical clustering subtree；And

Hierarchical clustering in group is carried out to each grouping, obtains the second hierarchical clustering subtree corresponding with each grouping, wherein Included word is the leaf node of the second hierarchical clustering subtree corresponding with the grouping in each grouping；

According to the first hierarchical clustering subtree and the second hierarchical clustering subtree, the semantic hierarchies tree is constructed.

Optionally, classify to data set, obtain at least one grouping, specifically include:

According to the similarity in the data set between word, the word in the data set is clustered, obtains institute State at least one grouping.

Optionally, according to the similarity in the data set between word, the word in the data set is clustered, At least one described grouping is obtained, is specifically included:

(i) using in the data set all words as the first cluster；

(ii) determine that the cluster heart of the first cluster, vector corresponding to the cluster heart are the corresponding vector of word each in the first cluster Average value, centered on the word nearest with the cluster heart similarity of first cluster, the determining and center is in default similarity model Interior word is enclosed, the second cluster is formed；

(iii) using second cluster as the first cluster, and return step (ii) is calculated, until meeting iteration stopping item Part, using the second cluster finally obtained as a grouping after cluster, and using the word in the second cluster as the word for completing cluster Language；

(iv) word of all unfinished clusters as the first cluster, counted by return step (ii) using in the data set It calculates, multiple groupings until words all in data set completion cluster, after being clustered.

Optionally, according to the similarity in the data set between word, the word in the data set is clustered, It specifically includes:

According to preset number of packet K, K word is randomly choosed from the data set as initial cluster centre；

For each initial cluster centre, following steps are executed:

(i) using the similarity between the cluster centre less than the first default similarity word and the cluster centre as First cluster calculates the cluster heart of the first cluster, and vector corresponding to the cluster heart is that each corresponding vector of word is averaged in the first cluster Value；

(ii) centered on the word nearest with the cluster heart similarity of first cluster, it is determining to the center preset it is similar The word in range is spent, the second cluster is formed；

(iii) using second cluster as the first new cluster, and return step (i) executes calculating, stops until meeting iteration Only condition, using the second cluster finally obtained as a grouping after cluster.

(i) using any one word in the word for not completing cluster in the data set currently as cluster centre, and according to The secondary similarity calculated between other current words and the cluster centre for not completing cluster；

(ii) by similarities between other current words and the cluster centre for not completing cluster according to from big to small Sequentially, from other current words for not completing cluster, obtain preset quantity word be divided into it is same with the cluster centre In grouping, and using words all in the grouping as the word for completing cluster；

(iii) return step (i) is calculated, until word all in the data set completes cluster.

(i) using the data set as wait split set；(ii) make from described wait split 2 words of random selection in set For initial cluster centre；

(iii) phase wait split in set between each word and described two initial cluster centres is calculated separately It is grouped into the higher cluster centre institute of its similarity in a packet like degree, and by word, fractionation obtains two intermediate packets；

It (iii), will be described if the quantity for the word for including in the intermediate packets is greater than default word amount threshold Intermediate packets are gathered as new wait split, and return step (ii) executes calculating, until the word for including in the intermediate packets Quantity be not more than default word amount threshold, and the quantity of the word for including is not more than in default word amount threshold Between grouping as cluster after one be grouped.

Optionally, the iteration stopping condition includes one or more in following: the word in the second cluster no longer becomes Change, the word quantity in the second cluster is not more than default word amount threshold, the number of iterations reaches preset times threshold value.

Optionally, it is described between it is described at least one grouping carry out group hierarchical clustering, obtain the first hierarchical clustering subtree, have Body includes:

For each grouping, the average value of the corresponding vector of all words in the grouping is calculated, it is corresponding to obtain the grouping Average vector；

According to the corresponding average vector of each grouping, hierarchical clustering carrying out group to each grouping.

Optionally, described according to the first hierarchical clustering subtree and the second hierarchical clustering subtree, construct institute's predicate Adopted hierarchical tree, specifically includes:

By the root node of the first hierarchical clustering subtree, as the root node of the semantic hierarchies tree, by described second The root node of hierarchical clustering subtree, as the leaf node of the second hierarchical clustering subtree, to the first hierarchical clustering Tree and the second hierarchical clustering subtree are attached, and generate the semantic hierarchies tree.

Second aspect, the embodiment of the present application also provide a kind of semantic hierarchies tree construction device, comprising:

Grouping module obtains each of at least one grouping, at least one grouping for classifying to data set Include at least one word；

Level cluster module between group obtains the first level for hierarchical clustering carrying out group at least one described grouping Cluster subtree, wherein each grouping is the leaf node of the first hierarchical clustering subtree；

Level cluster module in group obtains corresponding with each grouping for carrying out hierarchical clustering in group to each grouping Second hierarchical clustering subtree, wherein included word is the second hierarchical clustering corresponding with the grouping in each grouping The leaf node of tree；

Semantic hierarchies clustering tree constructing module, for according to the first hierarchical clustering subtree and second hierarchical clustering Subtree constructs the semantic hierarchies tree.

The embodiment of the present application obtains at least one grouping by classifying to data set, at least one grouping It each include at least one word；Hierarchical clustering carrying out group at least one described grouping, obtains the first hierarchical clustering subtree, Wherein, each grouping is the leaf node of the first hierarchical clustering subtree；And level in group is carried out to each grouping and is gathered Class obtains the second hierarchical clustering subtree corresponding with each grouping, wherein included word is and this point in each grouping The leaf node of the corresponding second hierarchical clustering subtree of group；According to the first hierarchical clustering subtree and second hierarchical clustering Subtree constructs the semantic hierarchies tree, can accelerate the speed for constructing level clustering tree, reduces required for building level clustering tree Calculation amount, reduce the complexity of calculating, thus meet on the basis of large-scale dataset rapid build semantic hierarchies tree Requirement.

To enable the above objects, features, and advantages of the application to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate Appended attached drawing, is described in detail below.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 shows a kind of flow chart of semantic hierarchies tree constructing method provided by the embodiment of the present application；

Fig. 2 shows the specific sides in semantic hierarchies tree constructing method provided by the embodiment of the present application, obtaining data set The flow chart of method；

Fig. 3 is shown in semantic hierarchies tree constructing method provided by the embodiment of the present application, the first concentrates data The flow chart for the specific method that word is clustered；

Fig. 4 is shown in semantic hierarchies tree constructing method provided by the embodiment of the present application, what second pair of data were concentrated The flow chart for the specific method that word is clustered；

Fig. 5 is shown in semantic hierarchies tree constructing method provided by the embodiment of the present application, the third concentrates data The flow chart for the specific method that word is clustered；

Fig. 6 is shown in semantic hierarchies tree constructing method provided by the embodiment of the present application, and the 4th kind to data concentration The flow chart for the specific method that word is clustered；

Fig. 7 is shown in example provided by the embodiment of the present application, the structural schematic diagram of the second hierarchical clustering subtree；

Fig. 8 is shown in example provided by the embodiment of the present application, the structural schematic diagram of the first hierarchical clustering subtree；

Fig. 9 is shown in example provided by the embodiment of the present application, the structural schematic diagram of hierarchical clustering tree；

Figure 10 shows a kind of structural schematic diagram of semantic hierarchies tree construction device provided by the embodiment of the present application；

Figure 11 shows a kind of structural schematic diagram of computer equipment provided by the embodiment of the present application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application Middle attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only It is some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is real The component for applying example can be arranged and be designed with a variety of different configurations.Therefore, below to the application's provided in the accompanying drawings The detailed description of embodiment is not intended to limit claimed scope of the present application, but is merely representative of the selected reality of the application Apply example.Based on embodiments herein, those skilled in the art institute obtained without making creative work There are other embodiments, shall fall in the protection scope of this application.

Semantic hierarchies tree is when construction at present, since semantic hierarchies tree needs continuous data two-by-two when building Relational matrix before, for the data acquisition system of extensive (such as million ranks), the complexity of calculating is unable to satisfy realization It is required that.Based on this, a kind of semantic hierarchies tree constructing method and device provided by the present application can be in the bases of large-scale dataset Rapid build semantic hierarchies tree on plinth.

Unlike the prior art, the application carries out a group interbed by least one grouping for carrying out classification stroke to data set Secondary cluster, and hierarchical clustering in organizing, and the first hierarchical clustering subtree obtained according to hierarchical clustering between group and the second hierarchical clustering Subtree constructing semantic hierarchical tree.Use work of this semantic hierarchies tree replacement huffman coding tree in Natural Language Processing Models With, accordingly even when prediction word have deviation, can also return to a similar option, result far off the beam will not be obtained.

To be constructed to a kind of semantic hierarchies tree disclosed in the embodiment of the present application first convenient for understanding the present embodiment Method describes in detail.

Shown in Figure 1, semantic hierarchies tree constructing method provided by the embodiments of the present application includes:

S101: classifying to data set, obtains at least one grouping, each of at least one described grouping is comprising at least One word.

Specific implementation when, data set refer to include multiple words data acquisition system.To natural language processing During model is trained, in order to guarantee the precision of Natural Language Processing Models, it is necessary to guarantee word in the data set Quantity will as far as possible more.

Shown in Figure 2, the embodiment of the present application provides a kind of acquisition pattern of data set:

S201: corpus is obtained from default platform.

Herein, can by crawler, crawl the technologies such as tool and crawl corpus from default platform.When crawling corpus, Can not any restrictions crawled, as long as that is, the corpus that default platform occurs, so that it may as the language crawled Material.Optionally, since applying for vocabulary is constantly changing, the degree of correlation between other vocabulary also can be with application Variation and change；Such as " dog food " word, the meaning of script is only " special food for feeding dog ", and can be explained now Make " conjugal love ".Certain limitation can also be carried out to the corpus crawled, such as the generation time of the corpus crawled is limited.Example Such as obtain the corpus within 3 years current times.

It optionally,, can be in order to determine the field keyword in a certain field faster when obtaining corpus It is preset for this and has determined that the corpus in field is targetedly obtained in platform.It in this way can be with quick obtaining and each neck The corresponding field keyword in domain.

S202: carrying out word segmentation processing to corpus using the obtained participle model of training in advance, obtains multiple words, and by institute The set that predicate language is constituted is as the data set.

For example, participle model can be the participle model based on string matching, the participle model based on statistics, based on mind Participle model through network, based on any one in N- shortest path participle model.

Wherein, the participle principle of the participle model based on string matching are as follows: the Chinese being analysed to according to certain strategy Word string is matched with the entry in " sufficiently big " machine dictionary, if finding some character string in dictionary, is matched Success, namely identify a word.According to the difference of scanning direction, String matching segmenting method can be divided into positive matching and reverse Matching；The case where according to different length priority match, can be divided into maximum (longest) matching and minimum (most short) matching；According to being It is no to be combined with part-of-speech tagging process, and the integral method that simple segmenting method and participle are combined with mark can be divided into.

The participle principle of participle model based on statistics are as follows: each combinatorics on words frequency of co-occurrence adjacent in corpus is carried out Statistics, calculates their information that appears alternatively.The information that appears alternatively for defining two words calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y.Mutually Existing information embodies the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, this can be thought Word group may constitute a word.

Participle principle based on N- shortest path participle model are as follows: according to dictionary, find out all possible word, structure in word string Make word segmentation directed acyclic graph.A directed edge in each word corresponding diagram, and it is assigned to corresponding side length (weight).Then needle To the cutting figure, in all paths of origin-to-destination, length value is found out by stringent ascending order and arranges (any two different location On value centainly differ, similarly hereinafter) be followed successively by the 1st, the 2nd ..., i-th ..., the set of paths of N is as corresponding rough segmentation result Collection.If two or more path length is equal, their length arranged side by side i-th will be included in rough segmentation result set, and And the arrangement serial number in other paths is not influenced, last rough segmentation result set size is greater than or equal to N.

It is necessary to classifying to data set after obtaining data set.It, will be similar in order to when constructing semantic hierarchies tree Or similar word assign under the same branch of semantic hierarchies tree, therefore when classifying to data set, generally It is to be carried out based on the similarity in data set between word.

In order to obtain the similarity between word, the word in data set can be mapped in higher dimensional space, be formed every The vector of a word.The distance between vector can be used in characterizing the similarity between corresponding word.The distance between vector is more Close, then the similarity between corresponding word is also higher；The distance between vector is about remote, then the similarity between corresponding word Also lower.

In the embodiment of the present application, the vector of each word in data set can be obtained using word2vec algorithm. Word2vec is term vector mapping, is to be mapped to word in one new space, by being calculated in a large amount of corpus Statistics, training, is indicated each word with the continuous real vector of multidimensional, word2vec model is one in neural network Big matrix, storage have the expression vector of all words.Similarity between word, can by seek the corresponding vector of word it Between distance determine.The distance between vector may include: Euclidean distance, manhatton distance, Chebyshev's distance, Min Kefu This cardinal distance from, standardization Euclidean distance, mahalanobis distance, included angle cosine, Hamming distance, with a distance from Jie Kade, correlation distance, comentropy Middle one or more.

Classify to data set, forms word few one and be divided to two groups, the similarity between word in each grouping will expire The certain similarity requirement of foot.In one embodiment of the application, it can use according to the similarity in data set between word, it is right The mode that word in the data set is clustered obtains at least one described grouping.When specific implementation, it can adopt It is clustered with the word that any one in following cluster modes concentrates data:

First, the method that the first word concentrated to data provided by the embodiments of the present application is clustered, referring to Fig. 3 institute Show, comprising:

S301: all words are as the first cluster using in the data set.

S302: determine that the cluster heart of the first cluster, vector corresponding to the cluster heart are the corresponding vector of word each in the first cluster Average value, centered on the word nearest with the cluster heart similarity of first cluster, it is determining with the center in default similarity Word in range forms the second cluster；

S303: it detects whether to meet iteration stopping condition；If it is, executing S304.If it is not, then by second cluster As the first cluster, and return step S302 is calculated.

S304: using the second cluster finally obtained as a grouping after cluster, and using the word in the second cluster as complete At the word of cluster；Execute S305.

S305: detection is currently with the presence or absence of the word for not completing cluster；If it is, with all not complete in the data set At the word of cluster as the first cluster, return step S302 is calculated, if it is not, then terminating.

In the method that the above-mentioned word concentrated to data is clustered, when clustering for the first time, due in data set All words all do not complete cluster, therefore when clustering for the first time, using all words as the first cluster, and calculate the cluster heart of the first cluster Corresponding vector.It is non-cluster for the first time when, due to having there is partial words to complete cluster, clustered before removing The word that cluster has been completed in journey, is clustered only for the word of the unfinished cluster of current residual, that is, by data It concentrates the word for not completing cluster as the first cluster, calculates the corresponding vector of the cluster heart of the first cluster.

Herein, in the corresponding vector of the cluster heart for calculating the first cluster, the corresponding vector of each word in the first cluster is sought Average value.At this point, the dimension of the vector of each word is identical in the first cluster.Assuming that in the first cluster the vector of each word dimension For m*n, then the dimension of the vector of the cluster heart is also m*n, and each element in the vector of the cluster heart is the vector of word in all first clusters The average value of middle corresponding position element.

The element of the i-th row jth column of the cluster heart can indicate are as follows: B_i,j.Assuming that having k word in the first cluster, in the first cluster The element of the i-th row jth column of s word can be expressed as A_i,j ^S, then B_i,jMeet following formula (1):

In each iteration cycle, after the vector of the cluster heart of the first cluster is calculated, in order to obtain with cluster heart similarity Nearest word will successively calculate the distance between the vector of the vector of each word and the first cluster cluster heart in the first cluster, and will Apart from the corresponding word of the smallest vector, remained as the immediate word of cluster heart similarity with the first cluster, and from the first cluster The word the similarity between the center in default similarity dimensions is determined in remaining word, forms the second cluster, and by the Two clusters are as the first new cluster, the step of executing the cluster Heart vector for calculating the first cluster again, until meeting iteration stopping condition.

At this point, it should be noted that in order to accelerate the convergence rate of each classification, it is each during successive ignition Default similarity dimensions used in secondary iterative process are all different, and with the increase of the number of iterations, preset similarity model The value enclosed is also to be gradually reduced.

When meeting iteration stopping condition, the second cluster that last time iteration is obtained is classified as one after cluster, And the word that the word in the second cluster for obtaining the last time iteration is clustered as completion, then again to unfinished cluster Word carry out above-mentioned iterative process, until all words are all completed to cluster.

Here iteration stopping condition includes at least one of the following conditions: 1) word in the second cluster no longer becomes Change；At this time, it is desirable that during successive ignition, default similarity dimensions also have threshold value.2) the number of iterations reaches setting number Threshold value.3) the word quantity in the second cluster is not more than default word amount threshold.

In condition 1) in, the word in the second cluster is no longer changed, and is shown to have formd optimal cluster, can be stopped Iteration.In condition 2) in, in order to save operand, the maximum value of the number of iterations can be set, if the number of iterations reaches setting Frequency threshold value can stop the iteration of this iteration cycle, and history included in the second cluster finally obtained is gone out vehicle place and is made For one kind.In condition 3) in, if the word quantity in the second cluster is not more than default word amount threshold, in subsequent construction When second hierarchical clustering subtree, so that calculation amount is limited in certain range, it can satisfy currently to calculation amount Limitation requires.

Second, the method that the word that second pair of data provided by the embodiments of the present application are concentrated is clustered, institute referring to fig. 4 Show, comprising:

For each initial cluster centre, following steps are executed:

S401: the similarity between the cluster centre is made less than the word of the first default similarity and the cluster centre For the first cluster, the cluster heart of the first cluster is calculated, vector corresponding to the cluster heart is flat for the corresponding vector of word each in the first cluster Mean value；

S402: it centered on the word nearest with the cluster heart similarity of first cluster, determines with the center in default phase Like the word in degree range, the second cluster is formed；

S403: using second cluster as the first new cluster, and return step S401 executes calculating, stops until meeting iteration Only condition, using the second cluster finally obtained as a grouping after cluster.

When specific implementation, the number of initial cluster center can specifically be set according to the actual needs； Specifically, it is based on required calculation amount when each grouping building hierarchical clustering subtree in order to be limited in, needs each point In a certain range, then the quantity of word included in data set is more, the value of K is also got over for the quantity limitation of word in group Greatly.

If for example, the word quantity for including in data set is 1,000,000, and requiring to be formed by word in each grouping Quantity is not more than 10000, then can be using the ratio of word quantity maximum in the quantity of word in data set and each grouping as K Value, such as in this example, K takes 100.

In addition, in this example, in order to enable there are surpluses in the quantity space of the word in each grouping, setting K's It, can also be default by the ratio of word quantity maximum in the quantity of word in data set and each grouping and the ratio when value The sum of the value of percentage is determined as K.Such as it will be by the ratio of word quantity maximum in the quantity of word in data set and each grouping The sum of value and the value of the ratio 10% is determined as K, that is, K=100+100*10%, is 110.

After determining K, it can choose K from the vocabulary of data set and be used as initial cluster center.Then for each initial Cluster centre is successively calculated except the similarity between each vocabulary of the initial cluster center and the initial cluster center.

Here in the embodiment corresponding with above-mentioned Fig. 3 of the similarity between each vocabulary and initial cluster center, is sought The method of similarity is similar between the cluster heart and word of cluster, and details are not described herein.

If for example, data are concentrated with 100W word, and the value of K is determined as 110, determined from 1,000,000 words 110 initial cluster centers be respectively as follows: X1-X110.For X1, successively to calculate in 1,000,000 words in addition to X1 The distance between 999999 words and X1.If wherein the distance between some word and X1 are less than the first default similarity, The word and X1 are divided into the same cluster namely the first cluster.Then by the average value of the vector of words all in the first cluster As the cluster heart.Then centered on the word nearest apart from the cluster heart, the determining word with the center in default similarity dimensions Language forms the second cluster, and using the second cluster as the first new cluster, and the step of returning to the cluster heart coordinate for calculating the first cluster, until Meet iteration stopping condition, using the second cluster finally obtained as a class after cluster.

Distinguishingly, if some be selected to initial cluster center go through word in an iterative process, be divided into certain In a class, then it is no longer based on the initial cluster center and carries out above-mentioned iterative process.The word of cluster can not be completed from remaining again Determine that a word carries out above-mentioned iterative process as initial cluster center, and for new cluster centre in language；At this point, final Obtained classification quantity is identical as the quantity of K.Can also initial distance center north is divided into some classification in, only for Other initial cluster centers carry out above-mentioned iterative process；At this point, the final classification quantity arrived is less than the quantity of K.

In the present embodiment, the iteration stopping condition in iteration stopping condition embodiment corresponding with above-mentioned Fig. 3 is similar, This is repeated no more.

Third, the method that the third word concentrated to data provided by the embodiments of the present application is clustered, referring to Fig. 5 institute Show, comprising:

S501: using any one word in the word for not completing cluster in the data set currently as cluster centre, and Successively calculate the similarity between other current words and the cluster centre for not completing cluster；

S502: by similarities between other current words and the cluster centre for not completing cluster according to from big to small Sequentially, from other current words for not completing cluster, obtain preset quantity word be divided into it is same with the cluster centre In grouping, and using words all in the grouping as the word for completing cluster；

S503: detection is currently with the presence or absence of the word for not completing cluster.If it is, S501 is jumped to, if it is not, then knot Beam.

S503: return step S501 is calculated, until word all in the data set completes cluster.

When specific implementation, calculate other it is currently not complete after similarity between the word and cluster centre that could cluster Method, in embodiment corresponding with above-mentioned Fig. 3, the method for seeking similarity between the cluster heart of the first cluster and word is similar, This is repeated no more.

In this embodiment, the condition constrained cluster result is the quantity of word in each classification, to be made The quantity of included word is limited in a certain range in each classification, to calculate required for hierarchical clustering in reduction group Amount.

The third clustering method is simpler compared with above-mentioned the first and second of clustering method, and computational efficiency is high, but Precision compared with above two clustering method can be declined.

In addition, word in each grouping also can be used other than constraining each grouping with the quantity of word in each grouping Similarity between language constrains each grouping.

Such as: above-mentioned S502, which may also is that, other not to be completed currently between the words and the cluster centre clustered Similarity obtains from the current word for not completing cluster according to descending order and is less than preset quantity, and between center Similarity be less than default similarity threshold word be divided into in the same grouping of the cluster centre, and by institute in the grouping There is word as the word for completing cluster.

Fourth, the method that the 4th kind provided by the embodiments of the present application word concentrated to data is clustered, referring to Fig. 6 institute Show, comprising:

S601: using the data set as wait split set；

S602: 2 words are randomly choosed in set as initial cluster centre wait split from described；

S603: the phase wait split in set between each word and described two initial cluster centres is calculated separately It is grouped into the higher cluster centre institute of its similarity in a packet like degree, and by word, fractionation obtains two intermediate packets；

S604: whether the quantity for the word for including in detection intermediate packets is greater than default word amount threshold, if it is, Using the intermediate packets as wait split set, return step S602 executes calculating.If it is not, then executing S605.

S605: after the quantity of the word for including to be not more than to the intermediate packets for presetting word amount threshold as cluster One grouping.

When specific implementation, wait split the phase in set between each word and described two initial cluster centres Like degree, in embodiment corresponding with above-mentioned Fig. 3, the method for seeking similarity between the cluster heart of the first cluster and word is similar, herein It repeats no more.

The embodiment is used based on wait split the similarity in set between each word and the cluster centre of two disposition Word in data set is divided into multiple classification by recursive mode, so that the similarity between the word in each classification It is all that relatively, and the quantity of word is not more than default word amount threshold, hierarchical clustering in reduction group in each grouping When required calculation amount.

It will classify in data set, after forming multiple groupings, hierarchical clustering in group is carried out for each grouping, with And component level cluster is carried out for all groupings.

When specific implementation, organize in hierarchical clustering sequence in no particular order between hierarchical clustering and group.

S102: hierarchical clustering carrying out group at least one described grouping obtains the first hierarchical clustering subtree, wherein every A grouping is the leaf node of the first hierarchical clustering subtree.

It is by each carrying out group to grouping formed in S101 when hierarchical clustering when specific implementation Word in grouping regards an entirety as, carries out hierarchical clustering, generates the first hierarchical clustering subtree.

Herein, the average value that the corresponding vector of all words in each grouping can be calculated by obtaining, obtain with each It is grouped corresponding and average vector, and characterizes corresponding grouping by the average vector.Then group is being carried out to each grouping Between hierarchical clustering when, it will be able to carried out based on the corresponding average vector of each grouping.

Carrying out group to each grouping when hierarchical clustering, following methods can be used:

It regard each grouping as a cluster, and calculates the similarity between every two cluster.

According to the sequence of similarity from big to small, multiple clusters pair are determined；Each cluster centering includes two clusters, and different clusters pair In, included cluster is different.

Two clusters for belonging to same cluster pair are merged, new cluster is constituted, and be directed to new cluster, executes above-mentioned calculating The process of similarity between two clusters, until all corresponding clusters of grouping are all merged together.

Wherein, each grouping is the leaf node for being formed by the first hierarchical clustering subtree；First hierarchical clustering subtree Root node include all groups.Each cluster is to the node constituted between leaf node and root node.

It is noted herein that after constituting new cluster, root when two clusters that will belong to same cluster pair merge According to the corresponding average vector of two clusters of the cluster centering merged, the average vector of two clusters is sought, by two clusters Average vector be formed by new cluster for characterizing to merge two clusters.

S103: carrying out hierarchical clustering in group to each grouping, obtain the second hierarchical clustering subtree corresponding with each grouping, Wherein, word included in each grouping is the leaf node of the second hierarchical clustering subtree corresponding with the grouping.

When specific implementation, hierarchical clustering in group is carried out to each grouping, being will be for included in each grouping Word carry out hierarchical clustering, each word is an independent part in the grouping.

When carrying out hierarchical clustering in group to each grouping, following methods can be used: using each word as a cluster, and Calculate the similarity between every two cluster.According to the sequence of similarity from big to small, multiple clusters pair are determined；Each cluster centering includes Two clusters, and different cluster centerings, included cluster are different.Two clusters for belonging to same cluster pair are merged, are constituted new Cluster, and be directed to new cluster, executes the process of similarity between two clusters of above-mentioned calculating, until all corresponding cluster of grouping all by It merges.

Wherein, each word is the leaf node for being formed by the second hierarchical clustering subtree；Second hierarchical clustering subtree Root node include word all in all corresponding groupings.Each cluster is to the section constituted between leaf node and root node Point.

It is noted herein that after constituting new cluster, root when two clusters that will belong to same cluster pair merge According to the average of the corresponding vector of two clusters of the cluster centering merged, the vector of the cluster new as this, this is new The vector of cluster is formed by new cluster for characterizing to merge two clusters.

S104: according to the first hierarchical clustering subtree and the second hierarchical clustering subtree, the semantic hierarchies are constructed Tree.

When specific implementation, since the first hierarchical clustering subtree and the second hierarchical clustering subtree are the language to be constructed A part of adopted hierarchical tree.And first hierarchical clustering subtree level be higher than the second hierarchical clustering subtree level.

Since the leaf node of the first hierarchical clustering subtree is each leaf of each grouping namely the first hierarchical clustering subtree It include all words of corresponding grouping in child node, and the root node of the second hierarchical clustering subtree includes institute in corresponding grouping There is word, therefore, can from here be attached the first hierarchical clustering subtree and the second hierarchical clustering subtree, namely: by institute The root node for stating the first hierarchical clustering subtree, as the root node of the semantic hierarchies tree, by the second hierarchical clustering subtree Root node, as the leaf node of the second hierarchical clustering subtree, to the first hierarchical clustering subtree and described second Hierarchical clustering subtree is attached, and generates the semantic hierarchies tree.

For example, the embodiment of the present application also provides an example, the above process is illustrated, it should be noted that this reality The magnitude for the data set that example uses becoming apparent from just to explanation is applied, does not represent the amount of data set in practical implementation Grade.

Include 100 words in data set, classifies according to the similarity between word to data set, obtain A~J Totally 10 classification, and this 10 classification in include words be respectively as follows: A1~A10, B1~B10, C1~C10 ... J1~ J10。

When carrying out hierarchical clustering in group to classification A, it is as shown in Figure 7 to obtain the second hierarchical clustering subtree.To A~J into Between row group when hierarchical clustering, the first obtained hierarchical clustering subtree is as shown in figure 8, then by the first hierarchical clustering subtree and the second layer Secondary cluster subtree links together, and the hierarchical clustering tree constituted is as shown in Figure 9.

In the embodiment of the present application, if constructing the hierarchical clustering tree using traditional mode, it is assumed that include in word finder F word, will calculate separately a similarity to the every two in this f word, then calculation amount are as follows:And pass through this Shen Please embodiment provide method construct semantic hierarchies tree when, it is assumed that be grouped have 100, the word quantity in each grouping isThen Calculation amount meets:

It can be seen that after f reaches certain magnitude the number of similarity mode can be greatly reduced, so as to accelerate structure The speed of build-up layers time clustering tree reduces calculation amount required for building level clustering tree, reduces the complexity of calculating, thus full The requirement of foot rapid build semantic hierarchies tree on the basis of million grades of data sets.

The embodiment of the present application obtains at least one grouping by classifying to data set, at least one grouping It each include at least one word；Hierarchical clustering carrying out group at least one described grouping, obtains the first hierarchical clustering subtree, Wherein, each grouping is the leaf node of the first hierarchical clustering subtree；And level in group is carried out to each grouping and is gathered Class obtains the second hierarchical clustering subtree corresponding with each grouping, wherein included word is and this point in each grouping The leaf node of the corresponding second hierarchical clustering subtree of group；According to the first hierarchical clustering subtree and second hierarchical clustering Subtree constructs the semantic hierarchies tree, can accelerate the speed for constructing level clustering tree, reduces required for building level clustering tree Calculation amount, reduce the complexity of calculating, thus meet on the basis of million grades of data sets rapid build semantic hierarchies tree Requirement.

Based on the same inventive concept, semanteme corresponding with semantic hierarchies tree constructing method is additionally provided in the embodiment of the present application Hierarchical tree construction device, the principle solved the problems, such as due to the device in the embodiment of the present application and the above-mentioned semantic layer of the embodiment of the present application Secondary tree constructing method is similar, therefore the implementation of device may refer to the implementation of method, and overlaps will not be repeated.

Shown in Figure 10, semantic hierarchies tree construction device provided by the embodiments of the present application specifically includes:

Grouping module 10 obtains at least one grouping for classifying to data set, at least one grouping it is every A includes at least one word；

Level cluster module 20 between group obtains first layer for hierarchical clustering carrying out group at least one described grouping Secondary cluster subtree, wherein each grouping is the leaf node of the first hierarchical clustering subtree；

Level cluster module 30 in group obtains corresponding with each grouping for carrying out hierarchical clustering in group to each grouping The second hierarchical clustering subtree, wherein included word is the second hierarchical clustering corresponding with the grouping in each grouping The leaf node of subtree；

Semantic hierarchies clustering tree constructing module 40, for poly- according to the first hierarchical clustering subtree and second level Class subtree constructs the semantic hierarchies tree.

Optionally, grouping module 10 obtain at least one point specifically for classifying by following step to data set Group:

Optionally, grouping module 10 is specifically used for through following step according to similar between word in the data set Degree, clusters the word in the data set, obtains at least one described grouping:

(i) using in the data set all words as the first cluster；

Optionally, grouping module 10, specifically for passing through following step according to similar between word in the data set Degree, clusters the word in the data set, obtains at least one described grouping: according to preset number of packet K, from institute It states and randomly chooses K word in data set as initial cluster centre；

For each initial cluster centre, following steps are executed:

Optionally, grouping module 10, specifically for passing through following step according to similar between word in the data set Degree, clusters the word in the data set, obtains at least one described grouping:

According to the similarity in the data set between word, the word in the data set is clustered, it is specific to wrap It includes:

Optionally, level cluster module 20 between group is specifically used for carrying out at least one described grouping by following step Hierarchical clustering between group obtains the first hierarchical clustering subtree:

Optionally, semantic hierarchies clustering tree constructing module 40 is specifically used for through following step according to first level Subtree and the second hierarchical clustering subtree are clustered, the semantic hierarchies tree is constructed:

In the present embodiment, the interior level cluster module 30 of level cluster module 20, group and semantic hierarchies between grouping module 10, group The concrete function and interactive mode of clustering tree constructing module 40, reference can be made to the record of the corresponding embodiment of Fig. 1-Fig. 8, herein no longer It repeats.

For the semantic hierarchies tree constructing method in Fig. 1, the embodiment of the present application also provides a kind of computer equipment, such as Figure 11 Shown, which includes memory 1000, processor 2000 and is stored on the memory 1000 and can be in the processor 2000 The computer program of upper operation, wherein above-mentioned processor 2000 realizes above-mentioned semantic hierarchies tree when executing above-mentioned computer program The step of construction method.

Specifically, above-mentioned memory 1000 and processor 2000 can be general memory and processor, not do here It is specific to limit, when the computer program of 2000 run memory 1000 of processor storage, it is able to carry out above-mentioned semantic hierarchies tree Construction method, so that solving semantic hierarchies tree needs the relational matrix of continuous data between any two when building, for hundred For the data acquisition system of ten thousand ranks, the complexity of calculating is unable to satisfy the problem of realization requires, and then reaches quickening building level The speed of clustering tree reduces calculation amount required for building level clustering tree, reduces the complexity of calculating, to meet hundred The effect that rapid build semantic hierarchies tree requires on the basis of ten thousand grades of data sets.

Corresponding to the semantic hierarchies tree constructing method in Fig. 1, computer-readable deposited the embodiment of the present application also provides a kind of Storage media is stored with computer program on the computer readable storage medium, execution when which is run by processor The step of above-mentioned semantic hierarchies tree constructing method.

Specifically, which can be general storage medium, such as mobile disk, hard disk, on the storage medium Computer program when being run, be able to carry out above-mentioned semantic hierarchies tree constructing method, constructed to solve semantic hierarchies tree When need the relational matrix of continuous data between any two, for the data acquisition system of million ranks, the complexity of calculating It is unable to satisfy the problem of realization requires, and then reaches the speed for accelerating building level clustering tree, reduces building level clustering tree institute The calculation amount needed reduces the complexity of calculating, to meet rapid build semantic layer on the basis of million grades of data sets The effect that secondary tree requires.

The computer program product of semantic hierarchies tree constructing method and device provided by the embodiment of the present application, including deposit The computer readable storage medium of program code is stored up, the instruction that said program code includes can be used for executing previous methods implementation Method described in example, specific implementation can be found in embodiment of the method, and details are not described herein.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description It with the specific work process of device, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) execute each embodiment the method for the application all or part of the steps. And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.

The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can easily think of the change or the replacement, and should all contain Lid is within the scope of protection of this application.Therefore, the protection scope of the application shall be subject to the protection scope of the claim.

Claims

1. a kind of semantic hierarchies tree constructing method characterized by comprising

Hierarchical clustering carrying out group at least one described grouping, obtains the first hierarchical clustering subtree, wherein each grouping is The leaf node of the first hierarchical clustering subtree；And

Hierarchical clustering in group is carried out to each grouping, obtains the second hierarchical clustering subtree corresponding with each grouping, wherein is each Included word is the leaf node of the second hierarchical clustering subtree corresponding with the grouping in grouping；

2. obtaining at least one grouping the method according to claim 1, wherein classifying to data set, have Body includes:

According to the similarity in the data set between word, the word in the data set is clustered, obtain it is described extremely A few grouping.

3. right according to the method described in claim 2, it is characterized in that, according to the similarity in the data set between word Word in the data set is clustered, and is obtained at least one described grouping, is specifically included:

(i) using in the data set all words as the first cluster；

(ii) the cluster heart of the first cluster is determined, vector corresponding to the cluster heart is that each corresponding vector of word is averaged in the first cluster Value, it is determining to be preset in similarity dimensions with the center centered on the word nearest with the cluster heart similarity of first cluster Word, form the second cluster；

(i ii) is using second cluster as the first cluster, and return step (ii) is calculated, until meeting iteration stopping condition, Using the second cluster finally obtained as a grouping after cluster, and using the word in the second cluster as the word for completing cluster；

(iv) word of all unfinished clusters is as the first cluster using in the data set, and return step (i i) is calculated, directly Into the data set, all words complete cluster, multiple groupings after being clustered.

4. right according to the method described in claim 2, it is characterized in that, according to the similarity in the data set between word Word in the data set is clustered, and is specifically included:

For each initial cluster centre, following steps are executed:

(i) word using the similarity between the cluster centre less than the first default similarity is with the cluster centre as first Cluster calculates the cluster heart of the first cluster, and vector corresponding to the cluster heart is the average value of the corresponding vector of each word in the first cluster；

(ii) centered on the word nearest with the cluster heart similarity of first cluster, the determining and center is in default similarity model Interior word is enclosed, the second cluster is formed；

(iii) using second cluster as the first new cluster, and return step (i) executes calculating, until meeting iteration stopping item Part, using the second cluster finally obtained as a grouping after cluster.

5. right according to the method described in claim 2, it is characterized in that, according to the similarity in the data set between word Word in the data set is clustered, and is specifically included:

(i) any one word in the word for not completing cluster in the data set currently is counted as cluster centre, and successively Calculate the similarity between other current words and the cluster centre for not completing cluster；

(ii) by the similarities between other current words and the cluster centre for not completing cluster according to sequence from big to small, From other current words for not completing cluster, the word for obtaining preset quantity is divided into and the same grouping of the cluster centre In, and using words all in the grouping as the word for completing cluster；

6. right according to the method described in claim 2, it is characterized in that, according to the similarity in the data set between word Word in the data set is clustered, and is specifically included:

(i) using the data set as wait split set；(ii) it is used as just from described wait split 2 words of random selection in set The cluster centre of beginning；

(i ii) calculates separately described similar between each word and described two initial cluster centres in set wait split Degree, and word is grouped into the higher cluster centre institute of its similarity in a packet, fractionation obtains two intermediate packets；

(i ii) is if the quantity for the word for including in the intermediate packets is greater than default word amount threshold, by the centre It is grouped and gathers as new wait split, return step (ii) executes calculating, until the number for the word for including in the intermediate packets The quantity of the word for including is not more than the centre point for presetting word amount threshold no more than default word amount threshold by amount Group is as a grouping after cluster.

7. the method according to claim 3 or 4, which is characterized in that the iteration stopping condition include it is following middle a kind of or A variety of: the word in the second cluster is no longer changed, the word quantity in the second cluster is not more than default word amount threshold, iteration Number reaches preset times threshold value.

8. the method according to claim 1, wherein it is described group is carried out between at least one described grouping level it is poly- Class obtains the first hierarchical clustering subtree, specifically includes:

For each grouping, the average value of the corresponding vector of all words in the grouping is calculated, it is corresponding flat to obtain the grouping Equal vector；

9. the method according to claim 1, wherein described according to the first hierarchical clustering subtree and described Two hierarchical clustering subtrees construct the semantic hierarchies tree, specifically include:

By the root node of the first hierarchical clustering subtree, as the root node of the semantic hierarchies tree, by second level The root node for clustering subtree, as the leaf node of the second hierarchical clustering subtree, to the first hierarchical clustering subtree and The second hierarchical clustering subtree is attached, and generates the semantic hierarchies tree.

10. a kind of semantic hierarchies tree construction device characterized by comprising

Grouping module obtains at least one grouping, each of at least one described grouping includes for classifying to data set At least one word；

Level cluster module between group obtains the first hierarchical clustering for hierarchical clustering carrying out group at least one described grouping Subtree, wherein each grouping is the leaf node of the first hierarchical clustering subtree；

Level cluster module in group obtains and each grouping corresponding second for carrying out hierarchical clustering in group to each grouping Hierarchical clustering subtree, wherein included word is the second hierarchical clustering subtree corresponding with the grouping in each grouping Leaf node；

Semantic hierarchies clustering tree constructing module, for according to the first hierarchical clustering subtree and second hierarchical clustering Tree, constructs the semantic hierarchies tree.