CN103744889B

CN103744889B - A kind of method and apparatus for problem progress clustering processing

Info

Publication number: CN103744889B
Application number: CN201310718033.5A
Authority: CN
Inventors: 李皛皛; 方高林; 孟新萍; 杨帆
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2013-12-23
Filing date: 2013-12-23
Publication date: 2019-02-22
Anticipated expiration: 2033-12-23
Also published as: CN103744889A

Abstract

The object of the present invention is to provide a kind of for carrying out the method and apparatus of clustering processing to problem；Obtain target problem to be clustered and candidate problem；According to feature the problem of the target problem and candidate problem, the feature vector of the target problem and candidate problem is determined；According to the feature vector of the target problem and candidate problem, the similarity of the target problem and candidate problem is calculated；According to the similarity, and fragmentation threshold is combined, judges whether to gather the target problem and candidate problem for one kind.Compared with prior art, the present invention is according to feature the problem of target problem and candidate problem to be clustered, determine feature vector, and then calculate the similarity of the target problem and candidate problem, and fragmentation threshold is combined, judge whether to gather the target problem and candidate problem as one kind, clustering processing efficiently and accurately is carried out to problem, problem resource is incorporated, the retrieval experience of user is improved.

Description

A kind of method and apparatus for problem progress clustering processing

Technical field

The present invention relates to field of computer technology more particularly to a kind of technologies for problem progress clustering processing.

Background technique

Knowing in station at present, there is a large amount of untrimmed repetition resources, and when user search, some problem seeks to answer When case, need to browse multiple same problems to be resolved.The quality of this part resource is also irregular simultaneously, and user is also It needs to screen these answers, to obtain relatively satisfactory answer.The click higher cost that user pays in this process, Retrieval experience is poor.By data analysis found that know stand in have in settled resource 39.5% for replication problem, wherein 22.92% replication problem is consistent on the text semantic that problem describes.

Currently, being directed to the short string of enquirement/phrase type, the judgement of its semantic consistency can be carried out with several tools. It has mainly used the technical solutions such as Type division, synonym replacement, the omission of non-key word, has preferable effect.But it answers It is often not necessarily most suitable when enquirement for being described with problem.Because in the Ask-Answer Community UGC, put question to have it is a variety of not Same form.Such as: a) a more general problem " consulting a mathematical problem " would generally be written in user when puing question to, and content is then Detailed description；B) user has multiple problems, not necessarily all includes in problem title, and can continue to mention in detailed description It asks；It c) may also be with question information, etc. in comment.The above problem shows that original semantic congruence can not be indiscriminately imitated completely Property judgment method, it is necessary to develop a kind of enquirement semantic consistency distinguished number of suitable question and answer class UGC product.

Therefore, how clustering processing efficiently and accurately to be carried out to problem, becomes asking for those skilled in the art's urgent need to resolve One of topic.

Summary of the invention

The object of the present invention is to provide a kind of for carrying out the method and apparatus of clustering processing to problem.

According to an aspect of the invention, there is provided a kind of method for carrying out clustering processing to problem, wherein the party Method the following steps are included:

A obtains target problem to be clustered and candidate problem；

B determines the spy of the target problem and candidate problem according to feature the problem of the target problem and candidate problem Levy vector；

C calculates the phase of the target problem with candidate problem according to the feature vector of the target problem and candidate problem Like degree；

D combines fragmentation threshold according to the similarity, judges whether to gather and be the target problem and candidate's problem It is a kind of.

According to another aspect of the present invention, a kind of clustering apparatus for problem progress clustering processing is additionally provided, In, which includes:

Acquisition device, for obtaining target problem to be clustered and candidate problem；

Determining device, for according to feature the problem of the target problem and candidate problem, determine the target problem with The feature vector of candidate problem；

Computing device, for the feature vector according to the target problem and candidate problem, calculate the target problem with The similarity of candidate problem；

Judgment means are used for according to the similarity, and combine fragmentation threshold, are judged whether the target problem and are waited Problem is selected to gather for one kind.

Compared with prior art, the present invention obtains target problem to be clustered and candidate problem, according to the target problem With feature the problem of candidate problem, determine the feature vector of the target problem and candidate problem, according to the target problem with The feature vector of candidate problem calculates the similarity of the target problem and candidate problem, according to the similarity, and combines and divides Section threshold value judges whether to gather the target problem and candidate problem as one kind, efficiently and accurately carry out at cluster to problem Reason, incorporates problem resource, improves the retrieval experience of user.

Further, clustering apparatus 1 passes through the threshold value relaxed and put question to semantic consistency judgement, along with further filtering Matching means carry out matching filtering according to problem types, matching is filtered according to keyword, was carried out according to crucial expression formula Filter matching etc. further improves the efficiency and accuracy rate of problem cluster so as to find more semantic consistency problems, Improve the retrieval experience of user.

Further, the problem of present invention is further combined with target problem or candidate problem content information and supplemental content are believed Breath, calculates the similarity of the target problem and candidate problem, to judge whether to gather the target problem with candidate's problem and be One kind, the clustering apparatus 1 are based on problem descriptive semantics consistency and carry out clustering processing to problem, pass through the title and tool to problem Hold in vivo and carry out comprehensive analysis, clustering processing is carried out to problem, problem resource is further incorporated, improves the retrieval body of user It tests.

Further, the method that present invention application increment clusters, the clustering problem that processing increases in real time on a large scale, further Problem resource is incorporated, the retrieval experience of user is improved.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, of the invention other Feature, objects and advantages will become more apparent upon:

Fig. 1 shows the schematic device for being used to carry out problem clustering processing of one aspect according to the present invention；

Fig. 2 shows in accordance with a preferred embodiment of the present invention for carrying out the schematic device of clustering processing to problem；

Fig. 3 shows the method flow diagram for being used to carry out problem clustering processing according to a further aspect of the present invention；

Fig. 4 shows the method flow diagram for being used to carry out problem clustering processing in accordance with a preferred embodiment of the present invention.

The same or similar appended drawing reference represents the same or similar component in attached drawing.

Specific embodiment

Present invention is further described in detail with reference to the accompanying drawing.

Fig. 1 shows the schematic device for being used to carry out problem clustering processing of one aspect according to the present invention.Cluster dress Setting 1 includes acquisition device 101, determining device 102, computing device 103 and judgment means 104.

Acquisition device 101 obtains target problem to be clustered and candidate problem.Specifically, acquisition device 101 for example passes through With the interaction of problem base, target problem to be clustered and candidate problem are obtained from the problem base, alternatively, the acquisition device 101 Directly by the interaction with user equipment, such as pass through the application programming interfaces of one or many calling user equipmenies offers (API), or by dynamic web page techniques such as ASP, JSP or PHP, the problem of user inputs is obtained, and as target Problem or candidate problem.

Here, the problem of being stored with user's input in the problem base, content information or supplemental content letter the problem of the problem Breath etc. obtains the problem of user inputs, and stored, to realize the problem for example, the problem base is periodically or in real time Library establishs or updates.The problem base can both be located in the clustering apparatus 1, may be alternatively located at and pass through network with the clustering apparatus 1 In the third party device being connected.

Those skilled in the art will be understood that the mode of above-mentioned acquisition target problem to be clustered and candidate problem is only to lift The mode of example, other acquisitions that are existing or being likely to occur from now on target problem and candidate problem to be clustered is such as applicable to this Invention, should also be included within the scope of protection of the present invention, and be incorporated herein by reference.

Determining device 102 determines the target problem and waits according to feature the problem of the target problem and candidate problem Select the feature vector of problem.Specifically, it is determined that device 102 is according to feature the problem of the target problem and candidate problem, such as the mesh Keyword feature, structure feature, semantic feature, problem types feature etc. in the title of mark problem or candidate problem, determining should The feature vector of target problem and candidate problem.For example, the determining device 102 carries out key to the target problem or candidate problem Power is assigned in word identification, as extracted important word in question matching calculating, and assigns different weights；Alternatively, the determining device 102 Structural analysis is carried out to the target problem or candidate problem and weight adjusts, structural analysis is carried out to question sentence, and pass through semantic mould Version and the mode of word structure carry out the identification of similar semantic redundancy；Alternatively, the determining device 102 is to the target problem or candidate Problem carries out Semantic mapping, introduces synonym resource, the word for the same meaning expressed with different terms is carried out normalizing；And or Person, which carries out problem types identification to the target problem or candidate problem, by being identified as different types, The type factor important as one is participated in into similarity weight calculation.In turn, the determining device 102 is according to said one or more A problem characteristic determines the feature vector of the target problem and candidate problem.

Here, the target problem includes but is not limited to feature the problem of candidate problem:

Keyword feature；

Structure feature；

Semantic feature；

Problem types feature.

Those skilled in the art will be understood that above problem feature is only for example, other are existing or are likely to occur from now on Problem characteristic is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and is contained in by reference herein This.

Computing device 103 calculates the target problem and waits according to the feature vector of the target problem and candidate problem Select the similarity of problem.Specifically, the target problem according to determined by the determining device 102 of computing device 103 and candidate problem Feature vector, calculate the similarity of the target problem and candidate problem, for example, the computing device 103 is according to the following formula, calculating should Target problem is at a distance from candidate problem:

Wherein, Sim (S₁,S₂) indicating the similarity of target problem and candidate problem, Wgt (w) indicates the weighted value of word w, Middle subscript 1k_tIndicate the word in target problem, subscript 2k_jIndicate that the word in candidate problem, molecule indicate the target problem or candidate The word of co-occurrence in problem, the word of co-occurrence is more or the weight of the word is higher, then the value of the molecule is bigger；Denominator indicates that the target is asked The sum of the weight of all words in topic or candidate problem；SentType(S₁, S₂) indicate the target problem and class the problem of candidate problem The similarity of type, the target problem is more similar to type the problem of candidate problem, the SentType (S₁, S₂) value it is bigger.

After calculating the acquisition target problem at a distance from candidate problem, the computing device 103 is further according to the distance, really The similarity of the target problem and candidate problem calmly, for example, the computing device 103 is directly using calculated distance value as this The similarity of target problem and candidate problem；Alternatively, the computing device 103 is by certain numerical value conversion, by what is be calculated Distance value is mapped as the similarity of the target problem and candidate problem.

Those skilled in the art will be understood that the mode of the similarity of above-mentioned determining target problem and candidate problem is only to lift Example, other are existing or are likely to occur the mode of the similarity of set the goal really problem and candidate problem from now on and are such as applicable to this Invention, should also be included within the scope of protection of the present invention, and be incorporated herein by reference.

Judgment means 104 combine fragmentation threshold according to the similarity, judge whether the target problem and candidate Problem is gathered for one kind.For example, it is assumed that only one preset fragmentation threshold, value A, when the computing device 103 is calculated The target problem and the similarity of candidate problem be greater than or equal to fragmentation threshold A, then the judgment means 104 judge the mesh Mark problem and candidate's problem are gathered for one kind；If the similarity is less than fragmentation threshold A, which will not be asked with candidate Topic is gathered for one kind.

For another example, it is assumed that there are two preset fragmentation threshold B and C, wherein the value of fragmentation threshold B is less than fragmentation threshold C's Value, then similarity is determined as having low similar by the judgment means 104 less than the target problem of fragmentation threshold B to candidate problem The target problem of degree and candidate problem, do not carry out clustering processing to it；Similarity is greater than or equal to fragmentation threshold B and is less than The target problem of fragmentation threshold C and candidate problem, are determined as the target problem with middle similarity and candidate problem；It will be similar Target problem of the degree more than or equal to fragmentation threshold C and candidate problem, are determined as the target problem and time with high similarity Select problem.For being determined as the target problem and candidate's problem of middle similarity or high similarity, after which carries out it Continuous processing, hereafter will be described in detail.

Here, the fragmentation threshold is to determine have the target problem of different similarities similar to candidate problem for being segmented Threshold value is spent, for example, determining target problem with high similarity and candidate problem, determining with middle phase based on fragmentation threshold segmentation Like the target problem of degree and candidate problem, determine the target problem with low similarity and candidate problem etc., the fragmentation threshold Quantity and numerical value can be preset, fixed value, can also be moved according to the target problem and the cluster situation of candidate problem State adjustment.

Those skilled in the art, which will be understood that, above-mentioned to be judged whether to gather target problem and candidate problem for a kind of mode It is only for example, other are existing or what is be likely to occur from now on judges whether to gather target problem and candidate problem for a kind of mode It is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and is incorporated herein by reference.

Preferably, it constantly works between each device of clustering apparatus 1.Specifically, acquisition device 101 obtains Target problem to be clustered and candidate problem；Determining device 102 is according to feature the problem of the target problem and candidate problem, really The feature vector of the fixed target problem and candidate problem；Computing device 103 is according to the spy of the target problem and candidate problem Vector is levied, the similarity of the target problem and candidate problem is calculated；Judgment means 104 combine and divide according to the similarity Section threshold value judges whether to gather the target problem and candidate problem for one kind.Here, it will be understood by those skilled in the art that " lasting " refers to that each device of clustering apparatus 1 carries out target problem according to the operating mode requirement of setting or real-time adjustment respectively The determination of acquisition, feature vector, the calculating of similarity and cluster judgement with candidate problem, until the clustering apparatus 1 is longer Stop obtaining target problem to be clustered and candidate problem in time.

Here, clustering apparatus 1 obtains target problem to be clustered and candidate problem, asked according to the target problem with candidate The problem of topic feature, the feature vector of the target problem and candidate problem is determined, according to the target problem and candidate problem Feature vector, calculate the similarity of the target problem and candidate problem, according to the similarity, and combine fragmentation threshold, Judge whether to gather the target problem and candidate problem and be incorporated for one kind efficiently and accurately to problem progress clustering processing Problem resource improves the retrieval experience of user.

Preferably, the judgment means 104 are according to the similarity, and combine fragmentation threshold, and determining has middle similarity Target problem and candidate problem；Wherein, which further includes coalignment (not shown), after which is based on Continuous matching treatment judges whether to gather the target problem with middle similarity and candidate problem for one kind.Specifically, judge Device 104 calculates resulting similarity according to computing device 103, by similarity meet in fragmentation threshold corresponding to similarity Target problem and candidate problem, are determined as the target problem with middle similarity and candidate problem；Then, after coalignment is based on Continuous matching treatment, such as the processing of problem types matching treatment, Keywords matching, crucial expression formula matching treatment etc., judge whether The target problem with middle similarity and candidate problem are gathered for one kind.

For example, target problem and candidate problem for being determined as having middle similarity, coalignment further obtains should Crucial expression formula in target problem and candidate problem, if the target problem and candidate problem include that a certain key is expressed Formula, then the coalignment judges to gather the target problem and candidate problem as one kind；If only the target problem includes the key Expression formula, and candidate's problem does not include, or vice versa, then the coalignment judges that the target problem cannot gather with candidate problem and is It is a kind of.

It is highly preferred that the subsequent match processing includes following at least any one:

Problem types matching；

Keywords matching；

Crucial expression formula matching.

For example, the coalignment is based on problem types matching treatment, judge whether the target with middle similarity Problem and candidate problem are gathered for one kind.When the coalignment carries out problem types matching, mainly pass through restricted problem content class Two conditions of quantity of type and question sentence carry out subsequent match.Here, problem content type is referred to the division of teaching contents of problem For question sentence (Q), non-question sentence (N) and descriptive sentence (D).In conjunction with the type of problem, to form the " type of problem title+ask The integrated marker of topic content type ".The quantity of question sentence has counted the quantity of question sentence and problem content letter in problem title respectively Question sentence quantity in breath.

It, will be corresponding according to being carried out the case where the matching degree of problem types and question sentence quantity when carrying out problem types matching Rule-based filtering.Problem unmatched for type is to directly filtering out, and the problem of matching for problem types is to further according to asking The quantity and total question sentence quantity of question sentence are filtered in topic title.Here the statistics of the judgement of question sentence type and question sentence quantity It is respectively to result comprehensive after problem title and problem content information processing.

For another example, which judges whether to ask the target with middle similarity based on Keywords matching processing Topic is gathered with candidate problem for one kind.The identical sentence of two semantemes, certain important words should be identical or synonymous, passes Keyword matching is namely based on this point.Keyword is by the way that wordrank result is carried out descending arrangement, screening according to rank value Highest N(N >=1 of ranking value out) a word.Wherein, wordrank is a kind of calculation method of word weight, and rank value is to pass through The ranking value that wordrank is calculated, ranking value is bigger, then the criticality of word is higher.The problem of for different length type Title selects the keyword of different number.For shorter problem title, keyword requires exact matching；It is asked for longer Title is inscribed, keyword requires major part that can match.

For another example, which is based on crucial expression formula matching treatment, judges whether the mesh with middle similarity Mark problem and candidate problem are gathered for one kind.Some important expression formulas should all exist in two problems, if one Have in a problem and do not have in another problem, then must not believe that two problems are semantic congruences.Here, crucial expression formula packet It includes but is not limited to:

1) some certain types of name entities, for example, place name, novel name etc.；

2) there is the expression formula of closed form, for example, with punctuation marks used to enclose the title, character string that quotation marks etc. are closed；

3) temporal expression, for example, X, the X month, the time in week X etc；

4) quantitative expression, including specific quantity and grade；

5) English digital string, the mainly mathematic(al) representation in mathematical problem；

Before carrying out crucial expression formula matching, need to identify the special expression formula in target problem and candidate problem. Matching is two-way progress, that is, candidate problem must include the special expression formula in the target problem, while in candidate problem again It cannot include other extra special expression formulas.

Preferably, which is based on above-mentioned any a variety of subsequent match processing, judges whether there is middle phase for described Gather like the target problem of degree and candidate problem for one kind.

Those skilled in the art will be understood that above-mentioned subsequent match processing is only for example, other are existing or may go out from now on Existing subsequent match processing is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with reference side Formula is incorporated herein.

Due to the problem of when using stringenter enquirement semantic consistency probability, many script semantic congruences can be screened out, Therefore, expand the strategy that semantic consistency problem is recalled supplemented with one herein.Clustering apparatus 1 is by relaxing enquirement semantic congruence Property judgement threshold value, along with further filtering matching means, according to problem types carry out matching filtering, according to keyword into Row filtering matching is filtered matching etc. according to crucial expression formula, so as to find more semantic consistency problems, into one Step improves the efficiency and accuracy rate of problem cluster, improves the retrieval experience of user.

Preferably, the judgment means 104 are according to the similarity, and combine fragmentation threshold, and determining has high similarity Target problem and candidate problem；Wherein, which further includes re-computation device (not shown), the re-computation device base Content information and augmented content information, recalculate the target problem and wait in the target problem and candidate problem the problem of The similarity of problem is selected, to judge whether to gather the target problem with high similarity and candidate problem for one kind.

Specifically, judgment means 104 calculate resulting similarity according to computing device 103, similarity are met high similar The target problem of the corresponding fragmentation threshold of degree and candidate problem, are determined as that there is the target problem of high similarity to ask with candidate Topic；Then, re-computation device is for example by the interaction with problem base, and target problem or candidate problem are obtained from the problem base Problem content information and augmented content information, alternatively, the re-computation device such as passes through one directly by the interaction with user equipment Application programming interfaces (API) that are secondary or repeatedly calling the user equipment to provide, or pass through the Dynamic Networks such as ASP, JSP or PHP Page technology, obtain user input the problem of the problem of content information or augmented content information etc., in turn, based on the target problem with The problem of candidate problem content information and augmented content information, recalculate the similarity of the target problem and candidate problem, with Judge whether to gather the target problem with high similarity and candidate problem for one kind.

For example, for the similarity for recalculating the target problem obtained and candidate problem, be greater than when the similarity or When equal to preset similar threshold value, judge to gather the target problem and candidate problem as one kind；When the similarity is less than the preset phase When like threshold value, judge not gather the target problem and candidate problem as one kind.

Here, the preset similar threshold value is the similarity based on target problem and candidate problem, judge whether to gather them For a kind of similarity threshold, value is preset.

Here, the problem content information is that the user asks the target when being put forward for the first time the target problem or candidate problem The particular content description of topic or candidate problem；Augmented content information such as user proposes one section in the target problem or candidate problem Continue the information of supplement after time.

It is highly preferred that the re-computation device further to the target problem with high similarity and candidate problem into Row problem types matching, to judge whether to gather the target problem with high similarity and candidate problem for one kind.Specifically Ground in judgment means 104 according to similarity, and combines fragmentation threshold, determines that the target problem with high similarity is asked with candidate After topic, further the target problem to this with high similarity and candidate problem carry out problem types to the re-computation device Match；Alternatively, after the similarity that re-computation device recalculates the target problem and candidate problem, the re-computation device is into one It walks the target problem to this with high similarity and candidate problem carries out problem types matching, to judge whether the target problem Gather with candidate problem for one kind.For example, when this has the problem of target problem and candidate's problem of high similarity type matching, Judge to gather the target problem and candidate problem as one kind；When its problem types mismatch when, judgement not by the target problem with Candidate problem is gathered for one kind.

Here, the matched mode of the problem types is similar with the mode that foregoing problems type matching is handled, therefore herein not It repeats, and is incorporated herein by reference again.

Here, the problem of clustering apparatus 1 is further combined with target problem or candidate problem content information and supplemental content are believed Breath, calculates the similarity of the target problem and candidate problem, to judge whether to gather the target problem with candidate's problem and be One kind, the clustering apparatus 1 are based on problem descriptive semantics consistency and carry out clustering processing to problem, pass through the title and tool to problem Hold in vivo and carry out comprehensive analysis, clustering processing is carried out to problem, problem resource is further incorporated, improves the retrieval body of user It tests.

Fig. 2 shows in accordance with a preferred embodiment of the present invention for carrying out the schematic device of clustering processing to problem. The clustering apparatus 1 further includes filter device 205.The preferred embodiment is described in detail referring to Fig. 2: specifically, being obtained Device 201 is taken to obtain complications to be clustered and candidate problem；Filter device 205 is to the target problem and time to be clustered It selects problem to carry out pretreatment filtration, obtains the target problem and candidate's problem of preprocessed filtering；Determining device 202 is according to described The problem of target problem of preprocessed filtering and candidate problem feature, determine the feature of the target problem and candidate problem to Amount；Computing device 203 calculates the target problem and candidate problem according to the feature vector of the target problem and candidate problem Similarity；Judgment means 204 combine fragmentation threshold according to the similarity, judge whether the target problem and wait Problem is selected to be gathered for one kind.Wherein, acquisition device 201, computing device 203 and judgment means 204 and corresponding intrument shown in Fig. 1 It is identical or essentially identical, therefore details are not described herein again, and is incorporated herein by reference.

Wherein, which carries out pretreatment filtration to the target problem to be clustered and candidate problem, obtains The target problem of preprocessed filtering and candidate problem；Wherein, the determining device 202 is according to the mesh of the preprocessed filtering Mark problem and feature the problem of candidate problem, determine the feature vector of the target problem and candidate problem.

Specifically, filter device carries out target problem to be clustered acquired in acquisition device 201 and candidate problem pre- Processing filtering is completed for example, filtering irrelevant problem, the stronger problem of filtering timeliness according to application scenarios or filtering out The problem of cluster etc., and then obtain the target problem and candidate's problem of preprocessed filtering；Then, 202 basis of determining device The target problem of the preprocessed filtering is with feature the problem of candidate problem, target problem or candidate such as the preprocessed filtering Keyword feature, structure feature, semantic feature, problem types feature in the title of problem etc. determine the target problem and wait Select the feature vector of problem.

Preferably, the target problem includes complications；The determination of judgment means 204 has with the complications The candidate problem of maximum similarity；Based on the maximum similarity, and predetermined threshold is combined, judged whether the complications Gather with the candidate problem for one kind.Since complications are to continuously emerge, it is carried out at cluster by the way of increment Reason.So-called increment cluster refers to that only a newly-built cluster either returns the complications for complications to be clustered And into an existing cluster.Increment cluster will not cause to change to original cluster structure, i.e., clustered is asked Topic, complications will not have an impact their label.Specifically, for complications acquired in acquisition device 201 with Candidate problem；The problem of determining device 202 increases target problem and candidate problem newly according to this feature, determines its feature vector；Meter Device 203 is calculated according to the feature vector of the target problem and candidate problem, calculates the similarity of the target problem and candidate problem； The similarity that judgment means 204 are calculated according to the computing device 203 therefrom selects a maximum similarity, thus really The fixed and complications have the candidate problem of maximum similarity；In turn, which is based on the maximum similarity, and In conjunction with predetermined threshold, judge whether to gather the complications and candidate's problem for one kind；That is, when the maximum similarity is greater than Or when being equal to predetermined threshold, candidate problem corresponding to the complications and the maximum similarity is gathered for one kind, that is, this is new Increasing problem is integrated into the cluster of candidate problem corresponding to the maximum similarity；When the maximum similarity is less than predetermined threshold When, candidate problem corresponding to the complications and the maximum similarity is not gathered for one kind.

Here, the predetermined threshold is for judging whether complications and therewith with the candidate problem of maximum similarity Gather for a kind of similarity threshold, value is preset.

Preferably, the clustering apparatus 1 carries out increment clustering processing according to the proposition time sequencing of complications, that is, first mention Complications out first carry out increment clustering processing, which in a serial fashion, presses and propose time sequencing, to all Complications carry out increment clustering processing.More preferably, which regularly carries out at increment cluster the complications Reason.More preferably, which carries out pretreatment filtration to complications, filters out the complications being deleted, alternatively, Filter out the complications for having carried out increment clustering processing.

Here, the method that clustering apparatus 1 clusters using increment, the clustering problem that processing increases in real time on a large scale, further Problem resource is incorporated, the retrieval experience of user is improved.

Preferably, which further includes new device (not shown), if the new device maximum similarity is small In the predetermined threshold, for the newly-built one kind of the complications.Specifically, for complications, computing device 203 is calculated separately The similarity of the complications and existing candidate problem；When judgment means 204 therefrom determine maximum similarity, if the maximum Similarity is still less than predetermined threshold, then the complications cannot gather with any existing candidate problem for one kind, the newly-built dress It is set to the newly-built one kind of the complications.

Fig. 3 shows the method flow diagram for being used to carry out problem clustering processing according to a further aspect of the present invention.

In step S301, clustering apparatus 1 obtains target problem to be clustered and candidate problem.Specifically, in step In S301, by the interaction with problem base, target problem and candidate to be clustered are for example obtained from the problem base for clustering apparatus 1 Problem, alternatively, clustering apparatus 1 directly by the interaction with user equipment, such as passes through one or many calling in step S301 The application programming interfaces (API) that the user equipment provides, or by dynamic web page techniques such as ASP, JSP or PHP, obtain and use The problem of family inputs, and as target problem or candidate problem.

In step s 302, clustering apparatus 1 determines the mesh according to feature the problem of the target problem and candidate problem The feature vector of mark problem and candidate problem.Specifically, in step s 302, clustering apparatus 1 is according to the target problem and candidate The problem of problem feature, as in the target problem or the title of candidate problem keyword feature, structure feature, semantic feature, Problem types feature etc. determines the feature vector of the target problem and candidate problem.For example, in step s 302, clustering apparatus 1 Keyword identification is carried out to the target problem or candidate problem and assigns power, as extracted important word in question matching calculating, and is assigned Give different weights；Alternatively, in step s 302, clustering apparatus 1 to the target problem or candidate problem carry out structural analysis with Weight adjustment carries out structural analysis to question sentence, and carries out similar semantic redundancy by way of semantic template and word structure Identification；Alternatively, in step s 302, clustering apparatus 1 carries out Semantic mapping to the target problem or candidate problem, synonym is introduced The word for the same meaning expressed with different terms is carried out normalizing by resource；Or in step s 302, clustering apparatus 1 is right The target problem or candidate problem carry out problem types identification, important using type as one by being identified as different types The factor participate in similarity weight calculation.In turn, in step s 302, clustering apparatus 1 is special according to said one or multiple problems Sign determines the feature vector of the target problem and candidate problem.

Keyword feature；

Structure feature；

Semantic feature；

Problem types feature.

In step S303, clustering apparatus 1 calculates the mesh according to the feature vector of the target problem and candidate problem The similarity of mark problem and candidate problem.Specifically, in step S303, clustering apparatus 1 according to determining in step s 302 Target problem and candidate problem feature vector, the similarity of the target problem and candidate problem is calculated, for example, in step In S303, clustering apparatus 1 according to the following formula, calculates the target problem at a distance from candidate problem:

Calculate obtain the target problem with candidate problem at a distance from after, in step S303, clustering apparatus 1 further according to The distance determines the similarity of the target problem and candidate problem, for example, clustering apparatus 1 will directly calculate in step S303 Similarity of the obtained distance value as the target problem and candidate problem；Alternatively, clustering apparatus 1 passes through in step S303 Calculated distance value is mapped as the similarity of the target problem and candidate problem by certain numerical value conversion.

In step s 304, clustering apparatus 1 is according to the similarity, and combines fragmentation threshold, judges whether the mesh Mark problem and candidate problem are gathered for one kind.For example, it is assumed that only one preset fragmentation threshold, value A, when in step S303 In, the similarity of the target problem and candidate problem that clustering apparatus 1 is calculated is greater than or equal to fragmentation threshold A, then exists In step S304, clustering apparatus 1 judges to gather the target problem and candidate's problem as one kind；If the similarity is less than the segmentation Threshold value A will not gather the target problem and candidate problem for one kind.

For another example, it is assumed that there are two preset fragmentation threshold B and C, wherein the value of fragmentation threshold B is less than fragmentation threshold C's Value, then in step s 304, target problem of the clustering apparatus 1 by similarity less than fragmentation threshold B are determined as with candidate problem Target problem and candidate problem with low similarity, do not carry out clustering processing to it；Similarity is greater than or equal to the segmentation Threshold value B and the target problem and candidate's problem for being less than fragmentation threshold C, are determined as the target problem with middle similarity and candidate Problem；Target problem and candidate problem by similarity more than or equal to fragmentation threshold C, are determined as the mesh with high similarity Mark problem and candidate problem.For being determined as the target problem and candidate's problem of middle similarity or high similarity, the clustering apparatus 1 Subsequent processing is carried out to it, hereafter will be described in detail.

Preferably, it constantly works between each step of clustering apparatus 1.Specifically, in step S301, gather Class device 1 obtains target problem to be clustered and candidate problem；In step s 302, clustering apparatus 1 is according to the target problem With feature the problem of candidate problem, the feature vector of the target problem and candidate problem is determined；In step S303, cluster dress 1 is set according to the feature vector of the target problem and candidate problem, calculates the similarity of the target problem and candidate problem；? In step S304, clustering apparatus 1 combines fragmentation threshold according to the similarity, judges whether the target problem and waits Problem is selected to be gathered for one kind.Here, it will be understood by those skilled in the art that " lasting " refer to each step of clustering apparatus 1 respectively by According to setting or real-time adjustment operating mode require to carry out the acquisition of target problem and candidate problem, the determination of feature vector, The calculating of similarity and cluster judgement, until the clustering apparatus 1 stops obtaining in a long time target problem to be clustered and Candidate problem.

Preferably, in step s 304, clustering apparatus 1 is according to the similarity, and combines fragmentation threshold, determines in having The target problem of similarity and candidate problem；Wherein, this method further includes that step S306(is not shown), in step S306, gather Class device 1 judges whether that gathering the target problem with middle similarity and candidate problem is one based on subsequent match processing Class.Specifically, in step s 304, clustering apparatus 1 meets similarity according to resulting similarity is calculated in step S303 The target problem of fragmentation threshold corresponding to middle similarity and candidate problem, are determined as the target problem and time with middle similarity Select problem；Then, in step S306, clustering apparatus 1 is based on subsequent match processing, such as problem types matching treatment, key Word matching treatment, crucial expression formula matching treatment etc., judge whether the target problem that this is had to middle similarity and candidate problem Gather for one kind.

For example, target problem and candidate problem for being determined as that there is middle similarity, in step S306, clustering apparatus 1 further obtains the crucial expression formula in the target problem and candidate problem, if the target problem includes with candidate problem A certain key expression formula, then in step S306, clustering apparatus 1 judges to gather the target problem and candidate problem as one kind；If Only the target problem includes the key expression formula, and candidate's problem does not include, or vice versa, then in step S306, cluster Device 1 judges that the target problem and candidate problem cannot gather for one kind.

Problem types matching；

Keywords matching；

Crucial expression formula matching.

For example, in step S306, clustering apparatus 1 is based on problem types matching treatment, judges whether have described The target problem of similarity and candidate problem are gathered for one kind.In step S306, when clustering apparatus 1 carries out problem types matching, Main two conditions of quantity by restricted problem content type and question sentence, carry out subsequent match.Here, problem content type It refers to the division of teaching contents of problem being question sentence (Q), non-question sentence (N) and descriptive sentence (D).In conjunction with the type of problem, thus Form the integrated marker of " type of problem title+problem content type ".The quantity of question sentence has counted respectively asks in problem title Question sentence quantity in the quantity and problem content information of sentence.

For another example, in step S306, clustering apparatus 1 judges whether there is middle phase for described based on Keywords matching processing Gather like the target problem of degree and candidate problem for one kind.The identical sentence of two semantemes, certain important words should be identical Or it is synonymous, Keywords matching is namely based on this point.Keyword is by carrying out wordrank result according to rank value Descending arrangement, highest N(N >=1 of the ranking value filtered out) a word.Wherein, wordrank is a kind of calculation method of word weight, Rank value is the ranking value being calculated by wordrank, and ranking value is bigger, then the criticality of word is higher.For different length The problem of spending type title, selects the keyword of different number.For shorter problem title, keyword requires exact matching； For longer problem title, keyword requires major part that can match.

For another example, in step S306, clustering apparatus 1 is based on crucial expression formula matching treatment, judges whether have described The target problem of middle similarity and candidate problem are gathered for one kind.Some important expression formulas should all deposit in two problems If having in a problem and not having in another problem, it must not believe that two problems are semantic congruences.Here, Crucial expression formula includes but is not limited to:

3) temporal expression, for example, X, the X month, the time in week X etc；

4) quantitative expression, including specific quantity and grade；

Preferably, in step S306, clustering apparatus 1 is based on above-mentioned any a variety of subsequent match processing, judge whether by The target problem with middle similarity and candidate problem are gathered for one kind.

Preferably, in step s 304, clustering apparatus 1 is according to the similarity, and combines fragmentation threshold, and determining has height The target problem of similarity and candidate problem；Wherein, this method further includes that step S307(is not shown), in step S307, gather Class device 1 is based on the problem of target problem and candidate problem content information and augmented content information, recalculates the mesh The similarity of mark problem and candidate problem is to judge whether to gather the target problem with high similarity with candidate's problem It is a kind of.

Specifically, in step s 304, clustering apparatus 1, will be similar according to resulting similarity is calculated in step S303 Degree meets the target problem and candidate's problem of fragmentation threshold corresponding to high similarity, is determined as that there is the target of high similarity to ask Topic and candidate problem；Then, in step S307, clustering apparatus 1 for example by the interaction with problem base, is obtained from the problem base The problem of taking target problem or candidate problem content information and augmented content information, alternatively, in step S307, clustering apparatus 1 Directly by the interaction with user equipment, such as pass through the application programming interfaces of one or many calling user equipmenies offers (API), the problem of the problem of or by the dynamic web page techniques such as ASP, JSP or PHP, obtaining user's input content information or Augmented content information etc., in turn, based on the target problem and content information the problem of candidate problem and augmented content information, again The similarity of the target problem and candidate problem is calculated, to judge whether the target problem with high similarity and candidate Problem is gathered for one kind.

It is highly preferred that clustering apparatus 1 is further to the target problem and time with high similarity in step S307 Problem is selected to carry out problem types matching, to judge whether to gather the target problem with high similarity and candidate problem for one Class.Specifically, in step s 304, clustering apparatus 1 is according to similarity, and combines fragmentation threshold, and determining has high similarity After target problem and candidate problem, in step S307, clustering apparatus 1 is further to the target problem with high similarity Problem types matching is carried out with candidate problem；Alternatively, clustering apparatus 1 recalculates the target problem and waits in step S307 After the similarity for selecting problem, further the target problem to this with high similarity and candidate problem carry out the clustering apparatus 1 Problem types matching, to judge whether to gather the target problem and candidate problem for one kind.For example, when this is with high similarity When the problem of target problem and candidate problem type matching, judge to gather the target problem and candidate problem as one kind；When it is asked When inscribing type mismatch, judge not gather the target problem and candidate problem as one kind.

Fig. 4 shows the method flow diagram for being used to carry out problem clustering processing in accordance with a preferred embodiment of the present invention. The preferred embodiment is described in detail referring to Fig. 4: specifically, in step S401, clustering apparatus 1 obtains to be clustered Complications and candidate problem；In step S405, clustering apparatus 1 to the target problem to be clustered and candidate problem into Row pretreatment filtration obtains the target problem and candidate's problem of preprocessed filtering；In step S402,1 basis of clustering apparatus The target problem of the preprocessed filtering and feature the problem of candidate problem, determine the spy of the target problem and candidate problem Levy vector；In step S403, clustering apparatus 1 calculates the mesh according to the feature vector of the target problem and candidate problem The similarity of mark problem and candidate problem；In step s 404, clustering apparatus 1 is according to the similarity, and combines fragmentation threshold, Judge whether to gather the target problem and candidate problem for one kind.Wherein, step S401, S402 and S404 and Fig. 3 institute Show that corresponding step is identical or essentially identical, therefore details are not described herein again, and is incorporated herein by reference.

Wherein, in step S405, clustering apparatus 1 pre-processes the target problem to be clustered and candidate problem Filtering obtains the target problem and candidate's problem of preprocessed filtering；Then, in step S402, clustering apparatus 1 is according to described The problem of target problem of preprocessed filtering and candidate problem feature, determine the feature of the target problem and candidate problem to Amount.

Specifically, in step S405, clustering apparatus 1 to acquired target problem to be clustered in step S401 with Candidate problem carries out pretreatment filtration, for example, according to application scenarios filter irrelevant problem, the stronger problem of filtering timeliness, Or filter out the problem of cluster is completed etc., and then obtain the target problem and candidate's problem of preprocessed filtering；Then, in step In rapid S402, target problem of the clustering apparatus 1 according to the preprocessed filtering and feature the problem of candidate problem, as the warp is located in advance Manage the target problem of filtering or the keyword feature in the title of candidate problem, structure feature, semantic feature, problem types feature Deng determining the feature vector of the target problem and candidate problem.

Preferably, the target problem includes complications；In step s 404, clustering apparatus 1 is determining increases newly with described Problem has the candidate problem of maximum similarity；Based on the maximum similarity, and predetermined threshold is combined, judging whether will be described Complications and the candidate problem are gathered for one kind.Due to complications be continuously emerge, by the way of increment into Row clustering processing.So-called increment cluster refers to that only a newly-built cluster is either new by this for complications to be clustered Increasing problem is integrated into an existing cluster.Increment cluster will not cause to change to original cluster structure, i.e., for The problem of cluster, complications will not have an impact their label.Specifically, for acquired new in step S401 Increasing problem and candidate problem；In step S402, clustering apparatus 1 increases target problem and spy the problem of candidate problem newly according to this Sign, determines its feature vector；In step S403, feature vector of the clustering apparatus 1 according to the target problem and candidate problem, meter Calculate the similarity of the target problem and candidate problem；In step s 404, clustering apparatus 1 is calculated according in step S403 Similarity out therefrom selects a maximum similarity, so that it is determined that with the complications there is the candidate of maximum similarity to ask Topic；In turn, in step s 404, clustering apparatus 1 is based on the maximum similarity, and combines predetermined threshold, judges whether this is new Increasing problem and candidate's problem are gathered for one kind；That is, this is increased newly and is asked when the maximum similarity is greater than or equal to predetermined threshold Topic is gathered with candidate problem corresponding to the maximum similarity for one kind, that is, the complications are integrated into the maximum similarity institute In the cluster of corresponding candidate's problem；When the maximum similarity is less than predetermined threshold, not by the complications and the maximum phase Gather like the corresponding candidate problem of degree for one kind.

Preferably, this method further includes that step S408(is not shown), if the maximum similarity is less than the predetermined threshold, In step S408, clustering apparatus 1 is the newly-built one kind of the complications.Specifically, for complications, in step S403 In, clustering apparatus 1 calculates separately the similarity of the complications and existing candidate problem；When in step s 404, clustering apparatus 1 When therefrom determining maximum similarity, if the maximum similarity still less than predetermined threshold, the complications cannot with it is any Existing candidate's problem is gathered for one kind, and in step S408, clustering apparatus 1 is the newly-built one kind of the complications.

It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, software program of the invention can be executed to implement the above steps or functions by processor.Similarly, of the invention Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory, Magnetic or optical driver or floppy disc and similar devices.In addition, some of the steps or functions of the present invention may be implemented in hardware, example Such as, as the circuit cooperated with processor thereby executing each step or function.

In addition, a part of the invention can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the method for the present invention and/or technical solution. And the program instruction of method of the invention is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, according to one embodiment of present invention including a device, which includes using Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the present invention are triggered Art scheme.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table Show title, and does not indicate any particular order.

Claims

1. a kind of method for carrying out clustering processing to problem, wherein method includes the following steps:

A obtains target problem to be clustered and candidate problem；

B according to feature the problem of the target problem and candidate problem, determine the feature of the target problem and candidate problem to Amount；

C calculates the similarity of the target problem and candidate problem according to the feature vector of the target problem and candidate problem；

D combines fragmentation threshold according to the similarity, and determining has the target of low similarity, middle similarity or high similarity Problem and candidate problem；

Wherein, this method further include:

Based on subsequent match processing, judge whether to gather the target problem with middle similarity and candidate problem for one kind；

X is based on the problem of target problem and candidate problem content information and augmented content information, recalculates the target The similarity of problem and candidate problem, to judge whether to gather the target problem with high similarity and candidate problem for one kind.

2. according to the method described in claim 1, wherein, described problem feature includes following at least any one:

Keyword feature；

Structure feature；

Semantic feature；

Problem types feature.

3. method according to claim 1 or 2, wherein the subsequent match processing includes following at least any one:

Problem types matching；

Keywords matching；

Crucial expression formula matching.

4. method according to claim 1 or 2, wherein the step x further comprises:

Problem types matching is carried out to the target problem with high similarity and candidate problem, with judge whether will be described Target problem and candidate problem with high similarity gather for one kind.

5. method according to claim 1 or 2, wherein this method further include:

Pretreatment filtration is carried out to the target problem to be clustered and candidate problem, the target for obtaining preprocessed filtering is asked Topic and candidate problem；

Wherein, the step b includes:

According to feature the problem of the target problem of the preprocessed filtering and candidate problem, determines the target problem and wait Select the feature vector of problem.

6. method according to claim 1 or 2, wherein the target problem includes complications；Wherein, the step d Include:

The determining candidate problem with the complications with maximum similarity；

It is based on the maximum similarity, and combines predetermined threshold, judges whether to gather the complications and the candidate problem For one kind.

7. according to the method described in claim 6, wherein, this method further include:

If the maximum similarity is less than the predetermined threshold, for the newly-built one kind of the complications.

8. a kind of for carrying out the clustering apparatus of clustering processing to problem, wherein the clustering apparatus includes:

Determining device, for determining the target problem and candidate according to feature the problem of the target problem and candidate problem The feature vector of problem；

Computing device calculates the target problem and candidate for the feature vector according to the target problem and candidate problem The similarity of problem；

Judgment means are used for according to the similarity, and combine fragmentation threshold, and determining has low similarity, middle similarity or height The target problem of similarity and candidate problem；

Wherein, the device further include:

Coalignment judges whether the target problem that will have middle similarity and candidate problem for handling based on subsequent match Gather for one kind；

Re-computation device, for based on the target problem and content information the problem of candidate problem and augmented content information, weight The similarity of the target problem and candidate problem is calculated, newly to judge whether the target problem that there will be high similarity and candidate Problem is gathered for one kind.

9. clustering apparatus according to claim 8, wherein described problem feature includes following at least any one:

Keyword feature；

Structure feature；

Semantic feature；

Problem types feature.

10. clustering apparatus according to claim 8 or claim 9, wherein the subsequent match processing includes following at least any :

Problem types matching；

Keywords matching；

Crucial expression formula matching.

11. clustering apparatus according to claim 8 or claim 9, wherein the re-computation device is also used to:

12. clustering apparatus according to claim 8 or claim 9, wherein the device further include:

Filter device obtains preprocessed for carrying out pretreatment filtration to the target problem to be clustered and candidate problem The target problem of filtering and candidate problem；

Wherein, the determining device is used for:

13. clustering apparatus according to claim 8 or claim 9, wherein the target problem includes complications；Wherein, described Judgment means are used for:

14. clustering apparatus according to claim 13, wherein the device further includes new device, is used for: