CN107491425A

CN107491425A - Determine method, determining device, computer installation and computer-readable recording medium

Info

Publication number: CN107491425A
Application number: CN201710620516.XA
Authority: CN
Inventors: 闫永刚; 沈亮
Original assignee: Hefei Midea Intelligent Technologies Co Ltd
Current assignee: Hefei Midea Intelligent Technologies Co Ltd
Priority date: 2017-07-26
Filing date: 2017-07-26
Publication date: 2017-12-19

Abstract

The present invention proposes a kind of determination method, determining device, computer installation and computer-readable recording medium, wherein it is determined that method includes：The Similar Text data set for including at least one second short text is determined according to the first short text；Determine the morphology similarity between the first short text and the second short text；Determine the word order similarity of the first short text or the second short text and standard word order；According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, weighting operations are performed to morphology similarity and word order similarity, to determine the similarity of the first short text and the second short text.By technical scheme, short text similarity can be more all-sidedly and accurately weighed.

Description

Determine method, determining device, computer installation and computer-readable recording medium

Technical field

It is short in particular to a kind of determination method of short text similarity, one kind the present invention relates to text-processing field The determining device of text similarity, a kind of computer installation and a kind of computer-readable recording medium.

Background technology

In correlation technique, short text Similarity Measure is the knowledge point of natural language processing field very core, and one good Short text similarity calculating method can considerably improve the performance of existing system.

The method of short text Similarity Measure has much at present, can be largely classified into following a few classes：Knowledge based storehouse, it is based on Corpus, evaluating method based on Expressive Features, based on machine translation result etc., are primarily present following defect：

(1) method in knowledge based storehouse highly dependent upon the semantic dictionary inquired about completeness because short text may be deposited Its semantic similarity can not be calculated in unregistered word, causes result inaccurate.And this method have ignored short text between count The similarity of feature；

(2) the method difficult point of feature based is how effectively to extract feature and obtains these characteristic values automatically.This side Method have ignored the similarity of semantic information between short text；

(3) different from long text Similarity Measure, indivedual noise words may be to the similar of whole short text in short text Degree calculates serious interference

The content of the invention

At least one in order to solve the above-mentioned technical problem, it is an object of the present invention to provide a kind of short text similarity Determination method.

It is another object of the present invention to provide a kind of determining device of short text similarity.

It is yet a further object of the present invention to provide a kind of computer installation.

A further object of the present invention is to provide a kind of computer-readable recording medium.

To achieve these goals, the embodiment of first aspect present invention proposes a kind of determination side of short text similarity Method, including：The Similar Text data set for including at least one second short text is determined according to the first short text；Determine the first short essay Morphology similarity between sheet and the second short text；Determine that the first short text or the second short text are similar to the word order of standard word order Degree；According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, to morphology similarity with Word order similarity performs weighting operations, to determine the similarity of the first short text and the second short text.

In the technical scheme, after the first short text is inputted, the second end text is determined according to default scene thesaurus This, is with it is determined that morphology similarity between the first short text and the second short text, and the first short text or the second short text With the word order similarity of standard word order, and then morphology similarity and word order similarity are weighted, it is short to obtain first Similarity between text and the second short text, according to the statistical nature and semantic information of short text to be detected, it is determined that with phase Like the text similarity between neighbouring several second short texts, short text similarity can be more all-sidedly and accurately weighed.

In addition, the determination method of the short text similarity in above-described embodiment provided by the invention can also be with following attached Add technical characteristic：

In the above-mentioned technical solutions, it is preferable that it is determined that morphology similarity between the first short text and the second short text Before, in addition to：Collect short text sample set；Pretreatment operation is performed to the short text sample in short text sample set, to obtain Text is handled, pretreatment operation includes Chinese word segmentation, removal stop words, text feature, text duplicate removal and text and made by oneself Adopted lexicon configuration；It is determined that multiple Similar Texts are imported presetting database by the Similar Text in processing text；According to default Cutting ratio, multiple Similar Texts in presetting database are respectively allocated to training dataset and test data set；According to machine Device learning algorithm, the multiple Similar Texts concentrated to training data establish iterative model；According to iterative model, to the first weighted value Renewal is iterated with the second weighted value, by the input of iterative model and the first weighted value and second during output difference minimum Weighted value, the first default weighted value and the second default weighted value are identified as, wherein, the first weighted value is multiple Similar Texts Morphology similarity weighted value, the second weighted value for multiple Similar Texts word order similarity weighted value.

In the technical scheme, by before similarity-rough set operation is performed, according to the short text sample set being collected into, Using the mode of learning based on machine, short text sample set is iterated renewal, respectively with λ₁=0.1, λ₂=0.99 starts to change Generation, iterations are 100 times, in an iterative process, the input value of iterative model are contrasted with output valve, comparing result is poor Different in nature minimum λ₁With λ₂, second as the first default weighted value and the word order similarity of corresponding determination morphology similarity be pre- If weighted value, on the one hand, realize the optimization selection of weighted value, on the other hand, can be weighed according to convenient adjustment is actually needed Weight occupies value, and use is more convenient.

Specifically, short text similarity comprise at least morphology similarity, word order similarity linear and, based on machine learning Mode determine λ₁(the first default weighted value) and λ₂(the second default weighted value), mainly includes：(1) short text data collection is collected, And short text is pre-processed, processing procedure includes Chinese word segmentation, removes stop words, text feature, text duplicate removal, text certainly Define the steps such as lexicon configuration；(2) pretreated short text is put in storage, and each short text sample is marked, with true Fixed most like short text, by most like short text model training, and according to default cutting ratio, such as 7:3, by short essay Originally training set, test set are distributed to；(3) machine learning algorithm, such as KNN (k-Nearest Neighbor, k arest neighbors are used Sorting algorithm), for training set modeling analysis, set λ₁' with 0.01 it is starting point, step-length 0.01, during iteration 100 times (simultaneously Meet λ₁'+λ₂'=1), the precision excursion of sample, and with λ during full accuracy₁'、λ₂' numerical value as optimal λ₁、λ₂。

In any of the above-described technical scheme, it is preferable that determine that the morphology between the first short text and the second short text is similar Degree, specifically includes following steps：According to the first calculation formula, morphology similarity is determined, wherein, the first calculation formula isX is the first short text, and y is the second short text, x^tRemoved for the first short text The later short text of stop words, y^tThe later short text of stop words, s (x are removed for the second short text^t) removed for the first short text Effective word quantity after stop words, s (y^t) effective word quantity after stop words, ts (x are removed for the second short text^t,y^t) be First short text and the second short text remove after stop words and remove the same word quantity after repetitor.

In the technical scheme, morphology similarity is determined by the first calculation formula, is removing stop words and repetition respectively Hereafter, and according to effective single quantity of the first short text and the second short text, determine morphology similarity, i.e. two short texts Co-occurrence degree, on the one hand, calculating process is fairly simple, on the other hand improves the accuracy of calculating.

Specifically, stop words, in information retrieval, to save memory space and improving search efficiency, in processing nature language Some words or word are fallen in meeting automatic fitration before or after speech data (or text), and these words or word are referred to as stop words, these Word or word are all manually entered, non-automated generation, and the stop words after generation can form a deactivation vocabulary.

Wherein, co-occurrence degree, refer to one of external Cohesion for establishing the contact of sentence border, refer to synonymity word and correlation word more While occur, make sentence is front and rear to connect, form coherent language, occasion contact, causal relation in external Cohesion etc., all Along with term co-occurrence.

In any of the above-described technical scheme, it is preferable that determine the word of the first short text or the second short text and standard word order Sequence similarity, specifically includes following steps：According to the second calculation formula, word order similarity is determined, wherein, the second calculation formula isBaseline is the mark of given scenario Quasi- word order, relative to baseline permutation number, maxInvCount (baseline) is invCount (y) the second short text y Baseline maximum permutation number, n are the word quantity in the second short text.

In the technical scheme, by the way that the first short text or the second short text are carried out into similarity word order pair with standard word order Than, the similar detection of word order is realized, can be by the first short text or the second short text and standard by setting standard word order Word order is compared, and when short text is null, the word order degree of correlation is 0, and in an only word, the word order degree of correlation is 1, During with n word, according to relative permutation number and the ratio of maximum permutation number, word order similarity is determined, realizes word order phase Like the determination of degree, determination mode is simple, and reliability is high.

In any of the above-described technical scheme, it is preferable that similar to word order according to the first of morphology similarity the default weighted value Second default weighted value of degree, weighting operations are performed to morphology similarity and word order similarity, to determine the first short text and the The similarity of two short texts, specifically includes following steps：According to the 3rd calculation formula, similarity is determined, wherein, the 3rd formula is SenSim (x)=λ₁×TermSim(x,y)+λ₂×{Order_sim(baseline,y)}_k, TermSim (x, y) is morphology phase Like degree, λ₁For the first default weighted value, Order_sim (baseline, y) is word order similarity, λ₂For the second default weighted value, k For the quantity of the second weighted value with the first weighted value similar neighborhoods.

In the technical scheme, similitude of first short text relative to k the second short texts is determined by the 3rd formula, To determine the similarity of the first short text, short text similarity can be more all-sidedly and accurately weighed.

In any of the above-described technical scheme, it is preferable that it is determined that the first short text or the second short text and standard word order Before word order similarity, in addition to：The standard word order of given scenario is determined according to the attribute of given scenario；It is raw according to standard word order Into the thesaurus of given scenario.

In the technical scheme, by determining the standard word order of the scene according to the attribute of given scenario, realize more The determination of the standard word order of individual scene, in the natural language processing application in vertical field, it can be built and stopped according to standard word order Word dictionary and thesaurus, exploitativeness are stronger.

The embodiment of second aspect of the present invention proposes a kind of determining device of short text similarity, including：Determining unit, For determining the Similar Text data set for including at least one second short text according to the first short text；Determining unit is additionally operable to： Determine the morphology similarity between the first short text and the second short text；Determining unit is additionally operable to：Determine the first short text or The word order similarity of two short texts and standard word order；Determining device also includes：Operating unit, for according to morphology similarity One default weighted value and the second default weighted value of word order similarity, weighting behaviour is performed to morphology similarity and word order similarity Make, to determine the similarity of the first short text and the second short text.

In the above-mentioned technical solutions, it is preferable that also include：Collector unit, for collecting short text sample set；Operating unit It is additionally operable to：Pretreatment operation is performed to the short text sample in short text sample set, to obtain processing text, pretreatment operation bag Include including Chinese word segmentation, remove stop words, text feature, text duplicate removal and the configuration of text Custom Dictionaries；Determining unit is also For：It is determined that multiple Similar Texts are imported presetting database by the Similar Text in processing text；Determining device also includes： Allocation unit, for according to default cutting ratio, multiple Similar Texts in presetting database to be respectively allocated into training number According to collection and test data set；Unit is established, for according to machine learning algorithm, the multiple Similar Texts concentrated to training data to be built Vertical iterative model；Determining unit is additionally operable to：According to iterative model, renewal is iterated to the first weighted value and the second weighted value, So that with exporting the first weighted value and the second weighted value during difference minimum, it is default that the input of iterative model is identified as into first Weighted value and the second default weighted value, wherein, the first weighted value for the morphology similarity of multiple Similar Texts weighted value, second Weighted value is the weighted value of the word order similarity of multiple Similar Texts.

In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to：According to the first calculation formula, morphology is determined Similarity, wherein, the first calculation formula isX is the first short text, y Two short texts, x^tThe later short text of stop words, y are removed for the first short text^tLater short of stop words is removed for the second short text Text, s (x^t) effective word quantity after stop words, s (y are removed for the first short text^t) it is after the second short text removes stop words Effective word quantity, ts (x^t,y^t) after the first short text and the second short text removal stop words and to remove the phase after repetitor With word quantity.

In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to：According to the second calculation formula, word order is determined Similarity, wherein, the second calculation formula is Baseline be given scenario standard word order, invCount (y) the second short text y relative to baseline permutation number, MaxInvCount (baseline) is baseline maximum permutation number, and n is the word quantity in the second short text.

In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to：According to the 3rd calculation formula, determine similar Degree, wherein, the 3rd formula is SenSim (x)=λ₁×TermSim(x,y)+λ₂×{Order_sim(baseline,y)}_k, TermSim (x, y) is morphology similarity, λ₁For the first default weighted value, Order_sim (baseline, y) is that word order is similar Degree, λ₂For the second default weighted value, k is the quantity with the second weighted value of the first weighted value similar neighborhoods.

In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to：Determine to specify according to the attribute of given scenario The standard word order of scene；Determining device also includes：Generation unit：For according to standard word order, generating the synonym of given scenario Storehouse.

According to the third aspect of the invention we, it is also proposed that a kind of computer installation, computer installation include processor, processing Device is realized such as the step of above-mentioned any one method when being used to perform the computer program stored in memory.

According to the fourth aspect of the invention, it is also proposed that a kind of computer-readable recording medium, be stored thereon with computer Program (instruction), when computer program (instruction) is executed by processor realize as described above middle any one method the step of.

The additional aspect and advantage of the present invention will become obvious in following description section, or the practice by the present invention Recognize.

Brief description of the drawings

The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment Substantially and it is readily appreciated that, wherein：

Fig. 1 shows the schematic flow diagram of the determination method of short text similarity according to an embodiment of the invention；

Fig. 2 shows the schematic block diagram of the determining device of short text similarity according to an embodiment of the invention；

Fig. 3 shows the exemplary flow of the determination method of short text similarity according to another embodiment of the invention Figure.

Embodiment

It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention Mode is applied the present invention is further described in detail.It should be noted that in the case where not conflicting, the implementation of the application Feature in example and embodiment can be mutually combined.

Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still, the present invention may be used also To be different from other modes described here using other to implement, therefore, protection scope of the present invention is not by described below Specific embodiment limitation.

Fig. 1 shows the schematic flow diagram of the determination method of short text similarity according to an embodiment of the invention.

As shown in figure 1, the determination method of short text similarity according to an embodiment of the invention, including：Step 102, the Similar Text data set for including at least one second short text is determined according to the first short text；Step 104, first is determined Morphology similarity between short text and the second short text；Step 106, the first short text or the second short text and standard words are determined The word order similarity of sequence；Step 108, according to the first of morphology similarity the default weighted value and the second default power of word order similarity Weight values, weighting operations are performed to morphology similarity and word order similarity, to determine that the first short text is similar to the second short text Degree.

Specifically, short text similarity comprise at least morphology similarity, word order similarity linear and, based on machine learning Mode determine λ₁(the first default weighted value) and λ₂(the second default weighted value), mainly includes：(1) short text data collection is collected, And short text is pre-processed, processing procedure includes Chinese word segmentation, removes stop words, text feature, text duplicate removal, text certainly Define the steps such as lexicon configuration；(2) pretreated short text is put in storage, and each short text sample is marked, with true Fixed most like short text, by most like short text model training, and according to default cutting ratio, such as 7:3, by short essay Originally training set, test set are distributed to；(3) machine learning algorithm, such as KNN (k-Nearest Neighbor, k arest neighbors are used Sorting algorithm), for training set modeling analysis, set λ₁' with 0.01 it is starting point, step-length 0.01, during iteration 100 times (simultaneously Meet λ₁′+λ₂'=1), the precision excursion of sample, and with λ during full accuracy₁′、λ₂' numerical value as optimal λ₁、λ₂。

Fig. 2 shows the schematic block diagram of the determining device of short text similarity according to an embodiment of the invention.

As shown in Fig. 2 the determining device 200 of short text similarity according to an embodiment of the invention, including：It is determined that Unit 202, for determining the Similar Text data set for including at least one second short text according to the first short text；Determining unit 202 are additionally operable to：Determine the morphology similarity between the first short text and the second short text；Determining unit 202 is additionally operable to：Determine The word order similarity of one short text or the second short text and standard word order；Determining device 200 also includes：Operating unit 204, is used for According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, to morphology similarity and word order Similarity performs weighting operations, to determine the similarity of the first short text and the second short text.

In the above-mentioned technical solutions, it is preferable that also include：Collector unit 206, for collecting short text sample set；Operation Unit 204 is additionally operable to：Pretreatment operation is performed to the short text sample in short text sample set, to obtain processing text, pre- place Reason operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and the configuration of text Custom Dictionaries；Really Order member 202 is additionally operable to：It is determined that multiple Similar Texts are imported presetting database by the Similar Text in processing text；It is determined that Device 200 also includes：Allocation unit 208, for according to default cutting ratio, by multiple Similar Texts in presetting database It is respectively allocated to training dataset and test data set；Unit 210 is established, for according to machine learning algorithm, to training data The multiple Similar Texts concentrated establish iterative model；Determining unit 202 is additionally operable to：According to iterative model, to the first weighted value with Second weighted value is iterated renewal, by the input of iterative model and the first weighted value and the second power during output difference minimum Weight values, the first default weighted value and the second default weighted value are identified as, wherein, the first weighted value is multiple Similar Texts The weighted value of morphology similarity, the second weighted value are the weighted value of the word order similarity of multiple Similar Texts.

In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to：According to the first calculation formula, word is determined Shape similarity, wherein, the first calculation formula isX is the first short text, and y is Second short text, x^tThe later short text of stop words, y are removed for the first short text^tIt is later that stop words is removed for the second short text Short text, s (x^t) effective word quantity after stop words, s (y are removed for the first short text^t) it is that the second short text removes stop words Effective word quantity afterwards, ts (x^t,y^t) for after the first short text and the second short text removal stop words and after removing repetitor Same word quantity.

In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to：According to the second calculation formula, word is determined Sequence similarity, wherein, the second calculation formula is Baseline be given scenario standard word order, invCount (y) the second short text y relative to baseline permutation number, MaxInvCount (baseline) is baseline maximum permutation number, and n is the word quantity in the second short text.

In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to：According to the 3rd calculation formula, phase is determined Like degree, wherein, the 3rd formula is SenSim (x)=λ₁×TermSim(x,y)+λ₂×{Order_sim(baseline,y)}_k, TermSim (x, y) is morphology similarity, λ₁For the first default weighted value, Order_sim (baseline, y) is that word order is similar Degree, λ₂For the second default weighted value, k is the quantity with the second weighted value of the first weighted value similar neighborhoods.

In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to：Determined according to the attribute of given scenario The standard word order of given scenario；Determining device 200 also includes：Generation unit 212：For according to standard word order, generating and specifying field The thesaurus of scape.

As shown in figure 3, the determination method of short text similarity according to another embodiment of the invention, including：Step 302, collect short text sample set；Step 304, pretreatment operation is performed to the short text sample in short text sample set, to obtain Text must be handled, pretreatment operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and text certainly Define lexicon configuration；Step 306, it is determined that handling the Similar Text in text, multiple Similar Texts are imported into presetting database； Step 308, according to default cutting ratio, multiple Similar Texts in presetting database are respectively allocated to test data set； Step 310, according to default cutting ratio, by multiple Similar Texts in presetting database be respectively allocated to training dataset with Test data set；Step 312, according to machine learning algorithm, the multiple Similar Texts concentrated to training data establish iterative model； Step 314, for any given scene, its standard word order is determined, establishes the thesaurus of all scenes, to any sentence, The word for wherein possessing synonymous scene is replaced, makes it consistent with standard scene；Step 316, according to iterative model, to the first weight Value and the second weighted value are iterated renewal, the first weighted value during by the input of iterative model with exporting difference minimum and the Two weighted values, the first default weighted value and the second default weighted value are identified as, wherein, the first weighted value is multiple similar texts The weighted value of this morphology similarity, the second weighted value are the weighted value of the word order similarity of multiple Similar Texts；Step 318, According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, to morphology similarity and word order Similarity performs weighting operations, to determine the similarity of the first short text and the second short text；Step 320, descending arrangement output As a result.

Computer installation according to an embodiment of the invention, computer installation include processor, and processor is deposited for execution Realized during the computer program stored in reservoir such as the step of above-mentioned any one method.

Computer-readable recording medium according to an embodiment of the invention, computer program (instruction) is stored thereon with, counted The step of middle any one method as described above is realized when calculation machine program (instruction) is executed by processor.

In the present invention, term " first ", " second ", " the 3rd " are only used for the purpose described, and it is not intended that instruction Or imply relative importance；Term " multiple " then refers to two or more, is limited unless otherwise clear and definite.Term " installation ", The term such as " connected ", " connection ", " fixation " all should be interpreted broadly, for example, " connection " can be fixedly connected or can Dismantling connection, or be integrally connected；" connected " can be joined directly together, and can also be indirectly connected by intermediary.For this For the those of ordinary skill in field, the concrete meaning of above-mentioned term in the present invention can be understood as the case may be.

In description of the invention, it is to be understood that the instruction such as term " on ", " under ", "left", "right", "front", "rear" Orientation or position relationship are based on orientation shown in the drawings or position relationship, are for only for ease of the description present invention and simplification is retouched State, rather than instruction or imply signified device or unit there must be specific direction, with specific azimuth configuration and operation, It is thus impossible to it is interpreted as limitation of the present invention.

In the description of this specification, the description of term " one embodiment ", " some embodiments ", " specific embodiment " etc. Mean to combine at least one reality that specific features, structure, material or the feature that the embodiment or example describe are contained in the present invention Apply in example or example.In this manual, identical embodiment or reality are not necessarily referring to the schematic representation of above-mentioned term Example.Moreover, description specific features, structure, material or feature can in any one or more embodiments or example with Suitable mode combines.

The preferred embodiments of the present invention are these are only, are not intended to limit the invention, for those skilled in the art For member, the present invention can have various modifications and variations.Any modification within the spirit and principles of the invention, being made, Equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

A kind of 1. determination method of short text similarity, it is characterised in that including：

The Similar Text data set for including at least one second short text is determined according to the first short text；

Determine the morphology similarity between first short text and second short text；

Determine the word order similarity of first short text or second short text and standard word order；

According to the first of the morphology similarity the default weighted value and the second default weighted value of the word order similarity, to described Morphology similarity performs weighting operations with the word order similarity, to determine first short text and second short text Similarity.
2. the determination method of short text similarity according to claim 1, it is characterised in that described it is determined that described first Before morphology similarity between short text and second short text, in addition to：

Collect short text sample set；

Pretreatment operation is performed to the short text sample in the short text sample set, to obtain processing text, the pretreatment Operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and the configuration of text Custom Dictionaries；

The Similar Text in the processing text is determined, multiple Similar Texts are imported into presetting database；

According to default cutting ratio, the multiple Similar Text in the presetting database is respectively allocated to training data Collection and test data set；

According to machine learning algorithm, the multiple Similar Text concentrated to the training data establishes iterative model；

According to the iterative model, renewal is iterated to the first weighted value and the second weighted value, by the iterative model Input and first weighted value during output difference minimum and second weighted value, are identified as the described first default power Weight values and the described second default weighted value,

Wherein, first weighted value for the multiple Similar Text morphology similarity weighted value, second weighted value For the weighted value of the word order similarity of the multiple Similar Text.
3. the determination method of short text similarity according to claim 2, it is characterised in that determine first short text Morphology similarity between second short text, specifically includes following steps：

According to the first calculation formula, the morphology similarity is determined,

Wherein, first calculation formula isX is first short text, y For second short text, x^tThe later short text of stop words, y are removed for first short text^tGone for second short text Except the later short text of stop words, s (x^t) effective word quantity after the stop words, s (y are removed for first short text^t) Effective word quantity after the stop words, ts (x are removed for second short text^t,y^t) for first short text with it is described Second short text removes after the stop words and removes the same word quantity after repetitor.
4. the determination method of short text similarity according to claim 3, it is characterised in that described to determine that described first is short The word order similarity of text or second short text and standard word order, specifically includes following steps：

According to the second calculation formula, the word order similarity is determined,

Wherein, second calculation formula is The baseline is the standard word order of given scenario, and invCount (y) the second short text y is relative to described Baseline permutation number, the maxInvCount (baseline) are the maximum permutation number of the baseline, and the n is Word quantity in second short text.
5. the determination method of short text similarity according to claim 4, it is characterised in that described according to the morphology phase Weighted value and the second default weighted value of the word order similarity are preset like the first of degree, to the morphology similarity and institute's predicate Sequence similarity perform weighting operations, to determine the similarity of first short text and second short text, specifically include with Lower step：

According to the 3rd calculation formula, the similarity is determined,

Wherein, the 3rd formula is SenSim (x)=λ₁×TermSim(x,y)+λ₂×{Order_sim(baseline, y)}_k, TermSim (x, y) is the morphology similarity, λ₁For the described first default weighted value, Order_sim (baseline, y) For the word order similarity, λ₂For the described second default weighted value, k is described second with the first weighted value similar neighborhoods The quantity of weighted value.
6. the determination method of short text similarity according to claim 4, it is characterised in that described it is determined that described first Before the word order similarity of short text or second short text and standard word order, in addition to：

The standard word order of the given scenario is determined according to the attribute of given scenario；

According to the standard word order, the thesaurus of the given scenario is generated.
A kind of 7. determining device of short text similarity, it is characterised in that including：

Determining unit, for determining the Similar Text data set for including at least one second short text according to the first short text；

The determining unit is additionally operable to：Determine the morphology similarity between first short text and second short text；

The determining unit is additionally operable to：Determine that first short text or second short text are similar to the word order of standard word order Degree；

The determining device also includes：

Operating unit, for being preset according to the first of the morphology similarity the default weighted value with the second of the word order similarity Weighted value, weighting operations are performed to the morphology similarity and the word order similarity, to determine first short text and institute State the similarity of the second short text.
8. the determining device of short text similarity according to claim 7, it is characterised in that also include：

Collector unit, for collecting short text sample set；

The operating unit is additionally operable to：Pretreatment operation is performed to the short text sample in the short text sample set, to obtain Text is handled, the pretreatment operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and text Custom Dictionaries configure；

The determining unit is additionally operable to：The Similar Text in the processing text is determined, multiple Similar Texts are imported default Database；

The determining device also includes：

Allocation unit, for according to default cutting ratio, the multiple Similar Text in the presetting database to be distinguished Distribute to training dataset and test data set；

Unit is established, for according to machine learning algorithm, establishing and changing to the multiple Similar Text that the training data is concentrated For model；

The determining unit is additionally operable to：According to the iterative model, renewal is iterated to the first weighted value and the second weighted value, So that the input of the iterative model and first weighted value during output difference minimum and second weighted value, difference is true It is set to the described first default weighted value and the described second default weighted value,

Wherein, first weighted value for the multiple Similar Text morphology similarity weighted value, second weighted value For the weighted value of the word order similarity of the multiple Similar Text.
9. the determining device of short text similarity according to claim 8, it is characterised in that

The determining unit is additionally operable to：According to the first calculation formula, the morphology similarity is determined,

Wherein, first calculation formula isX is first short text, y For second short text, x^tThe later short text of stop words, y are removed for first short text^tGone for second short text Except the later short text of stop words, s (x^t) effective word quantity after the stop words, s (y are removed for first short text^t) Effective word quantity after the stop words, ts (x are removed for second short text^t,y^t) for first short text with it is described Second short text removes after the stop words and removes the same word quantity after repetitor.
10. the determining device of short text similarity according to claim 9, it is characterised in that

The determining unit is additionally operable to：According to the second calculation formula, the word order similarity is determined,

Wherein, second calculation formula is The baseline is the standard word order of given scenario, and invCount (y) the second short text y is relative to described Baseline permutation number, the maxInvCount (baseline) are the maximum permutation number of the baseline, and the n is Word quantity in second short text.
11. the determining device of short text similarity according to claim 10, it is characterised in that

The determining unit is additionally operable to：According to the 3rd calculation formula, the similarity is determined,

Wherein, the 3rd formula is SenSim (x)=λ₁×TermSim(x,y)+λ₂×{Order_sim(baseline, y)}_k, TermSim (x, y) is the morphology similarity, λ₁For the described first default weighted value, Order_sim (baseline, y) For the word order similarity, λ₂For the described second default weighted value, k is described second with the first weighted value similar neighborhoods The quantity of weighted value.
12. the determining device of short text similarity according to claim 10, it is characterised in that

The determining unit is additionally operable to：The standard word order of the given scenario is determined according to the attribute of given scenario；

The determining device also includes：

Generation unit：For according to the standard word order, generating the thesaurus of the given scenario.
13. a kind of computer installation, it is characterised in that the computer installation includes processor, and the processor is used to perform The step of any one methods described in such as claim 1-6 is realized during the computer program stored in memory.
14. a kind of computer-readable recording medium, it is stored thereon with computer program (instruction), it is characterised in that：The calculating The step of any one methods described in such as claim 1-6 is realized when machine program (instruction) is executed by processor.