CN107491425A - Determine method, determining device, computer installation and computer-readable recording medium - Google Patents

Determine method, determining device, computer installation and computer-readable recording medium Download PDF

Info

Publication number
CN107491425A
CN107491425A CN201710620516.XA CN201710620516A CN107491425A CN 107491425 A CN107491425 A CN 107491425A CN 201710620516 A CN201710620516 A CN 201710620516A CN 107491425 A CN107491425 A CN 107491425A
Authority
CN
China
Prior art keywords
short text
similarity
text
weighted value
word order
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710620516.XA
Other languages
Chinese (zh)
Inventor
闫永刚
沈亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Midea Intelligent Technologies Co Ltd
Original Assignee
Hefei Midea Intelligent Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Midea Intelligent Technologies Co Ltd filed Critical Hefei Midea Intelligent Technologies Co Ltd
Priority to CN201710620516.XA priority Critical patent/CN107491425A/en
Publication of CN107491425A publication Critical patent/CN107491425A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The present invention proposes a kind of determination method, determining device, computer installation and computer-readable recording medium, wherein it is determined that method includes:The Similar Text data set for including at least one second short text is determined according to the first short text;Determine the morphology similarity between the first short text and the second short text;Determine the word order similarity of the first short text or the second short text and standard word order;According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, weighting operations are performed to morphology similarity and word order similarity, to determine the similarity of the first short text and the second short text.By technical scheme, short text similarity can be more all-sidedly and accurately weighed.

Description

Determine method, determining device, computer installation and computer-readable recording medium
Technical field
It is short in particular to a kind of determination method of short text similarity, one kind the present invention relates to text-processing field The determining device of text similarity, a kind of computer installation and a kind of computer-readable recording medium.
Background technology
In correlation technique, short text Similarity Measure is the knowledge point of natural language processing field very core, and one good Short text similarity calculating method can considerably improve the performance of existing system.
The method of short text Similarity Measure has much at present, can be largely classified into following a few classes:Knowledge based storehouse, it is based on Corpus, evaluating method based on Expressive Features, based on machine translation result etc., are primarily present following defect:
(1) method in knowledge based storehouse highly dependent upon the semantic dictionary inquired about completeness because short text may be deposited Its semantic similarity can not be calculated in unregistered word, causes result inaccurate.And this method have ignored short text between count The similarity of feature;
(2) the method difficult point of feature based is how effectively to extract feature and obtains these characteristic values automatically.This side Method have ignored the similarity of semantic information between short text;
(3) different from long text Similarity Measure, indivedual noise words may be to the similar of whole short text in short text Degree calculates serious interference
The content of the invention
At least one in order to solve the above-mentioned technical problem, it is an object of the present invention to provide a kind of short text similarity Determination method.
It is another object of the present invention to provide a kind of determining device of short text similarity.
It is yet a further object of the present invention to provide a kind of computer installation.
A further object of the present invention is to provide a kind of computer-readable recording medium.
To achieve these goals, the embodiment of first aspect present invention proposes a kind of determination side of short text similarity Method, including:The Similar Text data set for including at least one second short text is determined according to the first short text;Determine the first short essay Morphology similarity between sheet and the second short text;Determine that the first short text or the second short text are similar to the word order of standard word order Degree;According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, to morphology similarity with Word order similarity performs weighting operations, to determine the similarity of the first short text and the second short text.
In the technical scheme, after the first short text is inputted, the second end text is determined according to default scene thesaurus This, is with it is determined that morphology similarity between the first short text and the second short text, and the first short text or the second short text With the word order similarity of standard word order, and then morphology similarity and word order similarity are weighted, it is short to obtain first Similarity between text and the second short text, according to the statistical nature and semantic information of short text to be detected, it is determined that with phase Like the text similarity between neighbouring several second short texts, short text similarity can be more all-sidedly and accurately weighed.
In addition, the determination method of the short text similarity in above-described embodiment provided by the invention can also be with following attached Add technical characteristic:
In the above-mentioned technical solutions, it is preferable that it is determined that morphology similarity between the first short text and the second short text Before, in addition to:Collect short text sample set;Pretreatment operation is performed to the short text sample in short text sample set, to obtain Text is handled, pretreatment operation includes Chinese word segmentation, removal stop words, text feature, text duplicate removal and text and made by oneself Adopted lexicon configuration;It is determined that multiple Similar Texts are imported presetting database by the Similar Text in processing text;According to default Cutting ratio, multiple Similar Texts in presetting database are respectively allocated to training dataset and test data set;According to machine Device learning algorithm, the multiple Similar Texts concentrated to training data establish iterative model;According to iterative model, to the first weighted value Renewal is iterated with the second weighted value, by the input of iterative model and the first weighted value and second during output difference minimum Weighted value, the first default weighted value and the second default weighted value are identified as, wherein, the first weighted value is multiple Similar Texts Morphology similarity weighted value, the second weighted value for multiple Similar Texts word order similarity weighted value.
In the technical scheme, by before similarity-rough set operation is performed, according to the short text sample set being collected into, Using the mode of learning based on machine, short text sample set is iterated renewal, respectively with λ1=0.1, λ2=0.99 starts to change Generation, iterations are 100 times, in an iterative process, the input value of iterative model are contrasted with output valve, comparing result is poor Different in nature minimum λ1With λ2, second as the first default weighted value and the word order similarity of corresponding determination morphology similarity be pre- If weighted value, on the one hand, realize the optimization selection of weighted value, on the other hand, can be weighed according to convenient adjustment is actually needed Weight occupies value, and use is more convenient.
Specifically, short text similarity comprise at least morphology similarity, word order similarity linear and, based on machine learning Mode determine λ1(the first default weighted value) and λ2(the second default weighted value), mainly includes:(1) short text data collection is collected, And short text is pre-processed, processing procedure includes Chinese word segmentation, removes stop words, text feature, text duplicate removal, text certainly Define the steps such as lexicon configuration;(2) pretreated short text is put in storage, and each short text sample is marked, with true Fixed most like short text, by most like short text model training, and according to default cutting ratio, such as 7:3, by short essay Originally training set, test set are distributed to;(3) machine learning algorithm, such as KNN (k-Nearest Neighbor, k arest neighbors are used Sorting algorithm), for training set modeling analysis, set λ1' with 0.01 it is starting point, step-length 0.01, during iteration 100 times (simultaneously Meet λ1'+λ2'=1), the precision excursion of sample, and with λ during full accuracy1'、λ2' numerical value as optimal λ1、λ2
In any of the above-described technical scheme, it is preferable that determine that the morphology between the first short text and the second short text is similar Degree, specifically includes following steps:According to the first calculation formula, morphology similarity is determined, wherein, the first calculation formula isX is the first short text, and y is the second short text, xtRemoved for the first short text The later short text of stop words, ytThe later short text of stop words, s (x are removed for the second short textt) removed for the first short text Effective word quantity after stop words, s (yt) effective word quantity after stop words, ts (x are removed for the second short textt,yt) be First short text and the second short text remove after stop words and remove the same word quantity after repetitor.
In the technical scheme, morphology similarity is determined by the first calculation formula, is removing stop words and repetition respectively Hereafter, and according to effective single quantity of the first short text and the second short text, determine morphology similarity, i.e. two short texts Co-occurrence degree, on the one hand, calculating process is fairly simple, on the other hand improves the accuracy of calculating.
Specifically, stop words, in information retrieval, to save memory space and improving search efficiency, in processing nature language Some words or word are fallen in meeting automatic fitration before or after speech data (or text), and these words or word are referred to as stop words, these Word or word are all manually entered, non-automated generation, and the stop words after generation can form a deactivation vocabulary.
Wherein, co-occurrence degree, refer to one of external Cohesion for establishing the contact of sentence border, refer to synonymity word and correlation word more While occur, make sentence is front and rear to connect, form coherent language, occasion contact, causal relation in external Cohesion etc., all Along with term co-occurrence.
In any of the above-described technical scheme, it is preferable that determine the word of the first short text or the second short text and standard word order Sequence similarity, specifically includes following steps:According to the second calculation formula, word order similarity is determined, wherein, the second calculation formula isBaseline is the mark of given scenario Quasi- word order, relative to baseline permutation number, maxInvCount (baseline) is invCount (y) the second short text y Baseline maximum permutation number, n are the word quantity in the second short text.
In the technical scheme, by the way that the first short text or the second short text are carried out into similarity word order pair with standard word order Than, the similar detection of word order is realized, can be by the first short text or the second short text and standard by setting standard word order Word order is compared, and when short text is null, the word order degree of correlation is 0, and in an only word, the word order degree of correlation is 1, During with n word, according to relative permutation number and the ratio of maximum permutation number, word order similarity is determined, realizes word order phase Like the determination of degree, determination mode is simple, and reliability is high.
In any of the above-described technical scheme, it is preferable that similar to word order according to the first of morphology similarity the default weighted value Second default weighted value of degree, weighting operations are performed to morphology similarity and word order similarity, to determine the first short text and the The similarity of two short texts, specifically includes following steps:According to the 3rd calculation formula, similarity is determined, wherein, the 3rd formula is SenSim (x)=λ1×TermSim(x,y)+λ2×{Order_sim(baseline,y)}k, TermSim (x, y) is morphology phase Like degree, λ1For the first default weighted value, Order_sim (baseline, y) is word order similarity, λ2For the second default weighted value, k For the quantity of the second weighted value with the first weighted value similar neighborhoods.
In the technical scheme, similitude of first short text relative to k the second short texts is determined by the 3rd formula, To determine the similarity of the first short text, short text similarity can be more all-sidedly and accurately weighed.
In any of the above-described technical scheme, it is preferable that it is determined that the first short text or the second short text and standard word order Before word order similarity, in addition to:The standard word order of given scenario is determined according to the attribute of given scenario;It is raw according to standard word order Into the thesaurus of given scenario.
In the technical scheme, by determining the standard word order of the scene according to the attribute of given scenario, realize more The determination of the standard word order of individual scene, in the natural language processing application in vertical field, it can be built and stopped according to standard word order Word dictionary and thesaurus, exploitativeness are stronger.
The embodiment of second aspect of the present invention proposes a kind of determining device of short text similarity, including:Determining unit, For determining the Similar Text data set for including at least one second short text according to the first short text;Determining unit is additionally operable to: Determine the morphology similarity between the first short text and the second short text;Determining unit is additionally operable to:Determine the first short text or The word order similarity of two short texts and standard word order;Determining device also includes:Operating unit, for according to morphology similarity One default weighted value and the second default weighted value of word order similarity, weighting behaviour is performed to morphology similarity and word order similarity Make, to determine the similarity of the first short text and the second short text.
In the technical scheme, after the first short text is inputted, the second end text is determined according to default scene thesaurus This, is with it is determined that morphology similarity between the first short text and the second short text, and the first short text or the second short text With the word order similarity of standard word order, and then morphology similarity and word order similarity are weighted, it is short to obtain first Similarity between text and the second short text, according to the statistical nature and semantic information of short text to be detected, it is determined that with phase Like the text similarity between neighbouring several second short texts, short text similarity can be more all-sidedly and accurately weighed.
In the above-mentioned technical solutions, it is preferable that also include:Collector unit, for collecting short text sample set;Operating unit It is additionally operable to:Pretreatment operation is performed to the short text sample in short text sample set, to obtain processing text, pretreatment operation bag Include including Chinese word segmentation, remove stop words, text feature, text duplicate removal and the configuration of text Custom Dictionaries;Determining unit is also For:It is determined that multiple Similar Texts are imported presetting database by the Similar Text in processing text;Determining device also includes: Allocation unit, for according to default cutting ratio, multiple Similar Texts in presetting database to be respectively allocated into training number According to collection and test data set;Unit is established, for according to machine learning algorithm, the multiple Similar Texts concentrated to training data to be built Vertical iterative model;Determining unit is additionally operable to:According to iterative model, renewal is iterated to the first weighted value and the second weighted value, So that with exporting the first weighted value and the second weighted value during difference minimum, it is default that the input of iterative model is identified as into first Weighted value and the second default weighted value, wherein, the first weighted value for the morphology similarity of multiple Similar Texts weighted value, second Weighted value is the weighted value of the word order similarity of multiple Similar Texts.
In the technical scheme, by before similarity-rough set operation is performed, according to the short text sample set being collected into, Using the mode of learning based on machine, short text sample set is iterated renewal, respectively with λ1=0.1, λ2=0.99 starts to change Generation, iterations are 100 times, in an iterative process, the input value of iterative model are contrasted with output valve, comparing result is poor Different in nature minimum λ1With λ2, second as the first default weighted value and the word order similarity of corresponding determination morphology similarity be pre- If weighted value, on the one hand, realize the optimization selection of weighted value, on the other hand, can be weighed according to convenient adjustment is actually needed Weight occupies value, and use is more convenient.
Specifically, short text similarity comprise at least morphology similarity, word order similarity linear and, based on machine learning Mode determine λ1(the first default weighted value) and λ2(the second default weighted value), mainly includes:(1) short text data collection is collected, And short text is pre-processed, processing procedure includes Chinese word segmentation, removes stop words, text feature, text duplicate removal, text certainly Define the steps such as lexicon configuration;(2) pretreated short text is put in storage, and each short text sample is marked, with true Fixed most like short text, by most like short text model training, and according to default cutting ratio, such as 7:3, by short essay Originally training set, test set are distributed to;(3) machine learning algorithm, such as KNN (k-Nearest Neighbor, k arest neighbors are used Sorting algorithm), for training set modeling analysis, set λ1' with 0.01 it is starting point, step-length 0.01, during iteration 100 times (simultaneously Meet λ1'+λ2'=1), the precision excursion of sample, and with λ during full accuracy1'、λ2' numerical value as optimal λ1、λ2
In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to:According to the first calculation formula, morphology is determined Similarity, wherein, the first calculation formula isX is the first short text, y Two short texts, xtThe later short text of stop words, y are removed for the first short texttLater short of stop words is removed for the second short text Text, s (xt) effective word quantity after stop words, s (y are removed for the first short textt) it is after the second short text removes stop words Effective word quantity, ts (xt,yt) after the first short text and the second short text removal stop words and to remove the phase after repetitor With word quantity.
In the technical scheme, morphology similarity is determined by the first calculation formula, is removing stop words and repetition respectively Hereafter, and according to effective single quantity of the first short text and the second short text, determine morphology similarity, i.e. two short texts Co-occurrence degree, on the one hand, calculating process is fairly simple, on the other hand improves the accuracy of calculating.
Specifically, stop words, in information retrieval, to save memory space and improving search efficiency, in processing nature language Some words or word are fallen in meeting automatic fitration before or after speech data (or text), and these words or word are referred to as stop words, these Word or word are all manually entered, non-automated generation, and the stop words after generation can form a deactivation vocabulary.
Wherein, co-occurrence degree, refer to one of external Cohesion for establishing the contact of sentence border, refer to synonymity word and correlation word more While occur, make sentence is front and rear to connect, form coherent language, occasion contact, causal relation in external Cohesion etc., all Along with term co-occurrence.
In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to:According to the second calculation formula, word order is determined Similarity, wherein, the second calculation formula is Baseline be given scenario standard word order, invCount (y) the second short text y relative to baseline permutation number, MaxInvCount (baseline) is baseline maximum permutation number, and n is the word quantity in the second short text.
In the technical scheme, by the way that the first short text or the second short text are carried out into similarity word order pair with standard word order Than, the similar detection of word order is realized, can be by the first short text or the second short text and standard by setting standard word order Word order is compared, and when short text is null, the word order degree of correlation is 0, and in an only word, the word order degree of correlation is 1, During with n word, according to relative permutation number and the ratio of maximum permutation number, word order similarity is determined, realizes word order phase Like the determination of degree, determination mode is simple, and reliability is high.
In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to:According to the 3rd calculation formula, determine similar Degree, wherein, the 3rd formula is SenSim (x)=λ1×TermSim(x,y)+λ2×{Order_sim(baseline,y)}k, TermSim (x, y) is morphology similarity, λ1For the first default weighted value, Order_sim (baseline, y) is that word order is similar Degree, λ2For the second default weighted value, k is the quantity with the second weighted value of the first weighted value similar neighborhoods.
In the technical scheme, similitude of first short text relative to k the second short texts is determined by the 3rd formula, To determine the similarity of the first short text, short text similarity can be more all-sidedly and accurately weighed.
In any of the above-described technical scheme, it is preferable that determining unit is additionally operable to:Determine to specify according to the attribute of given scenario The standard word order of scene;Determining device also includes:Generation unit:For according to standard word order, generating the synonym of given scenario Storehouse.
In the technical scheme, by determining the standard word order of the scene according to the attribute of given scenario, realize more The determination of the standard word order of individual scene, in the natural language processing application in vertical field, it can be built and stopped according to standard word order Word dictionary and thesaurus, exploitativeness are stronger.
According to the third aspect of the invention we, it is also proposed that a kind of computer installation, computer installation include processor, processing Device is realized such as the step of above-mentioned any one method when being used to perform the computer program stored in memory.
According to the fourth aspect of the invention, it is also proposed that a kind of computer-readable recording medium, be stored thereon with computer Program (instruction), when computer program (instruction) is executed by processor realize as described above middle any one method the step of.
The additional aspect and advantage of the present invention will become obvious in following description section, or the practice by the present invention Recognize.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment Substantially and it is readily appreciated that, wherein:
Fig. 1 shows the schematic flow diagram of the determination method of short text similarity according to an embodiment of the invention;
Fig. 2 shows the schematic block diagram of the determining device of short text similarity according to an embodiment of the invention;
Fig. 3 shows the exemplary flow of the determination method of short text similarity according to another embodiment of the invention Figure.
Embodiment
It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention Mode is applied the present invention is further described in detail.It should be noted that in the case where not conflicting, the implementation of the application Feature in example and embodiment can be mutually combined.
Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still, the present invention may be used also To be different from other modes described here using other to implement, therefore, protection scope of the present invention is not by described below Specific embodiment limitation.
Fig. 1 shows the schematic flow diagram of the determination method of short text similarity according to an embodiment of the invention.
As shown in figure 1, the determination method of short text similarity according to an embodiment of the invention, including:Step 102, the Similar Text data set for including at least one second short text is determined according to the first short text;Step 104, first is determined Morphology similarity between short text and the second short text;Step 106, the first short text or the second short text and standard words are determined The word order similarity of sequence;Step 108, according to the first of morphology similarity the default weighted value and the second default power of word order similarity Weight values, weighting operations are performed to morphology similarity and word order similarity, to determine that the first short text is similar to the second short text Degree.
In the technical scheme, after the first short text is inputted, the second end text is determined according to default scene thesaurus This, is with it is determined that morphology similarity between the first short text and the second short text, and the first short text or the second short text With the word order similarity of standard word order, and then morphology similarity and word order similarity are weighted, it is short to obtain first Similarity between text and the second short text, according to the statistical nature and semantic information of short text to be detected, it is determined that with phase Like the text similarity between neighbouring several second short texts, short text similarity can be more all-sidedly and accurately weighed.
In addition, the determination method of the short text similarity in above-described embodiment provided by the invention can also be with following attached Add technical characteristic:
In the above-mentioned technical solutions, it is preferable that it is determined that morphology similarity between the first short text and the second short text Before, in addition to:Collect short text sample set;Pretreatment operation is performed to the short text sample in short text sample set, to obtain Text is handled, pretreatment operation includes Chinese word segmentation, removal stop words, text feature, text duplicate removal and text and made by oneself Adopted lexicon configuration;It is determined that multiple Similar Texts are imported presetting database by the Similar Text in processing text;According to default Cutting ratio, multiple Similar Texts in presetting database are respectively allocated to training dataset and test data set;According to machine Device learning algorithm, the multiple Similar Texts concentrated to training data establish iterative model;According to iterative model, to the first weighted value Renewal is iterated with the second weighted value, by the input of iterative model and the first weighted value and second during output difference minimum Weighted value, the first default weighted value and the second default weighted value are identified as, wherein, the first weighted value is multiple Similar Texts Morphology similarity weighted value, the second weighted value for multiple Similar Texts word order similarity weighted value.
In the technical scheme, by before similarity-rough set operation is performed, according to the short text sample set being collected into, Using the mode of learning based on machine, short text sample set is iterated renewal, respectively with λ1=0.1, λ2=0.99 starts to change Generation, iterations are 100 times, in an iterative process, the input value of iterative model are contrasted with output valve, comparing result is poor Different in nature minimum λ1With λ2, second as the first default weighted value and the word order similarity of corresponding determination morphology similarity be pre- If weighted value, on the one hand, realize the optimization selection of weighted value, on the other hand, can be weighed according to convenient adjustment is actually needed Weight occupies value, and use is more convenient.
Specifically, short text similarity comprise at least morphology similarity, word order similarity linear and, based on machine learning Mode determine λ1(the first default weighted value) and λ2(the second default weighted value), mainly includes:(1) short text data collection is collected, And short text is pre-processed, processing procedure includes Chinese word segmentation, removes stop words, text feature, text duplicate removal, text certainly Define the steps such as lexicon configuration;(2) pretreated short text is put in storage, and each short text sample is marked, with true Fixed most like short text, by most like short text model training, and according to default cutting ratio, such as 7:3, by short essay Originally training set, test set are distributed to;(3) machine learning algorithm, such as KNN (k-Nearest Neighbor, k arest neighbors are used Sorting algorithm), for training set modeling analysis, set λ1' with 0.01 it is starting point, step-length 0.01, during iteration 100 times (simultaneously Meet λ1′+λ2'=1), the precision excursion of sample, and with λ during full accuracy1′、λ2' numerical value as optimal λ1、λ2
In any of the above-described technical scheme, it is preferable that determine that the morphology between the first short text and the second short text is similar Degree, specifically includes following steps:According to the first calculation formula, morphology similarity is determined, wherein, the first calculation formula isX is the first short text, and y is the second short text, xtRemoved for the first short text The later short text of stop words, ytThe later short text of stop words, s (x are removed for the second short textt) removed for the first short text Effective word quantity after stop words, s (yt) effective word quantity after stop words, ts (x are removed for the second short textt,yt) be First short text and the second short text remove after stop words and remove the same word quantity after repetitor.
In the technical scheme, morphology similarity is determined by the first calculation formula, is removing stop words and repetition respectively Hereafter, and according to effective single quantity of the first short text and the second short text, determine morphology similarity, i.e. two short texts Co-occurrence degree, on the one hand, calculating process is fairly simple, on the other hand improves the accuracy of calculating.
Specifically, stop words, in information retrieval, to save memory space and improving search efficiency, in processing nature language Some words or word are fallen in meeting automatic fitration before or after speech data (or text), and these words or word are referred to as stop words, these Word or word are all manually entered, non-automated generation, and the stop words after generation can form a deactivation vocabulary.
Wherein, co-occurrence degree, refer to one of external Cohesion for establishing the contact of sentence border, refer to synonymity word and correlation word more While occur, make sentence is front and rear to connect, form coherent language, occasion contact, causal relation in external Cohesion etc., all Along with term co-occurrence.
In any of the above-described technical scheme, it is preferable that determine the word of the first short text or the second short text and standard word order Sequence similarity, specifically includes following steps:According to the second calculation formula, word order similarity is determined, wherein, the second calculation formula isBaseline is the mark of given scenario Quasi- word order, relative to baseline permutation number, maxInvCount (baseline) is invCount (y) the second short text y Baseline maximum permutation number, n are the word quantity in the second short text.
In the technical scheme, by the way that the first short text or the second short text are carried out into similarity word order pair with standard word order Than, the similar detection of word order is realized, can be by the first short text or the second short text and standard by setting standard word order Word order is compared, and when short text is null, the word order degree of correlation is 0, and in an only word, the word order degree of correlation is 1, During with n word, according to relative permutation number and the ratio of maximum permutation number, word order similarity is determined, realizes word order phase Like the determination of degree, determination mode is simple, and reliability is high.
In any of the above-described technical scheme, it is preferable that similar to word order according to the first of morphology similarity the default weighted value Second default weighted value of degree, weighting operations are performed to morphology similarity and word order similarity, to determine the first short text and the The similarity of two short texts, specifically includes following steps:According to the 3rd calculation formula, similarity is determined, wherein, the 3rd formula is SenSim (x)=λ1×TermSim(x,y)+λ2×{Order_sim(baseline,y)}k, TermSim (x, y) is morphology phase Like degree, λ1For the first default weighted value, Order_sim (baseline, y) is word order similarity, λ2For the second default weighted value, k For the quantity of the second weighted value with the first weighted value similar neighborhoods.
In the technical scheme, similitude of first short text relative to k the second short texts is determined by the 3rd formula, To determine the similarity of the first short text, short text similarity can be more all-sidedly and accurately weighed.
In any of the above-described technical scheme, it is preferable that it is determined that the first short text or the second short text and standard word order Before word order similarity, in addition to:The standard word order of given scenario is determined according to the attribute of given scenario;It is raw according to standard word order Into the thesaurus of given scenario.
In the technical scheme, by determining the standard word order of the scene according to the attribute of given scenario, realize more The determination of the standard word order of individual scene, in the natural language processing application in vertical field, it can be built and stopped according to standard word order Word dictionary and thesaurus, exploitativeness are stronger.
Fig. 2 shows the schematic block diagram of the determining device of short text similarity according to an embodiment of the invention.
As shown in Fig. 2 the determining device 200 of short text similarity according to an embodiment of the invention, including:It is determined that Unit 202, for determining the Similar Text data set for including at least one second short text according to the first short text;Determining unit 202 are additionally operable to:Determine the morphology similarity between the first short text and the second short text;Determining unit 202 is additionally operable to:Determine The word order similarity of one short text or the second short text and standard word order;Determining device 200 also includes:Operating unit 204, is used for According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, to morphology similarity and word order Similarity performs weighting operations, to determine the similarity of the first short text and the second short text.
In the technical scheme, after the first short text is inputted, the second end text is determined according to default scene thesaurus This, is with it is determined that morphology similarity between the first short text and the second short text, and the first short text or the second short text With the word order similarity of standard word order, and then morphology similarity and word order similarity are weighted, it is short to obtain first Similarity between text and the second short text, according to the statistical nature and semantic information of short text to be detected, it is determined that with phase Like the text similarity between neighbouring several second short texts, short text similarity can be more all-sidedly and accurately weighed.
In the above-mentioned technical solutions, it is preferable that also include:Collector unit 206, for collecting short text sample set;Operation Unit 204 is additionally operable to:Pretreatment operation is performed to the short text sample in short text sample set, to obtain processing text, pre- place Reason operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and the configuration of text Custom Dictionaries;Really Order member 202 is additionally operable to:It is determined that multiple Similar Texts are imported presetting database by the Similar Text in processing text;It is determined that Device 200 also includes:Allocation unit 208, for according to default cutting ratio, by multiple Similar Texts in presetting database It is respectively allocated to training dataset and test data set;Unit 210 is established, for according to machine learning algorithm, to training data The multiple Similar Texts concentrated establish iterative model;Determining unit 202 is additionally operable to:According to iterative model, to the first weighted value with Second weighted value is iterated renewal, by the input of iterative model and the first weighted value and the second power during output difference minimum Weight values, the first default weighted value and the second default weighted value are identified as, wherein, the first weighted value is multiple Similar Texts The weighted value of morphology similarity, the second weighted value are the weighted value of the word order similarity of multiple Similar Texts.
In the technical scheme, by before similarity-rough set operation is performed, according to the short text sample set being collected into, Using the mode of learning based on machine, short text sample set is iterated renewal, respectively with λ1=0.1, λ2=0.99 starts to change Generation, iterations are 100 times, in an iterative process, the input value of iterative model are contrasted with output valve, comparing result is poor Different in nature minimum λ1With λ2, second as the first default weighted value and the word order similarity of corresponding determination morphology similarity be pre- If weighted value, on the one hand, realize the optimization selection of weighted value, on the other hand, can be weighed according to convenient adjustment is actually needed Weight occupies value, and use is more convenient.
Specifically, short text similarity comprise at least morphology similarity, word order similarity linear and, based on machine learning Mode determine λ1(the first default weighted value) and λ2(the second default weighted value), mainly includes:(1) short text data collection is collected, And short text is pre-processed, processing procedure includes Chinese word segmentation, removes stop words, text feature, text duplicate removal, text certainly Define the steps such as lexicon configuration;(2) pretreated short text is put in storage, and each short text sample is marked, with true Fixed most like short text, by most like short text model training, and according to default cutting ratio, such as 7:3, by short essay Originally training set, test set are distributed to;(3) machine learning algorithm, such as KNN (k-Nearest Neighbor, k arest neighbors are used Sorting algorithm), for training set modeling analysis, set λ1' with 0.01 it is starting point, step-length 0.01, during iteration 100 times (simultaneously Meet λ1′+λ2'=1), the precision excursion of sample, and with λ during full accuracy1′、λ2' numerical value as optimal λ1、λ2
In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to:According to the first calculation formula, word is determined Shape similarity, wherein, the first calculation formula isX is the first short text, and y is Second short text, xtThe later short text of stop words, y are removed for the first short texttIt is later that stop words is removed for the second short text Short text, s (xt) effective word quantity after stop words, s (y are removed for the first short textt) it is that the second short text removes stop words Effective word quantity afterwards, ts (xt,yt) for after the first short text and the second short text removal stop words and after removing repetitor Same word quantity.
In the technical scheme, morphology similarity is determined by the first calculation formula, is removing stop words and repetition respectively Hereafter, and according to effective single quantity of the first short text and the second short text, determine morphology similarity, i.e. two short texts Co-occurrence degree, on the one hand, calculating process is fairly simple, on the other hand improves the accuracy of calculating.
Specifically, stop words, in information retrieval, to save memory space and improving search efficiency, in processing nature language Some words or word are fallen in meeting automatic fitration before or after speech data (or text), and these words or word are referred to as stop words, these Word or word are all manually entered, non-automated generation, and the stop words after generation can form a deactivation vocabulary.
Wherein, co-occurrence degree, refer to one of external Cohesion for establishing the contact of sentence border, refer to synonymity word and correlation word more While occur, make sentence is front and rear to connect, form coherent language, occasion contact, causal relation in external Cohesion etc., all Along with term co-occurrence.
In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to:According to the second calculation formula, word is determined Sequence similarity, wherein, the second calculation formula is Baseline be given scenario standard word order, invCount (y) the second short text y relative to baseline permutation number, MaxInvCount (baseline) is baseline maximum permutation number, and n is the word quantity in the second short text.
In the technical scheme, by the way that the first short text or the second short text are carried out into similarity word order pair with standard word order Than, the similar detection of word order is realized, can be by the first short text or the second short text and standard by setting standard word order Word order is compared, and when short text is null, the word order degree of correlation is 0, and in an only word, the word order degree of correlation is 1, During with n word, according to relative permutation number and the ratio of maximum permutation number, word order similarity is determined, realizes word order phase Like the determination of degree, determination mode is simple, and reliability is high.
In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to:According to the 3rd calculation formula, phase is determined Like degree, wherein, the 3rd formula is SenSim (x)=λ1×TermSim(x,y)+λ2×{Order_sim(baseline,y)}k, TermSim (x, y) is morphology similarity, λ1For the first default weighted value, Order_sim (baseline, y) is that word order is similar Degree, λ2For the second default weighted value, k is the quantity with the second weighted value of the first weighted value similar neighborhoods.
In the technical scheme, similitude of first short text relative to k the second short texts is determined by the 3rd formula, To determine the similarity of the first short text, short text similarity can be more all-sidedly and accurately weighed.
In any of the above-described technical scheme, it is preferable that determining unit 202 is additionally operable to:Determined according to the attribute of given scenario The standard word order of given scenario;Determining device 200 also includes:Generation unit 212:For according to standard word order, generating and specifying field The thesaurus of scape.
In the technical scheme, by determining the standard word order of the scene according to the attribute of given scenario, realize more The determination of the standard word order of individual scene, in the natural language processing application in vertical field, it can be built and stopped according to standard word order Word dictionary and thesaurus, exploitativeness are stronger.
Fig. 3 shows the exemplary flow of the determination method of short text similarity according to another embodiment of the invention Figure.
As shown in figure 3, the determination method of short text similarity according to another embodiment of the invention, including:Step 302, collect short text sample set;Step 304, pretreatment operation is performed to the short text sample in short text sample set, to obtain Text must be handled, pretreatment operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and text certainly Define lexicon configuration;Step 306, it is determined that handling the Similar Text in text, multiple Similar Texts are imported into presetting database; Step 308, according to default cutting ratio, multiple Similar Texts in presetting database are respectively allocated to test data set; Step 310, according to default cutting ratio, by multiple Similar Texts in presetting database be respectively allocated to training dataset with Test data set;Step 312, according to machine learning algorithm, the multiple Similar Texts concentrated to training data establish iterative model; Step 314, for any given scene, its standard word order is determined, establishes the thesaurus of all scenes, to any sentence, The word for wherein possessing synonymous scene is replaced, makes it consistent with standard scene;Step 316, according to iterative model, to the first weight Value and the second weighted value are iterated renewal, the first weighted value during by the input of iterative model with exporting difference minimum and the Two weighted values, the first default weighted value and the second default weighted value are identified as, wherein, the first weighted value is multiple similar texts The weighted value of this morphology similarity, the second weighted value are the weighted value of the word order similarity of multiple Similar Texts;Step 318, According to the first of morphology similarity the default weighted value and the second default weighted value of word order similarity, to morphology similarity and word order Similarity performs weighting operations, to determine the similarity of the first short text and the second short text;Step 320, descending arrangement output As a result.
Computer installation according to an embodiment of the invention, computer installation include processor, and processor is deposited for execution Realized during the computer program stored in reservoir such as the step of above-mentioned any one method.
Computer-readable recording medium according to an embodiment of the invention, computer program (instruction) is stored thereon with, counted The step of middle any one method as described above is realized when calculation machine program (instruction) is executed by processor.
In the present invention, term " first ", " second ", " the 3rd " are only used for the purpose described, and it is not intended that instruction Or imply relative importance;Term " multiple " then refers to two or more, is limited unless otherwise clear and definite.Term " installation ", The term such as " connected ", " connection ", " fixation " all should be interpreted broadly, for example, " connection " can be fixedly connected or can Dismantling connection, or be integrally connected;" connected " can be joined directly together, and can also be indirectly connected by intermediary.For this For the those of ordinary skill in field, the concrete meaning of above-mentioned term in the present invention can be understood as the case may be.
In description of the invention, it is to be understood that the instruction such as term " on ", " under ", "left", "right", "front", "rear" Orientation or position relationship are based on orientation shown in the drawings or position relationship, are for only for ease of the description present invention and simplification is retouched State, rather than instruction or imply signified device or unit there must be specific direction, with specific azimuth configuration and operation, It is thus impossible to it is interpreted as limitation of the present invention.
In the description of this specification, the description of term " one embodiment ", " some embodiments ", " specific embodiment " etc. Mean to combine at least one reality that specific features, structure, material or the feature that the embodiment or example describe are contained in the present invention Apply in example or example.In this manual, identical embodiment or reality are not necessarily referring to the schematic representation of above-mentioned term Example.Moreover, description specific features, structure, material or feature can in any one or more embodiments or example with Suitable mode combines.
The preferred embodiments of the present invention are these are only, are not intended to limit the invention, for those skilled in the art For member, the present invention can have various modifications and variations.Any modification within the spirit and principles of the invention, being made, Equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (14)

  1. A kind of 1. determination method of short text similarity, it is characterised in that including:
    The Similar Text data set for including at least one second short text is determined according to the first short text;
    Determine the morphology similarity between first short text and second short text;
    Determine the word order similarity of first short text or second short text and standard word order;
    According to the first of the morphology similarity the default weighted value and the second default weighted value of the word order similarity, to described Morphology similarity performs weighting operations with the word order similarity, to determine first short text and second short text Similarity.
  2. 2. the determination method of short text similarity according to claim 1, it is characterised in that described it is determined that described first Before morphology similarity between short text and second short text, in addition to:
    Collect short text sample set;
    Pretreatment operation is performed to the short text sample in the short text sample set, to obtain processing text, the pretreatment Operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and the configuration of text Custom Dictionaries;
    The Similar Text in the processing text is determined, multiple Similar Texts are imported into presetting database;
    According to default cutting ratio, the multiple Similar Text in the presetting database is respectively allocated to training data Collection and test data set;
    According to machine learning algorithm, the multiple Similar Text concentrated to the training data establishes iterative model;
    According to the iterative model, renewal is iterated to the first weighted value and the second weighted value, by the iterative model Input and first weighted value during output difference minimum and second weighted value, are identified as the described first default power Weight values and the described second default weighted value,
    Wherein, first weighted value for the multiple Similar Text morphology similarity weighted value, second weighted value For the weighted value of the word order similarity of the multiple Similar Text.
  3. 3. the determination method of short text similarity according to claim 2, it is characterised in that determine first short text Morphology similarity between second short text, specifically includes following steps:
    According to the first calculation formula, the morphology similarity is determined,
    Wherein, first calculation formula isX is first short text, y For second short text, xtThe later short text of stop words, y are removed for first short texttGone for second short text Except the later short text of stop words, s (xt) effective word quantity after the stop words, s (y are removed for first short textt) Effective word quantity after the stop words, ts (x are removed for second short textt,yt) for first short text with it is described Second short text removes after the stop words and removes the same word quantity after repetitor.
  4. 4. the determination method of short text similarity according to claim 3, it is characterised in that described to determine that described first is short The word order similarity of text or second short text and standard word order, specifically includes following steps:
    According to the second calculation formula, the word order similarity is determined,
    Wherein, second calculation formula is The baseline is the standard word order of given scenario, and invCount (y) the second short text y is relative to described Baseline permutation number, the maxInvCount (baseline) are the maximum permutation number of the baseline, and the n is Word quantity in second short text.
  5. 5. the determination method of short text similarity according to claim 4, it is characterised in that described according to the morphology phase Weighted value and the second default weighted value of the word order similarity are preset like the first of degree, to the morphology similarity and institute's predicate Sequence similarity perform weighting operations, to determine the similarity of first short text and second short text, specifically include with Lower step:
    According to the 3rd calculation formula, the similarity is determined,
    Wherein, the 3rd formula is SenSim (x)=λ1×TermSim(x,y)+λ2×{Order_sim(baseline, y)}k, TermSim (x, y) is the morphology similarity, λ1For the described first default weighted value, Order_sim (baseline, y) For the word order similarity, λ2For the described second default weighted value, k is described second with the first weighted value similar neighborhoods The quantity of weighted value.
  6. 6. the determination method of short text similarity according to claim 4, it is characterised in that described it is determined that described first Before the word order similarity of short text or second short text and standard word order, in addition to:
    The standard word order of the given scenario is determined according to the attribute of given scenario;
    According to the standard word order, the thesaurus of the given scenario is generated.
  7. A kind of 7. determining device of short text similarity, it is characterised in that including:
    Determining unit, for determining the Similar Text data set for including at least one second short text according to the first short text;
    The determining unit is additionally operable to:Determine the morphology similarity between first short text and second short text;
    The determining unit is additionally operable to:Determine that first short text or second short text are similar to the word order of standard word order Degree;
    The determining device also includes:
    Operating unit, for being preset according to the first of the morphology similarity the default weighted value with the second of the word order similarity Weighted value, weighting operations are performed to the morphology similarity and the word order similarity, to determine first short text and institute State the similarity of the second short text.
  8. 8. the determining device of short text similarity according to claim 7, it is characterised in that also include:
    Collector unit, for collecting short text sample set;
    The operating unit is additionally operable to:Pretreatment operation is performed to the short text sample in the short text sample set, to obtain Text is handled, the pretreatment operation includes Chinese word segmentation, removes stop words, text feature, text duplicate removal and text Custom Dictionaries configure;
    The determining unit is additionally operable to:The Similar Text in the processing text is determined, multiple Similar Texts are imported default Database;
    The determining device also includes:
    Allocation unit, for according to default cutting ratio, the multiple Similar Text in the presetting database to be distinguished Distribute to training dataset and test data set;
    Unit is established, for according to machine learning algorithm, establishing and changing to the multiple Similar Text that the training data is concentrated For model;
    The determining unit is additionally operable to:According to the iterative model, renewal is iterated to the first weighted value and the second weighted value, So that the input of the iterative model and first weighted value during output difference minimum and second weighted value, difference is true It is set to the described first default weighted value and the described second default weighted value,
    Wherein, first weighted value for the multiple Similar Text morphology similarity weighted value, second weighted value For the weighted value of the word order similarity of the multiple Similar Text.
  9. 9. the determining device of short text similarity according to claim 8, it is characterised in that
    The determining unit is additionally operable to:According to the first calculation formula, the morphology similarity is determined,
    Wherein, first calculation formula isX is first short text, y For second short text, xtThe later short text of stop words, y are removed for first short texttGone for second short text Except the later short text of stop words, s (xt) effective word quantity after the stop words, s (y are removed for first short textt) Effective word quantity after the stop words, ts (x are removed for second short textt,yt) for first short text with it is described Second short text removes after the stop words and removes the same word quantity after repetitor.
  10. 10. the determining device of short text similarity according to claim 9, it is characterised in that
    The determining unit is additionally operable to:According to the second calculation formula, the word order similarity is determined,
    Wherein, second calculation formula is The baseline is the standard word order of given scenario, and invCount (y) the second short text y is relative to described Baseline permutation number, the maxInvCount (baseline) are the maximum permutation number of the baseline, and the n is Word quantity in second short text.
  11. 11. the determining device of short text similarity according to claim 10, it is characterised in that
    The determining unit is additionally operable to:According to the 3rd calculation formula, the similarity is determined,
    Wherein, the 3rd formula is SenSim (x)=λ1×TermSim(x,y)+λ2×{Order_sim(baseline, y)}k, TermSim (x, y) is the morphology similarity, λ1For the described first default weighted value, Order_sim (baseline, y) For the word order similarity, λ2For the described second default weighted value, k is described second with the first weighted value similar neighborhoods The quantity of weighted value.
  12. 12. the determining device of short text similarity according to claim 10, it is characterised in that
    The determining unit is additionally operable to:The standard word order of the given scenario is determined according to the attribute of given scenario;
    The determining device also includes:
    Generation unit:For according to the standard word order, generating the thesaurus of the given scenario.
  13. 13. a kind of computer installation, it is characterised in that the computer installation includes processor, and the processor is used to perform The step of any one methods described in such as claim 1-6 is realized during the computer program stored in memory.
  14. 14. a kind of computer-readable recording medium, it is stored thereon with computer program (instruction), it is characterised in that:The calculating The step of any one methods described in such as claim 1-6 is realized when machine program (instruction) is executed by processor.
CN201710620516.XA 2017-07-26 2017-07-26 Determine method, determining device, computer installation and computer-readable recording medium Pending CN107491425A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710620516.XA CN107491425A (en) 2017-07-26 2017-07-26 Determine method, determining device, computer installation and computer-readable recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710620516.XA CN107491425A (en) 2017-07-26 2017-07-26 Determine method, determining device, computer installation and computer-readable recording medium

Publications (1)

Publication Number Publication Date
CN107491425A true CN107491425A (en) 2017-12-19

Family

ID=60644855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710620516.XA Pending CN107491425A (en) 2017-07-26 2017-07-26 Determine method, determining device, computer installation and computer-readable recording medium

Country Status (1)

Country Link
CN (1) CN107491425A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304378A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Text similarity computing method, apparatus, computer equipment and storage medium
CN110781662A (en) * 2019-10-21 2020-02-11 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN114491177A (en) * 2022-02-15 2022-05-13 北京百度网讯科技有限公司 Information determination method, model training method, model determination device and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002057961A2 (en) * 2001-01-18 2002-07-25 Science Applications International Corporation Method and system of ranking and clustering for document indexing and retrieval
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN102968500A (en) * 2012-12-04 2013-03-13 中国飞行试验研究院 Quick retrieving method for special treatment of flight based on layered retrieval
CN104091054A (en) * 2014-06-26 2014-10-08 中国科学院自动化研究所 Mass disturbance warning method and system applied to short texts
CN104216968A (en) * 2014-08-25 2014-12-17 华中科技大学 Rearrangement method and system based on document similarity
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN105630767A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Text similarity comparison method and device
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106776558A (en) * 2016-12-14 2017-05-31 北京工业大学 Merge the domain term recognition method of language ambience information
CN106970912A (en) * 2017-04-21 2017-07-21 北京慧闻科技发展有限公司 Chinese sentence similarity calculating method, computing device and computer-readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002057961A2 (en) * 2001-01-18 2002-07-25 Science Applications International Corporation Method and system of ranking and clustering for document indexing and retrieval
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN102968500A (en) * 2012-12-04 2013-03-13 中国飞行试验研究院 Quick retrieving method for special treatment of flight based on layered retrieval
CN105224518A (en) * 2014-06-17 2016-01-06 腾讯科技(深圳)有限公司 The lookup method of the computing method of text similarity and system, Similar Text and system
CN104091054A (en) * 2014-06-26 2014-10-08 中国科学院自动化研究所 Mass disturbance warning method and system applied to short texts
CN104216968A (en) * 2014-08-25 2014-12-17 华中科技大学 Rearrangement method and system based on document similarity
CN105630767A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Text similarity comparison method and device
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106776558A (en) * 2016-12-14 2017-05-31 北京工业大学 Merge the domain term recognition method of language ambience information
CN106970912A (en) * 2017-04-21 2017-07-21 北京慧闻科技发展有限公司 Chinese sentence similarity calculating method, computing device and computer-readable storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304378A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Text similarity computing method, apparatus, computer equipment and storage medium
WO2019136993A1 (en) * 2018-01-12 2019-07-18 深圳壹账通智能科技有限公司 Text similarity calculation method and device, computer apparatus, and storage medium
CN110781662A (en) * 2019-10-21 2020-02-11 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN114491177A (en) * 2022-02-15 2022-05-13 北京百度网讯科技有限公司 Information determination method, model training method, model determination device and electronic equipment

Similar Documents

Publication Publication Date Title
CN105389349B (en) Dictionary update method and device
WO2018157805A1 (en) Automatic questioning and answering processing method and automatic questioning and answering system
CN110019658B (en) Method and related device for generating search term
Guo et al. Question generation from sql queries improves neural semantic parsing
CN107491425A (en) Determine method, determining device, computer installation and computer-readable recording medium
CN107688608A (en) Intelligent sound answering method, device, computer equipment and readable storage medium storing program for executing
CN109542247B (en) Sentence recommendation method and device, electronic equipment and storage medium
WO2021139262A1 (en) Document mesh term aggregation method and apparatus, computer device, and readable storage medium
JPH08500691A (en) Bilingual database processing method and apparatus
JP6355840B2 (en) Stopword identification method and apparatus
CN104516903A (en) Keyword extension method and system and classification corpus labeling method and system
WO2021068683A1 (en) Method and apparatus for generating regular expression, server, and computer-readable storage medium
CN106557777B (en) One kind being based on the improved Kmeans document clustering method of SimHash
KR101933953B1 (en) Software domain topics extraction system using PageRank and topic modeling
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN103970733A (en) New Chinese word recognition method based on graph structure
CN109558166A (en) A kind of code search method of facing defects positioning
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN106202034A (en) A kind of adjective word sense disambiguation method based on interdependent constraint and knowledge and device
CN110502742A (en) A kind of complexity entity abstracting method, device, medium and system
CN110348020A (en) A kind of English- word spelling error correction method, device, equipment and readable storage medium storing program for executing
CN109508460A (en) Unsupervised composition based on Subject Clustering is digressed from the subject detection method and system
CN106815209B (en) Uygur agricultural technical term identification method
CN104572633A (en) Method for determining meanings of polysemous word

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 230088 Building No. 198, building No. 198, Mingzhu Avenue, Anhui high tech Zone, Anhui

Applicant after: Hefei Hualing Co.,Ltd.

Address before: 230601 R & D building, No. 176, Jinxiu Road, Hefei economic and Technological Development Zone, Anhui 501

Applicant before: Hefei Hualing Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20171219

RJ01 Rejection of invention patent application after publication