CN110347796A

CN110347796A - Short text similarity calculating method under vector semantic tensor space

Info

Publication number: CN110347796A
Application number: CN201910607928.9A
Authority: CN
Inventors: 李民; 陈龙; 单英哲; 崔豪楠
Original assignee: Synthesis Electronic Technology Co Ltd
Current assignee: Synthesis Electronic Technology Co Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2019-10-18

Abstract

The present invention discloses the short text similarity calculating method under a kind of vector semantic tensor space, this method carries out natural language processing, statistical analysis and vector analysis to the training dataset of input or test sample first, obtains the vector distance of user query sentence and training sample；Then Semantic Come-back prediction is carried out to the statistical data set obtained by natural language processing, statistical analysis, obtains the semantic distance of user query sentence and training sample；Tensor space is finally established based on vector distance distribution and semantic distance distribution, the similarity between short text is being calculated in the tensor space of foundation.Document is alleviated or eliminated to the present invention, word different length influences similarity bring, while completing Text similarity computing under vector sum semantic tensor space for solving the problems, such as short text similarity calculation under condition of small sample.

Description

Short text similarity calculating method under vector semantic tensor space

Technical field

A kind of short essay the present invention relates to short text similarity calculating method, under specifically a kind of vector semantic tensor space This similarity calculating method belongs to field of artificial intelligence.

Background technique

The Intelligent dialogue system core is the accuracy of semantic understanding, and accuracy is higher, provides more accurate, the user of service Experience is more preferable.But mainly there are two aspects for Chinese natural language semantic understanding difficulty, language is the mankind to objective things event It is artificial abstract, so language is subjective and changeable, be especially apparent in the processing of Chinese；The information of another aspect language transmitting is past Past and context-sensitive.The information sparsity and word randomness of short text further increase short text analysis processing simultaneously Difficulty, on the other hand also embodying research short text analysis has highly important theory significance.

The prior art using Word2Vec realize text vector, Word2Vec to the quantity of training data, quality and Domain knowledge completeness is more demanding.And specific industry data volume is usually not enough in actual environment, under normal circumstances can not By third party's channel obtain quality data sample, cause in actual production environment or small sample scene under can not obtain The text vectorization of high quality indicates, and then influences the accuracy of short text similarity.

Summary of the invention

The technical problem to be solved in the present invention is to provide the short text similarity calculations under a kind of vector semantic tensor space Method, for solving the problems, such as short text similarity calculation under condition of small sample, alleviating or eliminating document, word different length to phase It is influenced like degree bring, while completing Text similarity computing under vector sum semantic tensor space.

Technical problem in order to solve described problem, the technical solution adopted by the present invention is that: under vector semantic tensor space Short text similarity calculating method, comprising the following steps: S01), nature is carried out to the training dataset or test sample of input Language Processing, statistical analysis and vector analysis obtain the vector distance of user query sentence and training sample；S02), to step Semantic Come-back prediction is carried out by the statistical data set that natural language processing, statistical analysis obtain in S01, obtains user query The semantic distance of sentence and training sample；S03), the language obtained based on the step S01 vector distance distribution obtained and step S02 Adopted range distribution establishes tensor space, and the value of vector distance and semantic distance is in [0,1], then in the tensor space of foundation Similarity between interior calculating short text.

Further, in step S02, semantic analysis prediction, the tool of semantic analysis prediction are carried out by Semantic Come-back model Body process are as follows: S51), first initialize Semantic Come-back model, that is, obtain all training sample entity weight set WordIDF and exist Semantic coding set in Chinese thesaurus then extracts all entity word Chinese thesaurus codings in user query sentence, so Afterwards according to entity coding and as branch where leaf node, based on Semantic Come-back model calculate between word it is semantic away from From Semantic Come-back model are as follows:Score (word_i,word_j) indicate word_iWith word_jBetween semantic distance, wherein weight value is with word_iWith word_jIt compiles at place Code branch rank is related, and when the code branch of two words is located at the 1st grade, weight=0.00, code branch is located at the 2nd grade When, weight=0.65, when code branch is located at 3rd level, weight=0.80, when code branch is located at the 4th grade, weight= 0.90, when code branch is located at the 5th grade, weight=0.96, n represent the node total number that word is sitting in branch's layer, and k represents difference Branch's distance between semantic item；S52), traversal obtains training sample entity word and query statement entity word, calculates between word Minimum semantic distance, the mean value of the minimum semantic distance of last query statement entity word and as semantic distance dist_semIt is defeated Out.

Further, word_iWith word_jWhen code branch is identical, if coding ending is equal sign, Score= 0.00, if coding ending is No. #, Score=0.5.

Further, in step S03, the similarity between short text, formula are characterized by Minkowski Distance are as follows:(11), dist_vecIndicate vector distance, dist_semIndicate semantic distance, p is distance coefficient.

Further, the value of p is determined by way of cross validation, p is equal to 1 or 2.

Further, statistical analysis includes training process and prediction process, and training process completes training set word entity word Weight calculation, prediction process are the weighted values that each entity word is calculated when user input query sentence, wherein training process Specifically: S31), to natural language processing generate structured text successively carry out deletion duplicate keys, merge synonym processing, Training set of words and document keyword matrix DocList are constructed, wherein document keyword matrix every record includes one Standard question sentence word list corresponding with its；S32), traversal calculates each word word in training set of words_iContribution to problem Degree is weight, finally obtains entity word weight set WordIDF, document keyword matrix DocList and entity word weight set WordIDF is the statistical data set that statistical analysis obtains；Prediction process is when user input query sentence in WordIDF collection The weight of entity is extracted in conjunction according to the entity word in query statement.

Further, the weight of each entity word in training set of words, specific formula are calculated using IDF algorithm are as follows:Wherein N is the problems in training set of words sum, n_iInclude for problem Word_iThe problem of sum, 0.5 is harmonic coefficient.

Further, vector analysis includes training process and prediction process, and training process completes training word collection entity word The generation of the relatedness computation and document vector of weight and document, prediction process are calculated when user input query sentence The vector distance of itself and training sample；Training process specifically: S41), traversal DocList and WordIDF, calculate DocList Several keywords and question sentence relevance score of every record, obtain the degree of correlation set of each word Yu each document WordDocCoef, calculation formula are as follows: WordDocCoef (word_i,d_j)=IDF (word_i)·R(word_i,d_j) (2),Wherein b, k₁、k₂For Regulatory factor, f_iFor word_iIn document d_jIn the frequency of occurrences,For document d_jRelative length, qf_iFor word_iIt is inquiring The frequency occurred in document, document relative length calculation method are as follows: then the average length Avgl for calculating each document first is counted Calculate the ratio of document average length and all document average lengths, i.e. the relative length Ratl of document, calculation formula are as follows: Represent document d_iTotal length,Represent document d_i? Number, N are training set total number of documents；S42), traversal document word matrix D ocList obtains document vector, first by WordIDF Descending arrangement is carried out according to weight and sets vector space dimension as M, and M record is empty as building vector before choosing WordIDF Between entity keyword, using one-hot coding initialize word vectorsThen traversal document word matrix D ocList is obtained The entity word list of each document is taken, and in document word correlation matrix WordDocCoef query entity word list and is somebody's turn to do The degree of correlation of document is indicated by Weight Acquisition sentence vectorFormula is as follows:(8)；Prediction process specifically: user, which inputs, completes structured analysis It is for statistical analysis afterwards, query statement and vector space entity key intersections are obtained, entity word vector sum word document is based on Correlation matrix calculates query statement vectorization and indicates, then calculates query statement vector sum training sample vector in advance apart from simultaneously Normalized is completed, then by normalized result or so vector distance dist_vecOutput.

Further, natural language processing, which is completed to input user, carries out entity information extraction and multidimensional scale designation, output Structured text；The entity information extraction is to realize that natural language segments using natural language processing basic fundamental, after participle Filter high-frequency nonsense words retain specific entity information vocabulary and query vocabulary, are then based on the word list that participle obtains Carry out word expansion and normalization；Multidimensional shield label is the part of speech label of the entity sets based on user dictionary, utilizes syntax point Analysis technology obtains the grammatical relation between key message, completes part-of-speech tagging and the part of speech correction of entity word.

Further, the training dataset of input is certain industry or a certain field question set, including several notes Record, each record include a typical problem and several Similar Problems, and test sample is the text question of user's input.

Beneficial effects of the present invention: the present invention endeavours to solve the problems, such as short text similarity calculation under condition of small sample；Vector Short text similarity calculating method under semantic tensor space, compared to word incorporation model, probabilistic model to sample size and Quality requirement is not high, has better practicability and applicability in actual production practice；Compared to single probabilistic model, draw Enter semantics distance to carry out improving model computational accuracy apart from correction；Compared to traditional pure semantics model, mathematical modulo is introduced Type reduces complexity, provides model generalization ability；Compared to traditional vector space model, introduces weight coefficient b and be used to control Influence of the Document Length to text similarity；Introduce weight coefficient k₁Influence of the different word frequency to text similarity is modulated in control, Improve the accuracy of short text similarity.

Detailed description of the invention

Fig. 1 is the flow chart of this method.

Specific embodiment

The present invention is further illustrated in the following with reference to the drawings and specific embodiments.

Embodiment 1

The present embodiment discloses the short text similarity calculating method under a kind of vector semantic tensor space, as shown in Figure 1, packet Include following steps:

S01), training dataset or test sample are obtained using input mould group.

Training dataset is certain industry or a certain field question set, and comprising several records, every record includes One typical problem and several Similar Problems.Test sample is the text question of user's input.

Training dataset does not limit source, and source includes disk document, structuring and unstructured database, webpage, Forum etc..The data of separate sources are standardized in input module, are processed into a typical problem and multiple similar are asked The form of sentence.Sentence is carried out simultaneously and artificially carries out semantic expansion, enhances the generalization ability of training set, output standardization text.Rule Generalized text is problem set and corresponding similar question sentence set.

S02), natural language processing mould group performance specification text completes extraction and the multidimensional scale designation of entity information, defeated Structured text out.

The entity extraction is to realize that natural language segments using natural language processing basic fundamental.Filter high-frequency after participle Nonsense words, for example, I and punctuation mark etc., retain specific entity information vocabulary and query vocabulary, finally obtain The word collection of natural language text；It is then based on the word list that participle obtains and carries out word expansion and normalization.Specific method Including but not limited to realize that efficient word figure scans based on Trie tree construction, Chinese character is all in generation sentence may be at word situation institute The directed acyclic graph (DAG) of composition；Maximum probability path is searched using Dynamic Programming, finds out the maximum cutting group based on word frequency It closes；Word for being not logged in finds out most possible group using Viterbi algorithm using the HMM model based on Chinese character at word ability It closes.

The multidimensional scale designation refers to the part of speech label of the entity sets based on user dictionary, utilizes syntactic analysis skill Art, such as finite graph analytic approach, phrase structure analysis, complete grammer, local grammer and dependency analysis etc. obtain key message Between grammer dependence, modified relationship etc., complete entity word part-of-speech tagging and part of speech correction.

In the present embodiment, the structured text of natural language processing mould group output includes the entity word set of natural language text Conjunction and corresponding part of speech set.

S03), the structured text that statistical analysis mould group exports natural language processing mould group is for statistical analysis, including Training process and prediction process.Wherein training process mainly completes training set word entity word weight calculation.Prediction process is to work as The weighted value of each entity word in query statement is calculated when user input query sentence.

In the present embodiment, the training process of statistical analysis is as follows:

S31), it is directed to structured text, successively by deleting duplicate keys, merges the processes such as synonym and constructs training word collection Conjunction and document keyword matrix DocList, wherein document keyword matrix every record includes a standard question sentence and its phase Answer word list.

S32), traversal calculates each word word in set of words_iTo contribution degree, that is, weight of problem.The calculating side of weight There are many kinds of methods, and the present embodiment uses IDF algorithm, finally obtains term weighing WordIDF set.

Calculation formula are as follows:

Wherein N is the problems in training set sum, n_iTo contain word in problem_iThe problem of sum, wherein 0.5 is Harmonic coefficient, log function are to allow the value of IDF by N and n_iInfluence it is more smooth.

The statistical data set of final training output includes term weighing set WordIDF and document keyword matrix DocList。

The prediction process of statistical analysis be when user input query sentence WordIDF set according in query statement Entity word extract entity weight.

S04), vector analysis mould group to statistical analysis mould group output data carry out vector analysis, including training process and Prediction process.Wherein training process mainly completes the relatedness computation and document vector of training set entity word weight and document It generates.Prediction process is that the vector distance for inquiring itself and training sample is calculated when user input query sentence.

Wherein the training process of vector analysis is as follows: S41), traverse training aggregated document keyword DocList and WordIDF calculates several keywords and question sentence relevance score of DocList every record, obtains each word and each The degree of correlation set WordDocCoef of document, calculation formula are as follows:

WordDocCoef(word_i,d_j)=IDF (word_i)·R(word_i,d_j) (2),

Wherein b, k₁、k₂For regulatory factor, it is arranged generally according to experience, general b=0.75, k₁=2, f_iFor word_iIn text Shelves d_jIn the frequency of occurrences,For document d_jRelative length, qf_iFor word_iThe frequency occurred in inquiry document is big absolutely Under partial picture, word_iIt only will appear primary, i.e. qf in queries_i=1, therefore formula 4 simplifies are as follows:

Document relative lengthCalculation method are as follows: calculate the average length Avgl of each document first, then calculate text The ratio of shelves average length and all document average lengths, i.e. the relative length Ratl of document.Document relative length is for correcting Its influence to short text similarity.Calculation formula are as follows:

WhereinRepresent document d_iTotal length,Represent document d_iNumber, N be training set total number of documents.

S42), traversal document word matrix D ocList obtains document vector, first drops WordIDF according to weight Sequence arranges and sets vector space dimension as M, and M record is adopted as building vector space entity keyword before choosing WordIDF Word vectors are initialized with one-hot codingVector dimension value non-zero i.e. 1；Then document word matrix D ocList is traversed Obtain the entity word list of each document, and document word correlation matrix WordDocCoef query entity word list with The degree of correlation of the document is indicated by Weight Acquisition sentence vectorFormula is as follows:

In the present embodiment, the prediction process of vector analysis is as follows: user carries out statistical after inputting completion structured analysis Analysis obtains query statement and vector space entity key intersections, based on entity word vector sum word file correlation matrix Calculating query statement vectorization indicates, then calculates query statement vector sum training sample vector COS distance and completes at normalization Reason, finally using normalized result as vector distance dist_vecOutput.

S05), Semantic Come-back mould group carries out Semantic Come-back prediction, and prediction process is to calculate user input query sentence and instruction Practice the semantic distance between sample.

In the present embodiment, Semantic Come-back model prediction process is as follows: S51), initializing Semantic Come-back model first, that is, obtains Semantic coding set of all training sample entity weight set WordIDF in Chinese thesaurus is taken, inquiry language is then extracted All entity word Chinese thesaurus codings are based on language then according to entity coding and as branch where leaf node in sentence Adopted regression model calculates the semantic distance between word.Semantic Come-back model is as follows:

Score(word_i,word_j) indicate word_iWith word_jBetween semantic distance, wherein weight value is with word_i With word_jPlace code branch rank is related, when the code branch of two words is located at the 1st grade, weight=0.00, and coding When branch is located at the 2nd grade, weight=0.65, when code branch is located at 3rd level, weight=0.80, code branch is located at the 4th When grade, weight=0.90, when code branch is located at the 5th grade, weight=0.96, n represent the node that word is sitting in branch's layer Sum, k represent branch's distance between different semantic items.If coding ending is "=" when code branch is identical, then Score=0.00, if ending is " # " so Score=0.5.

Branch distance k where word between the node total number n and non-synonymity of branch's layer influences the similar of the senses of a dictionary entry Degree.For example:

Aa01A01=personage personage personage people native to a place person

The Aa01A02=mankind stranger whole mankind

Aa01A03=manpower personnel's population population mouthful index finger

Aa01A04=labour labour worker

Aa01A05=ordinary man is personal

Personage and personnel's same branches are Aa01A, with code branch totally 5 of Aa01A beginning, therefore n=5；Personage and people Member where branch's rank be 5 grades, wherein coding in digital level every two represent a rank, so weight=0.96；People The branch of scholar and personnel distance are 2, therefore k=2

Then

S52), traversal obtains training sample entity word and query statement entity word, calculate minimum semanteme between word away from From, the mean value of the minimum semantic distance of last query statement entity word and as semantic distance dist_secOutput.

S06), under tensor space similarity calculation module calculate short text similarity

It is primarily based on the distribution of sentence vector distance and tensor space is established in semantic distance distribution, wherein vector distance and language Vector distance of the value of adopted distance between [0,1], sentence is from the semanteme between angle of statistics characterization sentence, and sentence Between semantic sentence then represent the difference between sentence from semantics angle, the semantic distance of two different dimensions can be mutually complementary It fills, therefore uses integrated study thought, the two is placed under same tensor space and carries out comprehensive consideration；Phase between final short text It is characterized like degree by the Minkowski Distance under tensor space.Formula is as follows:

dist_vecIndicate vector distance, dist_semIndicate semantic distance, p is distance coefficient.In actual application may be used It is specific according to specific business datum, the value of p is determined by way of cross validation.General classics value is 1 or 2.Value is 1 When distance be city distance or manhatton distance；Distance is classical Euclidean distance when value is 2.

Described above is only basic principle and preferred embodiment of the invention, and those skilled in the art do according to the present invention Improvement and replacement out, belong to the scope of protection of the present invention.

Claims

1. the short text similarity calculating method under vector semantic tensor space, it is characterised in that: the following steps are included: S01), Training dataset or test sample to input carry out natural language processing, statistical analysis and vector analysis, obtain user and look into Ask the vector distance of sentence and training sample；S02), to the system for passing through natural language processing in step S01, statistical analysis obtains It counts set and carries out Semantic Come-back prediction, obtain the semantic distance of user query sentence and training sample；S03), it is based on step Tensor space is established in the semantic distance distribution that the distribution of vector distance that S01 is obtained and step S02 are obtained, vector distance and it is semantic away from From value in [0,1], then in the tensor space of foundation calculate short text between similarity.

2. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: In step S02, pass through Semantic Come-back model carry out semantic analysis prediction, semantic analysis prediction detailed process are as follows: S51), first Semantic Come-back model is initialized, that is, obtains semanteme of all training sample entity weight set WordIDF in Chinese thesaurus Code set then extracts all entity word Chinese thesaurus coding in user query sentence, then according to entity coding and As branch where leaf node, the semantic distance between word, Semantic Come-back model are calculated based on Semantic Come-back model are as follows:

Score(word_i,word_j) indicate word_iWith word_jBetween semantic distance, wherein weight value is with word_iWith word_jPlace code branch rank is related, when the code branch of two words is located at the 1st grade, weight=0.00, and coding point When branch is located at the 2nd grade, weight=0.65, when code branch is located at 3rd level, weight=0.80, code branch is located at the 4th grade When, weight=0.90, code branch be located at the 5th grade when, weight=0.96, n represent word be sitting in branch layer node it is total Number, k represent branch's distance between different semantic items；S52), traversal obtains training sample entity word and query statement entity word, Calculate word between minimum semantic distance, the mean value of the minimum semantic distance of last query statement entity word and as semanteme away from From dist_semOutput.

3. the short text similarity calculating method under vector semantic tensor according to claim 2 space, it is characterised in that: word_iWith word_jWhen code branch is identical, if coding ending is equal sign, Score=0.00, if coding ending is No. #, Then Score=0.5.

4. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: In step S03, the similarity between short text, formula are characterized by Minkowski Distance are as follows:dist_vecIndicate vector distance, dist_semIndicate that semantic distance, p are distance system Number.

5. the short text similarity calculating method under vector semantic tensor according to claim 4 space, it is characterised in that: The value of p is determined by way of cross validation, p is equal to 1 or 2.

6. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: Statistical analysis includes training process and prediction process, and training process completes training set word entity word weight calculation, predicts process Be the weighted value that each entity word is calculated when user input query sentence, wherein training process specifically: S31), to nature Language Processing generate structured text successively carry out deletion duplicate keys, merge synonym processing, construct training set of words with And document keyword matrix DocList, wherein document keyword matrix every record includes a standard question sentence and its corresponding words Language list；S32), traversal calculates each word word in training set of words_iTo contribution degree, that is, weight of problem, finally obtain Entity word weight set WordIDF, document keyword matrix DocList and entity word weight set WordIDF are statistical analysis The statistical data set obtained；Prediction process be when user input query sentence WordIDF set according to query statement In entity word extract entity weight.

7. the short text similarity calculating method under vector semantic tensor according to claim 6 space, it is characterised in that: The weight of each entity word in training set of words, specific formula are calculated using IDF algorithm are as follows:Wherein N is the problems in training set of words sum, n_iIt is contained for problem word_iThe problem of sum, 0.5 is harmonic coefficient.

8. the short text similarity calculating method under vector semantic tensor according to claim 7 space, it is characterised in that: Vector analysis includes training process and prediction process, and training process completes the degree of correlation of training word collection entity word weight and document Calculating and the generation of document vector, prediction process is to calculate its vector with training sample when user input query sentence Distance；Training process specifically: S41), traversal DocList and WordIDF, calculate DocList every record several pass Keyword and question sentence relevance score, obtain the degree of correlation set WordDocCoef of each word Yu each document, calculation formula Are as follows:

WordDocCoef(word_i,d_j)=IDF (word_i)·R(word_i,d_j) (2),

Wherein b, k₁、k₂For regulatory factor, f_iFor word_iIn document d_jIn the frequency of occurrences,For document d_jIt is relatively long Degree, qf_iFor word_iThe frequency occurred in inquiry document, document relative length calculation method are as follows: calculate each document first Then average length Avgl calculates the ratio of document average length and all document average lengths, the i.e. relative length of document Ratl, calculation formula are as follows: Represent document d_iTotal length,Represent document d_iNumber, N be training set total number of documents；S42), traversal document word matrix D ocList obtain document to WordIDF is carried out descending arrangement according to weight first and sets vector space dimension as M by amount, chooses M note before WordIDF Record initializes word vectors as building vector space entity keyword, using one-hot codingThen document word is traversed Language matrix D ocList obtains the entity word list of each document, and inquires in document word correlation matrix WordDocCoef The degree of correlation of entity word list and the document is indicated by Weight Acquisition sentence vectorFormula is as follows:Prediction process specifically: user, which inputs, completes structuring It is for statistical analysis after analysis, query statement and vector space entity key intersections are obtained, entity word vector sum word is based on File correlation matrix calculate query statement vectorization indicate, then calculate query statement vector sum training sample vector in advance away from From and complete normalized, then by normalized result or so vector distance dist_vecOutput.

9. the short text similarity calculating method under vector semantic tensor according to claim 1 space, it is characterised in that: Natural language processing, which is completed to input user, carries out entity information extraction and multidimensional scale designation, export structure text；The reality Body information extraction is to realize that natural language segments using natural language processing basic fundamental, the meaningless word of filter high-frequency after participle It converges, retains specific entity information vocabulary and query vocabulary, be then based on the word list that participle obtains and carry out word expansion and return One changes；Multidimensional shield label is the part of speech label of the entity sets based on user dictionary, obtains crucial letter using syntactic analysis technology Grammatical relation between breath completes part-of-speech tagging and the part of speech correction of entity word.

10. the short text similarity calculating method under vector semantic tensor according to claim 1 space, feature exist In: the training dataset of input is certain industry or a certain field question set, including several records, each record include One typical problem and several Similar Problems, test sample are the text question of user's input.