CN109344236A - One kind being based on the problem of various features similarity calculating method - Google Patents
One kind being based on the problem of various features similarity calculating method Download PDFInfo
- Publication number
- CN109344236A CN109344236A CN201811041071.0A CN201811041071A CN109344236A CN 109344236 A CN109344236 A CN 109344236A CN 201811041071 A CN201811041071 A CN 201811041071A CN 109344236 A CN109344236 A CN 109344236A
- Authority
- CN
- China
- Prior art keywords
- new
- rel
- similarity
- word
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The problem of being based on various features the invention discloses one kind similarity calculating method, comprising steps of for the new problem sentence of input, calculating is compared with the historical problem of storage and corresponding answer in it, calculates the similarity between new problem and historical problem based on character feature, the similarity based on word semantic feature, the similarity based on sentence semantics feature, the similarity for implying based on sentence theme feature and the similarity based on answer semanteme feature;Final similarity is the sum of products of above-mentioned 5 similarities and its respective respective weights, and weight is obtained using linear regression method training.The present invention increases the diversity of sample attribute using various features, improves the generalization ability of model.TF-IDF is merged using soft COS distance with information such as editing distance, phrase semantics simultaneously, the semantic gap between word is overcome, improves the accuracy rate of similarity calculation.
Description
Technical field
The present invention relates to Computer Natural Language Processings and automatically request-answering system research field, in particular to a kind of based on more
The problem of planting feature similarity calculating method.
Background technique
With quickling increase for digital information, the difficulty that people obtain required information resources from network also increases therewith
Greatly.How in the digital information of magnanimity, required information rapidly precisely is found for user and gives natural language processing (NLP)
Technology and information retrieval technique brings stern challenge.Therefore, in order to provide the user with the high information of strong real-time, accuracy
Channel is obtained, research institution and relevant technical company begin one's study automatically request-answering system (QA).In automatically request-answering system, user
Corresponding answer can be directly obtained by only needing to input problem, it is no longer necessary to user according to problem extract keyword retrieved with
And it reads a large amount of webpages and finds answer.Automatically request-answering system is more simpler than traditional search engines easy-to-use, real-time, accurate, for
Family provides comfortable man-machine interaction experience, becomes the research hotspot of current information technology a new generation.Automatically request-answering system allows
The problem of user describes problem in the form of natural language, then accurately understands user, and pass through retrieval question and answer library or interconnection
The information organization answers searched on the net, finally return to refining and accurately as a result, providing efficient acquisition of information channel.
Problem similarity calculation is primary link in automatically request-answering system, it is therefore an objective to found out from existing problem set with
The new historical problem for proposing that problem is most like, to provide the answer of new problem according to the answer set of historical problem.
Currently, domestic, in automatic question answering field, there are also achievements.Universal community's question answering system includes Quora, top news
Question and answer, Baidu know, professional community's question answering system is related to multi-specialized, such as the IT technology such as Stack Overflow, CSDN
Relevant question answering system.Therefore, problem similarity calculating method directly affects the accuracy rate of question answering system, has good industry
Prospect.
It is accumulated by years of researches, automatically request-answering system forms general frame, mainly by information retrieval, problem point
Analysis and answer obtain three module compositions.Wherein, the main task of case study module was divided the problem of user's input
Analysis is found out from existing problem set and is related to problem similarity with the most like historical problem of problem, research contents is newly proposed
Analysis and problem sequence, wherein the similarity calculation most importantly between problem and problem, thus according to similarity to history
Problem set is ranked up.Answer obtains the Similar Problems set that module is mainly retrieved according to problem, obtains corresponding answer
Case set.
Text similarity the relevant technologies are the bases of problem similarity technology (problem and answer belong to text type).Text
There are mainly three types of this similarity calculating methods.
The first is the similarity calculation based on vector space model (Vector Space Model, VSM), and text is reflected
A point being mapped in vector space recycles mathematical method to calculate space midpoint at a distance from point.Having researcher to propose will
VSM model is applied in the Similar Problems retrieval tasks of frequently asked question (Frequently Asked Questions, FAQ), and
VSM is improved for the task feature of FAQ.But this method text is sparse to cause dimension excessive, is easy to appear semanteme
Divide problem.
Second is the similarity calculation based on syntactic analysis, is introduced into patterned mode and describes each word in a sentence
Group mutually dominates and by dominance relation.Have researcher propose the analysis method based on deep structure, first to problem carry out according to
It deposits relationship analysis, chooses most important word in sentence and be directly attached to effective word of the word and matched, is then based on interdependent
Relational structure carries out the Text similarity computing of Chinese.But the tools such as this method syntactic analysis, dependency analysis are more multiple
It is miscellaneous, linguistics background is needed, and bad to the analytical effect of complicated long sentence pattern.
The third is semantic-based similarity calculation, includes two kinds of similarities of phrase semantic and sentence semantics.About word
The Semantic Similarity Measurement of language, usually using the semantic dictionaries such as WordNet and Hownet, contained in semantic dictionary word and word it
Between semantic relation.There is researcher to think that the expressed intact of short sentence depends not only upon syntactic structure, but also depends on word
And its weight, therefore improve using WordNet the semantic expressiveness of word.In terms of the Semantic Similarity Measurement of sentence, grind
Personnel are studied carefully using the transition probability between Machine Translation Model two problem sentences of study of IBM, to indicate the semanteme of sentence
Similarity retrieves Similar Problems.But this method depends on semantic dictionary unduly, and the accuracy rate of similarity calculation is by semantic dictionary
Completeness and correctness influence;Similarly, long sentence pattern effect of the semantic-based similarity calculating method in processing syntax complexity
It is poor.
Method in the prior art is mostly the information extraction text representation feature based on single type simultaneously, is focused on single
Type feature, does not account for text representation and is meant that and be made of many-sided multi-level information, therefore calculates similarity
Accuracy rate is not high.
Summary of the invention
For overcome the deficiencies in the prior art, the problem of being based on various features the object of the present invention is to provide one kind similarity
Calculation method, similarity calculation of this method between problem English question answering system, has the advantages that accuracy rate is high.
The purpose of the present invention is realized by the following technical solution: one kind being based on the problem of various features similarity calculation side
Method, comprising steps of meter is compared with the historical problem of storage and corresponding answer in it for the new problem sentence of input
It calculates, calculates the similarity between new problem and historical problem based on character feature, the similarity based on word semantic feature, base
In the similarity of sentence semantics feature, the similarity of theme feature implied based on sentence and based on the similar of answer semanteme feature
Degree;Final similarity is the sum of products of above-mentioned 5 similarities and its respective respective weights, and weight utilizes linear regression method training
It obtains.
Preferably, for first being pre-processed for the new problem sentence of contrast conting, historical problem and corresponding answer,
Including the processing of removal punctuation mark, capital and small letter conversion (all capitalizations change into small letter), remove stop words and low frequency word.
Preferably, the method for calculating the similarity based on character feature is: firstly, by calculating the volume between each pair of word
Distance is collected, the relational matrix between word is obtained, then utilizes the TF-IDF (term frequency-inverse of question sentence
Document frequency) it indicates to calculate soft COS distance with relational matrix, as the similarity based on character feature.
Further, the relational matrix calculation method between word is:
Defining corpus is the question sentence and answer text data set for trained and test model, it is assumed that dictionary in corpus
Size be n, then the relational matrix M that editing distance between word is formedlev∈Rn×n, Rn×nIt is the real number square that size is n × n
Battle array set (hereafter meaning is identical), MlevMiddle element mi,jFor i-th of word w in dictionaryiWith j-th of word wjEditing distance.Editor away from
It is as follows from calculation formula:
Wherein, | | wi| | it is word wiIn include character number, | | wj| | it is word wjIn include character number, α is
The weighted factor of diagonal element, β are the intensifiers apart from score.lev(wi,wj) recursive computing formula it is as follows:
Wherein, m and n represent wiAnd wjThe length (including the number of character) of word.Cost indicates wiIn m-th of character
To wjIn n-th of character replacement cost, if two characters are consistent, cost 0, otherwise cost be 1.
Further, the TF-IDF of question sentence indicates that calculation method is: in a sentence, for word wi, calculate TF
Value and IDF value, TF value indicate the frequency that word occurs in current sentence, and IDF value indicates inverse document frequency, for word
wiTF-IDF calculation formula are as follows:
TFIDFi=TFi*IDFi。
Further, question sentence Q is proposed for newnewWith history question sentence Qrel, the calculation method of soft COS distance is:
Question sentence Qnew、QrelIt is expressed as TFIDFnewAnd TFIDFrel:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iIndicate wiIn QnewIn TF-IDF value, drel,jIndicate wjIn QrelIn TF-IDF value, n indicates corpus
It include the number of word in dictionary, T indicates the transposition of vector.
It is simultaneously M according to relational matrix between the word acquiredlev, Q is calculated using soft COS distancenew、QrelBetween be based on
The similarity R of character featurelev(Qnew,Qrel), formula is as follows:
Wherein, " " is the dot product (hereafter meaning is identical) of vector with matrix, and calculation is as follows:
Preferably, the method for calculating the similarity based on word semantic feature, step is:
(1) obtaining the distributed of each word in corpus using the training of word2vec tool indicates, i.e., each word pair
The real vector for answering a K to tie up.
(2) corpus for being n to dictionary size calculates word two in dictionary by seeking the COS distance between term vector
Semantic relation m between twoi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n。
(3) question sentence Q is readnew、QrelUsing the expression of TF-IDF, respectively TFIDFnewAnd TFIDFrel。
(4) Q is calculated using soft COS distancenew、QrelBetween the similarity R based on word semantic featurew2v(Qnew,Qrel),
Formula is as follows:
Preferably, the method for calculating the similarity based on sentence semantics feature, step is:
The term vector w obtained according to the training of word2vec tooli∈RK, it is assumed that question sentence includes M word, then question sentence can be with
It is expressed as Qmatrix=(w1,w2,…,wM),
It is the form of term vector arithmetic mean of instantaneous value in sentence, i.e. vector by problem sentence expression:
According to above formula, question sentence Qnew、QrelArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVGnew
And AVGrel, by seeking AVGnewWith AVGrelCOS distance obtain Qnew、QrelBetween the similarity based on question semanteme feature
Rvec(Qnew,Qrel)。
Preferably, the method that the similarity of theme feature is implied based on sentence is calculated, step is:
Using corpus as the input of LdaModel function in Gensim, LDA model is obtained by training;Then needs
Calculate the new proposition question sentence Q of theme distributionnewWith history question sentence QrelIt is input to trained LDA model, obtains two question sentence bases
It is indicated in the vector of theme distribution, is denoted as LDA respectivelynew、LDArel.By seeking LDAnewWith LDArelCOS distance, obtain
Qnew、QrelBetween the similarity R of theme feature is implied based on sentencelda(Qnew,Qrel)。
Preferably, the method for calculating the similarity based on answer semanteme feature, step is:
In question answering system, each historical problem corresponds to a candidate answers set.Question sentence Q is proposed for newnewWith
History question sentence QrelAnd QrelCorresponding candidate answers Arel, calculate QnewWith ArelWord level semantic similarity, as
Similarity R of two question sentences based on answer semanteme featureqa(Qnew,Qrel), detailed process is as follows:
(1) obtaining the distributed of each word in corpus using the training of word2vec tool indicates, i.e., each word pair
The real vector for answering a K to tie up.
(2) corpus for being n to dictionary size calculates word two in dictionary by seeking the COS distance between term vector
Semantic relation m between twoi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n。
(3) question sentence Q is readnewTF-IDF indicate TFIDFnew;ArelIt is indicated by TF-IDF, obtains TFIDFans=
[dans,1,dans,2,...,dans,n]T, wherein dans,iIndicate wiIn ArelIn TF-IDF value, n indicate corpus dictionary in wrap
Number containing word, T indicate the transposition of vector.
(4) Q is calculated using soft COS distancenew、QrelBetween the similarity R based on answer semanteme featureqa(Qnew,Qrel),
Formula is as follows:
Preferably, the calculation method of final similarity is:
Wherein, Rk(Qnew,Qrel) represent the similarity based on feature k, i.e., similarity, base respectively based on character feature
In the similarity of phrase semantic feature, the similarity based on sentence semantics feature, the similarity for implying based on sentence theme feature
With the similarity based on answer semanteme feature.λkIt is to be obtained to training parameter using the training of linear regression analysis method.
Further, using in linear regression analysis method training process, iterative step is as follows:
(1) the corresponding weight λ of each feature of random initializtionk, according to the Squared Error Loss of weight calculation current iteration;Square
Loss function is as follows:
Wherein, I is given training samples number, and training sample is known problem pair and the similarity to problem,It is the similarity predicted value of i-th of sample, Y(i)It is the given value of i-th of Sample Similarity;
(2) according to Squared Error Loss, to the weight λ of each featurekIt carries out seeking local derviation, obtains the gradient of the current iteration of weightT indicates the t times iteration;
(3) basisThe weight of each feature is updated, α is step-length;
(4) Squared Error Loss is recalculated according to new weight, if current Squared Error Loss is flat not less than preceding an iteration
Side's loss, then stop iteration, obtain the corresponding weight λ of each featurekEnd value;Otherwise circulation carries out step (2)-(4).
Compared with prior art, the present invention has the following advantages that and technical effect:
(1) in machine learning field, training sample is by one group of attribute description, and different attribute sets provides not
The visual angle of same observation data.The present invention from five different views with the problem of natural language description and answer sentence,
It is extracted five seed type features.Compared with based on single type feature representation mode, various features increase the more of sample attribute
Sample improves the generalization ability of model.
(2) present invention is merged TF-IDF with information such as editing distance, phrase semantics using soft COS distance, is calculated
The similarity of problem and problem.Traditional similarity calculating method is compared, the present invention overcomes the semantic gaps between word, mention
The high accuracy rate of similarity calculation.
Detailed description of the invention
Fig. 1 is the flow chart of the present embodiment method.
Specific embodiment
Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited
In this.
The present embodiment one kind measures two problems using 5 kinds of features based on similarity calculating method the problem of various features
Similarity between sentence is character feature respectively, phrase semantic feature, sentence semantics feature, the implicit theme feature of sentence, answers
Case semantic feature.Similarity based on this 5 kinds of features is integrated into the final similarity of new problem and historical problem.Referring to
Fig. 1, by each step for stating this method in detail in conjunction with an example.
(1) a new problem sentence Q is inputtednew: Where I can buy good oil for massage?
(2) a historical problem sentence Q is readrel: is there any place i can find scented
Massage oils in qatar?
The answer A of a historical problem is read simultaneouslyrel: Yes.It is right behind Kahrama in the
National area.
(3) to Qnew﹑ QrelAnd ArelIt is pre-processed respectively, comprising: the processing of removal punctuation mark, capital and small letter conversion are (all
Capitalization changes into small letter), remove stop words and low frequency word etc..It obtains:
Qnew: buy good oil massage
Qrel: place find scented massage oils qatar
Arel: yes right behind kahrama national area
(4) Q is calculated separatelynewAnd QrelSimilarity based on following 5 kinds of features.
Be assumed below will area, behind, buy, find, good, kahrama, massage, national, oil,
Oils, place, qatar, right, scented, yes } it is used as corpus dictionary set.
The similarity of (4-1) based on character feature
Character feature is the similarity measured between word from character level, and word character level is sought using editing distance
Similarity.Firstly, obtaining the relational matrix M between word by calculating the editing distance between each pair of wordlev, then sharp
It is indicated and relational matrix M with the TF-IDF of question sentencelevSoft COS distance is calculated as the similarity based on character feature.Specifically such as
Under:
(4-1-1) calculated relationship matrix Mlev
(refer to the question sentence and answer text data set for trained and test model, hereinafter corpus assuming that corpus
Meaning is identical) in the size of dictionary be n, then the relational matrix M that editing distance between word is formedlev∈Rn×n, Rn×nIt is big
The small real number matrix set for n × n, MlevMiddle element mi,jFor i-th of word w in dictionaryiWith j-th of word wjEditing distance.Editor
Distance calculation formula is as follows:
Wherein, | | wi| | it is word wiIn include character number, | | wj| | it is word wjIn include character number, α is
The weighted factor of diagonal element, β are the intensifiers apart from score, and α value is that 1.8, β value is 5.lev(wi,wj) recurrence
Calculation formula is as follows:
Wherein, m and n represent wiAnd wjThe length (including the number of character) of word.Cost indicates wiIn m-th of character
To wjIn n-th of character replacement cost, if two characters are consistent, cost 0, otherwise cost be 1.
Because dictionary size is 15 words, then MlevIt is 15 × 15 matrix, following form:
The TF-IDF that (4-1-2) calculates question sentence is indicated
TF-IDF is made of TF and IDF.
In a sentence, for word wi, calculate TF value formula it is as follows:
Wherein, TFiIndicate the frequency that i-th of word occurs in current sentence, niIndicate that the word goes out in current sentence
Existing number, nkIndicate the number that k-th of word occurs in current sentence.
For given corpus, the IDF value of each word is fixed, word wiCalculation it is as follows:
Wherein: | D | indicate total text number, denominator then indicates the text number comprising the word.
In a sentence, for word wiTF-IDF calculation formula are as follows:
TFIDFi=TFi*IDFi
Question sentence Qnew、QrelIt is expressed as following vector:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iIndicate wiIn QnewIn TF-IDF value, drel,jIndicate wjIn QrelIn TF-IDF value, n indicates corpus
It include the number of word in dictionary, T indicates the transposition of vector.
Step (3) are pre-processed to obtained question sentence Qnew、QrelIt is expressed as 15 dimensional vectors of following form:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634 ...]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0 ...]T。
(4-1-3) calculates soft COS distance
Soft COS distance is 2014 for COS distance, and Sidorov proposes improved cosine similarity
The entitled soft COS distance (Soft Cosine) of calculation method, when calculating COS distance introduce relational matrix come indicate word it
Between relationship.
Question sentence Qnew、QrelIndicate to be respectively TFIDF by step (4-1-2)newAnd TFIDFrel, pass through step (4-1-1)
Acquiring relational matrix between word is Mlev, question sentence Q is calculated using soft COS distancenew、QrelBetween based on the similar of character feature
Spend Rlev(Qnew,Qrel), formula is as follows:
Wherein, " " is the dot product (hereafter meaning is identical) of vector with matrix, and calculation is as follows:
In the present embodiment
The similarity of (4-2) based on word semantic feature
(4-2-1), which obtains the distributed of each word in corpus using the training of word2vec tool, to be indicated, i.e., each word
The term vector of corresponding one 200 dimension of language, as follows:
The corpus that (4-2-2) is n to dictionary size calculates word in dictionary by seeking the COS distance between term vector
The semantic relation m of language between any twoi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n。mi,jCalculation formula such as
Under:
mi,j=max (0, cos (wi,wj))2
Wherein, wi、wjIndicate the real vector of ith and jth word K dimension in corpus, wi,wj∈RK, RKBe length be K
One-dimensional vector (hereafter meaning is identical), " " is the standard dot product (hereafter meaning is identical) between vector, and calculation formula is as follows:
Wherein, wi,mIndicate wiM-th of component, wj,mIndicate wjM-th of component.
Dictionary size is 15 words in the present embodiment, then Mw2vIt is 15 × 15 matrix, as follows:
(4-2-3) reads the question sentence Q that (4-1-2) is calculatednew、QrelTF-IDF expression be respectively as follows:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634 ...]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0 ...]T
(4-2-4) seeks question sentence Q by soft COS distance formulanew、QrelBetween the similarity based on word semantic feature:
The similarity of (4-3) based on sentence semantics feature
The term vector w obtained according to word2vec trainingi∈RK, it is assumed that question sentence includes M word, then question sentence can indicate
For Qmatrix=(w1,w2,…,wM),It is term vector arithmetic average in sentence by problem sentence expression
The form of value, i.e. vector:
According to above formula, question sentence Qnew、QrelArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVGnew
And AVGrel:
AVGnew=[0.014657,0.075914, -0.042454,0.219559, -0.117374 ...]
AVGrel=[- 0.088187, -0.025432, -0.05328,0.17098376, -0.13033055 ...]
By seeking AVGnewWith AVGrelCOS distance obtain Qnew、QrelBetween the similarity based on question semanteme feature
Rvec(Qnew,Qrel), formula is as follows:
(4-4) implies the similarity of theme feature based on sentence
The present invention implies the implicit master that topic model seeks question sentence using LDA (Latent Dirichlet Allocation)
Topic.After training by LDA, in available document sets every document implicit theme distribution, and then obtain the theme of sentence
Vector.For example, sentence QmImplicit theme distribution Indicate that the probability for belonging to i-th of theme, I are
The quantity of theme is implied, then QmIt is expressed as vector:
The theme distribution of question sentence is opened by Gensim (https: //radimrehurek.com/gensim/) topic model
Source tool is calculated.First using corpus as the input of LdaModel function in Gensim, LDA mould is obtained by training
The question sentence for needing to calculate theme distribution, is then input to trained LDA model by type, and obtained output is the theme of the question sentence
Distribution, and be vector by the sentence expression.
Question sentence Q is proposed for newnewWith history question sentence Qrel, two are calculated by Gensim topic model Open-Source Tools
Question sentence is indicated based on the vector of theme distribution, is denoted as LDA respectivelynew、LDArel:
LDAnew=[0.001784,0.001934,0.002056,0.002072,0.001772 ...]
LDArel=[0.001706,0.001850,0.001967,0.001982,0.001695 ...]
By seeking LDAnewWith LDArelCOS distance, obtain the similarity R that theme feature is implied based on sentencelda(Qnew,
Qrel), formula is as follows:
The similarity of (4-5) based on answer semanteme feature
In question answering system, each historical problem corresponds to a candidate answers set.Question sentence Q is proposed for newnewWith
History question sentence QrelAnd QrelCorresponding candidate answers Arel, Q is calculated using the method being similar in (4-2)newWith ArelWord
The semantic similarity of language level, the similarity R as two question sentences based on answer semanteme featureqa(Qnew,Qrel), detailed process is such as
Under:
(4-5-1) reads the term vector that each word 200 is tieed up in the dictionary that (4-2-1) is calculated;
(4-5-2) reads the M that (4-2-2) is calculatedw2vMatrix;
(4-5-3) reads the Q that (4-1-2) is obtainednewTF-IDF indicate:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634 ...]T
It calculates step (3) and pre-processes obtained ArelTF-IDF indicate, obtain following 15 dimensional vector:
TFIDFans=[0.408248,0.408248,0.0,0.0,0.0 ...]T
(4-5-4) calculates Q using soft COS distancenew、QrelBetween the similarity R based on answer semanteme featureqa(Qnew,
Qrel), formula is as follows:
(5) final similarity is calculated
Calculate QnewWith QrelFinal similarity Sim (Qnew,Qrel), formula is as follows:
Wherein, Rk(Qnew,Qrel) similarity based on feature k is represented, it is acquired, that is, distinguished by (4-1)~(4-5) respectively
Are as follows:
Rlev(Qnew,Qrel)=0.225969, Rv2w(Qnew,Qrel)=0.304225, Rvec(Qnew,Qrel)=0.738933,
Rlda(Qnew,Qrel)=0.685844, Rqa(Qnew,Qrel)=0..018413
λkIt is to be obtained to training parameter using the training of linear regression analysis method.Training method uses forward stepwire regression,
Each step all reduces error as far as possible.Using quadratic loss function, the certain number of iteration, until loss function minimum.Square damage
It is as follows to lose function:
Wherein, I is given training samples number, and training sample is known problem pair and the similarity to problem,It is the similarity predicted value of i-th of sample, Y(i)It is the given value of i-th of Sample Similarity.
Iterative process is as follows:
1. the corresponding weight λ of each feature of random initializtionk, according to the Squared Error Loss of weight calculation current iteration;
2. according to Squared Error Loss, to the weight λ of each featurekIt carries out seeking local derviation, obtains the gradient of the current iteration of weightT indicates the t times iteration;
3. basisThe weight of each feature is updated, α is step-length, takes 0.1;
4. Squared Error Loss is recalculated according to new weight, if current Squared Error Loss is not less than square of preceding an iteration
Loss, then stop iteration, obtain the corresponding weight λ of each featurekEnd value;Otherwise circulation carries out 2,3,4 steps.
The present embodiment obtains R according to above-mentioned steps trainingk(Qnew,Qrel) weight:
λlev=0.055985, λw2v=0.753228, λvec=0.207070, λlda=0.475735, λqa=-0.122604
Calculate final similarity are as follows:
It can implement the technology that the present invention describes by various means.For example, these technologies may be implemented in hardware, consolidate
In part, software or combinations thereof.For hardware embodiments, processing module may be implemented in one or more specific integrated circuits
(ASIC), digital signal processor (DSP), programmable logic device (PLD), field-programmable logic gate array (FPGA), place
Manage device, controller, microcontroller, electronic device, other electronic units for being designed to execute function described in the invention or
In a combination thereof.
It, can be with the module of execution functions described herein (for example, process, step for firmware and/or Software implementations
Suddenly, process etc.) implement the technology.Firmware and/or software code are storable in memory and are executed by processor.Storage
Device may be implemented in processor or outside processor.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through
The relevant hardware of program instruction is completed, and program above-mentioned can store in a computer-readable storage medium, the program
When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light
The various media that can store program code such as disk.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment
Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention,
It should be equivalent substitute mode, be included within the scope of the present invention.
Claims (10)
1. one kind is based on the problem of various features similarity calculating method, which is characterized in that comprising steps of newly asking for input
Inscribe sentence, calculatings be compared with the historical problem of storage and corresponding answer in it, calculate new problem and historical problem it
Between the similarity based on character feature, the similarity based on word semantic feature, the similarity based on sentence semantics feature, be based on
Sentence implies the similarity of theme feature and the similarity based on answer semanteme feature;Final similarity is above-mentioned 5 similarities
And its respectively the sum of products of respective weights, weight are obtained using linear regression method training.
2. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that for being used for
New problem sentence, historical problem and the corresponding answer of contrast conting are first pre-processed, including removal punctuation mark handles, is big
Stop words and low frequency word are removed in small letter conversion.
3. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on
The method of the similarity of character feature is: obtaining the relationship square between word by calculating the editing distance between each pair of word
Gust, it is then indicated using the TF-IDF of question sentence and relational matrix calculates soft COS distance, using soft COS distance as based on character
The similarity of feature.
4. the problem of being based on various features similarity calculating method according to claim 3, which is characterized in that between word
Relational matrix calculation method be:
Defining corpus is the question sentence and answer text data set for trained and test model, it is assumed that dictionary is big in corpus
Small is n, then the relational matrix M that the editing distance between word is formedlev, MlevMiddle element mi,jFor i-th of word w in dictionaryiWith
J word wjEditing distance, editing distance calculation formula is as follows:
Wherein, | | wi| | it is word wiIn include character number, | | wj| | it is word wjIn include character number, α is diagonal
The weighted factor of element, β are the intensifiers apart from score;lev(wi,wj) recursive computing formula it is as follows:
Wherein, m and n represent wiAnd wjThe length of word, cost indicate wiIn m-th of character to wjIn n-th of character replacement generation
Valence, if two characters are consistent, cost 0, otherwise cost is 1.
5. the problem of being based on various features similarity calculating method according to claim 3, which is characterized in that question sentence
TF-IDF indicates that calculation method is: in a sentence, for word wi, TF value and IDF value are calculated, TF value indicates that word is being worked as
The frequency occurred in preceding sentence, IDF value indicates inverse document frequency, for word wiTF-IDF calculation formula are as follows:
TFIDFi=TFi*IDFi;
Question sentence Q is proposed for newnewWith history question sentence Qrel, the calculation method of soft COS distance is:
Question sentence Qnew、QrelIt is expressed as TFIDFnewAnd TFIDFrel:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iIndicate wiIn QnewIn TF-IDF value, drel,jIndicate wjIn QrelIn TF-IDF value, n indicate corpus dictionary
In include word number, T indicate vector transposition;
Simultaneously according to relational matrix M between the word acquiredlev, Q is calculated using soft COS distancenew、QrelBetween it is special based on character
The similarity R of signlev(Qnew,Qrel), formula is as follows:
Wherein, " " is the dot product of vector and matrix.
6. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on
The method of the similarity of phrase semantic feature, step are:
(6-1), which obtains the distributed of each word in corpus using the training of word2vec tool, indicates that is, each word is corresponding
The real vector of one K dimension;
The corpus that (6-2) is n to dictionary size calculates in dictionary word two-by-two by seeking the COS distance between term vector
Between semantic relation mi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n;
(6-3) reads question sentence Qnew、QrelUsing the expression of TF-IDF, respectively TFIDFnewAnd TFIDFrel;
(6-4) calculates Q using soft COS distancenew、QrelBetween the similarity R based on word semantic featurew2v(Qnew,Qrel), it is public
Formula is as follows:
7. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on
The method of the similarity of sentence semantics feature, step are:
The term vector w obtained according to the training of word2vec tooli∈RK, it is assumed that question sentence includes M word, then question sentence is expressed as
Qmatrix=(w1,w2,…,wM),
It is the form of term vector arithmetic mean of instantaneous value in sentence, i.e. vector by problem sentence expression:
According to above formula, question sentence Qnew、QrelArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVGnewWith
AVGrel, by seeking AVGnewWith AVGrelCOS distance obtain Qnew、QrelBetween the similarity R based on question semanteme featurevec
(Qnew,Qrel)。
8. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on
The method that sentence implies the similarity of theme feature, step is:
Using corpus as the input of LdaModel function in Gensim, LDA model is obtained by training;Then needing to calculate
The new proposition question sentence Q of theme distributionnewWith history question sentence QrelIt is input to trained LDA model, two question sentences is obtained and is based on master
The vector of topic distribution indicates, is denoted as LDA respectivelynew、LDArel;By seeking LDAnewWith LDArelCOS distance, obtain Qnew、Qrel
Between the similarity R of theme feature is implied based on sentencelda(Qnew,Qrel)。
9. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on
The method of the similarity of answer semanteme feature, step are:
Question sentence Q is proposed for newnewWith history question sentence QrelAnd QrelCorresponding candidate answers Arel, calculate QnewWith ArelWord
The semantic similarity of language level, the similarity R as two question sentences based on answer semanteme featureqa(Qnew,Qrel), detailed process is such as
Under:
(9-1), which obtains the distributed of each word in corpus using the training of word2vec tool, indicates that is, each word is corresponding
The real vector of one K dimension;
The corpus that (9-2) is n to dictionary size calculates in dictionary word two-by-two by seeking the COS distance between term vector
Between semantic relation mi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n;
(9-3) reads question sentence QnewTF-IDF indicate TFIDFnew;ArelIt is indicated by TF-IDF, obtains TFIDFans=[dans,1,
dans,2,...,dans,n]T, wherein dans,iIndicate wiIn ArelIn TF-IDF value, n indicate corpus dictionary in include word
Number, T indicate vector transposition;
(9-4) calculates Q using soft COS distancenew、QrelBetween the similarity R based on answer semanteme featureqa(Qnew,Qrel), it is public
Formula is as follows:
10. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that most last phase
Calculation method like degree is:
Wherein, Rk(Qnew,Qrel) represent Qnew、QrelBetween the similarity based on feature k, that is, be respectively the phase based on character feature
Theme feature is implied like degree, the similarity based on word semantic feature, the similarity based on sentence semantics feature, based on sentence
Similarity and similarity based on answer semanteme feature;λkIt is to be obtained to training parameter using the training of linear regression analysis method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811041071.0A CN109344236B (en) | 2018-09-07 | 2018-09-07 | Problem similarity calculation method based on multiple characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811041071.0A CN109344236B (en) | 2018-09-07 | 2018-09-07 | Problem similarity calculation method based on multiple characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109344236A true CN109344236A (en) | 2019-02-15 |
CN109344236B CN109344236B (en) | 2020-09-04 |
Family
ID=65304890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811041071.0A Active CN109344236B (en) | 2018-09-07 | 2018-09-07 | Problem similarity calculation method based on multiple characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109344236B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399615A (en) * | 2019-07-29 | 2019-11-01 | 中国工商银行股份有限公司 | Transaction risk monitoring method and device |
CN110532565A (en) * | 2019-08-30 | 2019-12-03 | 联想(北京)有限公司 | Sentence processing method and processing device and electronic equipment |
CN110543551A (en) * | 2019-09-04 | 2019-12-06 | 北京香侬慧语科技有限责任公司 | question and statement processing method and device |
CN110738049A (en) * | 2019-10-12 | 2020-01-31 | 招商局金融科技有限公司 | Similar text processing method and device and computer readable storage medium |
CN110781662A (en) * | 2019-10-21 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Method for determining point-to-point mutual information and related equipment |
CN110825857A (en) * | 2019-09-24 | 2020-02-21 | 平安科技(深圳)有限公司 | Multi-turn question and answer identification method and device, computer equipment and storage medium |
CN111191464A (en) * | 2020-01-17 | 2020-05-22 | 珠海横琴极盛科技有限公司 | Semantic similarity calculation method based on combined distance |
CN111259668A (en) * | 2020-05-07 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Reading task processing method, model training device and computer equipment |
CN111368177A (en) * | 2020-03-02 | 2020-07-03 | 北京航空航天大学 | Answer recommendation method and device for question-answer community |
CN111401031A (en) * | 2020-03-05 | 2020-07-10 | 支付宝(杭州)信息技术有限公司 | Target text determination method, device and equipment |
CN111414765A (en) * | 2020-03-20 | 2020-07-14 | 北京百度网讯科技有限公司 | Sentence consistency determination method and device, electronic equipment and readable storage medium |
CN111582498A (en) * | 2020-04-30 | 2020-08-25 | 重庆富民银行股份有限公司 | QA (quality assurance) assistant decision method and system based on machine learning |
CN111680515A (en) * | 2020-05-21 | 2020-09-18 | 平安国际智慧城市科技股份有限公司 | Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium |
CN111723297A (en) * | 2019-11-20 | 2020-09-29 | 中共南通市委政法委员会 | Grid social situation research and judgment-oriented dual semantic similarity discrimination method |
CN112380830A (en) * | 2020-06-18 | 2021-02-19 | 达而观信息科技(上海)有限公司 | Method, system and computer readable storage medium for matching related sentences in different documents |
CN112507097A (en) * | 2020-12-17 | 2021-03-16 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112632252A (en) * | 2020-12-25 | 2021-04-09 | 中电金信软件有限公司 | Dialogue response method, dialogue response device, computer equipment and storage medium |
CN112926340A (en) * | 2021-03-25 | 2021-06-08 | 东南大学 | Semantic matching model for knowledge point positioning |
CN113139034A (en) * | 2020-01-17 | 2021-07-20 | 深圳市优必选科技股份有限公司 | Statement matching method, statement matching device and intelligent equipment |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
CN113392176A (en) * | 2020-09-28 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Text similarity determination method, device, equipment and medium |
CN113722459A (en) * | 2021-08-31 | 2021-11-30 | 平安国际智慧城市科技股份有限公司 | Question and answer searching method based on natural language processing model and related device |
CN113779183A (en) * | 2020-06-08 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Text matching method, device, equipment and storage medium |
CN113792125A (en) * | 2021-08-25 | 2021-12-14 | 北京库睿科技有限公司 | Intelligent retrieval sorting method and device based on text relevance and user intention |
CN113779183B (en) * | 2020-06-08 | 2024-05-24 | 北京沃东天骏信息技术有限公司 | Text matching method, device, equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101286161A (en) * | 2008-05-28 | 2008-10-15 | 华中科技大学 | Intelligent Chinese request-answering system based on concept |
CN103729381A (en) * | 2012-10-16 | 2014-04-16 | 佳能株式会社 | Method and device used for recognizing semantic information in series of documents |
US20140229163A1 (en) * | 2013-02-12 | 2014-08-14 | International Business Machines Corporation | Latent semantic analysis for application in a question answer system |
CN105701253A (en) * | 2016-03-04 | 2016-06-22 | 南京大学 | Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method |
CN106997376A (en) * | 2017-02-28 | 2017-08-01 | 浙江大学 | The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
US20180075016A1 (en) * | 2016-09-15 | 2018-03-15 | International Business Machines Corporation | System and method for automatic, unsupervised paraphrase generation using a novel framework that learns syntactic construct while retaining semantic meaning |
-
2018
- 2018-09-07 CN CN201811041071.0A patent/CN109344236B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101286161A (en) * | 2008-05-28 | 2008-10-15 | 华中科技大学 | Intelligent Chinese request-answering system based on concept |
CN103729381A (en) * | 2012-10-16 | 2014-04-16 | 佳能株式会社 | Method and device used for recognizing semantic information in series of documents |
US20140229163A1 (en) * | 2013-02-12 | 2014-08-14 | International Business Machines Corporation | Latent semantic analysis for application in a question answer system |
CN105701253A (en) * | 2016-03-04 | 2016-06-22 | 南京大学 | Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method |
US20180075016A1 (en) * | 2016-09-15 | 2018-03-15 | International Business Machines Corporation | System and method for automatic, unsupervised paraphrase generation using a novel framework that learns syntactic construct while retaining semantic meaning |
CN106997376A (en) * | 2017-02-28 | 2017-08-01 | 浙江大学 | The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
Non-Patent Citations (1)
Title |
---|
YU Y等: ""A noval similarity calculation method based on Chinese sentence keyword weight"", 《JOURNAL OF SOFTWARE》 * |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399615A (en) * | 2019-07-29 | 2019-11-01 | 中国工商银行股份有限公司 | Transaction risk monitoring method and device |
CN110399615B (en) * | 2019-07-29 | 2023-08-18 | 中国工商银行股份有限公司 | Transaction risk monitoring method and device |
CN110532565A (en) * | 2019-08-30 | 2019-12-03 | 联想(北京)有限公司 | Sentence processing method and processing device and electronic equipment |
CN110532565B (en) * | 2019-08-30 | 2022-03-25 | 联想(北京)有限公司 | Statement processing method and device and electronic equipment |
CN110543551A (en) * | 2019-09-04 | 2019-12-06 | 北京香侬慧语科技有限责任公司 | question and statement processing method and device |
CN110543551B (en) * | 2019-09-04 | 2022-11-08 | 北京香侬慧语科技有限责任公司 | Question and statement processing method and device |
CN110825857B (en) * | 2019-09-24 | 2023-07-21 | 平安科技(深圳)有限公司 | Multi-round question and answer identification method and device, computer equipment and storage medium |
CN110825857A (en) * | 2019-09-24 | 2020-02-21 | 平安科技(深圳)有限公司 | Multi-turn question and answer identification method and device, computer equipment and storage medium |
CN110738049A (en) * | 2019-10-12 | 2020-01-31 | 招商局金融科技有限公司 | Similar text processing method and device and computer readable storage medium |
CN110738049B (en) * | 2019-10-12 | 2023-04-18 | 招商局金融科技有限公司 | Similar text processing method and device and computer readable storage medium |
CN110781662A (en) * | 2019-10-21 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Method for determining point-to-point mutual information and related equipment |
CN111723297B (en) * | 2019-11-20 | 2023-05-12 | 中共南通市委政法委员会 | Dual-semantic similarity judging method for grid society situation research and judgment |
CN111723297A (en) * | 2019-11-20 | 2020-09-29 | 中共南通市委政法委员会 | Grid social situation research and judgment-oriented dual semantic similarity discrimination method |
CN113139034A (en) * | 2020-01-17 | 2021-07-20 | 深圳市优必选科技股份有限公司 | Statement matching method, statement matching device and intelligent equipment |
CN111191464A (en) * | 2020-01-17 | 2020-05-22 | 珠海横琴极盛科技有限公司 | Semantic similarity calculation method based on combined distance |
CN111368177A (en) * | 2020-03-02 | 2020-07-03 | 北京航空航天大学 | Answer recommendation method and device for question-answer community |
CN111368177B (en) * | 2020-03-02 | 2023-10-24 | 北京航空航天大学 | Answer recommendation method and device for question-answer community |
CN111401031A (en) * | 2020-03-05 | 2020-07-10 | 支付宝(杭州)信息技术有限公司 | Target text determination method, device and equipment |
CN111414765A (en) * | 2020-03-20 | 2020-07-14 | 北京百度网讯科技有限公司 | Sentence consistency determination method and device, electronic equipment and readable storage medium |
CN111582498A (en) * | 2020-04-30 | 2020-08-25 | 重庆富民银行股份有限公司 | QA (quality assurance) assistant decision method and system based on machine learning |
CN111259668A (en) * | 2020-05-07 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Reading task processing method, model training device and computer equipment |
CN111680515B (en) * | 2020-05-21 | 2022-05-03 | 平安国际智慧城市科技股份有限公司 | Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium |
CN111680515A (en) * | 2020-05-21 | 2020-09-18 | 平安国际智慧城市科技股份有限公司 | Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium |
CN113779183B (en) * | 2020-06-08 | 2024-05-24 | 北京沃东天骏信息技术有限公司 | Text matching method, device, equipment and storage medium |
CN113779183A (en) * | 2020-06-08 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Text matching method, device, equipment and storage medium |
CN112380830A (en) * | 2020-06-18 | 2021-02-19 | 达而观信息科技(上海)有限公司 | Method, system and computer readable storage medium for matching related sentences in different documents |
CN112380830B (en) * | 2020-06-18 | 2024-05-17 | 达观数据有限公司 | Matching method, system and computer readable storage medium for related sentences in different documents |
CN113392176B (en) * | 2020-09-28 | 2023-08-22 | 腾讯科技(深圳)有限公司 | Text similarity determination method, device, equipment and medium |
CN113392176A (en) * | 2020-09-28 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Text similarity determination method, device, equipment and medium |
CN112507097A (en) * | 2020-12-17 | 2021-03-16 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112507097B (en) * | 2020-12-17 | 2022-11-18 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112632252A (en) * | 2020-12-25 | 2021-04-09 | 中电金信软件有限公司 | Dialogue response method, dialogue response device, computer equipment and storage medium |
CN112926340A (en) * | 2021-03-25 | 2021-06-08 | 东南大学 | Semantic matching model for knowledge point positioning |
CN112926340B (en) * | 2021-03-25 | 2024-05-07 | 东南大学 | Semantic matching model for knowledge point positioning |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
CN113792125A (en) * | 2021-08-25 | 2021-12-14 | 北京库睿科技有限公司 | Intelligent retrieval sorting method and device based on text relevance and user intention |
CN113792125B (en) * | 2021-08-25 | 2024-04-02 | 北京库睿科技有限公司 | Intelligent retrieval ordering method and device based on text relevance and user intention |
CN113722459A (en) * | 2021-08-31 | 2021-11-30 | 平安国际智慧城市科技股份有限公司 | Question and answer searching method based on natural language processing model and related device |
Also Published As
Publication number | Publication date |
---|---|
CN109344236B (en) | 2020-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109344236A (en) | One kind being based on the problem of various features similarity calculating method | |
Bhat et al. | Iiit-h system submission for fire2014 shared task on transliterated search | |
CN105095204B (en) | The acquisition methods and device of synonym | |
Song et al. | Named entity recognition based on conditional random fields | |
CN104794169B (en) | A kind of subject terminology extraction method and system based on sequence labelling model | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
US20160350288A1 (en) | Multilingual embeddings for natural language processing | |
CN106844658A (en) | A kind of Chinese text knowledge mapping method for auto constructing and system | |
CN110705612A (en) | Sentence similarity calculation method, storage medium and system with mixed multi-features | |
CN109960756A (en) | Media event information inductive method | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
Meena et al. | Survey on graph and cluster based approaches in multi-document text summarization | |
CN109255012A (en) | A kind of machine reads the implementation method and device of understanding | |
Shah et al. | Literature study on multi-document text summarization techniques | |
CN113157860A (en) | Electric power equipment maintenance knowledge graph construction method based on small-scale data | |
Zhang et al. | PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese | |
De Melo et al. | UWN: A large multilingual lexical knowledge base | |
Islam et al. | Applications of corpus-based semantic similarity and word segmentation to database schema matching | |
Hu et al. | BIM oriented intelligent data mining and representation | |
AL-Khassawneh et al. | Improving triangle-graph based text summarization using hybrid similarity function | |
Juan | An effective similarity measurement for FAQ question answering system | |
CN115827988B (en) | Self-media content heat prediction method | |
Sidhu et al. | Role of machine translation and word sense disambiguation in natural language processing | |
Yadav et al. | Graph-based extractive text summarization based on single document | |
BAZRFKAN et al. | Using machine learning methods to summarize persian texts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |