CN109344236A - One kind being based on the problem of various features similarity calculating method - Google Patents

One kind being based on the problem of various features similarity calculating method Download PDF

Info

Publication number
CN109344236A
CN109344236A CN201811041071.0A CN201811041071A CN109344236A CN 109344236 A CN109344236 A CN 109344236A CN 201811041071 A CN201811041071 A CN 201811041071A CN 109344236 A CN109344236 A CN 109344236A
Authority
CN
China
Prior art keywords
new
rel
similarity
word
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811041071.0A
Other languages
Chinese (zh)
Other versions
CN109344236B (en
Inventor
刘波
彭永幸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
University of Jinan
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201811041071.0A priority Critical patent/CN109344236B/en
Publication of CN109344236A publication Critical patent/CN109344236A/en
Application granted granted Critical
Publication of CN109344236B publication Critical patent/CN109344236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The problem of being based on various features the invention discloses one kind similarity calculating method, comprising steps of for the new problem sentence of input, calculating is compared with the historical problem of storage and corresponding answer in it, calculates the similarity between new problem and historical problem based on character feature, the similarity based on word semantic feature, the similarity based on sentence semantics feature, the similarity for implying based on sentence theme feature and the similarity based on answer semanteme feature;Final similarity is the sum of products of above-mentioned 5 similarities and its respective respective weights, and weight is obtained using linear regression method training.The present invention increases the diversity of sample attribute using various features, improves the generalization ability of model.TF-IDF is merged using soft COS distance with information such as editing distance, phrase semantics simultaneously, the semantic gap between word is overcome, improves the accuracy rate of similarity calculation.

Description

One kind being based on the problem of various features similarity calculating method
Technical field
The present invention relates to Computer Natural Language Processings and automatically request-answering system research field, in particular to a kind of based on more The problem of planting feature similarity calculating method.
Background technique
With quickling increase for digital information, the difficulty that people obtain required information resources from network also increases therewith Greatly.How in the digital information of magnanimity, required information rapidly precisely is found for user and gives natural language processing (NLP) Technology and information retrieval technique brings stern challenge.Therefore, in order to provide the user with the high information of strong real-time, accuracy Channel is obtained, research institution and relevant technical company begin one's study automatically request-answering system (QA).In automatically request-answering system, user Corresponding answer can be directly obtained by only needing to input problem, it is no longer necessary to user according to problem extract keyword retrieved with And it reads a large amount of webpages and finds answer.Automatically request-answering system is more simpler than traditional search engines easy-to-use, real-time, accurate, for Family provides comfortable man-machine interaction experience, becomes the research hotspot of current information technology a new generation.Automatically request-answering system allows The problem of user describes problem in the form of natural language, then accurately understands user, and pass through retrieval question and answer library or interconnection The information organization answers searched on the net, finally return to refining and accurately as a result, providing efficient acquisition of information channel.
Problem similarity calculation is primary link in automatically request-answering system, it is therefore an objective to found out from existing problem set with The new historical problem for proposing that problem is most like, to provide the answer of new problem according to the answer set of historical problem.
Currently, domestic, in automatic question answering field, there are also achievements.Universal community's question answering system includes Quora, top news Question and answer, Baidu know, professional community's question answering system is related to multi-specialized, such as the IT technology such as Stack Overflow, CSDN Relevant question answering system.Therefore, problem similarity calculating method directly affects the accuracy rate of question answering system, has good industry Prospect.
It is accumulated by years of researches, automatically request-answering system forms general frame, mainly by information retrieval, problem point Analysis and answer obtain three module compositions.Wherein, the main task of case study module was divided the problem of user's input Analysis is found out from existing problem set and is related to problem similarity with the most like historical problem of problem, research contents is newly proposed Analysis and problem sequence, wherein the similarity calculation most importantly between problem and problem, thus according to similarity to history Problem set is ranked up.Answer obtains the Similar Problems set that module is mainly retrieved according to problem, obtains corresponding answer Case set.
Text similarity the relevant technologies are the bases of problem similarity technology (problem and answer belong to text type).Text There are mainly three types of this similarity calculating methods.
The first is the similarity calculation based on vector space model (Vector Space Model, VSM), and text is reflected A point being mapped in vector space recycles mathematical method to calculate space midpoint at a distance from point.Having researcher to propose will VSM model is applied in the Similar Problems retrieval tasks of frequently asked question (Frequently Asked Questions, FAQ), and VSM is improved for the task feature of FAQ.But this method text is sparse to cause dimension excessive, is easy to appear semanteme Divide problem.
Second is the similarity calculation based on syntactic analysis, is introduced into patterned mode and describes each word in a sentence Group mutually dominates and by dominance relation.Have researcher propose the analysis method based on deep structure, first to problem carry out according to It deposits relationship analysis, chooses most important word in sentence and be directly attached to effective word of the word and matched, is then based on interdependent Relational structure carries out the Text similarity computing of Chinese.But the tools such as this method syntactic analysis, dependency analysis are more multiple It is miscellaneous, linguistics background is needed, and bad to the analytical effect of complicated long sentence pattern.
The third is semantic-based similarity calculation, includes two kinds of similarities of phrase semantic and sentence semantics.About word The Semantic Similarity Measurement of language, usually using the semantic dictionaries such as WordNet and Hownet, contained in semantic dictionary word and word it Between semantic relation.There is researcher to think that the expressed intact of short sentence depends not only upon syntactic structure, but also depends on word And its weight, therefore improve using WordNet the semantic expressiveness of word.In terms of the Semantic Similarity Measurement of sentence, grind Personnel are studied carefully using the transition probability between Machine Translation Model two problem sentences of study of IBM, to indicate the semanteme of sentence Similarity retrieves Similar Problems.But this method depends on semantic dictionary unduly, and the accuracy rate of similarity calculation is by semantic dictionary Completeness and correctness influence;Similarly, long sentence pattern effect of the semantic-based similarity calculating method in processing syntax complexity It is poor.
Method in the prior art is mostly the information extraction text representation feature based on single type simultaneously, is focused on single Type feature, does not account for text representation and is meant that and be made of many-sided multi-level information, therefore calculates similarity Accuracy rate is not high.
Summary of the invention
For overcome the deficiencies in the prior art, the problem of being based on various features the object of the present invention is to provide one kind similarity Calculation method, similarity calculation of this method between problem English question answering system, has the advantages that accuracy rate is high.
The purpose of the present invention is realized by the following technical solution: one kind being based on the problem of various features similarity calculation side Method, comprising steps of meter is compared with the historical problem of storage and corresponding answer in it for the new problem sentence of input It calculates, calculates the similarity between new problem and historical problem based on character feature, the similarity based on word semantic feature, base In the similarity of sentence semantics feature, the similarity of theme feature implied based on sentence and based on the similar of answer semanteme feature Degree;Final similarity is the sum of products of above-mentioned 5 similarities and its respective respective weights, and weight utilizes linear regression method training It obtains.
Preferably, for first being pre-processed for the new problem sentence of contrast conting, historical problem and corresponding answer, Including the processing of removal punctuation mark, capital and small letter conversion (all capitalizations change into small letter), remove stop words and low frequency word.
Preferably, the method for calculating the similarity based on character feature is: firstly, by calculating the volume between each pair of word Distance is collected, the relational matrix between word is obtained, then utilizes the TF-IDF (term frequency-inverse of question sentence Document frequency) it indicates to calculate soft COS distance with relational matrix, as the similarity based on character feature.
Further, the relational matrix calculation method between word is:
Defining corpus is the question sentence and answer text data set for trained and test model, it is assumed that dictionary in corpus Size be n, then the relational matrix M that editing distance between word is formedlev∈Rn×n, Rn×nIt is the real number square that size is n × n Battle array set (hereafter meaning is identical), MlevMiddle element mi,jFor i-th of word w in dictionaryiWith j-th of word wjEditing distance.Editor away from It is as follows from calculation formula:
Wherein, | | wi| | it is word wiIn include character number, | | wj| | it is word wjIn include character number, α is The weighted factor of diagonal element, β are the intensifiers apart from score.lev(wi,wj) recursive computing formula it is as follows:
Wherein, m and n represent wiAnd wjThe length (including the number of character) of word.Cost indicates wiIn m-th of character To wjIn n-th of character replacement cost, if two characters are consistent, cost 0, otherwise cost be 1.
Further, the TF-IDF of question sentence indicates that calculation method is: in a sentence, for word wi, calculate TF Value and IDF value, TF value indicate the frequency that word occurs in current sentence, and IDF value indicates inverse document frequency, for word wiTF-IDF calculation formula are as follows:
TFIDFi=TFi*IDFi
Further, question sentence Q is proposed for newnewWith history question sentence Qrel, the calculation method of soft COS distance is:
Question sentence Qnew、QrelIt is expressed as TFIDFnewAnd TFIDFrel:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iIndicate wiIn QnewIn TF-IDF value, drel,jIndicate wjIn QrelIn TF-IDF value, n indicates corpus It include the number of word in dictionary, T indicates the transposition of vector.
It is simultaneously M according to relational matrix between the word acquiredlev, Q is calculated using soft COS distancenew、QrelBetween be based on The similarity R of character featurelev(Qnew,Qrel), formula is as follows:
Wherein, " " is the dot product (hereafter meaning is identical) of vector with matrix, and calculation is as follows:
Preferably, the method for calculating the similarity based on word semantic feature, step is:
(1) obtaining the distributed of each word in corpus using the training of word2vec tool indicates, i.e., each word pair The real vector for answering a K to tie up.
(2) corpus for being n to dictionary size calculates word two in dictionary by seeking the COS distance between term vector Semantic relation m between twoi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n
(3) question sentence Q is readnew、QrelUsing the expression of TF-IDF, respectively TFIDFnewAnd TFIDFrel
(4) Q is calculated using soft COS distancenew、QrelBetween the similarity R based on word semantic featurew2v(Qnew,Qrel), Formula is as follows:
Preferably, the method for calculating the similarity based on sentence semantics feature, step is:
The term vector w obtained according to the training of word2vec tooli∈RK, it is assumed that question sentence includes M word, then question sentence can be with It is expressed as Qmatrix=(w1,w2,…,wM),
It is the form of term vector arithmetic mean of instantaneous value in sentence, i.e. vector by problem sentence expression:
According to above formula, question sentence Qnew、QrelArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVGnew And AVGrel, by seeking AVGnewWith AVGrelCOS distance obtain Qnew、QrelBetween the similarity based on question semanteme feature Rvec(Qnew,Qrel)。
Preferably, the method that the similarity of theme feature is implied based on sentence is calculated, step is:
Using corpus as the input of LdaModel function in Gensim, LDA model is obtained by training;Then needs Calculate the new proposition question sentence Q of theme distributionnewWith history question sentence QrelIt is input to trained LDA model, obtains two question sentence bases It is indicated in the vector of theme distribution, is denoted as LDA respectivelynew、LDArel.By seeking LDAnewWith LDArelCOS distance, obtain Qnew、QrelBetween the similarity R of theme feature is implied based on sentencelda(Qnew,Qrel)。
Preferably, the method for calculating the similarity based on answer semanteme feature, step is:
In question answering system, each historical problem corresponds to a candidate answers set.Question sentence Q is proposed for newnewWith History question sentence QrelAnd QrelCorresponding candidate answers Arel, calculate QnewWith ArelWord level semantic similarity, as Similarity R of two question sentences based on answer semanteme featureqa(Qnew,Qrel), detailed process is as follows:
(1) obtaining the distributed of each word in corpus using the training of word2vec tool indicates, i.e., each word pair The real vector for answering a K to tie up.
(2) corpus for being n to dictionary size calculates word two in dictionary by seeking the COS distance between term vector Semantic relation m between twoi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n
(3) question sentence Q is readnewTF-IDF indicate TFIDFnew;ArelIt is indicated by TF-IDF, obtains TFIDFans= [dans,1,dans,2,...,dans,n]T, wherein dans,iIndicate wiIn ArelIn TF-IDF value, n indicate corpus dictionary in wrap Number containing word, T indicate the transposition of vector.
(4) Q is calculated using soft COS distancenew、QrelBetween the similarity R based on answer semanteme featureqa(Qnew,Qrel), Formula is as follows:
Preferably, the calculation method of final similarity is:
Wherein, Rk(Qnew,Qrel) represent the similarity based on feature k, i.e., similarity, base respectively based on character feature In the similarity of phrase semantic feature, the similarity based on sentence semantics feature, the similarity for implying based on sentence theme feature With the similarity based on answer semanteme feature.λkIt is to be obtained to training parameter using the training of linear regression analysis method.
Further, using in linear regression analysis method training process, iterative step is as follows:
(1) the corresponding weight λ of each feature of random initializtionk, according to the Squared Error Loss of weight calculation current iteration;Square Loss function is as follows:
Wherein, I is given training samples number, and training sample is known problem pair and the similarity to problem,It is the similarity predicted value of i-th of sample, Y(i)It is the given value of i-th of Sample Similarity;
(2) according to Squared Error Loss, to the weight λ of each featurekIt carries out seeking local derviation, obtains the gradient of the current iteration of weightT indicates the t times iteration;
(3) basisThe weight of each feature is updated, α is step-length;
(4) Squared Error Loss is recalculated according to new weight, if current Squared Error Loss is flat not less than preceding an iteration Side's loss, then stop iteration, obtain the corresponding weight λ of each featurekEnd value;Otherwise circulation carries out step (2)-(4).
Compared with prior art, the present invention has the following advantages that and technical effect:
(1) in machine learning field, training sample is by one group of attribute description, and different attribute sets provides not The visual angle of same observation data.The present invention from five different views with the problem of natural language description and answer sentence, It is extracted five seed type features.Compared with based on single type feature representation mode, various features increase the more of sample attribute Sample improves the generalization ability of model.
(2) present invention is merged TF-IDF with information such as editing distance, phrase semantics using soft COS distance, is calculated The similarity of problem and problem.Traditional similarity calculating method is compared, the present invention overcomes the semantic gaps between word, mention The high accuracy rate of similarity calculation.
Detailed description of the invention
Fig. 1 is the flow chart of the present embodiment method.
Specific embodiment
Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.
The present embodiment one kind measures two problems using 5 kinds of features based on similarity calculating method the problem of various features Similarity between sentence is character feature respectively, phrase semantic feature, sentence semantics feature, the implicit theme feature of sentence, answers Case semantic feature.Similarity based on this 5 kinds of features is integrated into the final similarity of new problem and historical problem.Referring to Fig. 1, by each step for stating this method in detail in conjunction with an example.
(1) a new problem sentence Q is inputtednew: Where I can buy good oil for massage?
(2) a historical problem sentence Q is readrel: is there any place i can find scented Massage oils in qatar?
The answer A of a historical problem is read simultaneouslyrel: Yes.It is right behind Kahrama in the National area.
(3) to Qnew﹑ QrelAnd ArelIt is pre-processed respectively, comprising: the processing of removal punctuation mark, capital and small letter conversion are (all Capitalization changes into small letter), remove stop words and low frequency word etc..It obtains:
Qnew: buy good oil massage
Qrel: place find scented massage oils qatar
Arel: yes right behind kahrama national area
(4) Q is calculated separatelynewAnd QrelSimilarity based on following 5 kinds of features.
Be assumed below will area, behind, buy, find, good, kahrama, massage, national, oil, Oils, place, qatar, right, scented, yes } it is used as corpus dictionary set.
The similarity of (4-1) based on character feature
Character feature is the similarity measured between word from character level, and word character level is sought using editing distance Similarity.Firstly, obtaining the relational matrix M between word by calculating the editing distance between each pair of wordlev, then sharp It is indicated and relational matrix M with the TF-IDF of question sentencelevSoft COS distance is calculated as the similarity based on character feature.Specifically such as Under:
(4-1-1) calculated relationship matrix Mlev
(refer to the question sentence and answer text data set for trained and test model, hereinafter corpus assuming that corpus Meaning is identical) in the size of dictionary be n, then the relational matrix M that editing distance between word is formedlev∈Rn×n, Rn×nIt is big The small real number matrix set for n × n, MlevMiddle element mi,jFor i-th of word w in dictionaryiWith j-th of word wjEditing distance.Editor Distance calculation formula is as follows:
Wherein, | | wi| | it is word wiIn include character number, | | wj| | it is word wjIn include character number, α is The weighted factor of diagonal element, β are the intensifiers apart from score, and α value is that 1.8, β value is 5.lev(wi,wj) recurrence Calculation formula is as follows:
Wherein, m and n represent wiAnd wjThe length (including the number of character) of word.Cost indicates wiIn m-th of character To wjIn n-th of character replacement cost, if two characters are consistent, cost 0, otherwise cost be 1.
Because dictionary size is 15 words, then MlevIt is 15 × 15 matrix, following form:
The TF-IDF that (4-1-2) calculates question sentence is indicated
TF-IDF is made of TF and IDF.
In a sentence, for word wi, calculate TF value formula it is as follows:
Wherein, TFiIndicate the frequency that i-th of word occurs in current sentence, niIndicate that the word goes out in current sentence Existing number, nkIndicate the number that k-th of word occurs in current sentence.
For given corpus, the IDF value of each word is fixed, word wiCalculation it is as follows:
Wherein: | D | indicate total text number, denominator then indicates the text number comprising the word.
In a sentence, for word wiTF-IDF calculation formula are as follows:
TFIDFi=TFi*IDFi
Question sentence Qnew、QrelIt is expressed as following vector:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iIndicate wiIn QnewIn TF-IDF value, drel,jIndicate wjIn QrelIn TF-IDF value, n indicates corpus It include the number of word in dictionary, T indicates the transposition of vector.
Step (3) are pre-processed to obtained question sentence Qnew、QrelIt is expressed as 15 dimensional vectors of following form:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634 ...]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0 ...]T
(4-1-3) calculates soft COS distance
Soft COS distance is 2014 for COS distance, and Sidorov proposes improved cosine similarity The entitled soft COS distance (Soft Cosine) of calculation method, when calculating COS distance introduce relational matrix come indicate word it Between relationship.
Question sentence Qnew、QrelIndicate to be respectively TFIDF by step (4-1-2)newAnd TFIDFrel, pass through step (4-1-1) Acquiring relational matrix between word is Mlev, question sentence Q is calculated using soft COS distancenew、QrelBetween based on the similar of character feature Spend Rlev(Qnew,Qrel), formula is as follows:
Wherein, " " is the dot product (hereafter meaning is identical) of vector with matrix, and calculation is as follows:
In the present embodiment
The similarity of (4-2) based on word semantic feature
(4-2-1), which obtains the distributed of each word in corpus using the training of word2vec tool, to be indicated, i.e., each word The term vector of corresponding one 200 dimension of language, as follows:
The corpus that (4-2-2) is n to dictionary size calculates word in dictionary by seeking the COS distance between term vector The semantic relation m of language between any twoi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n。mi,jCalculation formula such as Under:
mi,j=max (0, cos (wi,wj))2
Wherein, wi、wjIndicate the real vector of ith and jth word K dimension in corpus, wi,wj∈RK, RKBe length be K One-dimensional vector (hereafter meaning is identical), " " is the standard dot product (hereafter meaning is identical) between vector, and calculation formula is as follows:
Wherein, wi,mIndicate wiM-th of component, wj,mIndicate wjM-th of component.
Dictionary size is 15 words in the present embodiment, then Mw2vIt is 15 × 15 matrix, as follows:
(4-2-3) reads the question sentence Q that (4-1-2) is calculatednew、QrelTF-IDF expression be respectively as follows:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634 ...]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0 ...]T
(4-2-4) seeks question sentence Q by soft COS distance formulanew、QrelBetween the similarity based on word semantic feature:
The similarity of (4-3) based on sentence semantics feature
The term vector w obtained according to word2vec trainingi∈RK, it is assumed that question sentence includes M word, then question sentence can indicate For Qmatrix=(w1,w2,…,wM),It is term vector arithmetic average in sentence by problem sentence expression The form of value, i.e. vector:
According to above formula, question sentence Qnew、QrelArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVGnew And AVGrel:
AVGnew=[0.014657,0.075914, -0.042454,0.219559, -0.117374 ...]
AVGrel=[- 0.088187, -0.025432, -0.05328,0.17098376, -0.13033055 ...]
By seeking AVGnewWith AVGrelCOS distance obtain Qnew、QrelBetween the similarity based on question semanteme feature Rvec(Qnew,Qrel), formula is as follows:
(4-4) implies the similarity of theme feature based on sentence
The present invention implies the implicit master that topic model seeks question sentence using LDA (Latent Dirichlet Allocation) Topic.After training by LDA, in available document sets every document implicit theme distribution, and then obtain the theme of sentence Vector.For example, sentence QmImplicit theme distribution Indicate that the probability for belonging to i-th of theme, I are The quantity of theme is implied, then QmIt is expressed as vector:
The theme distribution of question sentence is opened by Gensim (https: //radimrehurek.com/gensim/) topic model Source tool is calculated.First using corpus as the input of LdaModel function in Gensim, LDA mould is obtained by training The question sentence for needing to calculate theme distribution, is then input to trained LDA model by type, and obtained output is the theme of the question sentence Distribution, and be vector by the sentence expression.
Question sentence Q is proposed for newnewWith history question sentence Qrel, two are calculated by Gensim topic model Open-Source Tools Question sentence is indicated based on the vector of theme distribution, is denoted as LDA respectivelynew、LDArel:
LDAnew=[0.001784,0.001934,0.002056,0.002072,0.001772 ...]
LDArel=[0.001706,0.001850,0.001967,0.001982,0.001695 ...]
By seeking LDAnewWith LDArelCOS distance, obtain the similarity R that theme feature is implied based on sentencelda(Qnew, Qrel), formula is as follows:
The similarity of (4-5) based on answer semanteme feature
In question answering system, each historical problem corresponds to a candidate answers set.Question sentence Q is proposed for newnewWith History question sentence QrelAnd QrelCorresponding candidate answers Arel, Q is calculated using the method being similar in (4-2)newWith ArelWord The semantic similarity of language level, the similarity R as two question sentences based on answer semanteme featureqa(Qnew,Qrel), detailed process is such as Under:
(4-5-1) reads the term vector that each word 200 is tieed up in the dictionary that (4-2-1) is calculated;
(4-5-2) reads the M that (4-2-2) is calculatedw2vMatrix;
(4-5-3) reads the Q that (4-1-2) is obtainednewTF-IDF indicate:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634 ...]T
It calculates step (3) and pre-processes obtained ArelTF-IDF indicate, obtain following 15 dimensional vector:
TFIDFans=[0.408248,0.408248,0.0,0.0,0.0 ...]T
(4-5-4) calculates Q using soft COS distancenew、QrelBetween the similarity R based on answer semanteme featureqa(Qnew, Qrel), formula is as follows:
(5) final similarity is calculated
Calculate QnewWith QrelFinal similarity Sim (Qnew,Qrel), formula is as follows:
Wherein, Rk(Qnew,Qrel) similarity based on feature k is represented, it is acquired, that is, distinguished by (4-1)~(4-5) respectively Are as follows:
Rlev(Qnew,Qrel)=0.225969, Rv2w(Qnew,Qrel)=0.304225, Rvec(Qnew,Qrel)=0.738933, Rlda(Qnew,Qrel)=0.685844, Rqa(Qnew,Qrel)=0..018413
λkIt is to be obtained to training parameter using the training of linear regression analysis method.Training method uses forward stepwire regression, Each step all reduces error as far as possible.Using quadratic loss function, the certain number of iteration, until loss function minimum.Square damage It is as follows to lose function:
Wherein, I is given training samples number, and training sample is known problem pair and the similarity to problem,It is the similarity predicted value of i-th of sample, Y(i)It is the given value of i-th of Sample Similarity.
Iterative process is as follows:
1. the corresponding weight λ of each feature of random initializtionk, according to the Squared Error Loss of weight calculation current iteration;
2. according to Squared Error Loss, to the weight λ of each featurekIt carries out seeking local derviation, obtains the gradient of the current iteration of weightT indicates the t times iteration;
3. basisThe weight of each feature is updated, α is step-length, takes 0.1;
4. Squared Error Loss is recalculated according to new weight, if current Squared Error Loss is not less than square of preceding an iteration Loss, then stop iteration, obtain the corresponding weight λ of each featurekEnd value;Otherwise circulation carries out 2,3,4 steps.
The present embodiment obtains R according to above-mentioned steps trainingk(Qnew,Qrel) weight:
λlev=0.055985, λw2v=0.753228, λvec=0.207070, λlda=0.475735, λqa=-0.122604 Calculate final similarity are as follows:
It can implement the technology that the present invention describes by various means.For example, these technologies may be implemented in hardware, consolidate In part, software or combinations thereof.For hardware embodiments, processing module may be implemented in one or more specific integrated circuits (ASIC), digital signal processor (DSP), programmable logic device (PLD), field-programmable logic gate array (FPGA), place Manage device, controller, microcontroller, electronic device, other electronic units for being designed to execute function described in the invention or In a combination thereof.
It, can be with the module of execution functions described herein (for example, process, step for firmware and/or Software implementations Suddenly, process etc.) implement the technology.Firmware and/or software code are storable in memory and are executed by processor.Storage Device may be implemented in processor or outside processor.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can store in a computer-readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims (10)

1. one kind is based on the problem of various features similarity calculating method, which is characterized in that comprising steps of newly asking for input Inscribe sentence, calculatings be compared with the historical problem of storage and corresponding answer in it, calculate new problem and historical problem it Between the similarity based on character feature, the similarity based on word semantic feature, the similarity based on sentence semantics feature, be based on Sentence implies the similarity of theme feature and the similarity based on answer semanteme feature;Final similarity is above-mentioned 5 similarities And its respectively the sum of products of respective weights, weight are obtained using linear regression method training.
2. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that for being used for New problem sentence, historical problem and the corresponding answer of contrast conting are first pre-processed, including removal punctuation mark handles, is big Stop words and low frequency word are removed in small letter conversion.
3. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of character feature is: obtaining the relationship square between word by calculating the editing distance between each pair of word Gust, it is then indicated using the TF-IDF of question sentence and relational matrix calculates soft COS distance, using soft COS distance as based on character The similarity of feature.
4. the problem of being based on various features similarity calculating method according to claim 3, which is characterized in that between word Relational matrix calculation method be:
Defining corpus is the question sentence and answer text data set for trained and test model, it is assumed that dictionary is big in corpus Small is n, then the relational matrix M that the editing distance between word is formedlev, MlevMiddle element mi,jFor i-th of word w in dictionaryiWith J word wjEditing distance, editing distance calculation formula is as follows:
Wherein, | | wi| | it is word wiIn include character number, | | wj| | it is word wjIn include character number, α is diagonal The weighted factor of element, β are the intensifiers apart from score;lev(wi,wj) recursive computing formula it is as follows:
Wherein, m and n represent wiAnd wjThe length of word, cost indicate wiIn m-th of character to wjIn n-th of character replacement generation Valence, if two characters are consistent, cost 0, otherwise cost is 1.
5. the problem of being based on various features similarity calculating method according to claim 3, which is characterized in that question sentence TF-IDF indicates that calculation method is: in a sentence, for word wi, TF value and IDF value are calculated, TF value indicates that word is being worked as The frequency occurred in preceding sentence, IDF value indicates inverse document frequency, for word wiTF-IDF calculation formula are as follows:
TFIDFi=TFi*IDFi
Question sentence Q is proposed for newnewWith history question sentence Qrel, the calculation method of soft COS distance is:
Question sentence Qnew、QrelIt is expressed as TFIDFnewAnd TFIDFrel:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iIndicate wiIn QnewIn TF-IDF value, drel,jIndicate wjIn QrelIn TF-IDF value, n indicate corpus dictionary In include word number, T indicate vector transposition;
Simultaneously according to relational matrix M between the word acquiredlev, Q is calculated using soft COS distancenew、QrelBetween it is special based on character The similarity R of signlev(Qnew,Qrel), formula is as follows:
Wherein, " " is the dot product of vector and matrix.
6. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of phrase semantic feature, step are:
(6-1), which obtains the distributed of each word in corpus using the training of word2vec tool, indicates that is, each word is corresponding The real vector of one K dimension;
The corpus that (6-2) is n to dictionary size calculates in dictionary word two-by-two by seeking the COS distance between term vector Between semantic relation mi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n
(6-3) reads question sentence Qnew、QrelUsing the expression of TF-IDF, respectively TFIDFnewAnd TFIDFrel
(6-4) calculates Q using soft COS distancenew、QrelBetween the similarity R based on word semantic featurew2v(Qnew,Qrel), it is public Formula is as follows:
7. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of sentence semantics feature, step are:
The term vector w obtained according to the training of word2vec tooli∈RK, it is assumed that question sentence includes M word, then question sentence is expressed as Qmatrix=(w1,w2,…,wM),
It is the form of term vector arithmetic mean of instantaneous value in sentence, i.e. vector by problem sentence expression:
According to above formula, question sentence Qnew、QrelArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVGnewWith AVGrel, by seeking AVGnewWith AVGrelCOS distance obtain Qnew、QrelBetween the similarity R based on question semanteme featurevec (Qnew,Qrel)。
8. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method that sentence implies the similarity of theme feature, step is:
Using corpus as the input of LdaModel function in Gensim, LDA model is obtained by training;Then needing to calculate The new proposition question sentence Q of theme distributionnewWith history question sentence QrelIt is input to trained LDA model, two question sentences is obtained and is based on master The vector of topic distribution indicates, is denoted as LDA respectivelynew、LDArel;By seeking LDAnewWith LDArelCOS distance, obtain Qnew、Qrel Between the similarity R of theme feature is implied based on sentencelda(Qnew,Qrel)。
9. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of answer semanteme feature, step are:
Question sentence Q is proposed for newnewWith history question sentence QrelAnd QrelCorresponding candidate answers Arel, calculate QnewWith ArelWord The semantic similarity of language level, the similarity R as two question sentences based on answer semanteme featureqa(Qnew,Qrel), detailed process is such as Under:
(9-1), which obtains the distributed of each word in corpus using the training of word2vec tool, indicates that is, each word is corresponding The real vector of one K dimension;
The corpus that (9-2) is n to dictionary size calculates in dictionary word two-by-two by seeking the COS distance between term vector Between semantic relation mi,j, i, j ∈ [1, n] obtain relational matrix Mw2v, Mw2v∈Rn×n
(9-3) reads question sentence QnewTF-IDF indicate TFIDFnew;ArelIt is indicated by TF-IDF, obtains TFIDFans=[dans,1, dans,2,...,dans,n]T, wherein dans,iIndicate wiIn ArelIn TF-IDF value, n indicate corpus dictionary in include word Number, T indicate vector transposition;
(9-4) calculates Q using soft COS distancenew、QrelBetween the similarity R based on answer semanteme featureqa(Qnew,Qrel), it is public Formula is as follows:
10. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that most last phase Calculation method like degree is:
Wherein, Rk(Qnew,Qrel) represent Qnew、QrelBetween the similarity based on feature k, that is, be respectively the phase based on character feature Theme feature is implied like degree, the similarity based on word semantic feature, the similarity based on sentence semantics feature, based on sentence Similarity and similarity based on answer semanteme feature;λkIt is to be obtained to training parameter using the training of linear regression analysis method.
CN201811041071.0A 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics Active CN109344236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811041071.0A CN109344236B (en) 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811041071.0A CN109344236B (en) 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics

Publications (2)

Publication Number Publication Date
CN109344236A true CN109344236A (en) 2019-02-15
CN109344236B CN109344236B (en) 2020-09-04

Family

ID=65304890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811041071.0A Active CN109344236B (en) 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics

Country Status (1)

Country Link
CN (1) CN109344236B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399615A (en) * 2019-07-29 2019-11-01 中国工商银行股份有限公司 Transaction risk monitoring method and device
CN110532565A (en) * 2019-08-30 2019-12-03 联想(北京)有限公司 Sentence processing method and processing device and electronic equipment
CN110543551A (en) * 2019-09-04 2019-12-06 北京香侬慧语科技有限责任公司 question and statement processing method and device
CN110738049A (en) * 2019-10-12 2020-01-31 招商局金融科技有限公司 Similar text processing method and device and computer readable storage medium
CN110781662A (en) * 2019-10-21 2020-02-11 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN110825857A (en) * 2019-09-24 2020-02-21 平安科技(深圳)有限公司 Multi-turn question and answer identification method and device, computer equipment and storage medium
CN111191464A (en) * 2020-01-17 2020-05-22 珠海横琴极盛科技有限公司 Semantic similarity calculation method based on combined distance
CN111259668A (en) * 2020-05-07 2020-06-09 腾讯科技(深圳)有限公司 Reading task processing method, model training device and computer equipment
CN111368177A (en) * 2020-03-02 2020-07-03 北京航空航天大学 Answer recommendation method and device for question-answer community
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN111414765A (en) * 2020-03-20 2020-07-14 北京百度网讯科技有限公司 Sentence consistency determination method and device, electronic equipment and readable storage medium
CN111582498A (en) * 2020-04-30 2020-08-25 重庆富民银行股份有限公司 QA (quality assurance) assistant decision method and system based on machine learning
CN111680515A (en) * 2020-05-21 2020-09-18 平安国际智慧城市科技股份有限公司 Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium
CN111723297A (en) * 2019-11-20 2020-09-29 中共南通市委政法委员会 Grid social situation research and judgment-oriented dual semantic similarity discrimination method
CN112380830A (en) * 2020-06-18 2021-02-19 达而观信息科技(上海)有限公司 Method, system and computer readable storage medium for matching related sentences in different documents
CN112507097A (en) * 2020-12-17 2021-03-16 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112632252A (en) * 2020-12-25 2021-04-09 中电金信软件有限公司 Dialogue response method, dialogue response device, computer equipment and storage medium
CN112926340A (en) * 2021-03-25 2021-06-08 东南大学 Semantic matching model for knowledge point positioning
CN113139034A (en) * 2020-01-17 2021-07-20 深圳市优必选科技股份有限公司 Statement matching method, statement matching device and intelligent equipment
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium
CN113392176A (en) * 2020-09-28 2021-09-14 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and medium
CN113722459A (en) * 2021-08-31 2021-11-30 平安国际智慧城市科技股份有限公司 Question and answer searching method based on natural language processing model and related device
CN113779183A (en) * 2020-06-08 2021-12-10 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium
CN113792125A (en) * 2021-08-25 2021-12-14 北京库睿科技有限公司 Intelligent retrieval sorting method and device based on text relevance and user intention
CN113779183B (en) * 2020-06-08 2024-05-24 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN103729381A (en) * 2012-10-16 2014-04-16 佳能株式会社 Method and device used for recognizing semantic information in series of documents
US20140229163A1 (en) * 2013-02-12 2014-08-14 International Business Machines Corporation Latent semantic analysis for application in a question answer system
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN106997376A (en) * 2017-02-28 2017-08-01 浙江大学 The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
US20180075016A1 (en) * 2016-09-15 2018-03-15 International Business Machines Corporation System and method for automatic, unsupervised paraphrase generation using a novel framework that learns syntactic construct while retaining semantic meaning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN103729381A (en) * 2012-10-16 2014-04-16 佳能株式会社 Method and device used for recognizing semantic information in series of documents
US20140229163A1 (en) * 2013-02-12 2014-08-14 International Business Machines Corporation Latent semantic analysis for application in a question answer system
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
US20180075016A1 (en) * 2016-09-15 2018-03-15 International Business Machines Corporation System and method for automatic, unsupervised paraphrase generation using a novel framework that learns syntactic construct while retaining semantic meaning
CN106997376A (en) * 2017-02-28 2017-08-01 浙江大学 The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YU Y等: ""A noval similarity calculation method based on Chinese sentence keyword weight"", 《JOURNAL OF SOFTWARE》 *

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399615A (en) * 2019-07-29 2019-11-01 中国工商银行股份有限公司 Transaction risk monitoring method and device
CN110399615B (en) * 2019-07-29 2023-08-18 中国工商银行股份有限公司 Transaction risk monitoring method and device
CN110532565A (en) * 2019-08-30 2019-12-03 联想(北京)有限公司 Sentence processing method and processing device and electronic equipment
CN110532565B (en) * 2019-08-30 2022-03-25 联想(北京)有限公司 Statement processing method and device and electronic equipment
CN110543551A (en) * 2019-09-04 2019-12-06 北京香侬慧语科技有限责任公司 question and statement processing method and device
CN110543551B (en) * 2019-09-04 2022-11-08 北京香侬慧语科技有限责任公司 Question and statement processing method and device
CN110825857B (en) * 2019-09-24 2023-07-21 平安科技(深圳)有限公司 Multi-round question and answer identification method and device, computer equipment and storage medium
CN110825857A (en) * 2019-09-24 2020-02-21 平安科技(深圳)有限公司 Multi-turn question and answer identification method and device, computer equipment and storage medium
CN110738049A (en) * 2019-10-12 2020-01-31 招商局金融科技有限公司 Similar text processing method and device and computer readable storage medium
CN110738049B (en) * 2019-10-12 2023-04-18 招商局金融科技有限公司 Similar text processing method and device and computer readable storage medium
CN110781662A (en) * 2019-10-21 2020-02-11 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN111723297B (en) * 2019-11-20 2023-05-12 中共南通市委政法委员会 Dual-semantic similarity judging method for grid society situation research and judgment
CN111723297A (en) * 2019-11-20 2020-09-29 中共南通市委政法委员会 Grid social situation research and judgment-oriented dual semantic similarity discrimination method
CN113139034A (en) * 2020-01-17 2021-07-20 深圳市优必选科技股份有限公司 Statement matching method, statement matching device and intelligent equipment
CN111191464A (en) * 2020-01-17 2020-05-22 珠海横琴极盛科技有限公司 Semantic similarity calculation method based on combined distance
CN111368177A (en) * 2020-03-02 2020-07-03 北京航空航天大学 Answer recommendation method and device for question-answer community
CN111368177B (en) * 2020-03-02 2023-10-24 北京航空航天大学 Answer recommendation method and device for question-answer community
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN111414765A (en) * 2020-03-20 2020-07-14 北京百度网讯科技有限公司 Sentence consistency determination method and device, electronic equipment and readable storage medium
CN111582498A (en) * 2020-04-30 2020-08-25 重庆富民银行股份有限公司 QA (quality assurance) assistant decision method and system based on machine learning
CN111259668A (en) * 2020-05-07 2020-06-09 腾讯科技(深圳)有限公司 Reading task processing method, model training device and computer equipment
CN111680515B (en) * 2020-05-21 2022-05-03 平安国际智慧城市科技股份有限公司 Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium
CN111680515A (en) * 2020-05-21 2020-09-18 平安国际智慧城市科技股份有限公司 Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium
CN113779183B (en) * 2020-06-08 2024-05-24 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium
CN113779183A (en) * 2020-06-08 2021-12-10 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium
CN112380830A (en) * 2020-06-18 2021-02-19 达而观信息科技(上海)有限公司 Method, system and computer readable storage medium for matching related sentences in different documents
CN112380830B (en) * 2020-06-18 2024-05-17 达观数据有限公司 Matching method, system and computer readable storage medium for related sentences in different documents
CN113392176B (en) * 2020-09-28 2023-08-22 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and medium
CN113392176A (en) * 2020-09-28 2021-09-14 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and medium
CN112507097A (en) * 2020-12-17 2021-03-16 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112507097B (en) * 2020-12-17 2022-11-18 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112632252A (en) * 2020-12-25 2021-04-09 中电金信软件有限公司 Dialogue response method, dialogue response device, computer equipment and storage medium
CN112926340A (en) * 2021-03-25 2021-06-08 东南大学 Semantic matching model for knowledge point positioning
CN112926340B (en) * 2021-03-25 2024-05-07 东南大学 Semantic matching model for knowledge point positioning
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium
CN113792125A (en) * 2021-08-25 2021-12-14 北京库睿科技有限公司 Intelligent retrieval sorting method and device based on text relevance and user intention
CN113792125B (en) * 2021-08-25 2024-04-02 北京库睿科技有限公司 Intelligent retrieval ordering method and device based on text relevance and user intention
CN113722459A (en) * 2021-08-31 2021-11-30 平安国际智慧城市科技股份有限公司 Question and answer searching method based on natural language processing model and related device

Also Published As

Publication number Publication date
CN109344236B (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN109344236A (en) One kind being based on the problem of various features similarity calculating method
Bhat et al. Iiit-h system submission for fire2014 shared task on transliterated search
CN105095204B (en) The acquisition methods and device of synonym
Song et al. Named entity recognition based on conditional random fields
CN104794169B (en) A kind of subject terminology extraction method and system based on sequence labelling model
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
US20160350288A1 (en) Multilingual embeddings for natural language processing
CN106844658A (en) A kind of Chinese text knowledge mapping method for auto constructing and system
CN110705612A (en) Sentence similarity calculation method, storage medium and system with mixed multi-features
CN109960756A (en) Media event information inductive method
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
Meena et al. Survey on graph and cluster based approaches in multi-document text summarization
CN109255012A (en) A kind of machine reads the implementation method and device of understanding
Shah et al. Literature study on multi-document text summarization techniques
CN113157860A (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
Zhang et al. PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese
De Melo et al. UWN: A large multilingual lexical knowledge base
Islam et al. Applications of corpus-based semantic similarity and word segmentation to database schema matching
Hu et al. BIM oriented intelligent data mining and representation
AL-Khassawneh et al. Improving triangle-graph based text summarization using hybrid similarity function
Juan An effective similarity measurement for FAQ question answering system
CN115827988B (en) Self-media content heat prediction method
Sidhu et al. Role of machine translation and word sense disambiguation in natural language processing
Yadav et al. Graph-based extractive text summarization based on single document
BAZRFKAN et al. Using machine learning methods to summarize persian texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant