CN108399163A - Bluebeard compound polymerize the text similarity measure with word combination semantic feature - Google Patents

Bluebeard compound polymerize the text similarity measure with word combination semantic feature Download PDF

Info

Publication number
CN108399163A
CN108399163A CN201810234539.1A CN201810234539A CN108399163A CN 108399163 A CN108399163 A CN 108399163A CN 201810234539 A CN201810234539 A CN 201810234539A CN 108399163 A CN108399163 A CN 108399163A
Authority
CN
China
Prior art keywords
word
text
embedded
context
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810234539.1A
Other languages
Chinese (zh)
Other versions
CN108399163B (en
Inventor
罗森林
周晓瑞
潘丽敏
魏超
吴舟婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810234539.1A priority Critical patent/CN108399163B/en
Publication of CN108399163A publication Critical patent/CN108399163A/en
Application granted granted Critical
Publication of CN108399163B publication Critical patent/CN108399163B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The present invention relates to the text similarity measures for combining text set word polymerization and word combination distributed semantic feature, belong to natural language processing and machine learning field.Prediction that this method combines word polymerization in text set first, word combination distributed semantic feature progress own coding is composed a poem to a given tune of ci, word embedded coding network is established by the training process of own coding;Then it is embedded in and is indicated by word embedded coding network struction word, then calculated the maximum weighted matching that word is embedded in and measured as text similarity.The present invention has the characteristics that accuracy height, distributed semantic feature rich.The word embedded coding network of structure can efficiently use the semantic relation of word, establish the abundanter word insertion of distributed semantic information and indicate that the Semantic Similarity between better words of description further promotes the accuracy of text similarity measurement.

Description

Bluebeard compound polymerize the text similarity measure with word combination semantic feature
Technical field
The present invention relates to combination text set words to polymerize the text similarity measure with word combination distributed semantic feature, Belong to natural language processing and machine learning field.
Background technology
Currently, in multiple applications for needing to carry out text-processing such as more text matches, text cluster/classification, information retrieval In scene, text similarity measurement plays more and more important basic effect.In addition, in the text based on manifold learning Dimensionality reduction indicates in research that most of algorithm is substantially built upon the choosing that the neighbour on text set schemes structure or neighbour's text collection It selects, and the basis of these algorithms all relies on good text similarity measurement.
Text similarity measurement is broadly divided into the method based on character string and the method based on corpus.And it is based on corpus Text similarity measurement again can be divided into two steps:Pass through the training of the context (context) of word in corpus first The word insertion of text indicates that the distributed semantic information for the text that word insertion represents, recycling word insertion calculates between text Similitude.
1. the method based on character string
Text similarity measurement based on character string usually converts the text to vector or the discretization similar with vector Serial No., certain distance measure by calculating and comparing these vector sum sequences are used as the measurement of text similarity.For example, Damerau-Levenshtein distances are by insertion, deletion, replacement and exchange adjacent character four kinds of position mode of operation by one Text is converted into another text, and the similitude of two texts is measured by comparing required operation step number.It can also will be literary Originally being converted into a point of vector space, either vector measures text by comparing the distance of point or vector in vector space This similitude, such as Euclidean distance (Euclidean Distance) calculate the air line distance of two coordinate points in vector space, It is higher apart from smaller expression text similarity.Manhatton distance (Manhattan Distance) exists the air line distance of two points The sum of the projector distance of each reference axis, as the similarity measurement of two texts, apart from smaller, similitude is higher.COS distance (Cosine Similarity) calculates two vectorial cosine angles in vector space, and angle is smaller, indicates the phase of two texts It is higher like property.Finally, longest common subsequence (Longest Common Substring) is also that a kind of common text is similar Property measure, it is similar as the two by comparing longest identical continuation character subsequence present in two character strings The measurement of property.
2. the method based on corpus
Text similarity based on character string is measured using word, word or phrase as independent semantic primitive, is not examined fully The semantic relation that Matching Relation implies between worry word, causes them to be difficult to accurately portray the Semantic Similarity of text.
The missing of meaning of a word contact details reduces the accuracy of final text similarity measurement.It is this in order to effectively utilize The meaning of a word contacts, the distributed semantic feature construction word that the text similarity measurement based on corpus passes through word in analysis text set The vector of language indicates that the word insertion theories of learning occurred in recent years provide effective solution thinking for the problem.
(1) word incorporation model:Earliest work about word insertion proposes that he is in a series of papers by Bengio 2003 In used neural probabilistic language model (Neural probabilistic language models) to make machine " acquistion language Distributed indicate ", to achieve the purpose that language space dimensionality reduction.Since entire modeling process is to be based on N-gram models, So obtained word insertion can reflect the continuity between word context, the i.e. semantic relation of text.Word embedding grammar Inspiration comes from distributed hypothesis, i.e., similar linguistic unit distribution has similar meaning.The word embedding grammar of mainstream is studied It lays particular emphasis on and extracts feature from target word and its context, modeled by distributed semantic contact.
(2) word2vec models:Word2vec was proposed by Tomas Mikolov in 2013, included Continuous altogether Two kinds of models of Bag-of-Words (CBOW) and Skip-Gram (SG), both models use simple neural network framework The expression for practising word, is randomly assigned a coding vector for all words first, by the sliding window of a regular length from language Expect to collect target word w in librarytContext distributed information Ct, recycle neural network according to given CtPredict wtProcess (CBOW) or given wtPredict Ct(SG).The word insertion finally built can preferably reflect distributed semantic information.Because it Simplicity, Skip-Gram and CBOG models can be trained on large data sets, be handled by parallelization, they can be with Learn the model of a more than one hundred billion word in 24 hours.But the shortcomings that both models, which is them, does not account for the overall situation Statistical information.
(3) Glove models:Pennington of Stanford University in 2014 et al. proposes that Glove models, the model are adopted again Distributed semantic information is characterized with the matrix of a global word-word, its main target is to observe the co-occurrence rate of word and word, is come The meaning of a word of comparing word in the text.Glove models initially set up a word co-occurrence matrix, then pass through singular value decomposition (SVD) vector for obtaining word indicates.Glove models improve the effect of word insertion.
(4) 2013 years Stephane Clinchant build word insertion using a gauss hybrid models (GMM) and indicate, In per one-dimensional word insertion can be regarded as a topic, thus entire text is converted into a set of word insertion, One text can be indicated by the vector of an indefinite length, finally use a Fisher kernel function that will represent original The DUAL PROBLEMS OF VECTOR MAPPING of this indefinite length of text is finally counted using COS distance at the fixed document representation vector of a length Calculate the similitude between text.
(5) 2014 years, Baotian Hu devised a kind of sentence level using depth convolutional neural networks (CNN) training Matching Model, this model keep sentence in word sequencing, sentence be converted into one by word insertion vector constitute Matrix, then successively carry out convolution operation, obtain adjacent word in matrix combination indicate, in depth convolutional neural networks Pond layer, the model select potential word combination using the operation of local maxima pondization, finally by a full connection The text vector that layer neural net layer obtains a fixed sentence level of length indicates.The shortcomings that this model, is, due to general Logical text usually contains a plurality of sentence, and the vector of single sentence indicates that the feature of semanteme of text can not be reacted how to utilize sentence Vector construct entire text vector indicate still to need deeper into research.Therefore, the effect of this mode still has promotion empty Between.
(6) 2014 years Quoc Le and Tomas Mikolov propose Paragraph again based on CBOW and SG frameworks Vector, it is a unsupervised-learning algorithm, by the way that text to be regarded as to " the virtual word as one and true word equivalence Language ", after one layer of additional mapping layer, into training in word2vec network structures, and with the w that occurs jointlytAnd CtAltogether With training, finally, original text is represented as the vector of a regular length in word embedded space.Similar achievement also has The Doc2vec that Chen Minmin in 2017 are proposed, the model randomly select one group of word insertion from text, utilize word insertion Averaging operation constructs text representation vector, to improve the execution efficiency of training process.
Summarizing the above text similarity measure can obtain:(1) the text similarity measure based on character string The semantic information between word in text is had ignored, causes them that can not accurately reflect the text similarity of deep layer;(2) word insertion side Method does not account for word combination, word polymerize the otherness of two kinds of distributed semantic features, this will necessarily cause word insertion to lose one Divide distributed semantic information.
Invention content
The purpose of the present invention is to solve the high-accuracy metric question of text similarity, propose a kind of imparametrization text Method for measuring similarity.The word polymerization of this method combination text contacts feature with word combination, and text is regarded as about there is word The word embedding set of language builds an own coding word insertion learning framework (SPC), passes through the mistake of one similar " prediction of composing a poem to a given tune of ci " Journey obtains the similarity measurement of text.
The present invention design principle be:According to the distributed definition assumed to word distribution form, exist between word A kind of " orthogonal " Two dimensional Distribution formula semantic relation feature:Word polymerization, word combination contact.Utilize the word polymerization of word, word combination Target vocabulary is first removed from its contextual window, then passes through one by the difference for the semantic level that the difference of distribution form is brought The course prediction target vocabulary of a encoding and decoding and other vocabulary that can substitute target vocabulary " filling a vacancy ", finally by own coding Network is trained, and can be built word embedded coding network and be indicated for extracting word insertion.
Technical scheme of the present invention includes that structure full text statistical information adjacency matrix and own coding compose a poem to a given tune of ci and predict two processes, Steps are as follows for specific implementation:
Step 1, construct a directed connected graph about dictionary V has wherein each node indicates a word in dictionary Front and back combination contact, is expressed as an adjacency matrix by this digraph, passes through adjacency matrix between indicating word to even side Off-diagonal element indicates the sequence combination contact (local message) of word in context, while being collected in full text by diagonal element The statistical information (global information) of word;
Step 2, the adjacency matrix established based on step 1, by the course prediction target vocabulary of encoding and decoding and with poly- The word of contact is closed, detailed process is as follows:
Step 2.1, a context is givenAnd pi={ pi(wt)}t∈V∈ C, it is assumed that polymerization word Remittance is wt, target vocabulary is tk(l+L)=ws∈ V (word in window centre position), enable zi(ws) indicate about in target vocabulary It hereafter reconstructs, zi(wt) indicate context reconstruct about polymerization vocabulary, according to encoding-decoding process, according toPredict ciAnd pi (wt), whole process can be expressed as formula
The target of model is to minimize the loss of reconstruct, is lost come Metric reconstruction using mean square error, the target letter of model Number can be expressed as following reconstruct loss metric formula
By minimizing reconstruct loss, as correct as possible " answer " filling space is found out, is as follows:
Step 2.1.1, the information decomposition that the context word combination adjacency matrix of target vocabulary is provided are full text statistics letter The diagonal regions element vector and context composite structure submatrix of breath;
Step 2.1.2 is iterated the update of formula parameter by gradient descent algorithm to the weight that these are remained;
Step 3, the similarity measurement δ between word pair is calculatedst, the word that consideration step 2 obtains is to (ws,wt), wherein ws∈ Ωi, wt∈Ωj, w is embedded in by wordsWith text ΩjThe insertion of all words executes inner products and calculates come the similitude between indicating, It is denoted as δ(s)=ws·Ωj;Likewise, being embedded in w by wordtWith ΩiIn the insertion of all words ask inner product both to calculate to characterize Between similitude, be denoted as δ(t)=wt·Ωi, then utilize softmax function formulas
Above-mentioned two situations are normalized, are denoted as σ (δ respectively(s))tWith σ (δ(t))s, finally select the two in compared with Similarity measurement of the big value as word pair, calculation formula are
δst=max (σ (δ(s))t,σ(δ(t))s);
Step 4, the similarity measurement δ of the word and text that are obtained by step 3st, by equalizing maximum weighted matching distance (Normalized maximum Matching Distance, NMD) obtains the similarity measurement of text, the calculation formula of NMD For
It, will be in coding network and decoding network in step 2.1.1 meanwhile according to loss metric process is reconstructed described in step 2 The decoding network for being responsible for reconstruct target vocabulary is set as shared relationship, and then reduces the parameter ranges for needing to estimate, prevents model Over-fitting.
Advantageous effect
Compared to the text similarity measure based on character string, the text based on corpus that the present invention uses is similar Property measure have the characteristics that accuracy rate is high, reflection text semantic feature.Through structure full text statistical information adjacency matrix and Own coding is composed a poem to a given tune of ci prediction, and the present invention can reach higher accuracy rate, while can efficiently use semantic relation between word, and carry For abundant distributed semantic information, effect is measured to further promote text similarity.
The maximum weighted matching distance of calculating word insertion, Ke Yichong are indicated by word insertion constructed by word embedded coding network Divide and provided distributed semantic information is provided using word, more accurately portray the Semantic Similarity of text, has good Application value and promotional value.
Description of the drawings
Fig. 1 is that the bluebeard compound polymerization phrase of the present invention closes distributed semantic word insertion learning algorithm schematic diagram;
The digraph of context word combination feature indicates example in Fig. 2 specific implementation modes;
Encoding and decoding frame based on word combination, aggregation association in Fig. 3 specific implementation modes;
Fig. 4 is the result figure that (2) are tested in specific implementation mode;
Fig. 5 is text similarity measurement algorithm principle figure in specific implementation mode;
Fig. 6 is the result figure that (3) are tested in specific implementation mode;
Fig. 7 is the result figure that (4) are tested in specific implementation mode.
Specific implementation mode
In order to better illustrate objects and advantages of the present invention, with reference to the accompanying drawings and examples to the reality of the method for the present invention The mode of applying is described in further details.
1. carrying out (1) part of speech than experiment and (2) meaning of a word discrimination contrast experiment by being embedded in construction method with existing word, test Card SPC has humidification to the distributed semantic information that word is embedded in.The specific method of comparison include CBOW and SG, Glove, HDC, CWin and SSG.Prediction of composing a poem to a given tune of ci only is carried out according to the context and full text information of target word simultaneously, SPC is simplified, It is denoted as SPC-1;On the basis of SPC-1, target word is predicted by only contextual information, obtains SPC-0.
2. the cluster and classifying quality of quantitative analysis NMD is tested by the experiment of (3) text cluster and (4) text classification, it is right The method of ratio includes Euclidean distance, COS distance, WMD distances.
Aforementioned four experiment flow is illustrated one by one below, all tests are completed on same computer, tool Body is configured to:Tetra- core processors of Intel (R) Core (TM) i7-477lK, 3.50GHz, physical memory (RAM) 32GB.Experiment opening Hair ring border:Windows 7spl, Microsoft VS2013, Python2.7, Matlab 2015a.
(1) part of speech is carried out than experiment using wikipedia corpus (Wikipedia 2010) disclosed in April, 2010, be somebody's turn to do Corpus includes about 2,000,000 texts and 9,900,000 vocabulary.After pretreatment, all words are changed to small letter, and screen Go out word of the occurrence number more than 20 to be trained.Part of speech is shown in Table 1 than testing using the details of corpus:
1. part of speech of table uses corpus details than experiment
Part of speech is than testing the COS distance by calculating word insertion, from all word insertions, search and vectorMost similar vectorTo answer, such as " word is embedded in waIt is embedded in w with wordbRelationship be similar to word it is embedding Enter wcWhich it is embedded in word ".Using accuracy (accuracy) as evaluation index, calculation formula is for this experiment:
Accuracy=Ntrue/N
Wherein, N is the sum of problem, NtrueIt indicates and the accurate matched answer number of correct option.
(2) utilize two common data sets of context similarity (SCWS) of WordSim-353 (WS-353) and Stamford into Row meaning of a word discrimination is tested.WS-353 is made of 353 pairs of nouns, and each noun is mutual indepedent, and relevant contextual information is not present, One is provided by 0 to 10 marking results to the subjective judgement of similitude and correlation by the mankind.SCWS data sets are by each word With provide together below thereon, can based on context reflect the semantic variation of target vocabulary.
Vector is embedded in by the word of different terms to indicate to calculate the cosine similarity between arbitrary word pair, then count first The Spearman rank correlation ρ between the cosine similarity score between these words pair and human subject's judgement marking is calculated, into The word insertion of this assessment structure indicates vector for the semantic discrimination effect between word.For n word pair, y is enablediIndicate some The grade sequence that human subject gives a mark between word pair, xiIndicate that being embedded in vector according to the word of the word pair indicates the cosine being calculated The grade sequence of similarity score, then, Spearman rank correlation calculation formula is:
(3) text cluster experiment, wherein 20newsgroups are carried out using 20newsgroups and RCV1 public data collection It is made of 20 different themes of news.And RCV1 is the multiclass for comprising more than 800,000 news releases provided by Reuter Other text data set, this experiment therefrom select 4 classifications to carry out dependence test, M11 (EQUITY MARKETS), M12 (BOND MARKETS), M131 (INTERBANK MARKETS) and M132 (FOREX MARKETS).Specific statistical information is shown in Table 2.
Partial statistical information on 2. two kinds of text sets of table
Evaluation index, value are used as using Average Mutual (Normalized mutual information, NMI) Range is [0,1], and value is higher to show that Clustering is more similar to true grouping.Its calculation formula is:
Wherein, C={ CiIndicate by clustering algorithm and T={ TiIndicate the true classification situation of text.H (C) and H (T) The entropy of two kinds of situations is corresponded to respectively, and MI (C, T) is the mutual information between two kinds of situations, its calculation formula is:
p(Ci,Ti) it is text xi∈Ci,TiJoint probability, expression randomly select a text from text set while belonging to In Ci,TiProbability, and p (Ci) indicate to randomly select a text xi∈CiProbability, p (Ti) indicate to randomly select a text xi∈TiProbability.
(4) data set that text classification experiment uses is as text cluster.Classification experiments are using weighting F valuesTo assess The accuracy of final classification result,Value is higher to show that classification results are more accurate, and calculation formula is:
Wherein, ciIndicate that the text of classification i accounts for ratio on entire test set, C indicates the whole size of test set.FiIt is to close In the F values of classification i, it and accurate rate (precision) PiWith recall rate (recall) RiIt is closely related, these three evaluation indexes Calculation formula be:
Experimental result
For experiment (1), 3 are the results are shown in Table.On general effect, SPC has surmounted other word insertion construction methods.This shows Prediction that joint text set word polymerize and word combination distributed semantic feature progress own coding is composed a poem to a given tune of ci, establishes word embedded coding network, The semantic relation between word can be made full use of, the abundanter word insertion of distributed semantic information is established and indicates, under The accuracy that one step improves text similarity measurement provides guarantee.
Accuracy (%) of 3. part of speech of table than answering result in evaluation task
For experiment (2), under two datasets, SPC performances in the case where different words are embedded in dimension are better than control methods, this is tested It has demonstrate,proved and has established word embedded coding network by combining word polymerization, word combination distributed semantic feature in text set and can effectively carry Accuracy of the word insertion for meaning of a word discrimination is risen, the similitude between the meaning of a word is preferably described.The result of experiment (2) is shown in Fig. 4.
For experiment (3), from overall effect as can be seen that NMD the and WMD effects of word-based insertion are better than traditional text This similarity measurement can reach 63.1% on 20 newsgroups language materials.This shows can have using word embedded coding network Effect is using semantic relation between word, and the abundant distributed semantic information provided, to further promote text similarity measurement Effect.The result of experiment (3) is shown in Fig. 6.
For experiment (4), the experimental results showed that, it is embedded in and is indicated using the word constructed by word embedded coding network, Ke Yichong Divide using semantic relation between word, enhance the accuracy with similarity measurement between class text, thus more accurately portrays text language Adopted similitude.When k increases to 20, degenerating occurs in the classifying quality of WMD, and NMD remains to keep preferably to show, can To reach 71.59%.This illustrates that NMD can efficiently use the distributed semantic information of SPC offers, and it is similar further to promote text Property measurement accuracy.The result of experiment (4) is shown in Fig. 7.
Above-mentioned 4 experiments the experimental results showed that, the present invention has that accuracy rate is high, spy of distributed semantic feature rich Point.In the part of speech ratio experiment of word insertion, accuracy reaches 73.95%, in the meaning of a word discrimination experiment of word insertion, Spearman Rank correlation reaches 74.12.By combining word polymerization, word combination distributed semantic feature in text set, structure word insertion to compile Code network can efficiently use the semantic relation of word, establish the abundanter word insertion of distributed semantic information and indicate, more preferably Words of description between Semantic Similarity.
In text cluster experiment, the cluster NMI effects of NMD reach 63.1%, in text classification experiment, point of NMD ClassEffect promoting is to 71.59%.The maximum weighted of calculating word insertion is indicated by word insertion constructed by word embedded coding network Matching distance can make full use of word to be embedded in provided distributed semantic information, more accurately portray the semantic phase of text Like property, the accuracy of text similarity measurement is further promoted.

Claims (2)

1. bluebeard compound polymerize the text similarity measure with word combination semantic feature, it is characterised in that:
Since the word polymerization of word, the difference of word combination distribution form can bring the difference of semantic level in text, for fully profit With the semantic information corresponding to this species diversity, this patent builds context word combination representation first, recycles in text set Prediction that the word of word polymerize, word combination distributed semantic feature progress own coding is composed a poem to a given tune of ci, word is obtained by the process that own coding is trained Embedded coding network;Then, it is embedded in and is indicated using obtained word embedded coding network struction word, finally by calculating word insertion Measurement of the maximum weighted matching distance as text similarity, specifically comprises the following steps:
Step 1, a directed connected graph about dictionary V is constructed, wherein each node indicates a word in dictionary, oriented company Side indicates front and back combination contact between word, this digraph is expressed as an adjacency matrix, passes through the non-right of adjacency matrix The sequence combination contact (local message) of word in the element representation context of angle, while word in full text is collected by diagonal element Statistical information (global information);
Step 2, the adjacency matrix established based on step 1, by the course prediction target vocabulary of encoding and decoding and with polymerization connection The word of system, detailed process are as follows:
Step 2.1, a context is givenAnd pi={ pi(wt)}t∈V∈ C, it is assumed that polymerizeing vocabulary is wt, target vocabulary is tk(l+L)=ws∈ V (word in window centre position), enable zi(ws) indicate about target vocabulary context Reconstruct, zi(wt) indicate context reconstruct about polymerization vocabulary, according to encoding-decoding process, according toPredict ciAnd pi(wt), Whole process can be expressed as formula
The target of model is to minimize the loss of reconstruct, is lost come Metric reconstruction using mean square error, the object function of model can To be expressed as following reconstruct loss metric formula
By minimizing reconstruct loss, as correct as possible " answer " filling space is found out, is as follows:
Step 2.1.1, the information decomposition that the context word combination adjacency matrix of target vocabulary is provided are full text statistical information Diagonal regions element vector and context composite structure submatrix;
Step 2.1.2 is iterated the update of formula parameter by gradient descent algorithm to the weight that these are remained;
Step 3, the similarity measurement δ between word pair is calculatedst, the word that consideration step 2 obtains is to (ws, wt), wherein ws∈Ωi, wt ∈Ωj, w is embedded in by wordsWith text ΩjThe insertion of all words executes inner products and calculates come the similitude between indicating, is denoted as δ(s)=ws·Ωj;Likewise, being embedded in w by wordtWith ΩiIn the insertion of all words carry out that inner product is asked to calculate to characterize between the two Similitude, be denoted as δ(t)=wt·Ωi, then utilize softmax function formulas
Above-mentioned two situations are normalized, are denoted as σ (δ respectively(s))tWith σ (δ(t))s, finally select larger in the two It is worth the similarity measurement as word pair, calculation formula is
δst=max (σ (δ(s))t, σ (δ(t))s);
Step 4, the similarity measurement δ of the word and text that are obtained by step 3st, by equalizing maximum weighted matching distance (Normalized maximum Matching Distance, NMD) obtains the similarity measurement of text, the calculation formula of NMD For
2. minimizing reconstruct loss process according to claim 1, it is characterised in that:In step 2.1.1 by coding network with The decoding network for being responsible for reconstruct target vocabulary in decoding network is set as shared relationship, and then reduces the parameter amount for needing to estimate Grade, prevents model over-fitting.
CN201810234539.1A 2018-03-21 2018-03-21 Text similarity measurement method combining word aggregation and word combination semantic features Expired - Fee Related CN108399163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810234539.1A CN108399163B (en) 2018-03-21 2018-03-21 Text similarity measurement method combining word aggregation and word combination semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810234539.1A CN108399163B (en) 2018-03-21 2018-03-21 Text similarity measurement method combining word aggregation and word combination semantic features

Publications (2)

Publication Number Publication Date
CN108399163A true CN108399163A (en) 2018-08-14
CN108399163B CN108399163B (en) 2021-01-12

Family

ID=63092017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810234539.1A Expired - Fee Related CN108399163B (en) 2018-03-21 2018-03-21 Text similarity measurement method combining word aggregation and word combination semantic features

Country Status (1)

Country Link
CN (1) CN108399163B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190126A (en) * 2018-09-17 2019-01-11 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109472023A (en) * 2018-10-19 2019-03-15 中国人民解放军国防科技大学 Entity association degree measuring method and system based on entity and text combined embedding and storage medium
CN109543191A (en) * 2018-11-30 2019-03-29 重庆邮电大学 One kind being based on the maximized term vector learning method of word relationship energy
CN109582953A (en) * 2018-11-02 2019-04-05 中国科学院自动化研究所 A kind of speech of information is according to support methods of marking, equipment and storage medium
CN109597876A (en) * 2018-11-07 2019-04-09 中山大学 A kind of more wheels dialogue answer preference pattern and its method based on intensified learning
CN109670171A (en) * 2018-11-23 2019-04-23 山西大学 A kind of word-based term vector expression learning method to asymmetric co-occurrence
CN109783806A (en) * 2018-12-21 2019-05-21 众安信息技术服务有限公司 A kind of text matching technique using semantic analytic structure
CN109815493A (en) * 2019-01-09 2019-05-28 厦门大学 A kind of modeling method that the intelligence hip-hop music lyrics generate
CN109829299A (en) * 2018-11-29 2019-05-31 电子科技大学 A kind of unknown attack recognition methods based on depth self-encoding encoder
CN110084440A (en) * 2019-05-15 2019-08-02 中国民航大学 The uncivil grade prediction technique of civil aviation passenger and system based on joint similarity
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110309505A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of data format self-analytic data method of word-based insertion semantic analysis
CN110321925A (en) * 2019-05-24 2019-10-11 中国工程物理研究院计算机应用研究所 A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint
CN110674292A (en) * 2019-08-27 2020-01-10 腾讯科技(深圳)有限公司 Man-machine interaction method, device, equipment and medium
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN110866195A (en) * 2019-11-12 2020-03-06 腾讯科技(深圳)有限公司 Text description generation method and device, electronic equipment and storage medium
CN110968690A (en) * 2018-09-30 2020-04-07 百度在线网络技术(北京)有限公司 Clustering division method and device for words, equipment and storage medium
CN111143510A (en) * 2019-12-10 2020-05-12 广东电网有限责任公司 Searching method based on latent semantic analysis model
WO2020098098A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Semantic analysis-based text accuracy calculation method, device and computer device
WO2020098099A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Text accuracy calculation method and apparatus based on semantic parsing, and computer device
CN111581351A (en) * 2020-04-30 2020-08-25 识因智能科技(北京)有限公司 Dynamic element embedding method based on multi-head self-attention mechanism
WO2020250064A1 (en) * 2019-06-11 2020-12-17 International Business Machines Corporation Context-aware data mining
CN112214995A (en) * 2019-07-09 2021-01-12 百度(美国)有限责任公司 Hierarchical multitask term embedding learning for synonym prediction
CN112749554A (en) * 2020-02-06 2021-05-04 腾讯科技(深圳)有限公司 Method, device and equipment for determining text matching degree and storage medium
CN113220832A (en) * 2021-04-30 2021-08-06 北京金山数字娱乐科技有限公司 Text processing method and device
CN116108158A (en) * 2023-04-13 2023-05-12 合肥工业大学 Online interactive question-answering text feature construction method and system
CN117056902A (en) * 2023-09-27 2023-11-14 广东云百科技有限公司 Password management method and system for Internet of things

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8484148B2 (en) * 2009-05-28 2013-07-09 Microsoft Corporation Predicting whether strings identify a same subject
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
US20160078148A1 (en) * 2014-09-16 2016-03-17 Microsoft Corporation Estimating similarity of nodes using all-distances sketches
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device
CN107273426A (en) * 2017-05-18 2017-10-20 四川新网银行股份有限公司 A kind of short text clustering method based on deep semantic route searching

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8484148B2 (en) * 2009-05-28 2013-07-09 Microsoft Corporation Predicting whether strings identify a same subject
US20160078148A1 (en) * 2014-09-16 2016-03-17 Microsoft Corporation Estimating similarity of nodes using all-distances sketches
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device
CN107273426A (en) * 2017-05-18 2017-10-20 四川新网银行股份有限公司 A kind of short text clustering method based on deep semantic route searching

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190126B (en) * 2018-09-17 2023-08-15 北京神州泰岳软件股份有限公司 Training method and device for word embedding model
CN109190126A (en) * 2018-09-17 2019-01-11 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN110968690A (en) * 2018-09-30 2020-04-07 百度在线网络技术(北京)有限公司 Clustering division method and device for words, equipment and storage medium
CN110968690B (en) * 2018-09-30 2023-05-23 百度在线网络技术(北京)有限公司 Clustering division method and device for words, equipment and storage medium
CN109472023A (en) * 2018-10-19 2019-03-15 中国人民解放军国防科技大学 Entity association degree measuring method and system based on entity and text combined embedding and storage medium
CN109582953B (en) * 2018-11-02 2023-04-07 中国科学院自动化研究所 Data support scoring method and equipment for information and storage medium
CN109582953A (en) * 2018-11-02 2019-04-05 中国科学院自动化研究所 A kind of speech of information is according to support methods of marking, equipment and storage medium
CN109597876A (en) * 2018-11-07 2019-04-09 中山大学 A kind of more wheels dialogue answer preference pattern and its method based on intensified learning
CN109597876B (en) * 2018-11-07 2023-04-11 中山大学 Multi-round dialogue reply selection model based on reinforcement learning and method thereof
WO2020098099A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Text accuracy calculation method and apparatus based on semantic parsing, and computer device
WO2020098098A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Semantic analysis-based text accuracy calculation method, device and computer device
CN109670171A (en) * 2018-11-23 2019-04-23 山西大学 A kind of word-based term vector expression learning method to asymmetric co-occurrence
CN109829299A (en) * 2018-11-29 2019-05-31 电子科技大学 A kind of unknown attack recognition methods based on depth self-encoding encoder
CN109543191B (en) * 2018-11-30 2022-12-27 重庆邮电大学 Word vector learning method based on word relation energy maximization
CN109543191A (en) * 2018-11-30 2019-03-29 重庆邮电大学 One kind being based on the maximized term vector learning method of word relationship energy
CN109783806B (en) * 2018-12-21 2023-05-02 众安信息技术服务有限公司 Text matching method utilizing semantic parsing structure
CN109783806A (en) * 2018-12-21 2019-05-21 众安信息技术服务有限公司 A kind of text matching technique using semantic analytic structure
CN109815493A (en) * 2019-01-09 2019-05-28 厦门大学 A kind of modeling method that the intelligence hip-hop music lyrics generate
CN110084440A (en) * 2019-05-15 2019-08-02 中国民航大学 The uncivil grade prediction technique of civil aviation passenger and system based on joint similarity
CN110084440B (en) * 2019-05-15 2022-12-23 中国民航大学 Civil aviation passenger non-civilization grade prediction method and system based on joint similarity
CN110134965B (en) * 2019-05-21 2023-08-18 北京百度网讯科技有限公司 Method, apparatus, device and computer readable storage medium for information processing
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110321925A (en) * 2019-05-24 2019-10-11 中国工程物理研究院计算机应用研究所 A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint
CN110321925B (en) * 2019-05-24 2022-11-18 中国工程物理研究院计算机应用研究所 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN110309505A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of data format self-analytic data method of word-based insertion semantic analysis
WO2020250064A1 (en) * 2019-06-11 2020-12-17 International Business Machines Corporation Context-aware data mining
GB2599300A (en) * 2019-06-11 2022-03-30 Ibm Context-aware data mining
US11409754B2 (en) 2019-06-11 2022-08-09 International Business Machines Corporation NLP-based context-aware log mining for troubleshooting
CN112214995A (en) * 2019-07-09 2021-01-12 百度(美国)有限责任公司 Hierarchical multitask term embedding learning for synonym prediction
CN112214995B (en) * 2019-07-09 2023-12-22 百度(美国)有限责任公司 Hierarchical multitasking term embedded learning for synonym prediction
CN110674292A (en) * 2019-08-27 2020-01-10 腾讯科技(深圳)有限公司 Man-machine interaction method, device, equipment and medium
CN110674292B (en) * 2019-08-27 2023-04-18 腾讯科技(深圳)有限公司 Man-machine interaction method, device, equipment and medium
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN110705274B (en) * 2019-09-06 2023-03-24 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN110866195A (en) * 2019-11-12 2020-03-06 腾讯科技(深圳)有限公司 Text description generation method and device, electronic equipment and storage medium
CN110866195B (en) * 2019-11-12 2024-03-19 腾讯科技(深圳)有限公司 Text description generation method and device, electronic equipment and storage medium
CN111143510A (en) * 2019-12-10 2020-05-12 广东电网有限责任公司 Searching method based on latent semantic analysis model
CN112749554B (en) * 2020-02-06 2023-08-08 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for determining text matching degree
CN112749554A (en) * 2020-02-06 2021-05-04 腾讯科技(深圳)有限公司 Method, device and equipment for determining text matching degree and storage medium
CN111581351B (en) * 2020-04-30 2023-05-02 识因智能科技(北京)有限公司 Dynamic element embedding method based on multi-head self-attention mechanism
CN111581351A (en) * 2020-04-30 2020-08-25 识因智能科技(北京)有限公司 Dynamic element embedding method based on multi-head self-attention mechanism
CN113220832A (en) * 2021-04-30 2021-08-06 北京金山数字娱乐科技有限公司 Text processing method and device
CN113220832B (en) * 2021-04-30 2023-09-05 北京金山数字娱乐科技有限公司 Text processing method and device
CN116108158A (en) * 2023-04-13 2023-05-12 合肥工业大学 Online interactive question-answering text feature construction method and system
CN117056902A (en) * 2023-09-27 2023-11-14 广东云百科技有限公司 Password management method and system for Internet of things

Also Published As

Publication number Publication date
CN108399163B (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN108399163A (en) Bluebeard compound polymerize the text similarity measure with word combination semantic feature
Fang et al. From captions to visual concepts and back
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
Bruni et al. Distributional semantics from text and images
CN109697285A (en) Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN108491389B (en) Method and device for training click bait title corpus recognition model
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN110008323A (en) A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN111476024A (en) Text word segmentation method and device and model training method
CN112037909B (en) Diagnostic information review system
CN109934251A (en) A kind of method, identifying system and storage medium for rare foreign languages text identification
CN114548321A (en) Self-supervision public opinion comment viewpoint object classification method based on comparative learning
Wadud et al. Text coherence analysis based on misspelling oblivious word embeddings and deep neural network
CN114743029A (en) Image text matching method
CN110516230A (en) The bilingual parallel sentence pairs abstracting method in the Chinese-Burma and device based on pivot
CN116628186B (en) Text abstract generation method and system
CN113284627A (en) Medication recommendation method based on patient characterization learning
Rakhsha et al. Detecting adverse drug reactions from social media based on multichannel convolutional neural networks modified by support vector machine
CN111079582A (en) Image recognition English composition running question judgment method
Bouchard-Côté et al. A probabilistic approach to language change
CN115660871A (en) Medical clinical process unsupervised modeling method, computer device, and storage medium
CN115269846A (en) Text processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210112