CN107273913B - Short text similarity calculation method based on multi-feature fusion - Google Patents

Short text similarity calculation method based on multi-feature fusion Download PDF

Info

Publication number
CN107273913B
CN107273913B CN201710328364.6A CN201710328364A CN107273913B CN 107273913 B CN107273913 B CN 107273913B CN 201710328364 A CN201710328364 A CN 201710328364A CN 107273913 B CN107273913 B CN 107273913B
Authority
CN
China
Prior art keywords
short text
matrix
feature
short
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710328364.6A
Other languages
Chinese (zh)
Other versions
CN107273913A (en
Inventor
高曙
周润
王讷
龚磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN201710328364.6A priority Critical patent/CN107273913B/en
Publication of CN107273913A publication Critical patent/CN107273913A/en
Application granted granted Critical
Publication of CN107273913B publication Critical patent/CN107273913B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a short text similarity calculation method based on multi-feature fusion, which comprises the following steps: firstly, an HTI method is designed to extract word frequency characteristics of a short text, secondly, the existing Skip _ gram training model of word2vec is utilized to extract grammatical characteristics of the short text, then, an HSBM model is designed to organically fuse the word frequency and the grammatical characteristics in semantic dimension, and finally, an MFSM model is designed to calculate to vectorize the fusion result and calculate the similarity between the short texts. The method extracts the features of the short text from multiple dimensions, so that the calculation accuracy of the similarity of the short text can be effectively improved.

Description

Short text similarity calculation method based on multi-feature fusion
Technical Field
The invention relates to a natural language processing technology, in particular to a short text similarity calculation method based on multi-feature fusion.
Background
The space vector model (VSM) converts the feature terms in the short text into a numerical form that can be recognized by a computer and reflects the importance degree of the feature terms in the short text to a certain extent.
The feature extraction based on word frequency is a process of selecting a feature term set which can reflect the features of the short text most in the original term set according to the calculation of a specific feature evaluation function. The word frequency-inverse document frequency (TF-IDF) and Mutual Information (MI) are two common word frequency feature extraction methods. The concept of Information Entropy (IE) is derived from statistical thermodynamics and is used for measuring the chaos degree of a system, and the concept is not directly used for feature extraction of texts but is often fused into other short text word frequency feature extraction methods.
The characteristic extraction based on grammar can be directly investigated from the context environment of words by utilizing a language model so as to extract the grammatical characteristic of the short text; and modeling the distribution of subsequent words in the short text under a given context condition by using a neural network, namely extracting grammatical features of the short text by using a deep learning method. The Skip _ gram training model of word2vec is an implementation of a Neural Network Language Model (NNLM), omits a nonlinear hidden layer of the NNLM, quickly improves the prediction process of words by sacrificing the training precision, and compensates the training precision by increasing the training corpus, so that the training model can effectively and quickly generate word vectors. The Skip _ gram training model predicts the probability of the context generation through the current word to obtain the feature words with different probabilities, thereby keeping the grammatical relation among the feature words.
The word pair topic model (BTM) is a relatively common short text semantic feature extraction model, and is a perfect combination of a unary mixed model and a topic model: first, to solve the data sparseness problem, BTMs combine the advantages of unary mixture models: all short texts share a topic distribution; then, in order to eliminate the defect that each short text only has one theme, the BTM models the co-occurrence word pair on the whole corpus; and finally, mapping the short text to a corresponding semantic space (or a theme space), thereby analyzing and judging the short text semantics. If the method is described by a mathematical language, the theme represents the conditional probability distribution of the feature words in the feature word set, and the conditional probability value of the feature words reflects the degree of closeness of the relationship between the feature words and the theme.
The short text similarity calculation can be defined as: for a given short text set, on the basis of researching a short text structure, various short text features (such as word frequency, grammar and semantic features) are extracted and quantized, so that the same points and different points among the short texts are reflected by data, the more the same points are, the higher the similarity is, and conversely, the lower the similarity is. The JS distance is a commonly used short text similarity calculation method, is suitable for the situation that short text features are presented in a probability form, can reflect the difference situation of two probability distributions in the same probability space, is based on the KL distance, and improves the results of the KL distance to be not satisfied with the defects of nonnegativity, symmetry and the like.
Short text similarity calculation is a difficult point and a hot point in the fields of Natural Language Processing (NLP) and machine learning, is an important task in NLP, can be used as a separate task, and can be used as the basis of other NLP applications. In the field of short text similarity calculation, students mostly prefer to extract single-dimensional features of word frequency or semantics, and few short text features across dimensions are extracted and fused, so that the obtained features are one-sided and incomplete, and the similarity precision obtained by using the features is not high. In addition, in the aspect of word frequency dimension feature combination, most of the current researches are combined in a feature pool or two-dimensional feature space mode, and deep integration is lacked; in the aspect of semantic dimension feature extraction, the current research direction generally applies BTM directly on the original short text set, that is, feature extraction is performed on information by directly using rich words in the original short text set, which may amplify adverse effects caused by noise features.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a short text similarity calculation method based on multi-feature fusion aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a short text similarity calculation method based on multi-feature fusion comprises the following steps:
(1) acquiring the text quantity | M | and the topic quantity | K | in the short text set D to be calculated, and acquiring a short text-topic matrix S through a three-dimensional feature extraction and fusion process (namely process (A)) of word frequency, grammar and semantics;
1.1) extracting the word frequency characteristics of the short texts from the short text set D;
1.2) extracting short text grammatical features of the short text set D;
1.3) short text semantic fusion:
firstly, acquiring a weight matrix W' with fused word frequency and grammatical features; then, modeling by using BTM on a co-occurrence binary pair consisting of ternary elements < feature item t, short text d and fusion weight W '> in W'; finally, calculating to obtain the theme distribution probability of the short text, and obtaining a semantic fusion result short text-theme matrix S of the short text;
(2) converting the short text-theme matrix S into a short text vector set Z according to a formula (10), and initializing a similarity calculation result matrix Y;
(3) non-repeated selection of short text vectors d in set Z1If the set Z has no selectable short text, turning to the step (6);
(4) non-repeated selection of short text vectors d in set Z2If the set Z has no selectable short text, turning to the step (3);
(5) computing short text d1And d2Recording the result into a similarity result matrix Y according to the similarity between the two groups, and turning to the step (4);
(6) and obtaining a short text similarity result matrix Y.
According to the scheme, the text quantity in the short text set D is set to be | M |, the quantity of the non-repetitive characteristic items in the dictionary is set to be | N |, and a weight matrix is calculated by adopting an HTI method in the step 1.1), and the specific steps are as follows:
1.1.1): initializing the values of the characteristic item index i and the short text index j to be 0, and initializing a weight matrix W to be a zero matrix;
1.1.2): statistically calculating the feature term tiIn short text djFrequency of occurrence in and assigning to TF (t)i,dj);
1.1.3): calculating local factors of the characteristic items, wherein the adopted calculation formula is as follows:
localT(ti,dj)=log(TF(ti,dj)+β) (1)
wherein, TF (t)i,dj) Representing a feature item tiIn short text djβ is a constant factor (typically taken to be an empirical value of 1).
1.1.4): computing a feature term tiAnd short text djThe adopted calculation formula is as follows:
Figure BDA0001291776310000051
wherein,P(ti,dj) Representing a feature item tiAnd short text djProbability of co-occurrence, P (t)i) Representing a feature item tiProbability of occurrence in short text sets, P (d)j) Representing short text djProbability of occurrence in a short text set.
1.1.5): calculating the global factor of the feature item, wherein the adopted calculation formula is as follows:
Figure BDA0001291776310000052
where n is the total number of short texts, C (t)i,dj) Representing a feature item tiAnd short text djα is a constant factor (typically taken to be an empirical value of 1).
1.1.6): computing feature item-short text pairs (t)i,dj) And assigning to WijHTI weight calculation formula:
HTI(ti,dj)=localT(ti,dj)×globalT(ti,dj) (4)
wherein localT (t)i,dj) Representing a local factor of a feature item, globalT (t)i,dj) Representing a feature item global factor;
1.1.7): for each feature item-short text pair (t)i,dj) And repeating the operations from 1.1.2) to 1.1.6) to obtain an HTI weight matrix W of the short text set D.
According to the scheme, in the step 1.2), short text grammatical feature extraction is that a short text set D is trained by using a Skip _ gram model of word2vec to obtain a word vector set X:
X=(x1,x2,...,xi) (5)
wherein x isiRepresenting a feature item tiThe word vector of (2).
According to the scheme, the short text semantic fusion in the step 1.3) comprises the following specific steps:
1.3.1): each word vector X of the set of word vectors X obtained according to step 1.2)iAnd calculating a word vector normalization factor:
Figure BDA0001291776310000061
where m denotes the dimension of a predetermined word vector and k denotes the word vector xiThe value of the k-th dimension.
1.3.2): for each ternary element < feature t, text d, HTI weight W > in the HTI weight matrix W, a weight normalization factor is calculated:
Figure BDA0001291776310000062
wherein, HTI (t)i,dj) Representing short text djMiddle characteristic term tiHTI weight w.
1.3.3): and calculating fusion weight by using the word vector normalization factor and the weight normalization factor, and replacing the HTI weight W of each ternary element in the matrix W by using the fusion weight to obtain a new word frequency and grammar fusion weight matrix W'. The fusion weight calculation formula:
NL(ti,dj)=F(ti,dj)×G(i) (8)
1.3.4): generating corpus B (or called co-occurrence binary pair set B) by BTM on fusion weight matrix W')
1.3.5): for each co-occurrence binary pair B in set B ═ (c)i,cj) Randomly initializing a theme, wherein the initialization iteration number i is 0;
1.3.6): each co-occurring binary pair B in pair set B ═ ci,cj) Calculating the state transition probability:
Figure BDA0001291776310000071
1.3.7): step 1.3.6) is repeated while updating the frequency n in the state transition probability formula (9)s
Figure BDA0001291776310000072
And
Figure BDA0001291776310000073
until reaching the upper limit of iteration times;
1.3.8): calculating topic distribution theta of whole short text set by BTMsAnd the distribution of the ternary element c under a specific topicc|sThus, the topic probability distribution of each short text is obtained, namely, the short text-topic matrix S is obtained.
According to the scheme, the short text vector in the step 2) is calculated as follows:
after the short text set D passes through the HSBM model, a short text-topic distribution matrix S is obtained, each element in the S is a conditional probability, and each column in the matrix S is converted into a vector form of a short text:
di=(P(s1|di),P(s2|di),P(s3|di),...,P(s|K||di)) (10)
wherein, P(s)i|di) Representing short text diIs assigned to a topic siA conditional probability value, | K | represents the number of topics;
based on equation (10), the short text-topic distribution matrix S is converted into a set of short text vectors Z.
According to the scheme, the short text d is calculated in the step 6)1And d2The similarity between them is given by the following formula:
the calculation formula of the KL distance and the JS distance is as follows:
Figure BDA0001291776310000081
Figure BDA0001291776310000082
wherein d is1、d2As a probability distribution vector for short text, d1(k)、d2(k) Respectively representing probability distribution vectors d1、d2The k-th probability.
The invention has the following beneficial effects:
(1) the method is based on analyzing TF-IDF and mutual information two word frequency feature extraction methods, combines the concept of information entropy to effectively fuse the two word frequency feature extraction methods, provides a short text word frequency feature extraction method HTI, and realizes deep integration of multiple word frequency dimensionality features.
(2) According to the method, a semantic feature extraction model HSBM of a short text is constructed based on BTM, a word pair generation process in a short text corpus is not modeled directly, a short text-feature word fusion weight matrix W ' is obtained first, and then a co-occurrence binary pair formed by ternary elements < feature item t, text d and fusion weight W ' > in W ' is modeled, so that adverse effects caused by noise features are removed to a certain extent.
(3) The method extracts features from multiple dimensions of word frequency, grammar and semantics, and effectively improves the calculation accuracy of the similarity of the short text.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a model structure diagram of an HSBM of an embodiment of the present invention;
fig. 2 is a model structural diagram of an MFSM of an embodiment of the present invention;
fig. 3 is a flowchart of a short text similarity calculation method based on multi-feature fusion according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in FIG. 1, FIG. 1 is a Model structure diagram of HSBM (HTI-Skip _ gram-BTM fusion Model), in which the parameters are described as follows:
in process (I), rounded rectangles (e.g., "HTI") represent feature extraction methods or models, and hexagons represent short text collections; the circles represent the weight matrix: w is an HTI weight matrix obtained by an HTI method, X is a feature word vector set obtained by a Skip _ gram training model, and W' is a short text-feature word fusion weight matrix obtained by carrying out normalization operation (NL) on the HTI weight matrix W and the feature word vector set X; | M | represents the total text quantity in the short text set, and | N | represents the quantity of the feature terms.
The transparent circle (e.g., "θ") in Process (II) represents an implicit parameter, the shaded circle (e.g., "c)i") represents variables that can be directly obtained by sampling, α is a hyper-parameter for the Dirichlet distribution theta, β is the Dirichlet distribution theta
Figure BDA0001291776310000101
S is the underlying topic distribution, c is the ternary element in the fusion weight matrix W<Feature item t, text d, fusion weight w'>,(ci,cj) Is a co-occurrence binary pair consisting of ternary elements c; | K | represents the number of topics and | B | represents the number of co-occurrence binary pairs.
As shown in fig. 3, the basic steps of the method of the present invention are as follows:
if the number of texts in the short text set D is | M |, the number of topics is | K |
(1) Obtaining a short text-topic matrix S through a three-dimensional feature extraction and fusion process (namely, process (A)) of word frequency, grammar and semantics;
(2) converting the short text-theme matrix S into a short text vector set Z according to a formula (10), and initializing a similarity calculation result matrix Y;
(3) non-repeated selection of short text vectors d in set Z1If the set Z has no selectable short text, turning to the step (6);
(4) non-repeated selection of short text vectors d in set Z2If the set Z has no selectable short text, turning to the step (3);
(5) calculating short text d according to formula (11) and formula (12)1And d2Recording the result into a similarity result matrix Y according to the similarity between the two groups, and turning to the step (4);
(6) and obtaining a short text similarity result matrix Y.
Three-dimensional feature extraction and fusion process of word frequency, grammar and semantics
The three-dimensional feature extraction and fusion process of word frequency, grammar and semantics is mainly realized by using the HSBM model designed by the patent, and the three-dimensional feature extraction and fusion process mainly comprises the following basic steps: firstly, short text features are respectively extracted from two dimensions of word frequency and grammar, and then organic fusion is carried out in semantic dimension. Therefore, the implementation process is divided into three phases: extracting short text word frequency characteristics; short text grammatical feature extraction; and a short text semantic fusion stage. These three stages will be described separately below.
Short text word frequency characteristic extraction stage
The method is mainly realized by using an HTI (Hybrid TF-IDF) method designed by the patent at the short text word frequency feature extraction stage, improves the TF-IDF by using the concepts of MI and IE, retains the important function of the TF on short text feature extraction, and optimizes the structure of the IDF to more accurately reflect the distribution condition and the importance degree of feature words in all short texts, thereby more effectively adjusting the weight of the feature words and improving the precision of similarity calculation.
If the number of texts in the short text set D is | M |, and the number of unrepeated feature items in the dictionary is | N |, the basic steps of calculating the weight matrix by the HTI method are as follows:
the first step is as follows: initializing the values of the characteristic item index i and the short text index j to be 0, and initializing a weight matrix W to be a zero matrix;
the second step is that: statistically calculating the feature term tiIn short text djFrequency of occurrence in and assigning to TF (t)i,dj);
The third step: calculating a local factor of the characteristic item, wherein the calculation formula is as follows:
localT(ti,dj)=log(TF(ti,dj)+β) (1)
wherein, TF (t)i,dj) Representing a feature item tiIn short text djβ is a constant factor (typically taken to be an empirical value of 1).
The fourth step: computing a feature term tiAnd short text djThe correlation factor is calculated by the formula:
Figure BDA0001291776310000121
wherein, P (t)i,dj) Representing a feature item tiAnd short text djProbability of co-occurrence, P (t)i) Representing a feature item tiProbability of occurrence in short text sets, P (d)j) Representing short text djProbability of occurrence in a short text set.
The fifth step: calculating a global factor of the feature item, wherein the calculation formula is as follows:
Figure BDA0001291776310000122
where n is the total number of short texts, C (t)i,dj) Representing a feature item tiAnd short text djα is a constant factor (typically taken to be an empirical value of 1).
And a sixth step: computing feature item-short text pairs (t)i,dj) And assigning to WijHTI weight calculation formula:
HTI(ti,dj)=localT(ti,dj)×globalT(ti,dj) (4)
wherein localT (t)i,dj) Representing a local factor of a feature item, globalT (t)i,dj) Representing a feature item global factor.
The seventh step: for each feature item-short text pair (t)i,dj) And repeating the operations of the second step and the sixth step to obtain an HTI weight matrix W of the short text set D.
Short text grammatical feature extraction stage
In the short text grammatical feature extraction stage, a short text set D is trained by mainly utilizing a Skip _ gram model of word2vec to obtain a word vector set X:
X=(x1,x2,...,xi) (5)
wherein x isiRepresenting a feature item tiThe word vector of (2).
Short text semantic fusion phase
In the HSBM model, the implementation of the short text semantic fusion stage is as follows: firstly, acquiring a weight matrix W' with fused word frequency and grammatical features; then, modeling by using BTM on a co-occurrence binary pair consisting of ternary elements < feature item t, short text d and fusion weight W '> in W'; and finally, calculating to obtain the topic distribution probability of the short text, and obtaining the semantic fusion result of the short text. The specific steps of this stage (in fig. 1, process (I) includes steps one to three, and process (II) includes steps four to eight):
the first step is as follows: for each word vector X of the set of word vectors X in equation (5)iAnd calculating a word vector normalization factor:
Figure BDA0001291776310000131
where m denotes the dimension of a predetermined word vector and k denotes the word vector xiThe value of the k-th dimension.
The second step is that: for each ternary element < feature t, text d, HTI weight W > in the HTI weight matrix W, a weight normalization factor is calculated:
Figure BDA0001291776310000062
wherein, HTI (t)i,dj) Representing short text djMiddle characteristic term tiHTI weight w.
The third step: and calculating fusion weight by using the word vector normalization factor and the weight normalization factor, and replacing the HTI weight W of each ternary element in the matrix W by using the fusion weight to obtain a new word frequency and grammar fusion weight matrix W'. The fusion weight calculation formula:
NL(ti,dj)=F(ti,dj)×G(i) (8)
the fourth step: generating corpus B (or called co-occurrence binary pair set B) by BTM on fusion weight matrix W')
The fifth step: for each co-occurrence binary pair B in set B ═ (c)i,cj) Randomly initializing a theme, wherein the initialization iteration number i is 0;
and a sixth step: each co-occurring binary pair B in pair set B ═ ci,cj) Calculating the state transition probability:
Figure BDA0001291776310000141
the seventh step: repeating the step six, and simultaneously updating the frequency n in the state transition probability formula (9)s
Figure BDA0001291776310000142
And
Figure BDA0001291776310000143
until reaching the upper limit of iteration times;
eighth step: calculating topic distribution theta of whole short text set by BTMsAnd the distribution of the ternary element c under a specific topicc|sAnd then, the topic probability distribution of each short text is obtained, namely, a short text-topic matrix S is obtained.
Implementation of short text similarity calculation method based on multi-feature fusion
The basic idea of the short text similarity calculation method based on multi-feature fusion is as follows: firstly, short text features are extracted from word frequency, grammar and semantic dimensions respectively, then the short text features are organically fused, and a fusion result is quantized, so that the similarity between short texts is calculated. The method is mainly implemented by using an MFSM Model designed by the present patent, as shown in fig. 2, fig. 2 is a Model structure diagram of an MFSM (Multi-Feature based Similarity-calculation Model), where S is a short text-topic distribution matrix obtained by a short text set through an HSBM Model, Z is a short text vector set, Y is a short text Similarity result matrix, | M | represents a text quantity in the short text set, | K | represents a topic quantity, and JS represents a Similarity calculation method (that is, JS distance) for processing the short text vector set Z. The specific implementation of the method is mainly divided into 3 processes: (A) extracting and fusing three-dimensional characteristics of word frequency, grammar and semantics; (B) calculating a short text vector; (C) and calculating the similarity of the short texts. Wherein process (a) has been carried out as hereinbefore described. The following first describes process (B) and process (C), and then the basic steps of the method are described.
Calculation of short text vectors
After the short text set D passes through the HSBM model, a short text-topic distribution matrix S is obtained, each element in the S is a conditional probability, and each column in the matrix S is converted into a vector form of a short text:
di=(P(s1|di),P(s2|di),P(s3|di),...,P(s|K||di)) (10)
wherein, P(s)i|di) Representing short text diIs assigned to a topic siThe conditional probability value of, | K | represents the number of topics.
Obviously, the short text has been mapped to the corresponding semantic space (i.e., topic space). Based on formula (10), the short text-topic distribution matrix S is converted into a short text vector set Z (i.e., process (B) in fig. 2), which can be used as an input for the short text similarity calculation in process (C).
Calculation of short text similarity
Because each short text vector in the short text vector set Z is presented in a probability form, the method utilizes the JS distance to calculate the similarity between the short texts, and the similarity is based on the KL distance, and the result of improving the KL distance does not meet the defects of nonnegativity, symmetry and the like. The calculation formula of the KL distance and the JS distance is as follows:
Figure BDA0001291776310000161
Figure BDA0001291776310000162
wherein d is1、d2As a probability distribution vector for short text, d1(k)、d2(k) Respectively representing probability distribution vectors d1、d2The k-th probability.
At present, short text features used for short text similarity calculation are single in dimensionality, most of the short text features are biased to extracting single-dimensionality features of word frequency or semantics, and cross-dimensionality short text features are rarely extracted and fused, so that the obtained features are one-sided and incomplete, and the similarity precision obtained by using the features is not high. The patent provides a short text similarity calculation method based on multi-feature fusion, which comprises the steps of firstly, designing an HTI method to extract word frequency features of short texts, secondly, extracting grammatical features of the short texts by using an existing Skip _ gram training model of word2vec, then designing an HSBM model to organically fuse the word frequency and the grammatical features in semantic dimension, and finally designing an MFSM model to calculate a fusion result and calculate the similarity between the short texts. The short text feature extraction method based on the multi-dimension extraction can effectively improve the short text similarity calculation accuracy.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (6)

1. A short text similarity calculation method based on multi-feature fusion is characterized by comprising the following steps:
(1) acquiring the text quantity | M |, the quantity of unrepeated feature items in a dictionary | N |, and the quantity of topics | K |, in the short text set D to be calculated, and acquiring a short text-topic matrix S through three-dimensional feature extraction and fusion processes of word frequency, grammar and semantics;
1.1) extracting the word frequency characteristics of the short texts from the short text set D;
1.2) extracting the grammatical features of the short text from the short text set D to obtain a word vector set;
1.3) short text semantic fusion:
firstly, acquiring a weight matrix W' with fused word frequency and grammatical features; then, modeling by using BTM on a co-occurrence binary pair consisting of ternary elements < feature item t, short text d and fusion weight W '> in W'; finally, calculating to obtain the theme distribution probability of the short text, and obtaining a semantic fusion result short text-theme matrix S of the short text;
(2) converting the short text-theme matrix S into a short text vector set Z, and initializing a similarity calculation result matrix Y;
(3) non-repeated selection of short text vectors d in set Z1If the set Z has no selectable short text, turning to the step (6);
(4) non-repeated selection of short text vectors d in set Z2If the set Z has no selectable short text, turning to the step (3);
(5) computing short text d1And d2Recording the result into a similarity result matrix Y according to the similarity between the two groups, and turning to the step (4);
(6) and obtaining a short text similarity result matrix Y.
2. The method for calculating similarity of short texts according to claim 1, wherein the number of texts in the short text set D obtained in step 1.1) is | M |, the number of unrepeated feature items in the dictionary is | N |, and a weight matrix is calculated by using an HTI method, and the method specifically includes the following steps:
1.1.1): initializing the values of the characteristic item index i and the short text index j to be 0, and initializing a weight matrix W to be a zero matrix;
1.1.2): statistically calculating the feature term tiIn short text djFrequency of occurrence in and assigning to TF (t)i,dj);
1.1.3): calculating local factors of the characteristic items, wherein the adopted calculation formula is as follows:
localT(ti,dj)=log(TF(ti,dj)+β)
wherein, TF (t)i,dj) Representing a feature item tiIn short text djβ is a constant factor;
1.1.4): computing feature itemstiAnd short text djThe adopted calculation formula is as follows:
Figure FDA0002214183830000021
wherein, P (t)i,dj) Representing a feature item tiAnd short text djProbability of co-occurrence, P (t)i) Representing a feature item tiProbability of occurrence in short text sets, P (d)j) Representing short text djProbability of occurrence in a short text set;
1.1.5): calculating the global factor of the feature item, wherein the adopted calculation formula is as follows:
Figure FDA0002214183830000031
where n is the total number of short texts, C (t)i,dj) Representing a feature item tiAnd short text djα is a constant factor;
1.1.6): computing feature item-short text pairs (t)i,dj) And assigning to WijHTI weight calculation formula:
HTI(ti,dj)=localT(ti,dj)×globalT(ti,dj)
wherein localT (t)i,dj) Representing a local factor of a feature item, globalT (t)i,dj) Representing a feature item global factor;
1.1.7): for each feature item-short text pair (t)i,dj) And repeating the operations from 1.1.2) to 1.1.6) to obtain an HTI weight matrix W of the short text set D.
3. The short text similarity calculation method according to claim 1, wherein the short text grammar feature extraction in step 1.2) is to train a short text set D to obtain a word vector set X by using a Skip _ gram model of word2 vec:
X=(x1,x2,...,xi)
wherein x isiRepresenting a feature item tiThe word vector of (2).
4. The short text similarity calculation method according to claim 2, wherein the short text semantic fusion in step 1.3) specifically comprises the following steps:
1.3.1): each word vector X of the set of word vectors X obtained according to step 1.2)iAnd calculating a word vector normalization factor:
Figure FDA0002214183830000041
where m denotes the dimension of a predetermined word vector and k denotes the word vector xiA value of the k-th dimension;
1.3.2): for each ternary element < feature t, text d, HTI weight W > in the HTI weight matrix W, a weight normalization factor is calculated:
Figure FDA0002214183830000042
wherein, HTI (t)i,dj) Representing short text djMiddle characteristic term tiHTI weight w of (a);
1.3.3): calculating a fusion weight by using the word vector normalization factor and the weight normalization factor, and replacing the HTI weight W of each ternary element in the matrix W by using the fusion weight to obtain a new word frequency and grammar fusion weight matrix W'; the fusion weight calculation formula:
NL(ti,dj)=F(ti,dj)×G(i)
1.3.4): generating a corpus B by using BTM on the fusion weight matrix W';
1.3.5): for each co-occurrence binary pair B in set B ═ (c)i,cj) Randomly initializing a theme, wherein the initialization iteration number i is 0;
1.3.6): (ii) for each co-occurring binary pair B in the set Bci,cj) Calculating the state transition probability:
Figure FDA0002214183830000043
1.3.7): repeating step 1.3.6) while updating the frequency n in the state transition probability formulas
Figure FDA0002214183830000044
And
Figure FDA0002214183830000051
until reaching the upper limit of iteration times;
1.3.8): calculating topic distribution theta of whole short text set by BTMsAnd the distribution of the ternary element c under a specific topicc|sThus, the topic probability distribution of each short text is obtained, namely, the short text-topic matrix S is obtained.
5. The method for calculating similarity of short texts according to claim 1, wherein the short text vector in step 2) is calculated as follows:
after the short text set D passes through the HSBM model, a short text-topic distribution matrix S is obtained, each element in the S is a conditional probability, and each column in the matrix S is converted into a vector form of a short text:
di=(P(s1|di),P(s2|di),P(s3|di),...,P(s|K||di))
wherein, P(s)i|di) Representing short text diIs assigned to a topic siA conditional probability value, | K | represents the number of topics;
and converting the short text-topic distribution matrix S into a short text vector set Z based on the formula.
6. The short text similarity calculation method according to claim 1, wherein the short text similarity calculation method is applied to the short text similarity calculation methodCalculating short text d in step 6)1And d2The similarity between them is given by the following formula:
the calculation formula of the KL distance and the JS distance is as follows:
Figure FDA0002214183830000052
Figure FDA0002214183830000053
wherein d is1、d2As a probability distribution vector for short text, d1(k)、d2(k) Respectively representing probability distribution vectors d1、d2The k-th probability.
CN201710328364.6A 2017-05-11 2017-05-11 Short text similarity calculation method based on multi-feature fusion Expired - Fee Related CN107273913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710328364.6A CN107273913B (en) 2017-05-11 2017-05-11 Short text similarity calculation method based on multi-feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710328364.6A CN107273913B (en) 2017-05-11 2017-05-11 Short text similarity calculation method based on multi-feature fusion

Publications (2)

Publication Number Publication Date
CN107273913A CN107273913A (en) 2017-10-20
CN107273913B true CN107273913B (en) 2020-04-21

Family

ID=60074133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710328364.6A Expired - Fee Related CN107273913B (en) 2017-05-11 2017-05-11 Short text similarity calculation method based on multi-feature fusion

Country Status (1)

Country Link
CN (1) CN107273913B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832381A (en) * 2017-10-30 2018-03-23 北京大数元科技发展有限公司 A kind of government procurement acceptance of the bid bulletin judging method and system from internet collection
CN108182176B (en) * 2017-12-29 2021-08-10 太原理工大学 Method for enhancing semantic relevance and topic aggregation of topic words of BTM topic model
CN108920603B (en) * 2018-06-28 2021-12-21 厦门快商通信息技术有限公司 Customer service guiding method based on customer service machine model
CN109325117B (en) * 2018-08-24 2022-10-11 北京信息科技大学 Multi-feature fusion social security event detection method in microblog
CN109543003A (en) * 2018-11-21 2019-03-29 珠海格力电器股份有限公司 A kind of system object similarity determines method and device
CN110069635A (en) * 2019-04-30 2019-07-30 秒针信息技术有限公司 A kind of determination method and device of temperature word
CN110472002B (en) * 2019-08-14 2022-11-29 腾讯科技(深圳)有限公司 Text similarity obtaining method and device
CN111461566B (en) * 2020-04-10 2023-02-03 武汉大学 Cross-border service flow fusion method and system based on message flow division and combination
CN113554053B (en) * 2021-05-20 2023-06-20 重庆康洲大数据有限公司 Method for comparing similarity of traditional Chinese medicine prescriptions
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification
CN114491318B (en) * 2021-12-16 2023-09-01 北京百度网讯科技有限公司 Determination method, device, equipment and storage medium of target information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899188A (en) * 2015-03-11 2015-09-09 浙江大学 Problem similarity calculation method based on subjects and focuses of problems
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899188A (en) * 2015-03-11 2015-09-09 浙江大学 Problem similarity calculation method based on subjects and focuses of problems
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Vivek Kumar Rangarajan Sridhar.Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words.《Proceedings of NAACL-HLT 2015》.2015, *
初建崇 等.Web文档中词语权重计算方法的改进.《计算机工程与应用》.2007,第43卷(第19期), *
张芸.基于BTM特征扩展的短文本相似度计算.《中国优秀硕士学位论文全文数据库 信息科技辑》.2014,(第09期), *
王亚民 等.基于BTM的微博舆情热点发现.《情报杂志》.2016,第35卷(第11期), *

Also Published As

Publication number Publication date
CN107273913A (en) 2017-10-20

Similar Documents

Publication Publication Date Title
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN107291693B (en) Semantic calculation method for improved word vector model
Ji et al. Representation learning for text-level discourse parsing
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
CN111831789B (en) Question-answering text matching method based on multi-layer semantic feature extraction structure
Maharjan et al. A multi-task approach to predict likability of books
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN108038106B (en) Fine-grained domain term self-learning method based on context semantics
CN112232087A (en) Transformer-based specific aspect emotion analysis method of multi-granularity attention model
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
Kathuria et al. Real time sentiment analysis on twitter data using deep learning (Keras)
CN116304748A (en) Text similarity calculation method, system, equipment and medium
CN114254645A (en) Artificial intelligence auxiliary writing system
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Arora et al. Comparative question answering system based on natural language processing and machine learning
CN113806543B (en) Text classification method of gate control circulation unit based on residual jump connection
Hung Vietnamese diacritics restoration using deep learning approach
Alqaraleh Turkish Sentiment Analysis System via Ensemble Learning
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph
CN111382333A (en) Case element extraction method in news text sentence based on case correlation joint learning and graph convolution
Zheng et al. A novel hierarchical convolutional neural network for question answering over paragraphs
CN112464673B (en) Language meaning understanding method for fusing meaning original information
CN111581339B (en) Method for extracting gene events of biomedical literature based on tree-shaped LSTM
CN110275957B (en) Name disambiguation method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200421

Termination date: 20210511