CN108090047B

CN108090047B - Text similarity determination method and equipment

Info

Publication number: CN108090047B
Application number: CN201810022280.4A
Authority: CN
Inventors: 周春; 郑百成; 黄妍明; 方永毅; 瞿荣; 蒋运承
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2022-05-24
Anticipated expiration: 2038-01-10
Also published as: CN108090047A

Abstract

The invention discloses a new method and equipment for determining text similarity, which can accurately reflect the similarity of texts. The text similarity determining method comprises the following steps: acquiring a first text and a second text of which the similarity is to be determined; determining grammar similarity and theme similarity of the first text, and determining grammar similarity and theme similarity of the second text; and determining the similarity between the first text and the second text according to the determined grammar similarity and the subject similarity.

Description

Text similarity determination method and equipment

Technical Field

The invention relates to the technical field of computers, in particular to a method and equipment for determining text similarity.

Background

In the prior art, the similarity of two texts is generally determined by segmenting two texts and then determining repeated words in the two texts.

However, if comprehensive information in the text is ignored, for example, the text one "i catch up with one dog today" and the text two "one dog catches up with one dog today", the meanings of the two text sentences are opposite, but according to most similarity algorithms at present, the divided participles in the two texts are almost the same, so that the similarity of the two texts is determined to be higher, even the same, and obviously inaccurate.

Therefore, the similarity obtained by the current text similarity calculation method is low in accuracy and cannot reflect the similarity of the text.

Disclosure of Invention

In view of the above problems, the present invention provides a new method and device for determining text similarity, which can accurately reflect the similarity of texts themselves.

In order to solve the above technical problem, in a first aspect, a method for determining text similarity is provided, where the method includes:

acquiring a first text and a second text of which the similarity is to be determined;

determining grammar similarity and theme similarity of the first text, and determining grammar similarity and theme similarity of the second text;

and determining the similarity between the first text and the second text according to the determined grammar similarity and the subject similarity.

Optionally, determining the topic similarity of the first text and the second text includes:

mapping the first text and the second text to a topic space, respectively; the first text and the second text respectively correspond to at least one theme;

obtaining at least one first theme vector corresponding to the first text and at least one second theme vector corresponding to the second text which are mapped to the theme space;

determining the topic similarity of the first text and the second text according to the at least one first topic vector, the at least one second topic vector and a first preset rule;

wherein the first preset rule is as follows:

wherein S is_topicIndicating topic similarity of two texts, a indicating a first topic vector, B indicating a second topic vector, a_iIndicating the ith first topic vector, B_iIndicating the ith second topic vector, n indicating the number of the first topic vector or the second topic vector, i being greater than etcIs greater than 1 and less than or equal to n.

Optionally, determining the grammar similarity of the first text and the second text includes:

segmenting the sentences in the first text to obtain a first word segmentation set, and segmenting the sentences in the second text to obtain a second word segmentation set;

determining the syntactic structure composition of the sentences in the first participle set and the second participle set respectively through a Stanford tool;

and determining the grammatical similarity of the first text and the second text according to the determined grammatical structure composition of the sentences in the first word segmentation set and the second word segmentation set.

Optionally, the determining, by the grammar structure including at least one grammar structure type, a grammar similarity between the first text and the second text according to the determined grammar structure composition of the sentences in the first word segmentation set and the second word segmentation set includes:

respectively determining the number of the first word segmentation set comprising the grammar structure type and the grammar structure type, and the number of the second word segmentation set comprising the grammar structure type and the grammar structure type;

determining the grammar similarity of the first text and the second text according to the obtained grammar structure types and the number of the grammar structure types of the first word set and the second word set and a second rule;

wherein the second rule is:

wherein S is_grammerIndicating grammar similarity between two texts, wherein sameCount indicates the number of the same grammar structure types in the first word set and the second word set, m is the number of the grammar structure types included in the first word set, and n is the number of the grammar structure types included in the second word set.

Optionally, before determining the similarity between the first text and the second text according to the determined grammar similarity and the determined topic similarity, the method further includes:

determining the position similarity of the participles in the first participle set and the second participle set; the position similarity is used for indicating the similarity of the position of a participle in a sentence in the text;

determining similarity between the first text and the second text according to the determined grammar similarity and the subject similarity, including:

and determining the similarity between the first text and the second text according to the determined grammar similarity, the determined subject similarity and the determined position similarity.

Optionally, determining the similarity between the first text and the second text according to the determined grammar similarity, the determined topic similarity, and the determined position similarity, includes:

determining similarity between the first text and the second text by a third rule, wherein the third rule is:

S1(Sen1,Sen2)＝a*S_topic+(1-a)(b*S_grammer+(1-b)*(S_position) S1(Sen1, Sen2 indicate similarity between two texts, S)_positionAnd indicating the similarity of the participles included in the two texts at respective positions, wherein a indicates a theme weight and b indicates a grammar type weight.

Optionally, after the first text and the second text with the similarity to be determined are obtained, the method further includes:

determining the emotional similarity of the first text and the second text;

determining similarity between the first text and the second text according to the determined grammar similarity, the determined subject similarity and the determined position similarity, including:

and determining the similarity between the first text and the second text according to the determined grammar similarity, theme similarity, position similarity and emotion similarity.

Optionally, determining the emotional similarity between the first text and the second text includes:

extracting at least one degree adverb in the first text and the second text, wherein the degree adverb is used for indicating an adverb which is limited in degree to content;

determining at least one weight corresponding to the obtained at least one degree adverb according to the obtained at least one degree adverb and the mapping relation between the degree adverb and the weight, wherein one degree adverb corresponds to one weight;

determining the emotional similarity of the first text and the second text according to the determined at least one weight and a fourth preset rule;

wherein the fourth preset rule is:

CDegSim (Sen1, Sen2) ═ c [ abs (Deg (pant 1) -Deg (pant 2)) ] + (1-c) × S1(Sen1, Sen2), wherein CDegSim (Sen1, Sen2) indicates emotional similarity between two texts, Deg (Sen 1) indicates weight of degree adverb in the first text, Deg (Sen 2) indicates weight of degree adverb in the second text, and c is influence weight of inter-sentence degree adverb weight difference on sentence similarity.

Optionally, determining the similarity between the first text and the second text according to the determined grammar similarity, the determined topic similarity, the determined position similarity, and the determined emotion similarity, including:

analyzing the determined grammar similarity, theme similarity, position similarity and emotion similarity through a similarity model, and determining the similarity between the first text and the second text;

the similarity model is a relation model of a word set and emotion categories of the text, which is obtained by training the grammar, the theme, the position and the emotion words of the word in the text layer by layer through a deep learning network, and the emotion categories comprise positive emotion categories and negative emotion categories.

In a second aspect, there is provided a text similarity determination apparatus, including:

the acquiring unit is used for acquiring a first text and a second text of which the similarity is to be determined;

the first determining unit is used for determining the grammar similarity and the theme similarity of the first text, and determining the grammar similarity and the theme similarity of the second text;

and the second determining unit is used for determining the similarity between the first text and the second text according to the determined grammar similarity and the subject similarity.

Optionally, the first determining unit is specifically configured to:

wherein the first preset rule is as follows:

wherein S is_topicIndicating topic similarity of two texts, a indicating a first topic vector, B indicating a second topic vector, a_iIndicating the ith first topic vector, B_iIndicating the ith second theme vector, n indicating the number of the first theme vector or the second theme vector, and i is greater than or equal to 1 and less than or equal to n.

Optionally, the first determining unit is specifically configured to:

Optionally, the syntax structure includes at least one syntax structure type, and the first determining unit is specifically configured to:

wherein the second rule is:

Optionally, the first determining unit is further configured to: determining the position similarity of the participles in the first participle set and the second participle set before determining the similarity between the first text and the second text according to the determined grammar similarity and the subject similarity; the position similarity is used for indicating the similarity of the position of a participle in a sentence in the text;

determining similarity between the first text and the second text according to the determined grammar similarity and the determined topic similarity, comprising:

Optionally, the second determining unit is specifically configured to:

Optionally, the determining device further includes a third determining unit, configured to:

determining the emotional similarity of the first text and the second text;

the second determining unit is specifically configured to: and determining the similarity between the first text and the second text according to the determined grammar similarity, theme similarity, position similarity and emotion similarity.

Optionally, the third determining unit is specifically configured to:

wherein the fourth preset rule is:

Optionally, the second determining unit is specifically configured to:

The embodiment of the invention provides a novel text similarity determination method, which determines the similarity between texts in the next year by comprehensively considering the grammar similarity and the theme similarity between two texts. Compared with the prior art that the similarity between the two texts is determined only through the similarity of the divided word segments in the two texts, the obtained similarity is more accurate due to the fact that the comprehensive information of the texts is considered, and the similarity of the texts can be reflected better.

Drawings

Fig. 1 is a flowchart of a text similarity determination method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a text similarity determination apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a text similarity determination method according to an embodiment of the present invention.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.

The similarity obtained by the current text similarity calculation method is low in accuracy and cannot reflect the similarity of the text.

In view of this, the embodiment of the present invention provides a new method for determining text similarity, which determines similarity between texts in the next year by comprehensively considering grammar similarity and topic similarity between two texts. Compared with the prior art that the similarity between the two texts is determined only through the similarity of the divided word segments in the two texts, the obtained similarity is more accurate due to the fact that the comprehensive information of the texts is considered, and the similarity of the texts can be reflected better.

The method for determining text similarity provided by the embodiment of the present invention may be applied to electronic devices with computing capabilities, such as a personal computer, a server, and the like, and the embodiment of the present invention is not limited to the type of the electronic device. Hereinafter, the method for determining text similarity provided by the embodiment of the invention is uniformly applied to electronic equipment.

The text in the embodiment of the present invention may include documents, such as papers, web pages, and other types of texts, and may be long text or short text, where the type and length of the text are not limited herein.

The technical scheme provided by the embodiment of the invention is described in the following with reference to the drawings in the specification.

Referring to fig. 1, an embodiment of the present invention provides a text similarity determination method, which can be executed by any electronic device with computing capability, and a specific flow of the determination method is described as follows:

s101: acquiring a first text and a second text of which the similarity is to be determined;

s102: determining grammar similarity and theme similarity of the first text, and determining grammar similarity and theme similarity of the second text;

s103: and determining the similarity between the first text and the second text according to the determined grammar similarity and the subject similarity.

The similarity calculation is a measurement parameter representing the matching degree between two or more texts, and the larger the similarity is, the higher the similarity of the contents is, and the lower the similarity is.

Before determining the similarity between two texts, the electronic device in the embodiment of the present invention may acquire the two texts with the similarity to be determined, that is, the first text and the second text. The first text and the second text may be texts stored locally in the electronic device, may also be texts stored on the network side, and for the network side, may be collected texts uploaded to the network by each user side device. The embodiment of the invention acquires the first text and the second text, firstly determines the grammar similarity and the theme similarity between the first text and the second text, and then determines the similarity between the first text and the second text according to the determined grammar similarity and the theme similarity. Grammatical similarity may be understood as a similarity that relates to the grammatical structure of a sentence, the part of speech of the words included in the sentence, etc. Topic similarity may be understood as the similarity of the text topic semantics. The method for determining the text similarity comprehensively considers the grammar similarity and the theme similarity between the two texts, obtains more accurate similarity due to the fact that comprehensive information of the texts is considered, and can reflect the similarity of the texts.

How the electronic device determines the grammar similarity and the theme similarity of the first text and the second text respectively in the embodiment of the invention is described below.

First, how to determine the topic similarity of the first text and the second text is described:

the electronic device in the embodiment of the present invention may respectively map the first text and the second text to the topic space through topic modeling technologies such as Hierarchical Dirichlet allocation (HDP), the number of the topic spaces may be determined according to the device requirement, and the first text and the second text may respectively correspond to at least one topic. Because the theme of the theme space is established based on the associated information and semantic information between the text features, the subsequent text similarity determination is performed after the text is mapped to the theme space, so that the associated information and the semantic information between the text features are involved in the text similarity determination process, and the method has higher accuracy compared with the method for determining the text similarity by only mapping the text to the word space in the prior art.

In the embodiment of the invention, the first text and the second text are mapped to the theme space through the HDP theme modeling technology, the HDP model may be regarded as a nonparametric model of a Latent Dirichlet Allocation (LDA) model, in the embodiments of the present invention, the method can be characterized by firstly modeling through an HDP technology, obtaining the subject information of the text through the HDP model, and then solving the subject similarity according to the subject information of the text, wherein the input file of the HDP is obtained after the first text is participled, and deleting stop words, and carrying out the first word segmentation set obtained after the word shape reduction, and similarly, after the second word segmentation set carries out word segmentation on the second text, and deleting stop words, and carrying out word shape reduction to obtain a word segmentation set, wherein the files processed by the HDP model are theme vector values of k themes, and the theme similarity is calculated by a theme vector extracted from the files through an expression (1).

The embodiment of the invention is modeled by an HDP technology, and the number of themes in the established HDP model can be expanded because the themes can be selected according to requirements. The first text and the second text are modeled and mapped to the topic space using the Gibbs Sampling method (Gibbs Sampling), respectively. After the first text and the second text are mapped to the theme space, a vector file formed by the respective theme information of the first text and the second text can be obtained, and the theme similarity of the first text and the second text is calculated according to the vector file.

Specifically, the embodiment of the present invention may obtain at least one first topic vector corresponding to a first text and at least one second topic vector corresponding to a second text, which are mapped to the topic space. And determining the topic similarity of the first text and the second text according to at least one first topic vector, at least one second topic vector and a first preset rule. In the embodiment of the present invention, the first preset rule may be formula (1).

In the formula (1), S_topicIndicating topic similarity, A indicates a first topicVector, B indicates a second topic vector, A_iIndicating the ith first topic vector, B_iIndicating the ith second theme vector, n indicating the number of the first theme vector or the second theme vector, and i is more than or equal to 1 and less than or equal to n. The embodiment of the invention can calculate the topic similarity of the first text and the second text through the formula (1).

How to determine the grammatical similarity of the first text and the second text is described below.

The embodiment of the present invention may use a word segmentation tool in the prior art, for example, an ictcalas 4j tool to segment the sentences in the first text to obtain a first word segmentation set, and segment the sentences in the second text to obtain a second word segmentation set. And then, carrying out grammar analysis through a Stanford tool, calculating the text similarity with the same grammar structure and considering the contribution of the word group sequence in the sentence to the similarity.

The grammar structure composition of the sentences in the first participle set and the second participle set can be respectively determined through a Stanford tool, the grammar structure comprises at least one grammar structure type, the part of speech of the words and the like, and therefore the grammar similarity of the first text and the second text is determined according to the determined grammar structure composition of the sentences in the first participle set and the second participle set.

Specifically, for example, for the first word set, the specific part of speech of a word in the first word set can be obtained by a Stanford tool, and what appears after the slash in the analysis result is the part of speech of the word. For example, [ for/P, today/NT, afternoon/NT,/DEG, exam/NN,/PU, I/PN, extraordinary/AD, VE, confidence/NN,. [ solution ]/PU ]. By using the above method to perform sentence grammar processing, the grammar structure of the sentence and the corresponding contained phrases are extracted, as shown in table 1.

TABLE 1

VP：	Is very confident
		PP：	For the examination in the afternoon of today
PU：	，。
		NP：	I am concerned with

However, it is not enough to know the part of speech of a word, and the specific part of speech represents the characteristics of the word and does not represent the structural composition of a sentence. Therefore, the embodiment of the present invention further extracts the grammatical structure of the sentence, such as a simple clause, a nominal clause, a verb phrase, and the like, and then classifies the corresponding word or phrase into the grammatical structure set.

Specifically, in the embodiment of the present invention, the ictal processing and the grammar parsing can be performed on the sentences in the first text or the second text through the ictcalas 4j tool and the Stanford tool, so as to obtain the part-of-speech tagging of the most basic words or phrases. And extracting and summarizing the words or phrases marked with the grammatical parts of speech in a layer-by-layer analysis and syntax tree analysis mode, and extracting the composition structure types of the sentences, such as simple clause types and verb phrase types. Finally, words or phrases with the same grammar structure type are stored in the sets with the corresponding structure types to form different grammar structure sets, such as a simple clause type set, a nominal clause type set, a verb phrase type set and the like, so that the analysis of the grammar composition structure of the sentence is realized. The syntax structure is extracted by the method, and the structural composition of the statement can be obtained, as shown in table 2. Compared with part-of-speech tagging of a single word in the table 1, the grammar structure composition of the sentence can be obtained, and the study on the grammar semantic similarity of the Chinese sentence is more facilitated.

TABLE 2

Grammar structure	Means of
		ROOT	Text sentence to be processed
IP	Simple clause
		NP	Noun phrases
VP	Word-of-speech phrases
		PP	Interword phrases
LCP	Phrase of azimuth word
		CP	The phrase "consisting of

After the sentence syntactic structure extraction method is analyzed, each sentence is divided into a plurality of structure types, for example, some sentences include nominal phrases and verb phrases, and some sentences include simple clauses, nominal phrases and adverb phrases. The sentence grammar structure type similarity calculation is to calculate the ratio of the number of the same structure types contained in the two sentences to the number of all the structure types. The ratio can reflect the similarity of the two sentences on the syntactic structure, so that when determining the syntactic similarity of the first text and the second text according to the syntactic structure composition of the sentences in the first participle set and the second participle set after syntactic analysis, firstly, the first participle set after syntactic analysis can be respectively determined to comprise the syntactic structure type and the number of the syntactic structure types, and the second participle set after syntactic analysis can comprise the number of the syntactic structure type and the number of the syntactic structure type. And determining the grammar similarity of the first text and the second text according to the obtained grammar structure types and the number of the grammar structure types of the first word segmentation set and the second word segmentation set and a second rule. Wherein the second rule may be formula (2).

In the formula (2), S_grammerIndicating grammar similarity between two texts, wherein sameCount indicates the number of the same grammar structure types in the first word set and the second word set, m is the number of the grammar structure types included in the first word set, and n is the number of the grammar structure types included in the second word set.

Because the position of the phrase in the sentence has an important influence on the similarity between the short texts, the similarity calculation method in the prior art does not consider the position similarity, so that the accuracy of the calculated similarity between the two texts is low. The embodiment of the invention can determine the position similarity of the participles in the first participle set and the second participle set after the grammar is analyzed. The position similarity can be used to indicate the similarity degree of the position of a word in a sentence.

In the embodiment of the invention, a single Chinese character is not taken as a basic unit, but a word group after word segmentation is taken as a unit, and the position similarity of the word group is calculated more reasonably by taking the word group as a unit because the single Chinese character contains too little information and the word group can reflect more information. Specifically, the embodiment of the present invention may respectively record the first parsed word set and the second parsed word set as vectors T1 and T2, where T1 includes s word groups, T2 includes T word groups, and the word groups at each position are T11, T12, …, T1s, T21, T22, …, and T2T. And acquiring a union T of T1 and T2, wherein the T comprises k phrases. For each phrase Ti in T, the phrase in T1 that is the same as or most similar to the phrase is searched. In a possible embodiment, a preset similarity threshold may be set, a subscript j of the phrase in T1 is noted, and then a phrase position vector R1 is constructed, so that R1i is equal to j. The vector R2 is constructed in the same way. That is, the phrase position vectors R1 and R2 corresponding to T1 and T2 can be obtained. The embodiment of the invention can calculate the position similarity of the sentence phrases in the first text and the second text through a formula (3).

In the formula (3), S_positionDenotes the degree of positional similarity, R_1iA phrase vector, R, representing a word in the first text_2iAnd the phrase vector represents a word in the second text, which has a similar meaning to a word in the first text.

Further, the embodiment of the invention can determine the similarity between the first text and the second text according to the determined grammar similarity, the determined theme similarity and the determined position similarity. In a possible implementation manner, the embodiment of the present invention may calculate the similarity between the first text and the second text by using formula (4).

S1(Sen1,Sen2)＝a*S_topic+(1-a)(b*S_grammer+(1-b)*(S_position)) (4)；

In formula (4), S1(Sen1, Sen2) indicates the similarity between two texts, (S)_position) Indicating the similarity of the participles included in the two texts at the respective positions, S_topicIndicating topic similarity of two texts, S_grammerGrammar facies indicating two textsSimilarity, a indicates topic weight and b indicates syntax type weight.

The embodiment of the invention considers the comprehensive information of the text, such as grammar similarity, theme similarity and position similarity, and further determines that the accuracy of the similarity of the first text and the second text is higher.

Further, the embodiment of the invention considers that even if the grammatical structures, semantics, themes and the like of the two sentences are the same, if the emotional similarity of the two sentences is different, the similarity is also greatly different. Therefore, the embodiment of the invention further considers the influence of the degree adverb and the emotion on the sentence similarity, so that the similarity of the first text and the second text is further determined according to the grammar similarity, the topic similarity, the position similarity and the emotion similarity, and the accuracy of the determined similarity is higher.

According to the embodiment of the invention, the emotion similarity of the first text and the second text can be determined. Specifically, the embodiment of the present invention may extract at least one adverb in the first text and the second text, where the adverb is used to indicate an adverb that defines the content in terms of degree. In the prior art, degree adverbs are divided into 6 classes in the 'Zhi Net' according to different modification directions and degree sizes, wherein 69 'extreme' class words, 42 'very' class words, 37 'comparatively' class words, 29 'slight' class words, 12 'under' class words, 30 'over' class words and 219 class degree adverbs in total are included in the 'Zhi Net'.

The embodiment of the invention divides the 6 classes of degree adverbs into 2 classes of tone-enhanced degree adverbs and tone-weakened degree adverbs according to semantic degrees on the basis of the prior art. Wherein, the tone-enhanced degree adverbs are extreme, super, very and comparatively, and the enhancement degree is extreme > super > very > comparatively in turn. The degree adverbs of the pattern of weakened voice are slightly deficient and the degree of weakening is slightly less than deficient. Specifically, the tone-enhanced degree adverbs can be assigned, the assignment interval can be 1-2, and the assignment interval is gradually increased from 1 to 0.1. After assigning the degree adverb, the embodiment of the invention can calculate the similarity of the sentence in the corpus by the assigned value, and finally determines that the 'extreme' class word is assigned with 1.4, the 'super' class word is assigned with 1.3, the 'very' class word is assigned with 1.2 and the 'comparative' class word is assigned with 1.1. Similarly, the assignment interval of the tone-reducing degree adverb can be 0-1, and the tone-reducing degree adverb and the adverb are gradually decreased from 1 by taking 0.1 as a unit, and finally the assignment of the tip class word is determined to be 0.8, and the assignment of the under class word is determined to be 0.4.

Further, the embodiment of the present invention may calculate the influence of the degree side word on the sentence similarity through formula (5).

In formula (5), w1 and w2 are any two degree adverbs in one sentence, Deg (w1, w2) indicates the influence degree of the degree adverbs w1 and w2 on the sentence similarity, Ad (c1) and Ad (c2) are respectively the weight values corresponding to the degree adverbs w1 and w2, and abs (Ad (w1) -Ad (w2)) is the absolute value of the difference between the weight values of w1 and w 2.

The weight of the impact of all degree adverbs on a sentence in a sentence can be calculated by equation (6).

In formula (6), Deg (w1, w2) indicates the influence degrees of any two degree adverbs wi and wj on sentence similarity, deg (set) represents the weight of the influence of all degree adverbs on a sentence in a sentence, and n is the number of degree adverbs contained in the sentence.

The embodiment of the invention can determine at least one weight corresponding to the obtained at least one degree adverb according to the obtained at least one degree adverb and the mapping relation between the degree adverb and the weight, wherein one degree adverb corresponds to one weight, and then determine the emotional similarity between the first text and the second text according to the determined at least one weight and a fourth preset rule:

wherein, the fourth preset rule may be formula (7):

CDegSim(Sen1,Sen2)＝c*[abs(Deg(Sent1)-Deg(Sent2))]+(1-c)*S1(Sen1,Sen2)(7)

in formula (7), CDegSim (Sen1, Sen2) indicates the similarity between two texts after considering the influence of degree adverb on the similarity on the basis of the aforementioned similarity determined by syntax, location, and topic, Deg (Sen 1) indicates the weight of degree adverb in the first text, Deg (Sen 2) indicates the weight of degree adverb in the second text, and c is the influence weight of the inter-sentence degree adverb weight difference on sentence similarity.

After determining the emotion similarity of the first text and the second text, the embodiment of the invention can further analyze the determined grammar similarity, theme similarity, position similarity and emotion similarity through a similarity model to determine the similarity between the first text and the second text, wherein the similarity model is a relational model of a word segmentation set and emotion classes of the texts, which is obtained by training the grammar, the theme, the position and the emotion words of the segmentation words in the texts layer by layer through a deep learning network, and the emotion classes comprise a positive emotion class and a negative emotion class.

In particular, embodiments of the invention may consider positive and negative emotion words. A model built on the basis of a Long Short-Term Memory (LSTM) neural network. The whole network is divided into five layers. At the input layer, word vector representation is performed on the participles in the sentence. At the LSTM layer, the LSTM is used for representation learning of text in both forward and backward directions. And the theme layer is used for automatically extracting the document features by utilizing LDA theme distribution. And the pooling layer is used for further extracting text semantic features by combining the pooling function and the document features of the subject layer. At the output layer, the emotion classification is predicted using the softmax function. The final emotion tendency judgment result 1 indicates a positive emotion and 0 indicates a negative emotion, and this can be represented by formula (8).

SDegSim (nt 1, nt2) indicates text similarity in consideration of the above emotion, neg (nt 1, nt2) ═ 1 indicates a positive emotion when the text similarity determination method with the above emotion similarity is indicated, and neg (nt 1, nt2) ═ 0 indicates a negative emotion when the text similarity determination method with the emotion similarity is indicated.

Since sentences may contain different topics, inter-sentence similarity with more different topics will be lower. Therefore, it is not necessary to blend modifiers to calculate the similarity of sentences for 2 sentences with small similarity, so it should be emphasized that a threshold value can be set for CDegSim (xi, yj), and when the threshold value is greater than the threshold value, the combination of positive emotion words and negative emotion words is considered, and when the threshold value is less than or equal to the threshold value, the influence of the positive or negative emotion words on the similarity of sentences is not considered. This threshold may be set to 0.8, for example. The modifier is blended into the syntactic structure, the similarity between main components in the sentence structure is considered, and the synonymy, the near-meaning, the antisense and other relations among the words are considered, particularly the positive and negative emotional words and the degree adverb are considered, so that the similarity calculation result is closer to the manual judgment value and can be indicated through a formula (9). .

In formula (9), Sim (nt 1, nt2) represents the similarity between two texts, and formula (9) illustrates that: for 2 texts, whether the emotion similarity is considered or not in the process of calculating the similarity depends on the similarity value calculated through grammar, theme and degree adverb, and the emotion similarity is considered to be added only when the similarity value exceeds a threshold value; otherwise, no emotional similarity is added. Because 2 texts are usually dissimilar when the grammars, themes, etc. of the two texts are far apart, there is no need to consider emotional similarity.

In summary, the embodiments of the present invention provide a new method for determining text similarity, and the method for determining similarity between texts in the next year comprehensively considers grammar similarity and topic similarity between two texts. Compared with the prior art that the similarity between the two texts is determined only through the similarity of the divided word segments in the two texts, the obtained similarity is more accurate due to the fact that the comprehensive information of the texts is considered, and the similarity of the texts can be reflected better.

When the similarity of two texts is calculated, the embodiment of the invention considers the information such as the sentence length, the word shape, the word sequence, the grammar structure of the sentence and the like, and mainly considers the following three aspects: (1) the similarity of grammar structures contained in the two sentences, (2) the similarity between word sets with the same grammar structure, and (3) the similarity of positions of word groups after word segmentation in the sentences can effectively eliminate misjudgment of sentence similarity caused by grammar information. For example: the 2 sentences of 'i chased a dog today' and 'one dog chased an me today' are two opposite meanings according to subjective judgment of people, but the similarity value of the two sentences in most similarity algorithms at present is very high, which is obviously inaccurate. The method for determining the text similarity provided by the embodiment of the invention can effectively eliminate the misjudgment of the sentence similarity caused by the grammar information.

The embodiment of the invention models the text by using the topic model, utilizes the statistical characteristics of the text, can effectively reduce the representation dimensionality of the text, can solve the problems of synonyms and polysemous words, and does not need to refer to a similarity calculation method of an external dictionary.

The embodiment of the invention also introduces the judgment of sentence emotional tendency, and makes the sentence similarity measurement with consistent expression and opposite attitudes of two subjects more accord with the language use and semantic understanding habit of human from the perspective of thinking and cognition of human, thereby adding the judgment of negative words. For example: the two sentences of 'i like sports' and 'i do not like sports' are opposite in meaning according to subjective judgment of people, but the result judged by most existing algorithms is high in similarity, which is unreasonable. 2) Meanwhile, the influence of the degree side words on the emotion of the sentence is considered, for example, the positive emotion is stronger because the degree words are added in the later sentence of the ' I ' happy and ' I ' special happy '. 3) The influence of the positions of the negative words and the degree adverbs on the sentence emotion similarity is considered. For example: "I do not pay much attention" and "I do not pay much attention", the words forming these two sentences are completely the same, but the position of the adverb and degree adverb is different, through subjective judgment, the first sentence expresses a very strong negative emotion, the second sentence expresses a weak positive emotion, so the similarity of these two sentences is greatly different. The text similarity determining method provided by the embodiment of the invention takes this point into consideration, so that the accuracy of the determined similarity is higher.

The following describes the apparatus provided by the embodiment of the present invention with reference to the drawings.

Referring to fig. 2, based on the same inventive concept, an embodiment of the present invention provides a text similarity determination apparatus, including: an acquisition unit 201, a first determination unit 202, and a second determination unit 203. The acquiring unit 201 is configured to acquire a first text and a second text with similarity to be determined. The first determining unit 202 is configured to determine a grammar similarity and a topic similarity of the first text, and determine a grammar similarity and a topic similarity of the second text. The second determining unit 203 is configured to determine a similarity between the first text and the second text according to the determined grammar similarity and the topic similarity.

Optionally, the first determining unit 202 is specifically configured to:

mapping the first text and the second text to a theme space respectively; the first text and the second text respectively correspond to at least one theme;

acquiring at least one first theme vector corresponding to a first text and at least one second theme vector corresponding to a second text which are mapped to a theme space;

determining the topic similarity of the first text and the second text according to at least one first topic vector, at least one second topic vector and a first preset rule;

wherein, the first preset rule is as follows:

Optionally, the first determining unit 202 is specifically configured to:

segmenting sentences in the first text to obtain a first word segmentation set, and segmenting sentences in the second text to obtain a second word segmentation set;

Optionally, the syntax structure includes at least one syntax structure type, and the second determining unit 203 is specifically configured to:

wherein the second rule is:

Optionally, the first determining unit 202 is specifically configured to:

determining the position similarity of the participles in the first participle set and the second participle set before determining the similarity between the first text and the second text according to the determined grammar similarity and the determined theme similarity; the position similarity is used for indicating the similarity of the position of a participle in a sentence in the text;

the second determining unit 203 is specifically configured to:

determining the similarity between the first text and the second text through a third rule, wherein the third rule is as follows:

S1(Sen1,Sen2)＝a*S_topic+(1-a)(b*S_grammer+(1-b)*(S_position) S1(Sen1, Sen2) indicating the similarity between two texts, S_positionAnd indicating the similarity of the participles included in the two texts at respective positions, wherein a indicates a theme weight and b indicates a grammar type weight.

Optionally, the determining device further includes a third determining unit, specifically configured to determine the emotion similarity between the first text and the second text after acquiring the first text and the second text of which the similarity is to be determined;

the second determining unit 203 is further specifically configured to:

and determining the similarity between the first text and the second text according to the determined grammar similarity, the subject similarity, the position similarity and the emotion similarity.

Optionally, the third determining unit is specifically configured to:

extracting at least one degree adverb in the first text and the second text, wherein the degree adverb is used for indicating an adverb which limits the content in degree;

wherein, the fourth preset rule is:

Optionally, the second determining unit 203 is specifically configured to:

analyzing the determined grammar similarity, theme similarity, position similarity and emotion similarity through a similarity model to determine the similarity between the first text and the second text;

The device may be configured to execute the method provided in the embodiment shown in fig. 1, and therefore, for functions and the like that can be realized by each functional module of the device, reference may be made to the description of the embodiment shown in fig. 1, which is not described in detail.

Referring to fig. 3, an embodiment of the present invention further provides a device for determining text similarity, where the device for determining text similarity includes: at least one processor 301, and a memory 302 coupled to the at least one processor 301. Wherein the memory 302 stores instructions executable by the at least one processor 301, and the at least one processor 301 executes the instructions stored in the memory 302 to perform the method shown in fig. 1.

In a Specific implementation, each processor 301 may be specifically a central processing unit, an Application Specific Integrated Circuit (ASIC), one or more Integrated circuits for controlling program execution, a hardware Circuit developed by using a Field Programmable Gate Array (FPGA), or a baseband processor.

The Memory 302 may include a Read Only Memory (ROM), a Random Access Memory (RAM), and a disk Memory, and is used for storing data required by the processor 301 during operation. The number of the memories 302 is one or more. The memory 302 is also shown in fig. 3, but it should be understood that the memory 302 is not an optional functional module, and is therefore shown in fig. 3 by a dotted line.

Based on the same inventive concept, embodiments of the present invention provide a computer-readable storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the method shown in fig. 1.

In particular implementations, the computer-readable storage medium includes: various storage media capable of storing program codes, such as a Universal Serial Bus flash drive (USB), a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, so that any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention will still fall within the scope of the technical solution of the present invention without departing from the content of the technical solution of the present invention.

Claims

1. A method for determining text similarity includes:

determining the similarity between the first text and the second text according to the determined grammar similarity and the determined theme similarity;

determining topic similarity of the first text and the second text, including:

wherein the first preset rule is as follows:

wherein S is_topicIndicating topic similarity of two texts, a indicating a first topic vector, B indicating a second topic vector, a_iIndicating the ith first topic vector, B_iIndicating an ith second topic vector, n indicating the number of the first topic vector or the second topic vector, i being greater than or equal to 1 and less than or equal to n;

determining grammatical similarity of the first text and the second text, including:

determining the grammar similarity of the first text and the second text according to the determined grammar structure composition of the sentences in the first word segmentation set and the second word segmentation set;

the grammar structure comprises at least one grammar structure type, and the grammar similarity of the first text and the second text is determined according to the determined grammar structure composition of the sentences in the first word segmentation set and the second word segmentation set, and the grammar similarity comprises the following steps:

wherein the second rule is:

wherein S is_grammerIndicating grammar similarity between two texts, wherein sameCount indicates the number of the same grammar structure types in the first word segmentation set and the second word segmentation set, m is the number of the grammar structure types included in the first word segmentation set, and n is the number of the grammar structure types included in the second word segmentation set;

before determining the similarity between the first text and the second text according to the determined grammar similarity and the topic similarity, the method further comprises the following steps:

determining the similarity between the first text and the second text according to the determined grammar similarity, the determined subject similarity and the determined position similarity;

S1(Sen1,Sen2)＝a*S_topic+(1-a)(b*S_grammer+(1-b)*(S_position) S1(Sen1, Sen2) indicating the similarity between two texts, S_positionIndicating the similarity of the participles included in the two texts at respective positions, wherein a indicates a theme weight and b indicates a grammar type weight;

after acquiring the first text and the second text with the similarity to be determined, the method further comprises the following steps:

determining the emotional similarity of the first text and the second text;

determining similarity between the first text and the second text according to the determined grammar similarity, theme similarity, position similarity and emotion similarity;

determining emotional similarity of the first text and the second text, including:

wherein the fourth preset rule is:

2. The method of claim 1, wherein determining the similarity between the first text and the second text based on the determined grammar similarity and topic similarity, location similarity, and emotion similarity comprises:

3. A text similarity determination device, for implementing the method of claim 1 or 2, comprising: