CN112507688A - Text similarity analysis method and device, electronic equipment and readable storage medium - Google Patents

Text similarity analysis method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112507688A
CN112507688A CN202011488930.8A CN202011488930A CN112507688A CN 112507688 A CN112507688 A CN 112507688A CN 202011488930 A CN202011488930 A CN 202011488930A CN 112507688 A CN112507688 A CN 112507688A
Authority
CN
China
Prior art keywords
triple
word
pairing
triplet
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011488930.8A
Other languages
Chinese (zh)
Inventor
徐欣辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Digital Media Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Digital Media Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Digital Media Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202011488930.8A priority Critical patent/CN112507688A/en
Publication of CN112507688A publication Critical patent/CN112507688A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text similarity analysis method and device, electronic equipment and a readable storage medium, and relates to the technical field of information analysis. The text similarity analysis method comprises the following steps: acquiring a first statement and a second statement, wherein the first statement and the second statement both comprise at least two words; generating a first triple set of the first sentence and a second triple set of the second sentence, wherein each of the first triple set and the second triple set comprises at least one triple, and the triple comprises two words and a grammatical relation between the two words; and acquiring the text similarity of the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set. The technical scheme provided by the application can solve the problem that the accuracy of the analysis result of the sentence similarity is low in the prior art.

Description

Text similarity analysis method and device, electronic equipment and readable storage medium
Technical Field
The application relates to the technical field of information analysis, in particular to a text similarity analysis method and device, electronic equipment and a readable storage medium.
Background
Sentences as structural forms above words and below paragraphs play an important role in various tasks of language processing, and similarity analysis of sentences is becoming one of important directions of text research. At present, whether two sentences are similar or not is generally analyzed on the basis of word level, specifically, words with similar semantemes in another sentence of each word in the sentence are searched, and the similarity between the two sentences is calculated on the basis of the words with similar semantemes so as to judge whether the two sentences are similar or not. However, due to the complexity of sentence semantics, such a word-based level is generally less accurate with respect to the results of an analysis of whether two sentences are similar.
Disclosure of Invention
The embodiment of the application provides a text similarity analysis method and device, an electronic device and a readable storage medium, which can solve the problem that in the prior art, the accuracy of an analysis result of sentence similarity is low.
In order to solve the technical problem, the present application is implemented as follows:
in a first aspect, an embodiment of the present application provides a text similarity analysis method, including:
acquiring a first statement and a second statement, wherein the first statement and the second statement both comprise at least two words;
generating a first triple set of the first sentence and a second triple set of the second sentence, wherein each of the first triple set and the second triple set comprises at least one triple, and the triple comprises two words and a grammatical relation between the two words;
and acquiring the text similarity of the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set.
Optionally, a triplet in the first triplet set is a first triplet, and a triplet in the second triplet set is a second triplet;
the obtaining the text similarity between the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set includes:
combining each first triple in the first triple set with each second triple in the second triple set to obtain a plurality of pairing triples; the pairing triplet comprises a first triplet and a second triplet;
acquiring the similarity value of each pairing triple;
and acquiring the text similarity of the first sentence and the second sentence based on the similarity value of each pairing triple.
Optionally, the obtaining the similarity value of each pairing-triplet includes:
obtaining a word matching score of each pairing triple based on two words in a first triple and two words in a second triple in each pairing triple;
obtaining a syntax relationship matching score of each pairing triple based on the syntax relationship in the first triple and the syntax relationship in the second triple in each pairing triple;
calculating a similarity value for each of the pair triplets based on the word matching score and the grammatical relationship matching score.
Optionally, each pair triplet includes a first pair word and a second pair word, the first pair word being one of a third word and a fourth word in a first triplet and a second triplet that constitute the pair triplet, and the second pair word being the other of the third word and the fourth word in the first triplet and the second triplet that constitute the pair triplet;
the obtaining a term matching score for each paired triple based on two terms in the first triple and two terms in the second triple in each paired triple comprises:
based on a cosine similarity algorithm of word vectors, obtaining a first score of a first pairing word and a second score of a second pairing word in each pairing triple;
and performing weighted summation calculation on the first score and the second score to obtain the word matching score of each pairing triple.
Optionally, each pair triplet includes a third pair word, and the third pair word includes a first phrase and a second phrase, where the first phrase is a first word and a second word in a first triplet that constitutes the pair triplet, and the second phrase is a third word and a fourth word in a second triplet that constitutes the pair triplet;
the obtaining a term matching score for each paired triple based on two terms in the first triple and two terms in the second triple in each paired triple comprises:
and acquiring a third score of a third pairing word in each pairing triple based on a cosine similarity algorithm of the word vector, wherein the third score is a word matching score of the corresponding pairing triple.
Optionally, the obtaining the text similarity between the first sentence and the second sentence based on the similarity value of each pairing triplet includes:
obtaining a target pairing triple with the highest similarity value in the pairing triples formed by the target first triple and each second triple;
determining a target pairing triple corresponding to each first triple;
acquiring a weight value corresponding to each target pairing triple based on a preset statement weight value table;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the corresponding weight value.
Optionally, the obtaining the text similarity between the first sentence and the second sentence based on the similarity value of the target paired triple and the corresponding weight value includes:
acquiring the number of preset words included in the target pairing triple, and determining the weight attenuation coefficient of the target pairing triple;
attenuating the weight values corresponding to the target pairing triplet group based on the weight attenuation coefficient;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the attenuated weight value.
In a second aspect, an embodiment of the present application provides a text similarity analysis apparatus, including:
the system comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a first statement and a second statement, and the first statement and the second statement both comprise at least two words;
a generating module, configured to generate a first triple set of the first sentence and a second triple set of the second sentence, where each of the first triple set and the second triple set includes at least one triple, and the triple includes two words and a grammatical relationship between the two words;
and the second obtaining module is used for obtaining the text similarity between the first statement and the second statement according to the triples in the first triple set and the triples in the second triple set.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a program or instructions stored on the memory and executable on the processor, where the program or instructions, when executed by the processor, implement the steps of the text similarity analysis method as described in the first aspect.
In a fourth aspect, the present application provides a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the text similarity analysis method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the text similarity analysis method according to the first aspect.
In the embodiment of the application, by generating the first triple set of the first sentence and the second triple set of the second sentence, the triple includes two words and a grammatical relationship between the two words, so that when the text similarity of the first sentence and the second sentence is obtained according to the first triple set and the second triple set, not only the similarity of the words in the first sentence but also the similarity of the grammatical relationship between the words are obtained, and the semantics of the sentences can be better considered based on the grammatical relationship, so that the accuracy of analyzing the similarity between the two sentences is further improved.
Drawings
Fig. 1 is a flowchart of a text similarity analysis method provided in an embodiment of the present application;
fig. 1a is a flowchart of another text similarity analysis method provided in an embodiment of the present application;
fig. 2 is a block diagram of a text similarity analysis apparatus according to an embodiment of the present application;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
The text similarity analysis method, the text similarity analysis device and the electronic device provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.
Referring to fig. 1, fig. 1 is a flowchart of a text similarity analysis method provided in an embodiment of the present application, where the text similarity analysis method may be applied to electronic devices such as a computer, a tablet computer, and a mobile phone.
As shown in fig. 1, the text similarity analysis method includes the following steps:
step 101, obtaining a first statement and a second statement, wherein the first statement and the second statement both comprise at least two words.
In this embodiment of the present application, the first sentence and the second sentence may be two sentences in the same text, or may also be two sentences in different texts. For example, when the electronic device displays a novel text, if the electronic device receives a text similarity analysis instruction from a user, two sentences in the novel text currently displayed may be randomly selected as a first sentence and a second sentence. Or, the user may select the first sentence and the second sentence and input the first sentence and the second sentence into a specific text similarity analysis program, and then the electronic device acquires the first sentence and the second sentence. Of course, the first sentence and the second sentence may be obtained in other manners, and this embodiment is not particularly limited.
For example, if the first sentence is that the third middle school squad defeats the fourth middle school squad, the first sentence may be subjected to word segmentation processing and split into words including "defeat", "third middle school squad", "fourth middle school squad", and the like.
And 102, generating a first triple set of the first statement and a second triple set of the second statement.
The first triple set comprises at least one triple, and the triples in the first triple set are subsequently called first triples; and the second triple set also comprises at least one triple, and the triples in the second triple set are subsequently called second triples. Each triple comprises two words and a grammatical relation between the two words. That is, each first triplet includes two words and a grammatical relationship between the two words, and each second triplet also includes two words and a grammatical relationship between the two words.
When generating the first triple set of the first sentence, the first sentence may be split into a plurality of words, then each two of the words are combined, and a grammatical relationship between the words combined two by two is added, so as to obtain a plurality of first triples, and then the first triple set is obtained through the first triples. Similarly, a second triple set corresponding to the second sentence can be obtained in the same manner.
For example, the first sentence is "the third middle school squad beats the fourth middle school squad", the first sentence may be split into three words "the third middle school squad", "beat", and "the fourth middle school squad", and by pairwise combination of these three words, a plurality of first triples may be generated, for example, a first triplet R1 ═(the third middle school squad, the predicate, and beat), a second first triplet R2 ═(the fourth middle school squad, the player, and beat), a third first triplet R3 ═ the third middle school squad, the subject and object, and the fourth middle school squad, and the first triplet set also includes at least R1, R2, and R3.
Of course, the two words and the grammatical relation included in each triple are not limited in the arrangement order therebetween. For example, R1 may be R1 (third middle school team, beat, major predicate relationship) or R1 (beat, third middle school team, major predicate relationship).
Assuming that the second sentence is "the fourth middle squad beats the third middle squad", the same may be split into words including "beat-beat", "third middle squad", "fourth middle squad", etc., and a plurality of second triplets are generated based on these words, for example, a first second triplet R1 '(third middle squad, action relation, beat-beat), a second triplet R2' ((fourth middle squad, predicate relation, beat-beat), a third second triplet R3 '((fourth middle squad, subject and object, third middle squad), and a second set including at least R1', R2 ', R3'.
Thus, by obtaining the first sentence and the second sentence, performing word segmentation processing on the first sentence and the second sentence respectively, and combining the words obtained after the word segmentation processing, a first triple set including at least one first triple and a second triple set including at least one second triple can be obtained.
Step 103, obtaining the text similarity between the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set.
Optionally, the text similarity analysis is performed on a first triple included in the first triple set and a second triple included in the second triple set, so as to obtain the text similarity between the first sentence and the second sentence.
For example, a first triple in the first triple set may be compared with each second triple in the second triple set one by one, and text similarity analysis is performed to obtain a similarity value; then, the second first triple in the first triple set is compared with each second triple in the second triple set one by one, text similarity analysis is performed, a similarity value … … is obtained, and thus, a similarity value between each first triple and each second triple is obtained, and then, for all the obtained similarity values, the similarity values of the first sentence and the second sentence can be calculated through a weighted average algorithm, so that text similarity of the first sentence and the second sentence is obtained.
In the embodiment of the application, by generating the first triple set of the first sentence and the second triple set of the second sentence, the triple includes two words and a grammatical relationship between the two words, so that when the text similarity analysis is performed on the first triple set and the second triple set, not only the similarity of the words in the first sentence is analyzed, but also the similarity of the grammatical relationship between the words is analyzed, the semantics of the sentences can be better considered based on the grammatical relationship, and thus the accuracy of the similarity analysis between the two sentences is further improved.
Optionally, the step 103 may include:
combining each first triple in the first triple set with each second triple in the second triple set to obtain a plurality of pairing triples; the pairing triplet comprises a first triplet and a second triplet;
acquiring the similarity value of each pairing triple;
and acquiring the text similarity of the first sentence and the second sentence based on the similarity value of each pairing triple.
For example, the first triplet set includes three first triplets R1, R2, and R3, and the second triplet set includes three second triplets R1 ', R2 ', and R3 ', then the three first triplets and the three second triplets are paired one by one to obtain a plurality of paired triplets: R1R1 ', R1R2 ', R1R3 ', R2R1 ', R2R2 ', R2R3 ', R3R1 ', R3R2 ' and R3R3 ' are adopted, so that the obtained pairing triple is more comprehensive, and the analysis on the similarity of the first sentence and the second sentence is more facilitated. Further, a similarity value of each of the paired triples is obtained, and text similarity of the first sentence and the second sentence is obtained based on the similarity value of each of the paired triples.
It is understood that each pair triplet includes a first triplet of the first sentence and a first triplet of the second sentence, and each triplet includes two words and a grammatical relationship between the two words, and thus a similarity value of a pair triplet can reflect a text similarity between the first sentence and the second sentence to some extent. In the embodiment of the application, the multiple paired triples are obtained by combining each first triple in the first triple set with each second triple in the second triple set, and the multiple paired triples obtained in such a way more comprehensively cover the combination mode between the words of the first sentence and the words of the second sentence, so that the analysis on the text similarity of the first sentence and the second sentence is more accurate, and the accuracy of the text similarity analysis between the first sentence and the second sentence can be effectively improved.
Optionally, the obtaining the similarity value of each pairing-triplet includes:
obtaining a word matching score of each pairing triple based on two words in a first triple and two words in a second triple in each pairing triple;
obtaining a syntax relationship matching score of each pairing triple based on the syntax relationship in the first triple and the syntax relationship in the second triple in each pairing triple;
calculating a similarity value for each of the pair triplets based on the word matching score and the grammatical relationship matching score.
It is to be understood that each pair triplet includes a first triplet and a second triplet, the first triplet includes two words and a grammatical relationship between the two words, and the second triplet also includes two words and a grammatical relationship between the two words, then the word matching score and the grammatical relationship matching score in the two triplets may be calculated respectively, and based on the word matching score and the grammatical relationship matching score, the similarity value of one pair triplet can be obtained.
It should be noted that, the first triplet includes two words, the second triplet also includes two words, and a pair triplet also includes four words, and the four words may be matched in different combination ways to calculate the word matching score between the first triplet and the second triplet.
Optionally, in an embodiment, each pair triplet includes a first pair word and a second pair word, the first pair word is one of a third word and a fourth word in the first triplet and the second triplet that constitute the pair triplet, and the second pair word is the other of the third word and the fourth word in the second triplet and the first triplet that constitute the pair triplet;
the obtaining a term matching score for each paired triple based on two terms in the first triple and two terms in the second triple in each paired triple comprises:
based on a cosine similarity algorithm of word vectors, obtaining a first score of a first pairing word and a second score of a second pairing word in each pairing triple;
and performing weighted summation calculation on the first score and the second score to obtain the word matching score of each pairing triple.
For example, the first statement is "the third middle school squad beats the fourth middle school squad", and the first triple is R1 ═ i (the third middle school squad, the cardinal relationship, beat); the second statement is that "the fourth middle school squad defeats the third middle school squad", and the second triplet R1 'is (the third middle school squad, the action-guest relationship, defeat), resulting in a paired triplet R1R 1'; then the first pair of words slot1 ═ (third middle school team ), the second pair of words slot2 ═ (beat ); or slot1 ═ for (third middle school team, beat score), slot2 ═ for (beat score, third middle school team); based on a cosine similarity algorithm of word vectors, calculating a first score of a first paired word slot1 and a second score of a second paired word slot2, carrying out weighted summation calculation on the first score and the second score, and further obtaining a word matching score of the paired triplet.
Specifically, based on the above pairing triplet R1R 1', the first score of the first pairing term slot1 is calculated (third middle school team ), and the second pairing term slot2 is calculated (beat ), where the first pairing term slot1 is (third middle school team, etc.)slot1
scoreslot1=sim(slotR11,slotR11’),
Wherein slotR11For the first word in the first triplet (e.g., the third middle school team in slot1, above), slotR11’A third word or a fourth word in a second triplet (e.g., a third school volleyball in slot1, above);
calculating a second score for the second pair of terms slot2slot2
scoreslot2=sim(slotR12,slotR12’),
Wherein slotR12For the second term in the first triplet (e.g., beat in slot2, described above), the slotR12’Is the third word or the fourth word in the second triplet (e.g., the beat in slot2, described above);
carrying out weighted summation calculation on the first score and the second score to obtain a word matching score of the paired triplesword
scoreword=a×scoreslot1+(1-a)×scoreslot2
Wherein a is a preset weighted value, and 0< a < 1. Optionally, the preset weighting value may be preset by a user.
Therefore, the word matching score of each pairing triple can be calculated through the method.
In the embodiment, a word in the first triplet is combined with a word in the second triplet to obtain a first paired word and a second paired word, and scores of the two paired words are respectively calculated by a cosine similarity algorithm of a word vector, so that a similarity score of the two paired words can be obtained, and thus similarity scores of two groups of words in the first triplet and the second triplet included in the paired triplets can be obtained, so that the similarity calculation of the words in the first sentence and the second sentence is more precise, and the accuracy of the text similarity of the two sentences can be improved.
Or, in another embodiment, each pair triplet includes a third pair word, and the third pair word includes a first phrase and a second phrase, the first phrase is a first word and a second word in a first triplet that constitutes the pair triplet, and the second phrase is a third word and a fourth word in a second triplet that constitutes the pair triplet;
the obtaining a term matching score for each paired triple based on two terms in the first triple and two terms in the second triple in each paired triple comprises:
and acquiring a third score of a third pairing word in each pairing triple based on a cosine similarity algorithm of the word vector, wherein the third score is a word matching score of the corresponding pairing triple.
For example, the first statement is "the third middle school squad beats the fourth middle school squad", and the first triple is R1 ═ i (the third middle school squad, the cardinal relationship, beat); the second statement is that "the fourth middle school squad defeats the third middle school squad", and the second triplet R1 'is (the third middle school squad, the action-guest relationship, defeat), resulting in a paired triplet R1R 1'; then the first short is obtainedSlotpairR1(third middle school team, beat), second phrase slotpairR1'A third pair of words also includes the first phrase and the second phrase; further, a third score of a third pairing word is obtained based on a cosine similarity algorithm of the word vector, and the third score is the word matching score of the corresponding pairing triple.
Specifically, in the pair triplet R1R 1', the third pair term includes slotpairR1And slotpairR1'The term match score of the pair tripletwordI.e. a third score calculated for the third pair of termspairThe third score ofpairCalculated by the following way:
scoreword=scorepair=sim(slotpairR1,slotpairR1’)。
therefore, the word matching score of each pairing triple can be calculated through the method.
In this embodiment, the word match score of the paired triples is obtained by combining two words in the first triplet into a first phrase, combining two words in the second triplet into a second phrase, and then calculating the similarity between the first phrase and the second phrase. Therefore, the words of the first triple and the second triple can be combined in another word combination mode to obtain the word matching score of the paired triple, so that the accuracy of the similarity analysis of the two sentence texts is improved.
Optionally, in this embodiment of the present application, a word vector training mode of word2vec may be used to obtain a word vector for calculating word matching. word2vec is composed of two training models of Cbow and skip-gram, and the training modes are that low-dimensional word vectors are obtained by counting the co-occurrence probability of front and back words in sentences. The training model of word2vec is composed of an input layer, a hidden layer and an output layer, each word predicts the occurrence probability of the word through the words appearing before and after the word, and a word sequence w is assumed1……wtEach termThe word vector of (a) is obtained by the maximum log probability of the trainer occurring from its neighboring words, and the formula is as follows:
Figure BDA0002840168630000111
wherein nb (t) is | wtP (w) is a set of neighboring wordsi|wt) To calculate the association vector wiAnd | wtThe specific calculation principle of the hidden layer softmax function may refer to related technologies, which is not described in detail in this embodiment.
It is to be understood that the pair triplet further includes a syntactic relationship in the first triplet and a syntactic relationship in the second triplet, and further needs to obtain a syntactic relationship score of the pair triplet.
For example, the first statement is "the third middle school squad beats the fourth middle school squad", and the first triple is R1 ═ i (the third middle school squad, the cardinal relationship, beat); the second statement is that "the fourth middle school squad defeats the third middle school squad", and the second triplet R1 'is (the third middle school squad, the action-guest relationship, defeat), and a pair triplet R1R 1' is obtained, although the two words in the first triplet and the two words in the second triplet are the same, the grammatical relationship of the two triplets is different, the meaning of the expression is also different, and further the grammatical relationship score of the pair triplet needs to be calculated. It is easy to understand that the syntactic relation of R1R 1' in the pair triplet includes a predicate relation and a dynamic guest relation, and in this embodiment of the present application, the syntactic relation score of the pair triplet may be calculated based on the dependency relation of the stanford parser, which is as follows:
scorerel=match(relR1,relR1’);
wherein, scorerelFor syntactic relationship scoring of paired triplets, relR1For the grammatical relationship of the first triplet of the pair triplets (e.g., the predicate relationship included in the first triplet of R1R 1'), relR1’For the grammatical relation of the second of the pair triplets (e.g. the first of R1R1The motile guest relationship comprised by the two triplets).
It can be understood that after the term matching score and the grammatical relationship matching score of the pairing-triplet are obtained, the similarity value of the pairing-triplet is further calculated, which is specifically as follows:
scoredep=scoreword×scorerel
therefore, the similarity value of each paired triple obtained by the first sentence and the second sentence can be obtained based on the above manner, and the text similarity of the first sentence and the second sentence is calculated based on the similarity value of each paired triple.
Optionally, the obtaining the text similarity between the first sentence and the second sentence based on the similarity value of each pairing triplet includes:
obtaining a target pairing triple with the highest similarity value in the pairing triples formed by the target first triple and each second triple;
determining a target pairing triple corresponding to each first triple;
acquiring a weight value corresponding to each target pairing triple based on a preset statement weight value table;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the corresponding weight value.
It is to be appreciated that the first statement is capable of generating a first triple set, the first triple set including at least one first triple, the target first triple being any one of the first triple set. In this embodiment of the application, after each first triple in the first triple set and each second triple in the second triple set are combined, a plurality of paired triples are obtained, and then the target first triple is also combined and paired with each second triple to obtain paired triples, and the similarity value of each paired triple is calculated. It can be understood that the target first triplet and each second triplet may not be very similar, and the similarity values of the pair triplets obtained based on the target first triplet are also different, and the target triplet with the highest similarity value is determined as the target pair triplet, and the target pair triplet corresponding to each first triplet is obtained based on the same method. Of course, it is also possible to determine a target pairing triple by using one of the second triple sets as a target second triple, and obtain a target pairing triple corresponding to each second triple based on the same manner. In this way, by comparing the similarity values one by one, the second triple most similar to each first triple can be determined more accurately.
In the embodiment of the application, the weight value corresponding to each pairing triple can be set in advance according to the importance degree of the words or the grammatical relations included in the pairing triples in the first statement or the second statement, so that the preset statement weight value table can be obtained. The preset sentence weight value table also comprises a weight value corresponding to each pairing triple, so that the weight value corresponding to each target pairing triple can be obtained, and the text similarity between the first sentence and the second sentence is calculated based on the similarity value of the target pairing triple and the corresponding weight value.
Optionally, the text similarity score of the first sentence and the second sentencesentThe calculation formula is as follows:
Figure BDA0002840168630000131
wherein, scorehighdepMatching the similarity values of the triples for the target; edgeweight is a weight value corresponding to the target pairing triplet; sentl represents the longer of the first sentence and the second sentence, it is understood that the implied meaning similarity analysis between the sentences is directional, the long sentence is more likely to contain the meaning implied by the short sentence, for example, in the two sentences of "zhang san zhang hospital walk" and "zhang san zhang walk", the long sentence contains the intention in the short sentence, and the short sentence cannot cover all the meanings implied in the long sentence.
In this way, all target paired triples generated by the first sentence and the second sentence and the corresponding weight values are weighted, so that the text similarity of the first sentence and the second sentence is obtained. In addition, in the embodiment of the application, the target pairing triplet includes not only two words of the first sentence and two words of the second sentence, but also a grammatical relation between the two words of each sentence, and further the similarity value of the target pairing triplet is only used for analyzing the similarity of the words, and also used for analyzing the similarity of the grammatical relation between the words, so that the accuracy of analyzing the similarity between the two sentences is further improved.
Optionally, the obtaining the text similarity between the first sentence and the second sentence based on the similarity value of the target paired triple and the corresponding weight value includes:
acquiring the number of preset words included in the target pairing triple, and determining the weight attenuation coefficient of the target pairing triple;
attenuating the weight values corresponding to the target pairing triplet group based on the weight attenuation coefficient;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the attenuated weight value.
The preset word may be a word preset by a user, for example, the verb may be set as the preset word. In the embodiment of the present application, the preset word may be a core word in a sentence, for example, the core word may be a verb, a modifier, and the like, for example, the core word in "the third middle school team defeats the fourth middle school team" is "defeat", and the core word in "a beautiful piece of clothes is worn today" is "beautiful".
Optionally, after the preset words are set, the weight attenuation coefficient of the target pairing triple may be determined based on the number of the preset words included in the target pairing triple, and then the weight value corresponding to the target pairing triple is attenuated based on the weight attenuation coefficient. For example, if two words of a first triple or two words of a second triple included in the target pairing triple are both preset words, the weight value corresponding to the target pairing triple is not attenuated; if the first triple or the second triple included in the target pairing triple includes only one preset word, performing half attenuation on a weight value corresponding to the target pairing triple; and if the first triple and the second triple included in the target pairing triple do not include the preset words, the weight value corresponding to the target pairing triple is attenuated by one fourth. Further, based on the similarity value of the target paired triple and the attenuated weight value, the text similarity between the first sentence and the second sentence is obtained, and the calculation formula is as follows:
Figure BDA0002840168630000151
wherein, reduce (edgeweight) is a weight value of the target pairing triple after attenuation, and please refer to the description in the above formula for other parameters. Optionally, the degree of correlation between the statements may be further calculated by using a pearson correlation coefficient, and the correlation calculation method may refer to the correlation technique, which is not described in this embodiment.
Therefore, the weight value corresponding to the target pairing triple is attenuated according to the number of the preset words in the target pairing triple, so that the text similarity analysis of the sentences can be influenced by different words or the number of the words in the sentences, the correlation between the text similarity analysis of the sentences and the included words is stronger, and the accuracy of the text similarity analysis of the sentences is further improved.
For better understanding of the solution provided in the embodiment of the present application, please refer to fig. 1a, where fig. 1a is a flowchart of another text similarity analysis method provided in the embodiment of the present application. As shown in fig. 1a, under the condition that a first sentence and a second sentence are obtained, a triple set Rsent of the first sentence and a triple set Rsent 'of the second sentence are generated, where the triple set Rsent of the first sentence includes a plurality of first triples R1, R2, and R3 … … Rn, and the triple set Rsent' of the second sentence includes a plurality of first triples R1 ', R2', R3 '… … Rn', and a generating manner of the triple set may be a specific description in the embodiment of fig. 1, which is not repeated by way of example in this embodiment.
Optionally, the plurality of first triples and the plurality of second triples may be arbitrarily combined to obtain the pair triples, as shown in fig. 1a, R1 is combined with R2 ', R2 is combined with R3 ', R3 is combined with R1 ', Rn is combined with Rn ', and the like, and the combined pair triples Rn ' are subjected to text similarity analysis. For example, the word slot of the first triplet Rnn1And the word slot in the second triple Rnn’1Calculating a first word matching score, and using the word slot of the first triple Rnn2And the word slot in the second triple Rnn’2Calculating a first word matching score, and relating the grammatical relation rel of the first triple RnnWith the grammatical relation rel in the second triplet Rnn’And calculating a grammatical relation matching score, and acquiring a similarity value of the pairing triple Rn Rn' based on the first word matching score, the second word matching score and the grammatical relation matching score. Based on the same manner, the similarity value of each paired triple can be obtained, and the obtained similarity value of each paired triple is subjected to weighted mean calculation to obtain the similarity of the first statement and the second statement, where the specific calculation manner may refer to the specific description in the embodiment shown in fig. 1, and is not described in detail in this embodiment; according to the text similarity analysis method provided by the embodiment of the application, the similarity of words in two sentences and the similarity of grammatical relations are analyzed and calculated, so that the accuracy of text similarity analysis is improved.
It should be noted that the text similarity analysis method provided by the embodiment of the present application may be applied to the fields of machine translation, text mining, text analysis, text data acquisition, and the like, and can meet the requirements of users on text similarity analysis.
It should be noted that, in the text similarity analysis method provided in the embodiment of the present application, the execution subject may be a text similarity analysis device, or a control module in the text similarity analysis device for executing the loaded text similarity analysis method. In the embodiment of the present application, a text similarity analysis device executes a method for analyzing similarity of loaded texts as an example, which illustrates the text similarity analysis device provided in the embodiment of the present application.
Referring to fig. 2, fig. 2 is a structural diagram of a text similarity analysis apparatus according to an embodiment of the present application. As shown in fig. 2, the text similarity analysis apparatus 200 includes:
a first obtaining module 201, configured to obtain a first sentence and a second sentence, where the first sentence and the second sentence both include at least two words;
a generating module 202, configured to generate a first triple set of the first sentence and a second triple set of the second sentence, where each of the first triple set and the second triple set includes at least one triple, and the triple includes two words and a grammatical relationship between the two words;
a second obtaining module 203, configured to obtain a text similarity between the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set.
Optionally, a triplet in the first triplet set is a first triplet, and a triplet in the second triplet set is a second triplet;
the second obtaining module 203 comprises:
a matching sub-module, configured to combine each first triple in the first triple set with each second triple in the second triple set to obtain a plurality of matching triples; the pairing triplet comprises a first triplet and a second triplet;
the obtaining submodule is used for obtaining the similarity value of each pairing triple;
and the analysis submodule is used for acquiring the text similarity of the first sentence and the second sentence based on the similarity value of each pairing triple.
Optionally, the obtaining sub-module is further configured to:
obtaining a word matching score of each pairing triple based on two words in a first triple and two words in a second triple in each pairing triple;
obtaining a syntax relationship matching score of each pairing triple based on the syntax relationship in the first triple and the syntax relationship in the second triple in each pairing triple;
calculating a similarity value for each of the pair triplets based on the word matching score and the grammatical relationship matching score.
Optionally, each pair triplet includes a first pair word and a second pair word, the first pair word being one of a third word and a fourth word in a first triplet and a second triplet that constitute the pair triplet, and the second pair word being the other of the third word and the fourth word in the first triplet and the second triplet that constitute the pair triplet;
the acquisition sub-module is further configured to:
based on a cosine similarity algorithm of word vectors, obtaining a first score of a first pairing word and a second score of a second pairing word in each pairing triple;
and performing weighted summation calculation on the first score and the second score to obtain the word matching score of each pairing triple.
Optionally, each pair triplet includes a third pair word, and the third pair word includes a first phrase and a second phrase, where the first phrase is a first word and a second word in a first triplet that constitutes the pair triplet, and the second phrase is a third word and a fourth word in a second triplet that constitutes the pair triplet;
the acquisition sub-module is further configured to:
and acquiring a third score of a third pairing word in each pairing triple based on a cosine similarity algorithm of the word vector, wherein the third score is a word matching score of the corresponding pairing triple.
Optionally, the analysis sub-module is further configured to:
obtaining a target pairing triple with the highest similarity value in the pairing triples formed by the target first triple and each second triple;
determining a target pairing triple corresponding to each first triple;
acquiring a weight value corresponding to each target pairing triple based on a preset statement weight value table;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the corresponding weight value.
Optionally, the analysis sub-module is further configured to:
acquiring the number of preset words included in the target pairing triple, and determining the weight attenuation coefficient of the target pairing triple;
attenuating the weight values corresponding to the target pairing triplet group based on the weight attenuation coefficient;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the attenuated weight value.
The text similarity analysis device 200 provided in the embodiment of the present application generates the first triple set of the first sentence and the second triple set of the second sentence, where the triple includes two words and a grammatical relationship between the two words, so that when the text similarity of the first sentence and the second sentence is obtained according to the first triple set and the second triple set, the similarity of the words in the first sentence is only analyzed, and the similarity of the grammatical relationship between the words is also analyzed, so that the semantics of the sentences can be better considered based on the grammatical relationship, and thus the accuracy of the similarity analysis between the two sentences is further improved.
The text similarity analysis apparatus 200 in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.
The text similarity analysis device 200 in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.
The text similarity analysis device 200 provided in the embodiment of the present application can implement each process implemented by the method embodiment described in fig. 1, and is not described here again to avoid repetition.
Referring to fig. 3, fig. 3 is a structural diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 3, the electronic device includes: a processor 300, a memory 320 and a program or instructions stored on the memory 320 and executable on the processor 300, the processor 300 for reading the program or instructions in the memory 320; the electronic device also includes a bus interface and transceiver 310.
A transceiver 310 for receiving and transmitting data under the control of the processor 300.
Where in fig. 3, the bus architecture may include any number of interconnected buses and bridges, with various circuits being linked together, particularly one or more processors represented by processor 300 and memory represented by memory 320. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 310 may be a number of elements including a transmitter and a transceiver providing a means for communicating with various other apparatus over a transmission medium. The processor 300 is responsible for managing the bus architecture and general processing, and the memory 320 may store data used by the processor 300 in performing operations.
In one implementation of this embodiment of the application, the processor 300, configured to read the program or the instructions in the memory 320, performs the following steps:
acquiring a first statement and a second statement, wherein the first statement and the second statement both comprise at least two words;
generating a first triple set of the first sentence and a second triple set of the second sentence, wherein each of the first triple set and the second triple set comprises at least one triple, and the triple comprises two words and a grammatical relation between the two words;
and acquiring the text similarity of the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set.
Optionally, a triplet in the first triplet set is a first triplet, and a triplet in the second triplet set is a second triplet; the processor 300, for reading the program or instructions in the memory 320, executes the following steps:
combining each first triple in the first triple set with each second triple in the second triple set to obtain a plurality of pairing triples; the pairing triplet comprises a first triplet and a second triplet;
acquiring the similarity value of each pairing triple;
and acquiring the text similarity of the first sentence and the second sentence based on the similarity value of each pairing triple.
Optionally, the processor 300 is configured to read the program or the instructions in the memory 320, and perform the following steps:
obtaining a word matching score of each pairing triple based on two words in a first triple and two words in a second triple in each pairing triple;
obtaining a syntax relationship matching score of each pairing triple based on the syntax relationship in the first triple and the syntax relationship in the second triple in each pairing triple;
calculating a similarity value for each of the pair triplets based on the word matching score and the grammatical relationship matching score.
Optionally, each pair triplet includes a first pair word and a second pair word, the first pair word is one of a first word in the first triplet and a third word in the second triplet and a fourth word in the first triplet, and the second pair word is the other of the second word in the first triplet and the third word in the second triplet;
the processor 300, for reading the program or instructions in the memory 320, executes the following steps:
based on a cosine similarity algorithm of word vectors, obtaining a first score of a first pairing word and a second score of a second pairing word in each pairing triple;
and performing weighted summation calculation on the first score and the second score to obtain the word matching score of each pairing triple.
Optionally, each pair triplet includes a third pair word, where the third pair word includes a first phrase and a second phrase, the first phrase is a first word and a second word in a first triplet that constitutes the pair triplet, and the second phrase is a third word and a fourth word in a second triplet that constitutes the pair triplet;
the processor 300, for reading the program or instructions in the memory 320, executes the following steps:
and acquiring a third score of a third pairing word in each pairing triple based on a cosine similarity algorithm of the word vector, wherein the third score is a word matching score of the corresponding pairing triple.
Optionally, the processor 300 is configured to read the program or the instructions in the memory 320, and perform the following steps:
obtaining a target pairing triple with the highest similarity value in the pairing triples formed by the target first triple and each second triple;
determining a target pairing triple corresponding to each first triple;
acquiring a weight value corresponding to each target pairing triple based on a preset statement weight value table;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the corresponding weight value.
Optionally, the processor 300 is configured to read the program or the instructions in the memory 320, and perform the following steps:
acquiring the number of preset words included in the target pairing triple, and determining the weight attenuation coefficient of the target pairing triple;
attenuating the weight values corresponding to the target pairing triplet group based on the weight attenuation coefficient;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the attenuated weight value.
In this embodiment, the electronic device can perform all the technical features of the text information search method embodiment shown in fig. 1, and can improve the accuracy of text similarity analysis of the electronic device, which is similar in implementation principle and technical effect, and this embodiment is not described herein again.
The embodiment of the invention also provides a readable storage medium, and the readable storage medium is stored with a computer program.
Wherein the computer program when executed by a processor implements the steps of:
acquiring a first statement and a second statement, wherein the first statement and the second statement both comprise at least two words;
generating a first triple set of the first sentence and a second triple set of the second sentence, wherein each of the first triple set and the second triple set comprises at least one triple, and the triple comprises two words and a grammatical relation between the two words;
and acquiring the text similarity of the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set.
Optionally, a triplet in the first triplet set is a first triplet, and a triplet in the second triplet set is a second triplet; the computer program when executed by a processor further enables the following steps:
combining each first triple in the first triple set with each second triple in the second triple set to obtain a plurality of pairing triples; the pairing triplet comprises a first triplet and a second triplet;
acquiring the similarity value of each pairing triple;
and acquiring the text similarity of the first sentence and the second sentence based on the similarity value of each pairing triple.
Optionally, the computer program when executed by the processor further implements the following steps:
obtaining a word matching score of each pairing triple based on two words in a first triple and two words in a second triple in each pairing triple;
obtaining a syntax relationship matching score of each pairing triple based on the syntax relationship in the first triple and the syntax relationship in the second triple in each pairing triple;
calculating a similarity value for each of the pair triplets based on the word matching score and the grammatical relationship matching score.
Optionally, each pair triplet includes a first pair word and a second pair word, the first pair word is one of a first word in the first triplet and a third word in the second triplet and a fourth word in the first triplet, and the second pair word is the other of the second word in the first triplet and the third word in the second triplet; the computer program when executed by a processor further enables the following steps:
based on a cosine similarity algorithm of word vectors, obtaining a first score of a first pairing word and a second score of a second pairing word in each pairing triple;
and performing weighted summation calculation on the first score and the second score to obtain the word matching score of each pairing triple.
Optionally, each pair triplet includes a third pair word, where the third pair word includes a first phrase and a second phrase, the first phrase is a first word and a second word in a first triplet that constitutes the pair triplet, and the second phrase is a third word and a fourth word in a second triplet that constitutes the pair triplet; the computer program when executed by a processor further enables the following steps:
and acquiring a third score of a third pairing word in each pairing triple based on a cosine similarity algorithm of the word vector, wherein the third score is a word matching score of the corresponding pairing triple.
Optionally, the computer program when executed by the processor further implements the following steps:
obtaining a target pairing triple with the highest similarity value in the pairing triples formed by the target first triple and each second triple;
determining a target pairing triple corresponding to each first triple;
acquiring a weight value corresponding to each target pairing triple based on a preset statement weight value table;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the corresponding weight value.
Optionally, the computer program when executed by the processor further implements the following steps:
acquiring the number of preset words included in the target pairing triple, and determining the weight attenuation coefficient of the target pairing triple;
attenuating the weight values corresponding to the target pairing triplet group based on the weight attenuation coefficient;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the attenuated weight value.
In this embodiment, the readable storage medium can perform all the technical features of the text similarity analysis method embodiment described in fig. 1, and the implementation principle and the technical effect are similar, which is not described herein again.
The readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the embodiment of the text similarity analysis method according to fig. 1, and the same technical effect can be achieved, and in order to avoid repetition, details are not repeated here.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A text similarity analysis method is characterized by comprising the following steps:
acquiring a first statement and a second statement, wherein the first statement and the second statement both comprise at least two words;
generating a first triple set of the first sentence and a second triple set of the second sentence, wherein each of the first triple set and the second triple set comprises at least one triple, and the triple comprises two words and a grammatical relation between the two words;
and acquiring the text similarity of the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set.
2. The method of claim 1, wherein a triplet in the first set of triples is a first triplet, and wherein a triplet in the second set of triples is a second triplet;
the obtaining the text similarity between the first sentence and the second sentence according to the triples in the first triple set and the triples in the second triple set includes:
combining each first triple in the first triple set with each second triple in the second triple set to obtain a plurality of pairing triples; the pairing triplet comprises a first triplet and a second triplet;
acquiring the similarity value of each pairing triple;
and acquiring the text similarity of the first sentence and the second sentence based on the similarity value of each pairing triple.
3. The method of claim 2, wherein obtaining the similarity value of each pair triplet comprises:
obtaining a word matching score of each pairing triple based on two words in a first triple and two words in a second triple in each pairing triple;
obtaining a syntax relationship matching score of each pairing triple based on the syntax relationship in the first triple and the syntax relationship in the second triple in each pairing triple;
calculating a similarity value for each of the pair triplets based on the word matching score and the grammatical relationship matching score.
4. The method of claim 3, wherein each pairwise triplet includes a first pairwise word and a second pairwise word, the first pairwise word being one of a third word and a fourth word in the first and second triples comprising the pairwise triplet, the second pairwise word being the other of the third word and the fourth word in the first and second triples comprising the pairwise triplet;
the obtaining a term matching score for each paired triple based on two terms in the first triple and two terms in the second triple in each paired triple comprises:
based on a cosine similarity algorithm of word vectors, obtaining a first score of a first pairing word and a second score of a second pairing word in each pairing triple;
and performing weighted summation calculation on the first score and the second score to obtain the word matching score of each pairing triple.
5. The method of claim 3, wherein each pair triplet includes a third pair word, the third pair word including a first phrase and a second phrase, the first phrase being a first word and a second word in a first triplet that constitutes the pair triplet, the second phrase being a third word and a fourth word in a second triplet that constitutes the pair triplet;
the obtaining a term matching score for each paired triple based on two terms in the first triple and two terms in the second triple in each paired triple comprises:
and acquiring a third score of a third pairing word in each pairing triple based on a cosine similarity algorithm of the word vector, wherein the third score is a word matching score of the corresponding pairing triple.
6. The method of claim 2, wherein the obtaining the text similarity between the first sentence and the second sentence based on the similarity value of each pairing triplet comprises:
obtaining a target pairing triple with the highest similarity value in the pairing triples formed by the target first triple and each second triple;
determining a target pairing triple corresponding to each first triple;
acquiring a weight value corresponding to each target pairing triple based on a preset statement weight value table;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the corresponding weight value.
7. The method of claim 6, wherein the obtaining the text similarity between the first sentence and the second sentence based on the similarity value of the target paired triple and the corresponding weight value comprises:
acquiring the number of preset words included in the target pairing triple, and determining the weight attenuation coefficient of the target pairing triple;
attenuating the weight values corresponding to the target pairing triplet group based on the weight attenuation coefficient;
and acquiring the text similarity of the first statement and the second statement based on the similarity value of the target pairing triple and the attenuated weight value.
8. A text similarity analysis apparatus, comprising:
the system comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a first statement and a second statement, and the first statement and the second statement both comprise at least two words;
a generating module, configured to generate a first triple set of the first sentence and a second triple set of the second sentence, where each of the first triple set and the second triple set includes at least one triple, and the triple includes two words and a grammatical relationship between the two words;
and the second obtaining module is used for obtaining the text similarity between the first statement and the second statement according to the triples in the first triple set and the triples in the second triple set.
9. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the text similarity analysis method according to any one of claims 1 to 7.
10. A readable storage medium, on which a program or instructions are stored, which when executed by a processor, implement the steps of the text similarity analysis method according to any one of claims 1 to 7.
CN202011488930.8A 2020-12-16 2020-12-16 Text similarity analysis method and device, electronic equipment and readable storage medium Pending CN112507688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011488930.8A CN112507688A (en) 2020-12-16 2020-12-16 Text similarity analysis method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011488930.8A CN112507688A (en) 2020-12-16 2020-12-16 Text similarity analysis method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112507688A true CN112507688A (en) 2021-03-16

Family

ID=74972803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011488930.8A Pending CN112507688A (en) 2020-12-16 2020-12-16 Text similarity analysis method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112507688A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120197631A1 (en) * 2011-02-01 2012-08-02 Accenture Global Services Limited System for Identifying Textual Relationships
WO2013172500A1 (en) * 2012-05-17 2013-11-21 한국과학기술정보연구원 Apparatus and method for determining similarity between paraphrase identification-based sentences
CN104462327A (en) * 2014-12-02 2015-03-25 百度在线网络技术(北京)有限公司 Computing method, search processing method, computing device and search processing device for sentence similarity
US20150127323A1 (en) * 2013-11-04 2015-05-07 Xerox Corporation Refining inference rules with temporal event clustering
CN109033073A (en) * 2018-06-28 2018-12-18 中国科学院自动化研究所 Text contains recognition methods and device
JP2019082931A (en) * 2017-10-31 2019-05-30 三菱重工業株式会社 Retrieval device, similarity calculation method, and program
US20190197482A1 (en) * 2017-12-27 2019-06-27 International Business Machines Corporation Creating and using triplet representations to assess similarity between job description documents
US20190392066A1 (en) * 2018-06-26 2019-12-26 Adobe Inc. Semantic Analysis-Based Query Result Retrieval for Natural Language Procedural Queries
CN110705612A (en) * 2019-09-18 2020-01-17 重庆邮电大学 Sentence similarity calculation method, storage medium and system with mixed multi-features

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120197631A1 (en) * 2011-02-01 2012-08-02 Accenture Global Services Limited System for Identifying Textual Relationships
WO2013172500A1 (en) * 2012-05-17 2013-11-21 한국과학기술정보연구원 Apparatus and method for determining similarity between paraphrase identification-based sentences
US20150127323A1 (en) * 2013-11-04 2015-05-07 Xerox Corporation Refining inference rules with temporal event clustering
CN104462327A (en) * 2014-12-02 2015-03-25 百度在线网络技术(北京)有限公司 Computing method, search processing method, computing device and search processing device for sentence similarity
JP2019082931A (en) * 2017-10-31 2019-05-30 三菱重工業株式会社 Retrieval device, similarity calculation method, and program
US20190197482A1 (en) * 2017-12-27 2019-06-27 International Business Machines Corporation Creating and using triplet representations to assess similarity between job description documents
US20190392066A1 (en) * 2018-06-26 2019-12-26 Adobe Inc. Semantic Analysis-Based Query Result Retrieval for Natural Language Procedural Queries
CN109033073A (en) * 2018-06-28 2018-12-18 中国科学院自动化研究所 Text contains recognition methods and device
CN110705612A (en) * 2019-09-18 2020-01-17 重庆邮电大学 Sentence similarity calculation method, storage medium and system with mixed multi-features

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
李璐旸: "基于表示学习的虚假信息检测研究", 中国博士学位论文全文数据库 信息科技辑, 15 January 2018 (2018-01-15), pages 138 - 128 *
邓涵;朱新华;李奇;彭琦;: "基于句法结构与修饰词的句子相似度计算", 计算机工程, no. 09, 15 September 2017 (2017-09-15), pages 240 - 244 *
金健: "基于自然语言处理的疑似侵权专利智能检索研究", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 January 2018 (2018-01-15), pages 138 - 1877 *
铉静;吴琼;魏从悦;伍星;: "基于句法依存卷积神经网络的句子相似度计算", 重庆大学学报, no. 09, 15 September 2020 (2020-09-15), pages 45 - 57 *

Similar Documents

Publication Publication Date Title
Bhatia et al. Automatic labelling of topics with neural embeddings
Hu et al. A multi-type multi-span network for reading comprehension that requires discrete reasoning
US11586814B2 (en) Paraphrase sentence generation method and apparatus
US11409813B2 (en) Method and apparatus for mining general tag, server, and medium
CN110795572B (en) Entity alignment method, device, equipment and medium
CN108287875B (en) Character co-occurrence relation determining method, expert recommending method, device and equipment
US20220083577A1 (en) Information processing apparatus, method and non-transitory computer readable medium
US20210042391A1 (en) Generating summary content using supervised sentential extractive summarization
US11238050B2 (en) Method and apparatus for determining response for user input data, and medium
US11132389B2 (en) Method and apparatus with latent keyword generation
US20200349204A1 (en) Patent evaluation and determination method, patent evaluation and determination device, and patent evaluation and determination program
CN111506596B (en) Information retrieval method, apparatus, computer device and storage medium
CN113988157A (en) Semantic retrieval network training method and device, electronic equipment and storage medium
Levy et al. Tr9856: A multi-word term relatedness benchmark
KR20190138623A (en) Method, apparauts and system for named entity linking and computer program thereof
US11971918B2 (en) Selectively tagging words based on positional relationship
CN112507688A (en) Text similarity analysis method and device, electronic equipment and readable storage medium
Ji et al. A short text similarity calculation method combining semantic and headword attention mechanism
CN113704452B (en) Data recommendation method, device, equipment and medium based on Bert model
CN113792230B (en) Service linking method, device, electronic equipment and storage medium
CN111199148B (en) Text similarity determination method and device, storage medium and electronic equipment
CN113076475B (en) Information recommendation method, model training method and related equipment
US20220027558A1 (en) Method and system for extracting keywords from text
CN107665189B (en) method, terminal and equipment for extracting central word
CN112347242A (en) Abstract generation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination