CN110348539B

CN110348539B - Short text relevance judging method

Info

Publication number: CN110348539B
Application number: CN201910653618.0A
Authority: CN
Inventors: 薛娇; 黄波; 李大任; 李大海
Original assignee: Zhizhe Information Technology Service Chengdu Co ltd
Current assignee: Zhizhe Information Technology Service Chengdu Co ltd
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2021-05-07
Anticipated expiration: 2039-07-19
Also published as: CN110348539A

Abstract

The disclosure relates to a short text relevance judging method. According to one embodiment of the present disclosure, the method comprises: obtaining similarity scores of the two short texts; acquiring the repetition degrees of the two short texts; and judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree. The method and the device disclosed by the invention have at least one of the following beneficial technical effects: the similarity score and the repetition degree of the two short texts are combined, the three-level judgment is carried out on the correlation of the two short texts, the correlation judgment method is more reasonable and accurate, and the application scene is wider.

Description

Short text relevance judging method

Technical Field

The present disclosure relates to the field of network information processing, and in particular, to a method and an apparatus for determining short text relevance, a storage medium, and an electronic device.

Background

The main content of the question-answering community in the network is questions and answers, wherein the questions are mostly short texts, and the relevance judgment of the short texts is widely applied to the service scene. For example, according to a certain question, other questions related to the question are recommended; repositioning the relevant questions to expose more answers of high quality; when the problem proposed by the user has a repeated condition, the user can be guided to directly check the existing repeated problem or to more clearly state the problem of the user. The existing text relevance judging method can only judge two texts as irrelevant and similar, and does not further determine the degree of similarity, so that the application is single, and the requirement on a complex scene cannot be met.

Disclosure of Invention

A brief summary of the disclosure is provided below in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

According to a first aspect of the present disclosure, a method for discriminating short text relevance is provided, which includes:

obtaining similarity scores of the two short texts;

acquiring the repetition degrees of the two short texts; and

and judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree.

In one embodiment of the present invention,

according to a second aspect of the present disclosure, there is provided a short text relevance determination apparatus, including:

a first acquisition unit configured to acquire similarity scores of two short texts;

a second acquisition unit configured to acquire a repetition degree of the two short texts; and

and the judging unit is configured to judge the correlation between the two short texts into one of irrelevant, similar and repeated according to the similarity score and the repetition degree.

According to a third aspect of the present disclosure, there is provided a storage medium having stored thereon a program that implements the short text relevance determination method of the first aspect described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising a processor; a memory, in which instructions are stored, which when executed by the processor implement the program of the short text relevance determination method according to the first aspect.

The technical scheme of the disclosure has at least one of the following technical effects: the similarity score and the repetition degree of the two short texts are combined, the three-level judgment is carried out on the correlation of the two short texts, and the correlation judgment method is more reasonable and accurate, so that the application scene is wider.

Drawings

The disclosure may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, along with the following detailed description. In the drawings:

FIG. 1 is a schematic flow diagram of a short text relevance determination method according to one embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of obtaining a similarity score for two short texts, according to one embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram of separately obtaining similarity score vectors for two short texts, according to one embodiment of the present disclosure;

fig. 4 is a schematic flow diagram of normalized IDF weighting of each similarity score vector according to one embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram for obtaining the repetition degree of two short texts according to one embodiment of the present disclosure;

FIG. 6 is a schematic flow diagram of extracting keywords of short text according to one embodiment of the present disclosure;

FIG. 7 is a schematic flow diagram of determining a relevance between two short texts based on a similarity score and a repetition according to one embodiment of the present disclosure; and

fig. 8 is a block diagram of a short text relevance determination apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another.

Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structure closely related to the scheme according to the present disclosure is shown in the drawings, and other details not so related to the present disclosure are omitted.

It is to be understood that the disclosure is not limited to the described embodiments, as described below with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.

According to one embodiment of the disclosure, the similarity score and the repeatability of the two short texts are combined, and the correlation of the two short texts is judged in three levels, so that the correlation judging method is more reasonable and accurate, and the application scene is wider.

Specifically, fig. 1 is a schematic flow chart diagram of a short text relevance determination method according to an embodiment of the present disclosure. Referring to fig. 1, the method for determining short text relevance includes:

s10, obtaining similarity scores of the two short texts;

s11, acquiring the repetition degree of the two short texts; and

and S12, judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree.

Wherein, the repetition degree of the two short texts refers to the repetition degree of words in the two short texts. In the embodiment, the similarity score and the repetition degree of the two short texts are combined, and the relevance of the two short texts is subjected to three-level judgment, namely, irrelevance, similarity and repetition. The three levels of correlation are sequentially enhanced. The irrelevant representation has the weakest relevance and represents that the semantic difference of two short text expressions is larger; the repeated representation has the strongest relevance, and represents that the semantics of two short texts have no difference, namely describing the same event, stating the same fact, expressing the same viewpoint and the like. Relevance is two similar short texts, and the relevance degree of the two short texts is between irrelevance and repetition.

In a complex service scenario, when a user proposes a new question, it is necessary to determine whether there is a semantic duplication problem in the existing question bank, i.e., whether two short texts are duplicated. If the questions posed by the user are repeated in the question bank, the user can be guided to directly view the existing repeated questions or to more clearly state the own questions. In addition, the existing problem library can be subjected to filtering and repeating treatment, namely only one repeated problem is reserved, so that the quality of the problems in the problem library can be effectively improved.

In addition, for short texts with similar relevance, which belong to a very important data source of a recommendation system, high-quality questions and answers can be exposed as much as possible by recommending questions with similar relevance, so that a better experience is provided for a user.

Fig. 2 is a schematic flow diagram of obtaining a similarity score of two short texts according to one embodiment of the present disclosure. Referring to fig. 2, obtaining the similarity score of two short texts may include, for example:

s20, respectively obtaining similarity score vectors of the two short texts;

s21, carrying out normalization IDF (inverse text frequency index) weighting on each similarity score vector to obtain a weighted value; and

and S22, calculating the average value of all weighted values to serve as the similarity score of the two short texts.

Fig. 3 is a schematic flow chart of respectively obtaining similarity score vectors of two short texts according to an embodiment of the present disclosure. Referring to fig. 3, obtaining similarity score vectors of two short texts respectively may include, for example:

s30, performing word segmentation preprocessing on the short text sensor 1 and the short text sensor 2 respectively to obtain a word segmentation sequence terms1 and a word segmentation sequence terms 2; wherein, the sequence length of the segmentation sequence terms1 can be M, the sequence length of the segmentation sequence terms2 can be N, and M and N can be natural numbers greater than or equal to 1;

and S31, respectively searching a word vector corresponding to each word in the word segmentation sequence term 1 and the word segmentation sequence term 2 from a word vector lookup table, wherein the word vector lookup table can be obtained by training the whole library content by utilizing an open source word2vec algorithm. For example, the word vector corresponding to each word in the segmentation sequence terms1 can be represented by vecs1, vecs1 is a vector of 1xD, and all the word vectors corresponding to the segmentation sequence terms1 form a matrix with a dimension of MxD; the word vector corresponding to each word in the word segmentation sequence terms2 can be represented by vecs2, vecs2 is a vector of 1xD, and all the word vectors corresponding to the word segmentation sequence terms2 form a matrix with the dimension of NxD; in this embodiment, D is 256, but not limited thereto;

s32, calculating cosine similarity scores between word vectors corresponding to each word in the word segmentation sequence terms1 and word vectors corresponding to each word in the word segmentation sequence terms2 to obtain a similarity matrix; the similarity matrix is a matrix with a dimension of MxN;

the cosine similarity score s _ ij can be calculated by the following formula (1):

s_ij＝dot_product(v_i,v_j)/||v_i||*||v_j|| (1)

in the formula (1), v _ i represents a word vector corresponding to the ith word in the word segmentation sequence term 1, v _ j represents a word vector corresponding to the jth word in the word segmentation sequence term 2, dot _ product represents a point multiplication operation, and s _ ij represents a cosine similarity score between v _ i and v _ j;

and S33, aiming at the similarity matrix, respectively sampling according to the maximum of the rows and the maximum of the columns to obtain similarity score vectors of the two short texts. The maximum sampling by row means that the maximum value of each row element in the similarity matrix is selected to obtain a similarity score vector similarity1 corresponding to the word segmentation sequence terms 1; the maximum sampling by column means that the maximum value of each column element in the similarity matrix is selected to obtain a similarity score vector similarity2 corresponding to the participle sequence term 2.

Fig. 4 is a schematic flow diagram of normalized IDF weighting of each similarity score vector according to one embodiment of the present disclosure. Referring to fig. 4, the normalized IDF weighting is performed on each similarity score vector to obtain a weighted value, which may include, for example:

s40, solving the IDF value of each word in the word segmentation sequence to obtain an IDF vector;

s41, carrying out normalization processing on the IDF vector to obtain a normalized IDF vector;

s42, performing a dot product on the similarity score vector and the normalized IDF vector to obtain a weighted value.

For example, for a word t in a word segmentation sequence, the IDF value of the word t can be calculated by the following formula (2):

IDF＝log(N_total/N_df) (2)

in the formula (2), N _ total represents the number of all articles, and N _ df represents the number of articles including the word t, wherein the articles are articles in a specified website, such as a known website.

From the definition of IDF of the word t, the larger the IDF is, the less likely the word is to be a common word, and the more likely the word is to be a word representing the meaning of the main meaning of the sentence. Therefore, the importance of the subject word can be fully considered through IDF weighting, and semantic information of the short text can be better represented. And normalizing the IDF of the word sequence during weighting, and limiting the final score interval to be 0-1.0, so that different scores have stronger comparability.

Fig. 5 is a schematic flowchart of obtaining the repetition degrees of two short texts according to an embodiment of the present disclosure. Referring to fig. 5, obtaining the repetition degrees of two short texts may include, for example:

s50, acquiring the keyword repetition degree and the editing repetition degree of the two short texts;

and S51, accumulating the repetition degrees of the keywords and the editing repetition degrees to obtain the repetition degrees of the two short texts.

The obtaining of the keyword repetition degrees of the two short texts may include, for example:

acquiring the keyword coverage rate of the two short texts; if the coverage rate of the keywords is greater than or equal to a first set threshold, the repetition rate of the keywords is 1, otherwise, the repetition rate of the keywords is 0. Wherein, preferably, the first set threshold is 0.5.

The keyword coverage rate of the two short texts can be defined as the ratio of the number of the same keywords in the two short texts to the number of non-repeated keywords in the two short texts; the number of the same keywords refers to the number of the elements of the intersection of the keywords of the two short texts, and the number of the non-repeated keywords refers to the number of the elements of the union of the keywords of the two short texts. For example, if the keywords of one short text are t _1 and t _4, and the keywords of the other short text are t _1 and t _2, the same keywords in the two short texts are t _1, and the non-repeated keywords are t _1, t _2 and t _4, so that the number of the same keywords in the two short texts is 1, and the number of the non-repeated keywords in the two short texts is 3.

Further, before the step of obtaining the keyword coverage rates of the two short texts, the method may further include: and respectively extracting the keywords of the two short texts.

Fig. 6 is a schematic flow diagram of extracting keywords of a short text according to one embodiment of the present disclosure. Referring to fig. 6, extracting the keywords of the short text may include, for example:

s60, performing word segmentation processing on the short text to obtain a word segmentation sequence;

s61, removing stop words in the word segmentation sequence to obtain a processed word segmentation sequence; the stop words include functional words such as tone assist words, adverbs, prepositions, conjunctions and the like. Specifically, word segmentation processing is carried out on the short text to obtain a word segmentation sequence; and matching the words in the segmentation sequence with the words in the stop word list, and removing the stop words from the segmentation sequence if the stop words matched with the stop words in the stop word list exist in the segmentation sequence. The stop word list is obtained by arranging all stop words in advance.

S62, calculating the IDF value of each word in the processed participle sequence, as described above with reference to formula (2);

s63, taking the set percentage of the maximum value in the IDF value as an extraction threshold value; the set percentage may be, for example, 80%, but is not limited thereto. The percentage is selected to enable the reserved words to have higher similarity or repeatability, namely, the reserved words all express more similar or repeated semantics

S64, selecting the words corresponding to the IDF values larger than the extraction threshold value from the IDF values as keywords.

In summary, in the above extraction of the short text keywords, stop words in the segmentation sequence corresponding to the short text are removed first, then the weight of the keywords is obtained by adopting a method of calculating the IDF value, which fully considers the characteristic that the number of words of the short text is mostly about 10 words, topics or main words in the problem are rarely repeated, and most repeated words are stop words or other meaningless words, so that the interference of meaningless words can be reduced to the greatest extent by removing the stop words and calculating the IDF value, and the main topic words are retained. In addition, the set percentage of the maximum value in the IDF value is used as an extraction threshold, the characteristic that the number of words of the short text is small is also considered, the fact that the theme or the main body described by the short text is single and specific is considered, and accurate keywords can be found out better by adopting the percentile threshold of the maximum value.

Acquiring the editing repetition degrees of the two short texts may include, for example:

and if the editing distance coefficient of the two short texts is larger than a second set threshold, the editing repetition degree is 1, otherwise, the editing repetition degree is 0.

The edit distance coefficients of the two short texts can be calculated by the following formula (3):

score＝1-editLength/max(length1,length2) (3)

in the formula (3), score is an edit distance coefficient of two short texts, length1 is the length of a word segmentation sequence corresponding to one short text, length2 is the length of a word segmentation sequence corresponding to the other short text, and editLength is an edit distance of two short texts.

Preferably, the second set threshold may be 0.65. If the edit distance coefficient for two short texts is 0.65, at most 35% of the words in the two short texts need to be edited, and the editing operation here includes replacing a word with another word, inserting a word, and deleting a word, i.e., at least 75% of the words in the two short texts are identical.

FIG. 7 is a schematic flow diagram of determining a relevance between two short texts based on a similarity score and a repetition according to one embodiment of the present disclosure. Referring to fig. 7, discriminating the correlation between two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree includes: after obtaining the similarity score and the repetition degree

If the similarity score is smaller than a third set threshold value, judging the correlation between the two short texts as irrelevant;

if the similarity score is greater than or equal to a third set threshold and less than a fourth set threshold, or the similarity score is greater than or equal to the fourth set threshold and the repetition degree is equal to 0, judging the correlation between the two short texts to be similar; wherein the fourth set threshold is greater than the third set threshold;

and if the similarity score is greater than or equal to a fourth set threshold and the repetition degree is not equal to 0, judging the correlation between the two short texts as repetition.

Preferably, the third set threshold may be 0.5, and the fourth set threshold may be 0.8. In the embodiment, by setting the third threshold, two short texts with similarity scores smaller than the third threshold are judged to be irrelevant; and further judging the two short texts with the similarity greater than the third set threshold by using a fourth set threshold and the repetition degree, and further judging the short texts to be similar or repeated. The method for judging the relevance is more reasonable and accurate, and the application scene is wider.

In a second aspect of the present disclosure, a device for determining relevance of a short text is also provided. Fig. 8 is a block diagram of a short text relevance determination apparatus in an embodiment provided according to the present disclosure. Referring to fig. 8, the short text relevance discriminating apparatus includes: a first acquisition unit 80, a second acquisition unit 81, and a determination unit 82.

A first obtaining unit 80, configured to obtain similarity scores of two short texts;

a second obtaining unit 81, configured to obtain a repetition degree of the two short texts; and

and the judging unit 82 is used for judging the correlation between the two short texts into one of irrelevant, similar and repeated texts according to the similarity score and the repetition degree.

In a third aspect of the present disclosure, an electronic device is provided, comprising one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform the method for short text relevance discrimination of the first aspect. Fig. 9 is a schematic structural diagram of an electronic device 900 according to one embodiment of the present disclosure. As shown in fig. 9, the electronic apparatus 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon executable instructions, which when executed by a processor, cause the processor to execute the short text relevance determination method of the first aspect. In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer-readable medium bearing instructions that, in such embodiments, may be downloaded and installed from a network via the communication section 909, and/or installed from the removable medium 911. The various method steps described in the present invention are performed when the instructions are executed by a Central Processing Unit (CPU) 901.

Although example embodiments have been described, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the inventive concept. Accordingly, it should be understood that the above-described exemplary embodiments are not limiting, but illustrative.

Claims

1. A short text relevance judging method comprises the following steps:

obtaining similarity scores of the two short texts;

acquiring the repetition degrees of the two short texts; and

judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition;

the obtaining of the repetition degrees of the two short texts comprises:

acquiring the keyword repetition degree and the editing repetition degree of the two short texts; and

accumulating the keyword repetition degree and the editing repetition degree to obtain the repetition degrees of the two short texts;

the discriminating the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree comprises:

if the similarity score is greater than or equal to the third set threshold and less than a fourth set threshold, or the similarity score is greater than or equal to the fourth set threshold and the repetition degree is equal to 0, judging the correlation between the two short texts to be similar; wherein the fourth set threshold is greater than the third set threshold;

and if the similarity score is greater than or equal to the fourth set threshold and the repetition degree is not equal to 0, judging the correlation between the two short texts as repetition.

2. The short text relevance discrimination method according to claim 1, wherein obtaining the similarity score of two short texts comprises:

respectively obtaining similarity score vectors of the two short texts;

carrying out normalization IDF weighting on each similarity score vector to obtain a weighted value; and

and calculating the average value of all the weighted values to serve as the similarity score of the two short texts.

3. The method for determining relevance of short texts according to claim 1, wherein obtaining the repetition degrees of keywords of the two short texts comprises:

acquiring the keyword coverage rate of the two short texts;

if the keyword coverage rate is greater than or equal to a first set threshold value, the keyword repetition degree is 1, otherwise, the keyword repetition degree is 0.

4. The short text relevance determination method according to claim 3, wherein before the step of obtaining the keyword coverage of the two short texts, the method further comprises:

respectively extracting keywords of the two short texts, including:

performing word segmentation processing on the two short texts respectively to obtain word segmentation sequences;

removing stop words in the word segmentation sequence to obtain a processed word segmentation sequence;

calculating the IDF value of each word in the processed word segmentation sequence;

taking a set percentage of a maximum value in the IDF values as an extraction threshold; and

and selecting words corresponding to the IDF values larger than the extraction threshold value in the IDF values as the keywords.

5. The method for discriminating the relevance of short texts according to claim 1, wherein the obtaining of the editing repetition of the two short texts comprises:

6. A short text relevance discriminating apparatus comprising:

a judging unit configured to judge the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition;

the second obtaining unit is further configured to obtain a keyword repetition degree and an editing repetition degree of the two short texts; accumulating the keyword repetition degree and the editing repetition degree to obtain the repetition degrees of the two short texts;

the judging unit is further configured to judge the correlation between the two short texts as irrelevant if the similarity score is smaller than a third set threshold; if the similarity score is greater than or equal to the third set threshold and less than a fourth set threshold, or the similarity score is greater than or equal to the fourth set threshold and the repetition degree is equal to 0, judging the correlation between the two short texts to be similar; wherein the fourth set threshold is greater than the third set threshold; and if the similarity score is greater than or equal to the fourth set threshold and the repetition degree is not equal to 0, judging the correlation between the two short texts as repetition.

7. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-5.

8. A computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 5.