CN110348539B - Short text relevance judging method - Google Patents

Short text relevance judging method Download PDF

Info

Publication number
CN110348539B
CN110348539B CN201910653618.0A CN201910653618A CN110348539B CN 110348539 B CN110348539 B CN 110348539B CN 201910653618 A CN201910653618 A CN 201910653618A CN 110348539 B CN110348539 B CN 110348539B
Authority
CN
China
Prior art keywords
short texts
repetition
short
similarity score
set threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910653618.0A
Other languages
Chinese (zh)
Other versions
CN110348539A (en
Inventor
薛娇
黄波
李大任
李大海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhizhe Information Technology Service Chengdu Co ltd
Original Assignee
Zhizhe Information Technology Service Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhizhe Information Technology Service Chengdu Co ltd filed Critical Zhizhe Information Technology Service Chengdu Co ltd
Priority to CN201910653618.0A priority Critical patent/CN110348539B/en
Publication of CN110348539A publication Critical patent/CN110348539A/en
Application granted granted Critical
Publication of CN110348539B publication Critical patent/CN110348539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The disclosure relates to a short text relevance judging method. According to one embodiment of the present disclosure, the method comprises: obtaining similarity scores of the two short texts; acquiring the repetition degrees of the two short texts; and judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree. The method and the device disclosed by the invention have at least one of the following beneficial technical effects: the similarity score and the repetition degree of the two short texts are combined, the three-level judgment is carried out on the correlation of the two short texts, the correlation judgment method is more reasonable and accurate, and the application scene is wider.

Description

Short text relevance judging method
Technical Field
The present disclosure relates to the field of network information processing, and in particular, to a method and an apparatus for determining short text relevance, a storage medium, and an electronic device.
Background
The main content of the question-answering community in the network is questions and answers, wherein the questions are mostly short texts, and the relevance judgment of the short texts is widely applied to the service scene. For example, according to a certain question, other questions related to the question are recommended; repositioning the relevant questions to expose more answers of high quality; when the problem proposed by the user has a repeated condition, the user can be guided to directly check the existing repeated problem or to more clearly state the problem of the user. The existing text relevance judging method can only judge two texts as irrelevant and similar, and does not further determine the degree of similarity, so that the application is single, and the requirement on a complex scene cannot be met.
Disclosure of Invention
A brief summary of the disclosure is provided below in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
According to a first aspect of the present disclosure, a method for discriminating short text relevance is provided, which includes:
obtaining similarity scores of the two short texts;
acquiring the repetition degrees of the two short texts; and
and judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree.
In one embodiment of the present invention,
according to a second aspect of the present disclosure, there is provided a short text relevance determination apparatus, including:
a first acquisition unit configured to acquire similarity scores of two short texts;
a second acquisition unit configured to acquire a repetition degree of the two short texts; and
and the judging unit is configured to judge the correlation between the two short texts into one of irrelevant, similar and repeated according to the similarity score and the repetition degree.
According to a third aspect of the present disclosure, there is provided a storage medium having stored thereon a program that implements the short text relevance determination method of the first aspect described above.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising a processor; a memory, in which instructions are stored, which when executed by the processor implement the program of the short text relevance determination method according to the first aspect.
The technical scheme of the disclosure has at least one of the following technical effects: the similarity score and the repetition degree of the two short texts are combined, the three-level judgment is carried out on the correlation of the two short texts, and the correlation judgment method is more reasonable and accurate, so that the application scene is wider.
Drawings
The disclosure may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, along with the following detailed description. In the drawings:
FIG. 1 is a schematic flow diagram of a short text relevance determination method according to one embodiment of the present disclosure;
FIG. 2 is a schematic flow diagram of obtaining a similarity score for two short texts, according to one embodiment of the present disclosure;
FIG. 3 is a schematic flow chart diagram of separately obtaining similarity score vectors for two short texts, according to one embodiment of the present disclosure;
fig. 4 is a schematic flow diagram of normalized IDF weighting of each similarity score vector according to one embodiment of the present disclosure;
FIG. 5 is a schematic flow chart diagram for obtaining the repetition degree of two short texts according to one embodiment of the present disclosure;
FIG. 6 is a schematic flow diagram of extracting keywords of short text according to one embodiment of the present disclosure;
FIG. 7 is a schematic flow diagram of determining a relevance between two short texts based on a similarity score and a repetition according to one embodiment of the present disclosure; and
fig. 8 is a block diagram of a short text relevance determination apparatus according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an electronic device according to one embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another.
Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structure closely related to the scheme according to the present disclosure is shown in the drawings, and other details not so related to the present disclosure are omitted.
It is to be understood that the disclosure is not limited to the described embodiments, as described below with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.
According to one embodiment of the disclosure, the similarity score and the repeatability of the two short texts are combined, and the correlation of the two short texts is judged in three levels, so that the correlation judging method is more reasonable and accurate, and the application scene is wider.
Specifically, fig. 1 is a schematic flow chart diagram of a short text relevance determination method according to an embodiment of the present disclosure. Referring to fig. 1, the method for determining short text relevance includes:
s10, obtaining similarity scores of the two short texts;
s11, acquiring the repetition degree of the two short texts; and
and S12, judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree.
Wherein, the repetition degree of the two short texts refers to the repetition degree of words in the two short texts. In the embodiment, the similarity score and the repetition degree of the two short texts are combined, and the relevance of the two short texts is subjected to three-level judgment, namely, irrelevance, similarity and repetition. The three levels of correlation are sequentially enhanced. The irrelevant representation has the weakest relevance and represents that the semantic difference of two short text expressions is larger; the repeated representation has the strongest relevance, and represents that the semantics of two short texts have no difference, namely describing the same event, stating the same fact, expressing the same viewpoint and the like. Relevance is two similar short texts, and the relevance degree of the two short texts is between irrelevance and repetition.
In a complex service scenario, when a user proposes a new question, it is necessary to determine whether there is a semantic duplication problem in the existing question bank, i.e., whether two short texts are duplicated. If the questions posed by the user are repeated in the question bank, the user can be guided to directly view the existing repeated questions or to more clearly state the own questions. In addition, the existing problem library can be subjected to filtering and repeating treatment, namely only one repeated problem is reserved, so that the quality of the problems in the problem library can be effectively improved.
In addition, for short texts with similar relevance, which belong to a very important data source of a recommendation system, high-quality questions and answers can be exposed as much as possible by recommending questions with similar relevance, so that a better experience is provided for a user.
Fig. 2 is a schematic flow diagram of obtaining a similarity score of two short texts according to one embodiment of the present disclosure. Referring to fig. 2, obtaining the similarity score of two short texts may include, for example:
s20, respectively obtaining similarity score vectors of the two short texts;
s21, carrying out normalization IDF (inverse text frequency index) weighting on each similarity score vector to obtain a weighted value; and
and S22, calculating the average value of all weighted values to serve as the similarity score of the two short texts.
Fig. 3 is a schematic flow chart of respectively obtaining similarity score vectors of two short texts according to an embodiment of the present disclosure. Referring to fig. 3, obtaining similarity score vectors of two short texts respectively may include, for example:
s30, performing word segmentation preprocessing on the short text sensor 1 and the short text sensor 2 respectively to obtain a word segmentation sequence terms1 and a word segmentation sequence terms 2; wherein, the sequence length of the segmentation sequence terms1 can be M, the sequence length of the segmentation sequence terms2 can be N, and M and N can be natural numbers greater than or equal to 1;
and S31, respectively searching a word vector corresponding to each word in the word segmentation sequence term 1 and the word segmentation sequence term 2 from a word vector lookup table, wherein the word vector lookup table can be obtained by training the whole library content by utilizing an open source word2vec algorithm. For example, the word vector corresponding to each word in the segmentation sequence terms1 can be represented by vecs1, vecs1 is a vector of 1xD, and all the word vectors corresponding to the segmentation sequence terms1 form a matrix with a dimension of MxD; the word vector corresponding to each word in the word segmentation sequence terms2 can be represented by vecs2, vecs2 is a vector of 1xD, and all the word vectors corresponding to the word segmentation sequence terms2 form a matrix with the dimension of NxD; in this embodiment, D is 256, but not limited thereto;
s32, calculating cosine similarity scores between word vectors corresponding to each word in the word segmentation sequence terms1 and word vectors corresponding to each word in the word segmentation sequence terms2 to obtain a similarity matrix; the similarity matrix is a matrix with a dimension of MxN;
the cosine similarity score s _ ij can be calculated by the following formula (1):
s_ij=dot_product(v_i,v_j)/||v_i||*||v_j|| (1)
in the formula (1), v _ i represents a word vector corresponding to the ith word in the word segmentation sequence term 1, v _ j represents a word vector corresponding to the jth word in the word segmentation sequence term 2, dot _ product represents a point multiplication operation, and s _ ij represents a cosine similarity score between v _ i and v _ j;
and S33, aiming at the similarity matrix, respectively sampling according to the maximum of the rows and the maximum of the columns to obtain similarity score vectors of the two short texts. The maximum sampling by row means that the maximum value of each row element in the similarity matrix is selected to obtain a similarity score vector similarity1 corresponding to the word segmentation sequence terms 1; the maximum sampling by column means that the maximum value of each column element in the similarity matrix is selected to obtain a similarity score vector similarity2 corresponding to the participle sequence term 2.
Fig. 4 is a schematic flow diagram of normalized IDF weighting of each similarity score vector according to one embodiment of the present disclosure. Referring to fig. 4, the normalized IDF weighting is performed on each similarity score vector to obtain a weighted value, which may include, for example:
s40, solving the IDF value of each word in the word segmentation sequence to obtain an IDF vector;
s41, carrying out normalization processing on the IDF vector to obtain a normalized IDF vector;
s42, performing a dot product on the similarity score vector and the normalized IDF vector to obtain a weighted value.
For example, for a word t in a word segmentation sequence, the IDF value of the word t can be calculated by the following formula (2):
IDF=log(N_total/N_df) (2)
in the formula (2), N _ total represents the number of all articles, and N _ df represents the number of articles including the word t, wherein the articles are articles in a specified website, such as a known website.
From the definition of IDF of the word t, the larger the IDF is, the less likely the word is to be a common word, and the more likely the word is to be a word representing the meaning of the main meaning of the sentence. Therefore, the importance of the subject word can be fully considered through IDF weighting, and semantic information of the short text can be better represented. And normalizing the IDF of the word sequence during weighting, and limiting the final score interval to be 0-1.0, so that different scores have stronger comparability.
Fig. 5 is a schematic flowchart of obtaining the repetition degrees of two short texts according to an embodiment of the present disclosure. Referring to fig. 5, obtaining the repetition degrees of two short texts may include, for example:
s50, acquiring the keyword repetition degree and the editing repetition degree of the two short texts;
and S51, accumulating the repetition degrees of the keywords and the editing repetition degrees to obtain the repetition degrees of the two short texts.
The obtaining of the keyword repetition degrees of the two short texts may include, for example:
acquiring the keyword coverage rate of the two short texts; if the coverage rate of the keywords is greater than or equal to a first set threshold, the repetition rate of the keywords is 1, otherwise, the repetition rate of the keywords is 0. Wherein, preferably, the first set threshold is 0.5.
The keyword coverage rate of the two short texts can be defined as the ratio of the number of the same keywords in the two short texts to the number of non-repeated keywords in the two short texts; the number of the same keywords refers to the number of the elements of the intersection of the keywords of the two short texts, and the number of the non-repeated keywords refers to the number of the elements of the union of the keywords of the two short texts. For example, if the keywords of one short text are t _1 and t _4, and the keywords of the other short text are t _1 and t _2, the same keywords in the two short texts are t _1, and the non-repeated keywords are t _1, t _2 and t _4, so that the number of the same keywords in the two short texts is 1, and the number of the non-repeated keywords in the two short texts is 3.
Further, before the step of obtaining the keyword coverage rates of the two short texts, the method may further include: and respectively extracting the keywords of the two short texts.
Fig. 6 is a schematic flow diagram of extracting keywords of a short text according to one embodiment of the present disclosure. Referring to fig. 6, extracting the keywords of the short text may include, for example:
s60, performing word segmentation processing on the short text to obtain a word segmentation sequence;
s61, removing stop words in the word segmentation sequence to obtain a processed word segmentation sequence; the stop words include functional words such as tone assist words, adverbs, prepositions, conjunctions and the like. Specifically, word segmentation processing is carried out on the short text to obtain a word segmentation sequence; and matching the words in the segmentation sequence with the words in the stop word list, and removing the stop words from the segmentation sequence if the stop words matched with the stop words in the stop word list exist in the segmentation sequence. The stop word list is obtained by arranging all stop words in advance.
S62, calculating the IDF value of each word in the processed participle sequence, as described above with reference to formula (2);
s63, taking the set percentage of the maximum value in the IDF value as an extraction threshold value; the set percentage may be, for example, 80%, but is not limited thereto. The percentage is selected to enable the reserved words to have higher similarity or repeatability, namely, the reserved words all express more similar or repeated semantics
S64, selecting the words corresponding to the IDF values larger than the extraction threshold value from the IDF values as keywords.
In summary, in the above extraction of the short text keywords, stop words in the segmentation sequence corresponding to the short text are removed first, then the weight of the keywords is obtained by adopting a method of calculating the IDF value, which fully considers the characteristic that the number of words of the short text is mostly about 10 words, topics or main words in the problem are rarely repeated, and most repeated words are stop words or other meaningless words, so that the interference of meaningless words can be reduced to the greatest extent by removing the stop words and calculating the IDF value, and the main topic words are retained. In addition, the set percentage of the maximum value in the IDF value is used as an extraction threshold, the characteristic that the number of words of the short text is small is also considered, the fact that the theme or the main body described by the short text is single and specific is considered, and accurate keywords can be found out better by adopting the percentile threshold of the maximum value.
Acquiring the editing repetition degrees of the two short texts may include, for example:
and if the editing distance coefficient of the two short texts is larger than a second set threshold, the editing repetition degree is 1, otherwise, the editing repetition degree is 0.
The edit distance coefficients of the two short texts can be calculated by the following formula (3):
score=1-editLength/max(length1,length2) (3)
in the formula (3), score is an edit distance coefficient of two short texts, length1 is the length of a word segmentation sequence corresponding to one short text, length2 is the length of a word segmentation sequence corresponding to the other short text, and editLength is an edit distance of two short texts.
Preferably, the second set threshold may be 0.65. If the edit distance coefficient for two short texts is 0.65, at most 35% of the words in the two short texts need to be edited, and the editing operation here includes replacing a word with another word, inserting a word, and deleting a word, i.e., at least 75% of the words in the two short texts are identical.
FIG. 7 is a schematic flow diagram of determining a relevance between two short texts based on a similarity score and a repetition according to one embodiment of the present disclosure. Referring to fig. 7, discriminating the correlation between two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree includes: after obtaining the similarity score and the repetition degree
If the similarity score is smaller than a third set threshold value, judging the correlation between the two short texts as irrelevant;
if the similarity score is greater than or equal to a third set threshold and less than a fourth set threshold, or the similarity score is greater than or equal to the fourth set threshold and the repetition degree is equal to 0, judging the correlation between the two short texts to be similar; wherein the fourth set threshold is greater than the third set threshold;
and if the similarity score is greater than or equal to a fourth set threshold and the repetition degree is not equal to 0, judging the correlation between the two short texts as repetition.
Preferably, the third set threshold may be 0.5, and the fourth set threshold may be 0.8. In the embodiment, by setting the third threshold, two short texts with similarity scores smaller than the third threshold are judged to be irrelevant; and further judging the two short texts with the similarity greater than the third set threshold by using a fourth set threshold and the repetition degree, and further judging the short texts to be similar or repeated. The method for judging the relevance is more reasonable and accurate, and the application scene is wider.
In a second aspect of the present disclosure, a device for determining relevance of a short text is also provided. Fig. 8 is a block diagram of a short text relevance determination apparatus in an embodiment provided according to the present disclosure. Referring to fig. 8, the short text relevance discriminating apparatus includes: a first acquisition unit 80, a second acquisition unit 81, and a determination unit 82.
A first obtaining unit 80, configured to obtain similarity scores of two short texts;
a second obtaining unit 81, configured to obtain a repetition degree of the two short texts; and
and the judging unit 82 is used for judging the correlation between the two short texts into one of irrelevant, similar and repeated texts according to the similarity score and the repetition degree.
Wherein, the repetition degree of the two short texts refers to the repetition degree of words in the two short texts. In the embodiment, the similarity score and the repetition degree of the two short texts are combined, and the relevance of the two short texts is subjected to three-level judgment, namely, irrelevance, similarity and repetition. The three levels of correlation are sequentially enhanced. The irrelevant representation has the weakest relevance and represents that the semantic difference of two short text expressions is larger; the repeated representation has the strongest relevance, and represents that the semantics of two short texts have no difference, namely describing the same event, stating the same fact, expressing the same viewpoint and the like. Relevance is two similar short texts, and the relevance degree of the two short texts is between irrelevance and repetition.
In a complex service scenario, when a user proposes a new question, it is necessary to determine whether there is a semantic duplication problem in the existing question bank, i.e., whether two short texts are duplicated. If the questions posed by the user are repeated in the question bank, the user can be guided to directly view the existing repeated questions or to more clearly state the own questions. In addition, the existing problem library can be subjected to filtering and repeating treatment, namely only one repeated problem is reserved, so that the quality of the problems in the problem library can be effectively improved.
In addition, for short texts with similar relevance, which belong to a very important data source of a recommendation system, high-quality questions and answers can be exposed as much as possible by recommending questions with similar relevance, so that a better experience is provided for a user.
In a third aspect of the present disclosure, an electronic device is provided, comprising one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform the method for short text relevance discrimination of the first aspect. Fig. 9 is a schematic structural diagram of an electronic device 900 according to one embodiment of the present disclosure. As shown in fig. 9, the electronic apparatus 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon executable instructions, which when executed by a processor, cause the processor to execute the short text relevance determination method of the first aspect. In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer-readable medium bearing instructions that, in such embodiments, may be downloaded and installed from a network via the communication section 909, and/or installed from the removable medium 911. The various method steps described in the present invention are performed when the instructions are executed by a Central Processing Unit (CPU) 901.
Although example embodiments have been described, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the inventive concept. Accordingly, it should be understood that the above-described exemplary embodiments are not limiting, but illustrative.

Claims (8)

1. A short text relevance judging method comprises the following steps:
obtaining similarity scores of the two short texts;
acquiring the repetition degrees of the two short texts; and
judging the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition;
the obtaining of the repetition degrees of the two short texts comprises:
acquiring the keyword repetition degree and the editing repetition degree of the two short texts; and
accumulating the keyword repetition degree and the editing repetition degree to obtain the repetition degrees of the two short texts;
the discriminating the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition degree comprises:
if the similarity score is smaller than a third set threshold value, judging the correlation between the two short texts as irrelevant;
if the similarity score is greater than or equal to the third set threshold and less than a fourth set threshold, or the similarity score is greater than or equal to the fourth set threshold and the repetition degree is equal to 0, judging the correlation between the two short texts to be similar; wherein the fourth set threshold is greater than the third set threshold;
and if the similarity score is greater than or equal to the fourth set threshold and the repetition degree is not equal to 0, judging the correlation between the two short texts as repetition.
2. The short text relevance discrimination method according to claim 1, wherein obtaining the similarity score of two short texts comprises:
respectively obtaining similarity score vectors of the two short texts;
carrying out normalization IDF weighting on each similarity score vector to obtain a weighted value; and
and calculating the average value of all the weighted values to serve as the similarity score of the two short texts.
3. The method for determining relevance of short texts according to claim 1, wherein obtaining the repetition degrees of keywords of the two short texts comprises:
acquiring the keyword coverage rate of the two short texts;
if the keyword coverage rate is greater than or equal to a first set threshold value, the keyword repetition degree is 1, otherwise, the keyword repetition degree is 0.
4. The short text relevance determination method according to claim 3, wherein before the step of obtaining the keyword coverage of the two short texts, the method further comprises:
respectively extracting keywords of the two short texts, including:
performing word segmentation processing on the two short texts respectively to obtain word segmentation sequences;
removing stop words in the word segmentation sequence to obtain a processed word segmentation sequence;
calculating the IDF value of each word in the processed word segmentation sequence;
taking a set percentage of a maximum value in the IDF values as an extraction threshold; and
and selecting words corresponding to the IDF values larger than the extraction threshold value in the IDF values as the keywords.
5. The method for discriminating the relevance of short texts according to claim 1, wherein the obtaining of the editing repetition of the two short texts comprises:
and if the editing distance coefficient of the two short texts is larger than a second set threshold, the editing repetition degree is 1, otherwise, the editing repetition degree is 0.
6. A short text relevance discriminating apparatus comprising:
a first acquisition unit configured to acquire similarity scores of two short texts;
a second acquisition unit configured to acquire a repetition degree of the two short texts; and
a judging unit configured to judge the correlation between the two short texts as one of irrelevant, similar and repeated according to the similarity score and the repetition;
the second obtaining unit is further configured to obtain a keyword repetition degree and an editing repetition degree of the two short texts; accumulating the keyword repetition degree and the editing repetition degree to obtain the repetition degrees of the two short texts;
the judging unit is further configured to judge the correlation between the two short texts as irrelevant if the similarity score is smaller than a third set threshold; if the similarity score is greater than or equal to the third set threshold and less than a fourth set threshold, or the similarity score is greater than or equal to the fourth set threshold and the repetition degree is equal to 0, judging the correlation between the two short texts to be similar; wherein the fourth set threshold is greater than the third set threshold; and if the similarity score is greater than or equal to the fourth set threshold and the repetition degree is not equal to 0, judging the correlation between the two short texts as repetition.
7. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-5.
8. A computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 5.
CN201910653618.0A 2019-07-19 2019-07-19 Short text relevance judging method Active CN110348539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910653618.0A CN110348539B (en) 2019-07-19 2019-07-19 Short text relevance judging method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910653618.0A CN110348539B (en) 2019-07-19 2019-07-19 Short text relevance judging method

Publications (2)

Publication Number Publication Date
CN110348539A CN110348539A (en) 2019-10-18
CN110348539B true CN110348539B (en) 2021-05-07

Family

ID=68179183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910653618.0A Active CN110348539B (en) 2019-07-19 2019-07-19 Short text relevance judging method

Country Status (1)

Country Link
CN (1) CN110348539B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507771A (en) * 2020-04-21 2020-08-07 北京思特奇信息技术股份有限公司 Content income prediction method and system
CN112860846A (en) * 2021-01-19 2021-05-28 平安科技(深圳)有限公司 Problem vocabulary recommendation method and device, computer equipment and storage medium
CN113392184A (en) * 2021-06-09 2021-09-14 平安科技(深圳)有限公司 Method and device for determining similar texts, terminal equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN106569989A (en) * 2016-10-20 2017-04-19 北京智能管家科技有限公司 De-weighting method and apparatus for short text
CN109947934A (en) * 2018-07-17 2019-06-28 中国银联股份有限公司 For the data digging method and system of short text

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184169B (en) * 2011-04-20 2013-06-19 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN104281606B (en) * 2013-07-08 2021-06-25 腾讯科技(北京)有限公司 Method and device for displaying microblog comments
CN103544255B (en) * 2013-10-15 2017-01-11 常州大学 Text semantic relativity based network public opinion information analysis method
CN106407195B (en) * 2015-07-28 2021-02-26 北京京东尚科信息技术有限公司 Method and system for web page duplication elimination
CN106528581B (en) * 2015-09-15 2019-05-07 阿里巴巴集团控股有限公司 Method for text detection and device
CN107644010B (en) * 2016-07-20 2021-05-25 阿里巴巴集团控股有限公司 Text similarity calculation method and device
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN106708966B (en) * 2016-11-29 2023-04-25 中国计量大学 Junk comment detection method based on similarity calculation
CN106649871B (en) * 2017-01-03 2019-10-25 广州爱九游信息技术有限公司 Detection method, device and the calculating equipment of article multiplicity
CN108304371B (en) * 2017-07-14 2021-07-13 腾讯科技(深圳)有限公司 Method and device for mining hot content, computer equipment and storage medium
CN107729489A (en) * 2017-10-17 2018-02-23 北京京东尚科信息技术有限公司 Advertisement text recognition methods and device
CN108073708A (en) * 2017-12-20 2018-05-25 北京百度网讯科技有限公司 Information output method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN106569989A (en) * 2016-10-20 2017-04-19 北京智能管家科技有限公司 De-weighting method and apparatus for short text
CN109947934A (en) * 2018-07-17 2019-06-28 中国银联股份有限公司 For the data digging method and system of short text

Also Published As

Publication number Publication date
CN110348539A (en) 2019-10-18

Similar Documents

Publication Publication Date Title
US10831769B2 (en) Search method and device for asking type query based on deep question and answer
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN110110062B (en) Machine intelligent question and answer method and device and electronic equipment
CN110348539B (en) Short text relevance judging method
US10891427B2 (en) Machine learning techniques for generating document summaries targeted to affective tone
CN109783631B (en) Community question-answer data verification method and device, computer equipment and storage medium
CN108829682B (en) Computer readable storage medium, intelligent question answering method and intelligent question answering device
US20160019293A1 (en) Interpreting and Distinguishing Lack of an Answer in a Question Answering System
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN114547274B (en) Multi-turn question and answer method, device and equipment
CN112860865A (en) Method, device, equipment and storage medium for realizing intelligent question answering
WO2020052061A1 (en) Method and device for processing information
Qi et al. What is the limitation of multimodal llms? a deeper look into multimodal llms through prompt probing
CN109033318A (en) Intelligent answer method and device
CN108763258B (en) Document theme parameter extraction method, product recommendation method, device and storage medium
CN111753062A (en) Method, device, equipment and medium for determining session response scheme
CN113240322B (en) Climate risk disclosure quality method, apparatus, electronic device, and storage medium
CN115630147A (en) Response method, response device, electronic equipment and storage medium
CN112115237B (en) Construction method and device of tobacco science and technology literature data recommendation model
CN111666770B (en) Semantic matching method and device
CN114676237A (en) Sentence similarity determining method and device, computer equipment and storage medium
CN110929501B (en) Text analysis method and device
CN112131889A (en) Intelligent Chinese subjective question scoring method and system based on big data
CN111401070A (en) Word sense similarity determining method and device, electronic equipment and storage medium
CN110688472A (en) Method for automatically screening answers to questions, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant