WO2019136993A1 - Procédé et dispositif de calcul de similarité de texte, appareil informatique, et support de stockage - Google Patents

Procédé et dispositif de calcul de similarité de texte, appareil informatique, et support de stockage Download PDF

Info

Publication number
WO2019136993A1
WO2019136993A1 PCT/CN2018/099994 CN2018099994W WO2019136993A1 WO 2019136993 A1 WO2019136993 A1 WO 2019136993A1 CN 2018099994 W CN2018099994 W CN 2018099994W WO 2019136993 A1 WO2019136993 A1 WO 2019136993A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
matched
target
word
similarity
Prior art date
Application number
PCT/CN2018/099994
Other languages
English (en)
Chinese (zh)
Inventor
艾明
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2019136993A1 publication Critical patent/WO2019136993A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to a text similarity calculation method, apparatus, computer device and storage medium.
  • the edit distance also known as the Levenshtein distance, refers to the minimum number of edit operations required between two strings, one from one to the other. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character. The larger the edit distance value, the smaller the similarity between strings.
  • the traditional edit distance algorithm is usually in units of single characters. The edit distance between each character sequence is calculated, and the calculated edit distance is only the distance of the text surface, resulting in low accuracy of the calculated text similarity.
  • a text similarity calculation method, apparatus, computer device, and storage medium capable of improving text similarity are provided.
  • a text similarity calculation method includes: acquiring a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of corresponding words to be matched and a sequence of target words; And the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; and all the to-be-matched words are formed to form a to-be-matched word set.
  • a text similarity calculation device includes: a character sequence acquisition module, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module, configured to pre-process the to-be-matched character sequence and the target character sequence respectively Processing, obtaining a corresponding sequence of to-be-matched words and a sequence of target words; a first similarity calculation module, configured to pass the to-be-matched words included in the sequence of to-be-matched words and the target words included in the target word sequence through the first The similarity algorithm performs calculation to obtain a first similarity degree; a word set forming module is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all target words to form a target word set; and a second similarity calculation module for And the text similarity calculation module is configured to use the first similarity and the second similarity by using a second similarity algorithm to calculate a second similarity degree; Performing a calculation to obtain a text similarity of the sequence of characters to be matched and
  • a computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps: obtaining a sequence of characters to be matched and a sequence of target characters; respectively preprocessing the sequence of characters to be matched and the sequence of target characters to obtain a sequence of a word to be matched and a sequence of target words;
  • the to-be-matched word included in the target word sequence and the target word included in the target word sequence are calculated by the first similarity algorithm to obtain a first similarity degree; all the to-be-matched words are formed to form a to-be-matched word set, and all target word formation targets are extracted a set of words; the set of the to-be-matched words and the set of target words are calculated by a second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity, The text similarity of the sequence of characters to be matched and the sequence of target characters.
  • One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: acquiring characters to be matched Sequence and target character sequence; respectively preprocessing the to-be-matched character sequence and the target character sequence to obtain a corresponding candidate word sequence and a target word sequence; and the to-be-matched words included in the to-be-matched word sequence
  • the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity; all the to-be-matched words are formed to form a to-be-matched word set, and all target words are extracted to form a target word set;
  • the matching word set and the target word set are calculated by the second similarity algorithm to obtain a second similarity; and calculating according to the first similarity and the second similarity to obtain the to-be-matched character sequence and The text similarity of the target character sequence.
  • 1 is an application scenario diagram of a text similarity calculation method in accordance with one or more embodiments.
  • FIG. 2 is a flow diagram of a text similarity calculation method in accordance with one or more embodiments.
  • 3A is a schematic diagram of a word tree derived from a physical substance in accordance with one or more embodiments.
  • FIG. 3B is a schematic diagram of a word tree derived from a virtual event in accordance with one or more embodiments.
  • FIG. 4 is a flow diagram of a text similarity calculation method in accordance with another or more embodiments.
  • FIG. 5 is a block diagram showing the structure of a text similarity calculation apparatus according to one or more embodiments.
  • FIG. 6 is a diagram showing the internal structure of a computer device in accordance with one or more embodiments.
  • first may be referred to as a second similarity without departing from the scope of the present application, and similarly, the second similarity may be referred to as a first similarity. Both the first similarity and the second similarity are similarities, but they are not the same similarity.
  • Terminal 102 communicates with server 104 over a network over a network.
  • the server 104 can receive a sequence of characters to be matched sent by the terminal 102.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.
  • a text similarity calculation method is provided, which is applied to the server 104 in FIG. 1 as an example, and includes the following steps:
  • Step 202 Acquire a sequence of characters to be matched and a sequence of target characters.
  • the sequence of characters to be matched refers to a sequence of characters that need to be matched.
  • the target character sequence refers to a preset sequence of characters in the database for matching the sequence of characters to be matched.
  • a sequence of characters refers to a sequence formed in order of characters, and the characters may be at least one of letters, Arabic numerals, Chinese characters, and punctuation marks.
  • the sequence of characters includes, but is not limited to, a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation.
  • Step 204 Perform preprocessing on the sequence of the matched character and the sequence of the target character respectively to obtain a corresponding sequence of the word to be matched and the sequence of the target word.
  • Preprocessing refers to the process of converting, reducing, splitting, and the like of at least one of a sequence of matching characters and a sequence of target characters.
  • the sequence of words to be matched refers to a sequence of words obtained by preprocessing the sequence of characters to be matched.
  • the target word sequence refers to a sequence of words obtained by preprocessing the target character sequence.
  • a word sequence refers to a sequence formed in order of words.
  • the sequence of to-be-matched words refers to a sequence formed in order of the words to be matched.
  • the target word sequence refers to a sequence formed in order in units of target words.
  • the to-be-matched word and the target word may be simple words composed of one or more characters, or may be composite words composed of two or more simple words.
  • the step 204 includes: deleting the unrelated characters included in the sequence of characters to be matched and the irrelevant characters included in the target character sequence; and respectively performing the word segmentation of the to-be-matched character sequence and the target character sequence after deleting the unrelated characters, The corresponding sequence of words to be matched and the sequence of target words are obtained.
  • Unrelated characters are characters that do not affect the calculation of text similarity, including but not limited to punctuation and deactivation.
  • Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule. The word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics are used to classify the character sequence to be matched and the target character sequence after deleting the irrelevant character.
  • Step 206 Calculate the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence by using a first similarity algorithm to obtain a first similarity.
  • the first similarity algorithm refers to an algorithm that calculates the similarity after word-by-word comparison in the order of the words in the two word sequences.
  • the sequence of the word to be matched and the sequence of the target word are respectively used as a one-dimensional word sequence, and are calculated by the first similarity algorithm according to the order of the words to be matched and the order of the target words, to obtain the first similarity.
  • the similarity calculation is performed in the one-dimensional form of the sequence of the word to be matched and the target word sequence, which can save the storage space of the system and reduce the time complexity.
  • step 206 includes: calculating a to-be-matched word included in the sequence of words to be matched and a target word included in the target word sequence by using an edit distance formula to obtain a sequence between the sequence of the word to be matched and the target word sequence. Editing the distance; obtaining a first number of words to be matched included in the sequence of words to be matched, and a second quantity of the target words included in the sequence of target words; calculating according to the editing distance, the first quantity, and the second quantity, obtaining the first Similarity.
  • the edit distance is the minimum number of edit operations required between two word sequences, one from one to another. Calculating the editing distance between two word sequences in terms of words can reduce the influence of the semantics of word sequences on the editing of word sequences, and improve the accuracy of calculating the similarity of word sequences.
  • contains
  • contains
  • the edit distance lev S, T (i, j) of the sequence S to be matched and the target word sequence T can be calculated by the formula Calculated. i represents the i-th to-be-matched word in the sequence S to be matched, and j represents the j-th target word in the target word sequence T.
  • the edit distance lev S, T (i, j) takes the maximum value of i and j; otherwise, the edit distance lev S, T (i, j) takes lev S, T (i, j-1) +1, lev S, T (i-1, j) +1, lev S, T (i-1, j-1) +1 the minimum value.
  • the formula can be calculated by the following similarity
  • the first similarity sim1 S, T (i, j) is calculated.
  • ) represents the maximum value in
  • Step 208 Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.
  • the set of to-be-matched words refers to a set consisting of all the to-be-matched words contained in the sequence of words to be matched.
  • the target word set refers to a set consisting of all target words contained in the target word sequence.
  • the words to be matched in the set of words to be matched do not have an order, and accordingly, the target words in the set of target words do not have an order.
  • the literally expressed number can also be converted to an Arabic numeral, for example, "Thirty-three” can be converted to "33". Unification into Arabic numerals makes it easier to match numbers and improve the accuracy of text similarity.
  • Step 210 Calculate the to-be-matched word set and the target word set by using a second similarity algorithm to obtain a second similarity.
  • the second two similarity algorithm is a similarity algorithm that compares all the to-be-matched words and all the target words as a whole. Including but not limited to lexical similarity algorithm based on semantic dictionary and lexical similarity algorithm based on corpus statistics.
  • step 210 includes: matching the to-be-matched word set with the target word set, and counting the number of matches between the to-be-matched word and the target word; and counting the number of to-be-matched words and the target word set of the to-be-matched word set.
  • the number of target words calculated according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.
  • Step 212 Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.
  • the text similarity refers to the similarity between the sequence of characters to be matched and the sequence of target characters.
  • the first similarity may be multiplied by the second similarity as the text similarity.
  • the first weight corresponding to the first similarity and the second weight corresponding to the second similarity may be preset, and the first similarity and the second similarity are weighted and summed to obtain a text similarity.
  • the sequence of the to-be-matched word and the target word sequence formed in order by the word unit are obtained.
  • the first similarity is calculated by considering a first similarity algorithm of the word order, and then the to-be-matched word set and the target word set are respectively formed according to the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence.
  • the second similarity is calculated by a second similarity algorithm that does not consider the word order, and then the first similarity and the second similarity are integrated to calculate a text similarity between the character sequence to be matched and the target character sequence.
  • the similarity calculation is performed in terms of words, and the similarity algorithms are used to calculate the text similarity, which reduces the error caused by single-characteristics through a single similarity algorithm, and improves the accuracy of text similarity calculation.
  • the extraneous characters include a deactivated character and the same character; deleting the extraneous characters contained in the sequence of characters to be matched and the extraneous characters contained in the sequence of target characters, including: deactivating the inclusion of the sequence of characters to be matched Deletion characters included in the character and target character sequences are deleted; it is judged whether the same character is present in the character sequence to be matched and the target character sequence after deleting the deactivated character; the same character is the sequence of characters to be matched after deleting the deactivated character and In the target character sequence, the same character at the same position; if yes, the character sequence to be matched after deleting the deactivated character and the same character included in the target character sequence are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
  • Deactivating characters means that in information retrieval, in order to save storage space and improve search efficiency, certain words or words can be filtered out before the sequence of characters is processed.
  • the deactivated character library can be preset for filtering deactivated characters.
  • Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah”, “bar”, “ ⁇ ”, "of", “further", “but” and the like.
  • the character to be matched included in the character sequence to be matched and the target character included in the target character sequence have an order
  • the character to be matched and the target character are matched in order, and will be the same in the sequence of the character to be matched and the target character sequence.
  • the same character in the position is the same character.
  • the same character in the character sequence to be matched and the target character in the target character sequence are respectively deleted.
  • the sequence of characters to be matched is “how to optimize this algorithm”, and the target character sequence is “how to do the optimization algorithm”.
  • “calculation” and “method” are in the sequence of characters to be matched and the sequence of target characters. The same location, so you can delete "calculation” and "method”.
  • the sequence of characters to be matched after deleting the same character is "How to optimize this", and the target character sequence is "Optimize what to do.”
  • the length of the word sequence participating in the text similarity calculation can be reduced, which can save the text similarity calculation time, reduce the memory space occupied by the calculation, and improve the text similarity calculation. effectiveness.
  • the unrelated characters included in the sequence of characters to be matched or the sequence of target characters may be replaced with preset characters, and the preset characters included in the sequence of characters to be matched or the sequence of target characters may be all cleared after replacement.
  • the character sequence S2 to be matched containing the space character is obtained, and the space characters included in the character sequence S2 to be matched are all cleared to obtain a space character containing no space character.
  • a word segmentation can be performed on the matching character sequence S3 to obtain a to-be-matched word sequence S4.
  • the word tree can be constructed for the upper and lower hierarchical relationships of the words in the semantic dictionary, as shown in FIG. 3A and FIG. 3B, FIG. 3A
  • the words in the word tree are derived from the physical material, and the words in Figure 3B are the word trees derived from the virtual events.
  • the words corresponding to the parent node and the words corresponding to the child nodes have a relationship of upper and lower positions.
  • the semantic distance between words can be calculated according to the word tree, and the higher the level, the larger the path parameter, the lower the level, and the smaller the path parameter. The greater the distance, the smaller the similarity.
  • Calculate the path length of the word A and the word B in the word tree according to the word tree, that is, after the semantic distance is d the similarity between the word A and the word B can be calculated according to the formula.
  • is a parameter.
  • the to-be-matched word set and the target word set may be matched, and the second similarity algorithm is used to calculate each of the to-be-matched words in the to-be-matched word set and each target word in the target word set.
  • the second son is similar.
  • the second similarity can be calculated from all the calculated second sub-similarities.
  • the second sub-similarity corresponding to the preset sub-similarity threshold is counted.
  • the number of matches of the words to be matched is Q(S, T), and counts the number of words to be matched
  • the second similarity sim2 can be passed through the formula Calculated. Where Max(
  • the method further includes: acquiring a to-be-matched pinyin sequence corresponding to the to-be-matched character sequence and a target pinyin sequence corresponding to the target character sequence; and including the to-be-matched pinyin sequence
  • the target pinyin included in the matched pinyin and target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated, and the character sequence to be matched and the target character sequence are obtained.
  • the text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining text similarity of the character sequence to be matched and the target character sequence.
  • the pinyin sequence to be matched refers to a sequence composed of pinyin corresponding to the character to be matched in the sequence of characters to be matched.
  • the target pinyin sequence is a sequence of pinyin corresponding to the target character in the target character sequence.
  • the pinyin to be matched may be generated by acquiring the pinyin corresponding to the character to be matched input by the user when the user performs an input operation.
  • the target pinyin sequence may be a sequence corresponding to the target character sequence preset in the database.
  • the pinyin sequence to be matched and the target character sequence may be calculated by the first similarity algorithm to obtain a third similarity in units of pinyin corresponding to each character.
  • the pinyin sequence to be matched corresponding to the character sequence "Your name” is "ni ming zi ao kou”
  • the target pinyin sequence corresponding to the target character sequence "You are too stubborn” is "ni tai zhi niu”
  • the characters to be matched and the target character sequence all contain the character " ⁇ ”
  • the pinyin corresponding to the "pin” in the pinyin sequence and the target pinyin sequence are "ao” and "niu”, respectively. Therefore, by calculating the text similarity with the pinyin sequence and the target pinyin sequence to be matched, the error caused by the multi-phonetic word " ⁇ " can be reduced.
  • the sequence of the character to be matched and the sequence of the target character are obtained, including: receiving a sequence of characters to be matched sent by the terminal; acquiring a plurality of target character sequences from the database according to the sequence of characters to be matched; The second similarity is calculated, and after obtaining the text similarity of the character sequence to be matched and the target character sequence, the method further includes: querying a related resource corresponding to the target character sequence whose text similarity is greater than a preset similarity threshold; and sending the related resource to terminal.
  • the text similarity calculation is performed on the character sequence to be matched and the plurality of target character sequences, and the target character sequence with the highest similarity to the text of the character sequence to be matched can also be determined.
  • the target character sequence can be associated with text, images, links, audio, video, and other related resources.
  • the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions.
  • the sequence of target characters can be a sequence of characters associated with the corresponding answer text. After determining the target character sequence having the highest similarity to the character sequence to be matched, the character sequence of the corresponding answer text associated with the target character sequence may be transmitted to the terminal.
  • FIG. 4 another text similarity calculation method is provided, the method comprising the following steps:
  • Step 402 Acquire a sequence of characters to be matched and a sequence of target characters.
  • the sequence of characters to be matched and the sequence of target characters may be a combination of one or more of letters, Arabic numerals, Chinese characters, and punctuation marks.
  • the sequence of characters to be matched may be a sequence of characters sent by the user through the terminal for consulting questions.
  • the sequence of characters to be matched can be "How much is the cost of 3 computers?".
  • the target character sequence can be a sequence of characters of a question template pre-stored in the database.
  • the target character sequence can be "3 computer prices?".
  • Step 404 deleting the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the sequence of target characters.
  • Unrelated characters include, but are not limited to, punctuation and deactivation characters.
  • the deactivated character library can be preset for filtering deactivated characters.
  • Chinese banned characters include, but are not limited to, modal particles, conjunctions, and turning words, such as "ah”, “bar”, “ ⁇ ”, “of”, “further”, “but”, and the like.
  • the sequence of characters to be matched is "How much is the 3 computers?".
  • the punctuation mark "?” is included, and the stop character "Yes” is included, and the irrelevant characters contained in the character sequence to be matched are deleted to obtain "how much is the 3 computers”.
  • Step 406 Perform word segmentation on the to-be-matched character sequence and the target character sequence after deleting the irrelevant characters, to obtain a corresponding to-be-matched word sequence and a target word sequence.
  • Word segmentation refers to the process of converting a sequence of characters into a sequence of words according to a certain rule.
  • the sequence of to-be-matched words refers to a sequence formed in order of the words to be matched.
  • the target word sequence refers to a sequence formed in order in units of target words.
  • the sequence of the word to be matched may be obtained as “3
  • the sequence of to-be-matched words includes five to-be-matched words of “3”, “Taiwan”, “Computer”, “How much”, and “Money”.
  • Step 408 Calculate the to-be-matched word and the target word included in the target word sequence included in the sequence of the word to be matched by the edit distance formula, and obtain an edit distance between the sequence of the word to be matched and the target word sequence.
  • the edit distance formula is a formula that calculates the minimum number of edit operations required to convert one word to another from one word sequence.
  • the minimum number of edit operations is the edit distance.
  • a licensed editing operation involves replacing one word with another, inserting a word, and deleting a word.
  • the sequence of words to be matched is “3
  • the target word sequence is “3
  • Step 410 Acquire a first quantity of the to-be-matched words included in the sequence of the to-be-matched words, and a second quantity of the target words included in the target word sequence.
  • Step 412 Perform calculation according to the edit distance, the first quantity, and the second quantity to obtain a first similarity.
  • contains
  • contains
  • the edit distance lev S, T (i, j) of the sequence S to be matched and the target word sequence T can be calculated by the formula Calculated. i represents the i-th to-be-matched word in the sequence S to be matched, and j represents the j-th target word in the target word sequence T.
  • the edit distance lev S, T (i, j) takes the maximum value of i and j; otherwise, the edit distance lev S, T (i, j) takes lev S, T (i, j-1) +1, lev S, T (i-1, j) +1, lev S, T (i-1, j-1) +1 the minimum value.
  • the formula can be calculated by the following similarity
  • the first similarity sim1 S, T (i, j) is calculated.
  • ) represents the maximum value in
  • Step 414 Extract all the to-be-matched words to form a set of to-be-matched words, and extract all the target words to form a target word set.
  • the sequence of to-be-matched words is “3
  • the five to-be-matched words “3”, “Taiwan”, “Computer”, “How much” and “Money” are juxtaposed and have no order.
  • Step 416 Match the to-be-matched word set and the target word set, and count the matching number of the to-be-matched word and the target word.
  • a word tree can be constructed for the upper and lower hierarchical relationship of words in the semantic dictionary, and the second sub-similarity between the to-be-matched word and the target word is calculated by the path distance between the to-be-matched word and the target word in the word tree.
  • the to-be-matched word corresponding to the second sub-similarity greater than the preset sub-similarity threshold is determined to be matched with the target word, and the number of matches between the to-be-matched word and the target word in the to-be-matched word set and the target word set is counted.
  • Step 418 Count the number of words to be matched and the number of target words of the target word set of the set of words to be matched.
  • Step 420 Calculate according to the number of matches, the number of words to be matched, and the number of target words, to obtain a second similarity.
  • the set of words to be matched is ⁇ "computer", “how much”, “money” ⁇
  • the target word set is ⁇ "computer", “price” ⁇
  • the similarity between "computer” and “computer” can be calculated. 11 , the similarity between "computer” and “price” sim 12 , “how much” and “computer” similarity sim 21 , “how much” and “price” similarity sim 22 , “money” and “computer” similar Degree sim 31 , the similarity between "money” and “price” sim 32 . Multiplying each of the to-be-matched words by the maximum second sub-similarity calculated by the target words of the target word set to obtain a second similarity sim2.
  • the maximum second sub-similarity corresponding to "computer”, "how much”, and “money” is sim 11 , sim 22 , and sim 32 respectively
  • Step 422 Perform calculation according to the first similarity and the second similarity to obtain a text similarity of the character sequence to be matched and the target character sequence.
  • the first similarity sim1 and the second similarity sim2 may be multiplied by the corresponding first weight w1 and second weight w2, and calculated.
  • the text similarity sim(S,T) sim1 ⁇ w1+sim2 ⁇ w2.
  • the sequence of the to-be-matched word and the target word sequence formed in order by the word unit are obtained.
  • the first similarity is calculated by considering a first similarity algorithm of the word order, and then the to-be-matched word set and the target word set are respectively formed according to the to-be-matched word included in the to-be-matched word sequence and the target word included in the target word sequence.
  • the second similarity is calculated by a second similarity algorithm that does not consider the word order, and then the first similarity and the second similarity are integrated to calculate a text similarity between the character sequence to be matched and the target character sequence.
  • the similarity calculation is performed in terms of words, and the similarity algorithms are used to calculate the text similarity, which reduces the error caused by single-characteristics through a single similarity algorithm, and improves the accuracy of text similarity calculation.
  • a text similarity calculation device 500 includes: a character sequence acquisition module 502, configured to acquire a character sequence to be matched and a target character sequence; and a word sequence acquisition module. 504. Perform pre-processing on the sequence of the matched character and the target character, respectively, to obtain a sequence of the to-be-matched word and the target word.
  • the first similarity calculation module 506 is configured to match the to-be-matched word included in the sequence of the word to be matched.
  • the target word included in the target word sequence is calculated by the first similarity algorithm to obtain a first similarity;
  • the word set forming module 508 is configured to extract all the to-be-matched words to form a to-be-matched word set, and extract all the target words to form a target.
  • a second similarity calculation module 510 configured to calculate a to-be-matched word set and a target word set by using a second similarity algorithm to obtain a second similarity;
  • a text similarity calculation module 512 configured to use the first similarity The degree and the second similarity are calculated to obtain the text similarity of the character sequence to be matched and the target character sequence.
  • the word sequence obtaining module 504 is further configured to delete the extraneous characters included in the sequence of characters to be matched and the extraneous characters included in the target character sequence; the sequence of characters to be matched and the sequence of target characters after deleting the unrelated characters The word segmentation is performed separately, and the corresponding sequence of the word to be matched and the sequence of the target word are obtained.
  • the unrelated character includes a deactivated character and the same character
  • the word sequence obtaining module 504 is further configured to delete the deactivated character included in the character sequence to be matched and the deactivated character included in the target character sequence; Whether the same character exists in the character sequence to be matched and the target character sequence after the character is deactivated; the same character refers to the same character in the same position in the character sequence to be matched and the target character sequence after the deactivated character is deleted; if so, The character sequence to be matched and the same character included in the target character sequence after deleting the deactivated character are deleted, and the corresponding sequence of the word to be matched and the target word sequence are obtained.
  • the first similarity calculation module 506 is further configured to calculate, by using an edit distance formula, the to-be-matched words and the target words included in the target word sequence included in the sequence of to-be-matched words, to obtain a sequence of to-be-matched words and An edit distance between the target word sequences; a first number of words to be matched included in the sequence of words to be matched, and a second quantity of target words included in the target word sequence; according to the edit distance, the first quantity, and the second quantity The calculation is performed to obtain the first similarity.
  • the second similarity calculation module 510 is further configured to match the to-be-matched word set and the target word set, and count the number of matches between the to-be-matched word and the target word; and count the number of to-be-matched words in the to-be-matched word set. And the number of target words in the target word set; the second similarity is obtained according to the number of matches, the number of words to be matched, and the number of target words.
  • the apparatus further includes a third similarity calculation module 514, configured to acquire a pinyin sequence to be matched and a target pinyin sequence corresponding to the target character sequence corresponding to the character sequence to be matched; and to include the pinyin sequence to be matched
  • the target pinyin included in the matched pinyin and the target pinyin sequence is calculated by the first similarity algorithm to obtain a third similarity; the first similarity and the second similarity are calculated according to the first similarity and the second similarity, and the character sequence to be matched and the target character sequence are obtained.
  • the text similarity includes: calculating according to the first similarity, the second similarity, and the third similarity, and obtaining a text similarity of the character sequence to be matched and the target character sequence.
  • the character sequence obtaining module 502 is further configured to receive a sequence of characters to be matched sent by the terminal, and obtain a plurality of target character sequences from the database according to the sequence of characters to be matched; the device further includes a related resource sending module, The related resources corresponding to the target character sequence whose query text similarity is greater than the preset similarity threshold are sent; the related resources are sent to the terminal.
  • Each of the above-described text similarity computing devices may be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
  • a computer device which may be a server, and its internal structure diagram may be as shown in FIG. 6.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-transitory computer readable storage medium, an internal memory.
  • the non-transitory computer readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of an operating system and computer readable instructions in a non-transitory computer readable storage medium.
  • the database of the computer device is used to store a sequence of target characters.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions are executed by the processor to implement a text similarity calculation method.
  • FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a computer apparatus comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, implement any of the embodiments of the present application The steps of the text similarity calculation method provided.
  • one or more non-transitory computer readable storage mediums storing computer readable instructions that, when executed by one or more processors, cause one or more processes
  • the steps of the text similarity calculation method provided in any one of the embodiments of the present application are implemented.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé de calcul de similarité de texte comportant les étapes consistant à: obtenir une séquence de caractères à apparier et une séquence de caractères cibles; prétraiter respectivement la séquence de caractères à apparier et la séquence de caractères cibles pour obtenir une séquence correspondante de mots à apparier et une séquence correspondante de mots cibles; effectuer un calcul, au moyen d'un premier algorithme de similarité, sur un mot à apparier contenu dans la séquence de mots à apparier et un mot cible contenu dans la séquence de mots cibles, de façon à obtenir un premier degré de similarité; extraire tous les mots à apparier, de façon à former un ensemble de mots à apparier, et extraire tous les mots cibles pour former un ensemble de mots cibles; effectuer un calcul sur l'ensemble de mots à apparier et l'ensemble de mots cibles au moyen d'un second algorithme de similarité, de façon à obtenir un second degré de similarité; et calculer, d'après le premier degré de similarité et le second degré de similarité, un degré de similarité de texte entre la séquence de caractères à apparier et la séquence de caractères cibles.
PCT/CN2018/099994 2018-01-12 2018-08-10 Procédé et dispositif de calcul de similarité de texte, appareil informatique, et support de stockage WO2019136993A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810031770.0A CN108304378B (zh) 2018-01-12 2018-01-12 文本相似度计算方法、装置、计算机设备和存储介质
CN201810031770.0 2018-01-12

Publications (1)

Publication Number Publication Date
WO2019136993A1 true WO2019136993A1 (fr) 2019-07-18

Family

ID=62868820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/099994 WO2019136993A1 (fr) 2018-01-12 2018-08-10 Procédé et dispositif de calcul de similarité de texte, appareil informatique, et support de stockage

Country Status (2)

Country Link
CN (1) CN108304378B (fr)
WO (1) WO2019136993A1 (fr)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738202A (zh) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 字符识别方法、装置及计算机可读存储介质
CN110765767A (zh) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 局部优化关键词的提取方法、装置、服务器及存储介质
CN111274366A (zh) * 2020-03-25 2020-06-12 联想(北京)有限公司 搜索推荐方法及装置、设备、存储介质
CN111767706A (zh) * 2020-06-19 2020-10-13 北京工业大学 文本相似度的计算方法、装置、电子设备及介质
CN112149414A (zh) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 文本相似度确定方法、装置、设备及存储介质
CN112287657A (zh) * 2020-11-19 2021-01-29 每日互动股份有限公司 基于文本相似度的信息匹配系统
CN113011178A (zh) * 2021-03-29 2021-06-22 广州博冠信息科技有限公司 文本生成方法、文本生成装置、电子设备及存储介质
CN113076748A (zh) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 弹幕敏感词的处理方法、装置、设备及存储介质
CN113268972A (zh) * 2021-05-14 2021-08-17 东莞理工学院城市学院 两英语单词外观相似度的智能计算方法、系统、设备和介质
CN113779183A (zh) * 2020-06-08 2021-12-10 北京沃东天骏信息技术有限公司 文本匹配方法、装置、设备及存储介质
CN113821587A (zh) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 文本相关性确定方法、模型训练方法、装置及存储介质
CN116136839A (zh) * 2023-04-17 2023-05-19 湖南正宇软件技术开发有限公司 法规文件花脸稿的生成方法、生成系统及相关设备
CN116881437A (zh) * 2023-09-08 2023-10-13 北京睿企信息科技有限公司 一种获取文本集的数据处理系统
CN113779183B (zh) * 2020-06-08 2024-05-24 北京沃东天骏信息技术有限公司 文本匹配方法、装置、设备及存储介质

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304378B (zh) * 2018-01-12 2019-09-24 深圳壹账通智能科技有限公司 文本相似度计算方法、装置、计算机设备和存储介质
CN109189907A (zh) * 2018-08-22 2019-01-11 山东浪潮通软信息科技有限公司 一种基于语义匹配的检索方法及装置
CN110083834B (zh) * 2019-04-24 2023-05-09 北京百度网讯科技有限公司 语义匹配模型训练方法、装置、电子设备及存储介质
CN110287286B (zh) * 2019-06-13 2022-03-08 北京百度网讯科技有限公司 短文本相似度的确定方法、装置及存储介质
CN110633356B (zh) * 2019-09-04 2022-05-20 广州市巴图鲁信息科技有限公司 一种词语相似度计算方法、装置以及存储介质
CN110717158B (zh) * 2019-09-06 2024-03-01 冉维印 信息验证方法、装置、设备及计算机可读存储介质
CN112825090B (zh) * 2019-11-21 2024-01-05 腾讯科技(深圳)有限公司 一种兴趣点确定的方法、装置、设备和介质
CN111159339A (zh) * 2019-12-24 2020-05-15 北京亚信数据有限公司 一种文本匹配处理方法和装置
CN111382563B (zh) * 2020-03-20 2023-09-08 腾讯科技(深圳)有限公司 文本相关性的确定方法及装置
CN111898376B (zh) * 2020-07-01 2024-04-26 拉扎斯网络科技(上海)有限公司 一种名称数据处理方法、装置、存储介质及计算机设备
CN112765962B (zh) * 2021-01-15 2022-08-30 上海微盟企业发展有限公司 一种文本纠错方法、装置及介质
CN113032519A (zh) * 2021-01-22 2021-06-25 中国平安人寿保险股份有限公司 一种句子相似度判断方法、装置、计算机设备及存储介质
CN113420234B (zh) * 2021-07-02 2022-08-02 青海师范大学 一种微博数据采集方法与系统
CN113627722B (zh) * 2021-07-02 2024-04-02 湖北美和易思教育科技有限公司 基于关键字分词的简答题评分方法、终端及可读存储介质
CN113569036A (zh) * 2021-07-20 2021-10-29 上海明略人工智能(集团)有限公司 一种媒体信息的推荐方法、装置及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090175545A1 (en) * 2008-01-04 2009-07-09 Xerox Corporation Method for computing similarity between text spans using factored word sequence kernels
CN103176962A (zh) * 2013-03-08 2013-06-26 深圳先进技术研究院 文本相似度的统计方法及系统
CN107491425A (zh) * 2017-07-26 2017-12-19 合肥美的智能科技有限公司 确定方法、确定装置、计算机装置和计算机可读存储介质
CN108304378A (zh) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 文本相似度计算方法、装置、计算机设备和存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123618B (zh) * 2011-11-21 2016-09-14 北京新媒传信科技有限公司 文本相似度获取方法和装置
CN103838789A (zh) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 一种文本相似度计算方法
US9535899B2 (en) * 2013-02-20 2017-01-03 International Business Machines Corporation Automatic semantic rating and abstraction of literature
CN104216968A (zh) * 2014-08-25 2014-12-17 华中科技大学 一种基于文件相似度的排重方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090175545A1 (en) * 2008-01-04 2009-07-09 Xerox Corporation Method for computing similarity between text spans using factored word sequence kernels
CN103176962A (zh) * 2013-03-08 2013-06-26 深圳先进技术研究院 文本相似度的统计方法及系统
CN107491425A (zh) * 2017-07-26 2017-12-19 合肥美的智能科技有限公司 确定方法、确定装置、计算机装置和计算机可读存储介质
CN108304378A (zh) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 文本相似度计算方法、装置、计算机设备和存储介质

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738202A (zh) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 字符识别方法、装置及计算机可读存储介质
CN110765767A (zh) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 局部优化关键词的提取方法、装置、服务器及存储介质
CN110765767B (zh) * 2019-09-19 2024-01-19 平安科技(深圳)有限公司 局部优化关键词的提取方法、装置、服务器及存储介质
CN111274366A (zh) * 2020-03-25 2020-06-12 联想(北京)有限公司 搜索推荐方法及装置、设备、存储介质
CN113779183A (zh) * 2020-06-08 2021-12-10 北京沃东天骏信息技术有限公司 文本匹配方法、装置、设备及存储介质
CN113779183B (zh) * 2020-06-08 2024-05-24 北京沃东天骏信息技术有限公司 文本匹配方法、装置、设备及存储介质
CN111767706A (zh) * 2020-06-19 2020-10-13 北京工业大学 文本相似度的计算方法、装置、电子设备及介质
CN112149414A (zh) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 文本相似度确定方法、装置、设备及存储介质
CN112149414B (zh) * 2020-09-23 2023-06-23 腾讯科技(深圳)有限公司 文本相似度确定方法、装置、设备及存储介质
CN112287657A (zh) * 2020-11-19 2021-01-29 每日互动股份有限公司 基于文本相似度的信息匹配系统
CN112287657B (zh) * 2020-11-19 2024-01-30 每日互动股份有限公司 基于文本相似度的信息匹配系统
CN113011178A (zh) * 2021-03-29 2021-06-22 广州博冠信息科技有限公司 文本生成方法、文本生成装置、电子设备及存储介质
CN113011178B (zh) * 2021-03-29 2023-05-16 广州博冠信息科技有限公司 文本生成方法、文本生成装置、电子设备及存储介质
CN113076748A (zh) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 弹幕敏感词的处理方法、装置、设备及存储介质
CN113076748B (zh) * 2021-04-16 2024-01-19 平安国际智慧城市科技股份有限公司 弹幕敏感词的处理方法、装置、设备及存储介质
CN113268972A (zh) * 2021-05-14 2021-08-17 东莞理工学院城市学院 两英语单词外观相似度的智能计算方法、系统、设备和介质
CN113821587A (zh) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 文本相关性确定方法、模型训练方法、装置及存储介质
CN113821587B (zh) * 2021-06-02 2024-05-17 腾讯科技(深圳)有限公司 文本相关性确定方法、模型训练方法、装置及存储介质
CN116136839A (zh) * 2023-04-17 2023-05-19 湖南正宇软件技术开发有限公司 法规文件花脸稿的生成方法、生成系统及相关设备
CN116881437B (zh) * 2023-09-08 2023-12-01 北京睿企信息科技有限公司 一种获取文本集的数据处理系统
CN116881437A (zh) * 2023-09-08 2023-10-13 北京睿企信息科技有限公司 一种获取文本集的数据处理系统

Also Published As

Publication number Publication date
CN108304378B (zh) 2019-09-24
CN108304378A (zh) 2018-07-20

Similar Documents

Publication Publication Date Title
WO2019136993A1 (fr) Procédé et dispositif de calcul de similarité de texte, appareil informatique, et support de stockage
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US11093854B2 (en) Emoji recommendation method and device thereof
US11544459B2 (en) Method and apparatus for determining feature words and server
WO2019153551A1 (fr) Procédé et appareil de classification d'articles, dispositif informatique et support de stockage
CN109670163B (zh) 信息识别方法、信息推荐方法、模板构建方法及计算设备
WO2019105432A1 (fr) Procédé et appareil de recommandation de texte, et dispositif électronique
WO2021114810A1 (fr) Procédé de recommandation de document officiel à base de structure de graphe, appareil, dispositif informatique et support
WO2020057022A1 (fr) Procédé et appareil de recommandation associative, dispositif informatique et support de stockage associés
JP2020123318A (ja) テキスト相関度を確定するための方法、装置、電子機器、コンピュータ可読記憶媒体及びコンピュータプログラム
US11816138B2 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
US11157816B2 (en) Systems and methods for selecting and generating log parsers using neural networks
WO2022095374A1 (fr) Procédé et appareil d'extraction de mots-clés, ainsi que dispositif terminal et support de stockage
WO2022142027A1 (fr) Procédé et appareil de mise en correspondance floue basés sur des graphes de connaissances, dispositif informatique et support de stockage
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN110134965B (zh) 用于信息处理的方法、装置、设备和计算机可读存储介质
WO2021175005A1 (fr) Procédé et appareil de récupération de documents basée sur un vecteur, dispositif informatique, et support de stockage
CN112651236B (zh) 提取文本信息的方法、装置、计算机设备和存储介质
CN111291177A (zh) 一种信息处理方法、装置和计算机存储介质
CN111325030A (zh) 文本标签构建方法、装置、计算机设备和存储介质
WO2022141872A1 (fr) Procédé et appareil de génération de résumé de document, dispositif informatique et support de stockage
CN116383412B (zh) 基于知识图谱的功能点扩增方法和系统
WO2020132933A1 (fr) Procédé et appareil de filtrage de texte court, support et dispositif informatique
WO2021244424A1 (fr) Procédé et appareil d'extraction de mot de tête, dispositif, et support de stockage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899704

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899704

Country of ref document: EP

Kind code of ref document: A1