WO2017092122A1 - 相似性确定方法、装置及终端 - Google Patents
相似性确定方法、装置及终端 Download PDFInfo
- Publication number
- WO2017092122A1 WO2017092122A1 PCT/CN2015/099523 CN2015099523W WO2017092122A1 WO 2017092122 A1 WO2017092122 A1 WO 2017092122A1 CN 2015099523 W CN2015099523 W CN 2015099523W WO 2017092122 A1 WO2017092122 A1 WO 2017092122A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cost
- sequence
- character string
- edit distance
- determining
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000012217 deletion Methods 0.000 claims description 34
- 230000037430 deletion Effects 0.000 claims description 34
- 238000003780 insertion Methods 0.000 claims description 34
- 230000037431 insertion Effects 0.000 claims description 34
- 230000009466 transformation Effects 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 10
- 238000012545 processing Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
- G06V30/274—Syntactic or semantic context, e.g. balancing
Definitions
- the present disclosure relates to the field of natural language processing, and in particular, to a similarity determining method, apparatus, and terminal.
- the related art can be realized by calculating the edit distance between two when determining the similarity between strings.
- the two strings can be separately segmented into characters; then, by deleting, inserting, or replacing the characters in one string, one string is converted into another string; , calculate the minimum number of operations required to convert from one string to another, using the minimum number of operations as the edit distance between the two strings; finally, calculate the similarity between the two strings based on the edit distance Sex.
- the present disclosure provides a similarity determination method, apparatus, and terminal.
- a similarity determining method comprising:
- a similarity determining apparatus comprising:
- a word segmentation module configured to respectively segment the first character string and the second character string to obtain a first sequence and a second sequence, the first sequence and the second sequence respectively including at least one word;
- a first determining module configured to determine an edit distance between the first character string and the second character string according to a predefined edit distance algorithm and the first sequence and the second sequence;
- a second determining module configured to determine, between the first character string and the second string, according to the edit distance and information of each operation performed by the first sequence to the second sequence transformation Similarity.
- a terminal comprising:
- a memory for storing processor executable instructions
- processor is configured to:
- each word in the string may include at least one character, such that the similarity determined according to the edit distance is combined
- the correlation between the characters in the string makes the determined similarity more accurate.
- FIG. 1 is a flow chart showing a similarity determination method according to an exemplary embodiment.
- FIG. 2 is a flow chart showing a similarity determination method according to an exemplary embodiment.
- FIG. 3 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
- FIG. 4 is a block diagram of a second determining module, according to an exemplary embodiment.
- FIG. 5 is a block diagram of a second determining unit, according to an exemplary embodiment.
- FIG. 6 is a block diagram of a second determining unit, according to an exemplary embodiment.
- FIG. 7 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
- FIG. 8 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
- FIG. 9 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
- FIG. 10 is a block diagram of a terminal, according to an exemplary embodiment.
- FIG. 11 is a block diagram of a server, according to an exemplary embodiment.
- FIG. 1 is a flowchart of a similarity determining method according to an exemplary embodiment.
- the similarity determining method provided by the embodiment of the present disclosure may be used in a terminal. As shown in FIG. 1 , the similarity determining method provided by the embodiment of the present disclosure includes the following steps.
- step S101 the first character string and the second character string are respectively segmented to obtain a first sequence and a second sequence, wherein the first sequence and the second sequence respectively include at least one word.
- step S102 an edit distance between the first character string and the second character string is determined according to the predefined edit distance algorithm and the first sequence and the second sequence.
- step S103 the similarity between the first character string and the second character string is determined according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation.
- the method provided by the embodiment of the present disclosure when the first character string and the second character string are respectively segmented into the first sequence and the second sequence, so that when the editing distance is changed when the first character string is converted into the second character string, Is implemented based on each word in the first sequence and the second sequence, and is not based on each character in the first string and the second string, and each word in the string may include at least one character, thereby
- the similarity of the edit distance determination combines the correlation between the characters in the string, making the determined similarity more accurate.
- the similarity between the first character string and the second character string is determined according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation, including:
- the similarity between the first character string and the second character string is determined according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence.
- each operation includes replacement operations, exchange operations, including:
- the similarity between the first string and the second string is determined according to the normalized result.
- each operation includes at least one of a replacement operation, an exchange operation, an insertion operation, and a deletion operation, including:
- the method further includes:
- the operation cost of the insertion operation the operation cost of the deletion operation, and the operation cost of the replacement operation are determined.
- the method further includes:
- the operation cost of the insertion operation + the operation cost of the deletion operation > the operation cost of the replacement operation is determined.
- the method further includes:
- determining an edit distance between the first character string and the second character string according to the predefined edit distance algorithm and the first sequence and the second sequence including:
- the edit distance between the first string and the second string is determined by the following formula 1:
- the minimum semantic edit distance between the first character string and the second character string is determined according to the edit distance, the number of pairs, and the operation cost of the replacement operation, and the operation cost of the exchange operation, including:
- the minimum semantic edit distance between the first string and the second string is determined by the following formula 2:
- S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the minimum semantic edit distance, d is the edit distance, p is the pairing number, and cost(J) is the operation of the swap operation. Cost, cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0.
- the first semantic edit distance between the first character string and the second character string is determined according to the edit distance, the number of pairs, and the operation cost of the replacement operation, and the operation cost of the exchange operation, including:
- the first semantic edit distance between the first character string and the second character string is determined by the following formula 3 according to the editing distance, the number of pairs, the operation cost of the replacement operation, and the operation cost of the exchange operation:
- S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the first semantic editing distance, d is the editing distance, p is the pairing number, and cost(J) is the switching operation.
- the cost of the operation cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0, and 2cost(T)-cost(J)>0.
- the first is determined according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and the number of words in the first sequence and the number of words in the second sequence.
- the second semantic edit distance between the string and the second string including:
- the operation cost of the replacement operation is determined by the following formula 4
- costM cost(C), if n ⁇ m;
- costM cost(S), if n>m
- Equation 4 normFact(S1, S2) is the second semantic edit distance, n is the number of words in the first sequence, m is the number of words in the second sequence, and cost(T) is the operation cost of the replacement operation, which is deleted.
- the operational cost of the operation, cost(C) is the operational cost of the insert operation.
- determining the similarity between cost(S) between the first character string and the second character string according to the first semantic edit distance and the second semantic edit distance including:
- the similarity between the first character string and the second character string is determined by the following formula 5:
- sim(S1, S2) is the similarity between the first string and the second string
- minCost(S1, S2) is the first semantic edit distance
- normFact(S1, S2) is the second semantic edit. distance
- FIG. 2 is a flowchart of a similarity determining method according to an exemplary embodiment, and the similarity determining method may be applied to a terminal.
- the similarity determining method provided by the embodiment of the present disclosure includes the following steps.
- step S201 the first character string and the second character string are respectively segmented to obtain a first sequence and a second sequence.
- the embodiment of the present disclosure does not separate the two strings into characters, but two characters.
- the string is segmented, each string is segmented into words, and the segmented string includes at least one word.
- the embodiment of the present disclosure defines two character strings that need to determine similarity as a first character string and a second character string, respectively, and after the first character string is divided into words, the first sequence is obtained; After the two string parts are the individual words, the second sequence is obtained.
- the first sequence and the second sequence respectively comprise at least one word
- the first sequence and the second sequence are (S11, S12, S13, ..., S1n) and (S21, S22, S23, ..., S2m, respectively). ).
- the number of words in S1 is n
- the number of words in S2 is m.
- the embodiment of the present disclosure is not specifically limited.
- the first string and the second string may both be Chinese, or both are English.
- the first character string and the second character string may each be a sentence.
- the first string is “Today I am going to Xiangshan” and the second string is “I am going to Xiangshan today”.
- step S202 the operation cost of the replacement operation and the operation cost of the exchange operation are determined according to the relationship between the replacement operation and the exchange operation, and the operation cost and the deletion operation of the insertion operation are determined according to the relationship between the replacement operation and the insertion operation and the deletion operation.
- the operational cost and the operational cost of the replacement operation are determined according to the relationship between the replacement operation and the exchange operation.
- the traditional method of determining the similarity between strings when converting a string to another string, often includes three editing operations, namely, an insert operation, a delete operation, and a replacement operation, and operations of the three operations.
- the price is the same.
- some components appear in different parts of the string and do not change the overall meaning of the string. For example, “Today I am going to Xiangshan”, “I plan to go to Xiangshan today”, “I am going to Xiangshan today”, although the words are in different positions in the string, but the three strings mean the same meaning. Therefore, in the embodiment of the present disclosure, Based on the system's insert operation, delete operation and replacement operation, the exchange operation is newly defined, and different operation costs are defined for different operations according to the relationship between various operations.
- the embodiment of the present disclosure can determine the replacement operation cost and the exchange operation cost according to the relationship between the replacement operation and the exchange operation, such as the embodiment of the present disclosure.
- the relationship between the defined replacement operation cost and the exchange operation cost is satisfied: 2* the operation cost of the replacement operation > the operation cost of the exchange operation, ie:
- cost(T) is the operational cost of the replacement operation and cost(J) is the operational cost of the exchange operation.
- the embodiment of the present disclosure can determine the insertion operation cost and the deletion operation cost according to the relationship between the replacement operation and the insertion operation and the deletion operation, as in the embodiment of the present disclosure.
- the relationship between the defined replacement operation cost, the exchange operation cost, and the deletion operation cost is satisfied: the operation cost of the insertion operation + the operation cost of the deletion operation > the operation cost of the replacement operation.
- the operational cost of the replacement operation is greater than the operational cost of the insertion operation and the maximum of the operational cost of the deletion operation.
- such a relationship can be expressed as the following formula:
- Cost(S) is the operation cost of the delete operation
- cost(C) is the operation cost of the insert operation
- the insertion operation cost is determined to be deleted according to the relationship between the insert operation and the delete operation. Operating cost.
- the insertion operation cost and the deletion operation cost may be different or the same, and the embodiment of the present disclosure does not specifically limit this.
- step S203 a predefined edit distance algorithm is generated according to the operation cost of the replacement operation, the operation cost of the deletion operation, and the operation cost of the insertion operation.
- the predefined edit distance algorithm can be as shown in Equation 1 below.
- the edit distance algorithm predefined in the embodiment of the present disclosure is a dynamic programming algorithm
- the pre-defined edit distance algorithm has a pre-defined operation cost of the delete operation and an operation cost of the insert operation according to an embodiment of the present disclosure. And the cost of the operation of the replacement operation is obtained.
- step S202 and step S203 are steps that need to be performed before determining the similarity, and are not required to be performed each time the similarity between two strings is determined, and it is determined that the similarity is determined before the similarity is determined.
- the operational cost of various operations and the predefined edit distance algorithm can be used.
- step S204 an edit distance between the first character string and the second character string is determined according to the predefined edit distance algorithm and the first sequence and the second sequence.
- the edit distance between two strings refers to the minimum number of edit operations required to convert one of the strings to another, where each edit corresponds to an operation cost, so the total operation at the time of the transformation can be The cost is the edit distance.
- the editing operations that can be performed include a replacement operation, an insertion operation, a deletion operation, and an exchange operation.
- the embodiment of the present disclosure may determine the edit distance between the first character string and the second character string according to the predefined edit distance algorithm and the first sequence and the second sequence.
- the edit distance between the first character string and the second character string is calculated by the above formula 1 according to the pre-defined edit distance algorithm and the first sequence and the second sequence.
- the principle of calculating the edit distance by the formula 1 is the same as the principle of calculating the edit distance based on the existing dynamic plan algorithm, and the embodiment of the present disclosure does not elaborate on this.
- step S205 the similarity between the first character string and the second character string is determined according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation.
- the information of each operation performed by the first sequence to the second sequence transformation includes the type of operation, the number of operations of each type of operation, and the operation cost of each type of operation.
- the different operations are pre-defined in the embodiment of the present disclosure with different operation costs, combined with the definition of the edit distance between the two strings, various operations required when converting from the first character string to the second character string
- the operational cost will directly affect the editing distance. Therefore, when determining the similarity between the first character string and the second character string, it is required to implement the operation information according to the editing distance and the various operations performed when the editing distance is obtained, and the operation information includes the operation cost and is edited.
- the operational cost of various operations at the time of distance is preset in step S202.
- the operation performed when converting from the first character string to the second character string includes two insertion operations, one deletion operation, one exchange operation, and one replacement operation, between the first character string and the second character string
- step S2051 Go to step S2053 to achieve:
- step S2051 when the edit distance is obtained, the replacement operation information in each operation information performed when the first sequence is converted to the second character string is acquired.
- the replacement operation is to replace one word in the first string with another word.
- the information of each replacement operation performed in the conversion process is counted, and the information of each replacement operation is recorded. Recorded in the specified collection.
- the information of the replacement operation includes the replaced word of the replacement operation and the position of the replaced word in the sequence, and therefore, the data recorded in the specified set includes the position of the replaced word and the replaced word in the first sequence. For example, if the first string is "I plan to go to Xiangshan today", the first sequence is "I-Plan-Today-Go-Xiangshan", and the words "Xiangshan” and "Plan” are replaced, then the records in the collection are specified.
- the information of the replacement operation includes "Intent-2, Xiangshan-5". Therefore, when the edit distance is obtained from the specified set, the replacement operation information in each operation information performed when the first character string is converted into the second character string, specifically, the replaced words and each of each replacement operation can be obtained. The position of the replaced word in the first sequence.
- the embodiment of the present disclosure newly defines the switching operation according to the relationship between the replacement operation and the switching operation, and defines 2cost(T)-cost(J)>0 in advance, thereby obtaining two replacement operations
- the cost is greater than the cost of performing a swap operation. Therefore, if a swap operation implementation can be performed when the first string is converted to the second string, it is not implemented by two replacement operations. Therefore, in addition to recording the replaced words in the first sequence and the position of each replaced word in the first sequence, it is further determined whether any two words in the specified set exist in the second sequence. If any two words exist in the second sequence, the two words and the position of each word in the second sequence will also be recorded in the specified set.
- the first string is "I am going to Xiangshan today," the first sequence is "I-Plan-Today-Go-Xiangshan” and the words “Xiangshan” and “Plan” are replaced.
- the second string is "Today I am going to Xiangshan”, the second sequence is "Today-I-Plan-Go-Xiangshan", because the replaced words “Xiangshan” and "Intended” exist in both the first sequence and the second sequence, therefore,
- the data recorded in the collection can be "Plan-S12, Xiangshan-S15; Planned-S23, Xiangshan-S25".
- Xiangshan and “intended” are defined as a matching word between the first character string and the second character string.
- the matching words refer to any two words that exist in both the first sequence and the second sequence.
- step S2052 the number of pairs is determined based on the replacement operation information.
- the number of pairs refers to the number of matching words in the first sequence and the second sequence, that is, the number of two words that exist in the first sequence and the second sequence at the same time. In conjunction with the above interpretation of the data recorded in the specified set, the number of pairs can be determined from the data recorded in the specified set.
- the number of pairs can be determined as 2.
- step S2053 the similarity between the first character string and the second character string is determined according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence. Sex.
- the embodiment of the present disclosure determines the first character string and the second character string according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence.
- each operation may include a replacement operation, an exchange operation.
- step S2053 can be implemented by the following steps S20531 to S20533:
- step S20531 the minimum semantic edit distance between the first character string of the word and the second character string of the word is determined according to the word edit distance, the number of word pairs, and the operation cost of the replacement operation, and the operation cost of the exchange operation.
- the operating cost of the exchange operation may be based on the edit distance, the number of pairs, and the replacement operation.
- the minimum semantic edit distance between the first string and the second string is determined by Equation 2 below:
- S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the minimum semantic edit distance, d is the edit distance, p is the pairing number, and cost(J) is the operation of the swap operation. Cost, cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0.
- step S20532 the minimum semantic edit distance is normalized to obtain a normalized result.
- the minimum semantic edit distance can be normalized by the maximum semantic edit distance between the first string and the second string.
- the maximum semantic edit distance can be expressed as the following formula four:
- costM cost(C), if n ⁇ m;
- costM cost(S), if n>m
- Equation 4 normFact(S1, S2) represents the maximum semantic edit distance, n represents the number of words in the first sequence, and m represents the number of words in the second sequence.
- the normalized result obtained by normalizing the minimum semantic edit distance minCost(S1, S2) is minCost(S1, S2)/normFact(S1, S2).
- minCost(S1, S2)/normFact(S1, S2) can be mapped between 0 and 1, thereby facilitating intuitive determination of similarity.
- step S20533 the similarity between the first character string of the word and the second character string of the word is determined based on the result of word normalization.
- the similarity between the first string and the second string can be determined according to the word normalization result by the following formula 5:
- sim(S1, S2) is the similarity between the first string and the second string
- minCost(S1, S2) is the minimum semantic edit distance
- normFact(S1, S2) is the maximum semantic edit distance
- minCost(S1, S2)/normFact(S1, S2) is the normalized result.
- the embodiment of the present disclosure determines the first character string and the second character string according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence.
- each operation may be at least one of a replacement operation, an exchange operation, an insertion operation, and a deletion operation.
- it can be implemented by the following steps S20534 to S20536:
- step S20534 the first semantic edit distance between the first character string and the second character string is determined according to the edit distance, the number of pairs, and the operation cost of the replacement operation and the operation cost of the exchange operation.
- the first semantic edit distance may be a minimum semantic edit distance between the first character string and the second character string.
- Equation 3 S1 and S2 represent the first character string and the second character string, respectively, minCost(S1, S2) represents the first semantic edit distance, d represents the edit distance, p represents the number of pairs, and cost(J) represents the exchange operation cost. .
- the first semantic edit distance may be the minimum semantic edit distance between the first character string and the second character string, and the formula 2 is different from the meaning represented by the formula 3 only minCost (S1, S2).
- the first string is determined according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and the number of words in the first sequence and the number of words in the second sequence.
- a second semantic edit distance between the second string is determined according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and the number of words in the first sequence and the number of words in the second sequence.
- the second semantic edit distance may be a maximum semantic edit distance between the first character string and the second character string.
- determining the first character string and the number of words in the first sequence and the number of words in the second sequence according to one of an operation cost of the insertion operation and an operation cost of the deletion operation, an operation cost of the replacement operation, and a number of words in the second sequence includes, but is not limited to, implemented by Equation 4 below.
- costM cost(C), if n ⁇ m;
- costM cost(S), if n>m
- Equation 4 normFact(S1, S2) represents the second semantic edit distance, n represents the number of words in the first sequence, and m represents the number of words in the second sequence.
- normFact(S1, S2) is a normalization factor, which is used to map minCost(S1, S2)/normFact(S1, S2) to between 0 and 1, so as to facilitate the intuitive determination of similarity.
- step S20536 the similarity between the first character string and the second character string is determined according to the first semantic edit distance and the second semantic edit distance.
- Equation 5 the similarity between the first string and the second string.
- sim(S1, S2) represents the similarity between the first string and the second string.
- the method provided by the embodiment of the present disclosure when the first character string and the second character string are respectively segmented into the first sequence and the second sequence, so that when the editing distance is changed when the first character string is converted into the second character string, Is implemented based on each word in the first sequence and the second sequence, and is not based on each character in the first string and the second string, and
- Each word in the string may include at least one character such that the similarity determined according to the edit distance combines the correlation between the individual characters in the string, making the determined similarity more accurate.
- FIG. 3 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
- the similarity determining apparatus includes a word segmentation module 301, a first determination module 302, and a second determination module 303. among them:
- a word segmentation module 301 configured to respectively segment the first character string and the second character string to obtain a first sequence and a second sequence, the first sequence and the second sequence respectively including at least one word;
- the first determining module 302 is configured to determine an edit distance between the first character string and the second character string according to the predefined edit distance algorithm and the first sequence and the second sequence;
- the second determining module 303 is configured to determine the similarity between the first character string and the second character string according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation.
- the apparatus by dividing the first character string and the second character string into the first sequence and the second sequence, when determining the editing distance, is implemented based on words in the string, not based on characters
- the characters in the string are implemented, and the words in the string may include at least one character, so that the similarity determined according to the edit distance combines the correlation between the characters in the string, so that the determined similarity is more accurate.
- the second determining module 303 includes:
- the obtaining unit 3031 is configured to acquire replacement operation information in each operation information performed when the first sequence is converted to the second sequence;
- a first determining unit 3032 configured to determine a number of pairs according to each replacement operation information, where the number of pairs refers to the number of two words that exist in the first sequence and the second sequence at the same time;
- the second determining unit 3033 is configured to determine, according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence, the first character string and the second character string. Similarity between the two.
- each operation includes a replacement operation and an exchange operation
- the second determining unit 3033 includes:
- a first determining sub-unit 30331 configured to determine, according to an edit distance, a pairing number, an operation cost of the replacement operation, and an operation cost of the exchange operation, a minimum semantic edit distance between the first character string and the second character string;
- the normalization sub-unit 30332 is configured to normalize the minimum semantic edit distance to obtain a normalized result
- the second determining subunit 30333 is configured to determine the similarity between the first character string and the second character string according to the normalization result.
- each operation includes at least one of a replacement operation, an exchange operation, an insertion operation, and a deletion operation
- the second determining unit 3033 includes:
- a third determining sub-unit 30334 configured to determine, according to an edit distance, a pairing number, an operation cost of the replacement operation, and an operation cost of the exchange operation, a first semantic edit distance between the first character string and the second character string;
- the fourth determining subunit 30335 is configured to determine, according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, a number of words in the first sequence, and a number of words in the second sequence.
- First string a second semantic edit distance between the second string;
- the fifth determining subunit 30336 is configured to determine the similarity between the first character string and the second character string according to the first semantic edit distance and the second semantic edit distance.
- the apparatus further includes:
- the third determining module 304 is configured to determine an operation cost of the replacement operation and an operation cost of the exchange operation according to the relationship between the replacement operation and the exchange operation;
- the fourth determining module 305 is configured to determine an operation cost of the insert operation, an operation cost of the delete operation, and an operation cost of the replacement operation according to the relationship between the replacement operation and the insert operation and the delete operation.
- the apparatus further includes:
- a fifth determining module 306, configured to determine an operation cost of the 2* replacement operation and an operation cost of the exchange operation according to a relationship between the replacement operation and the exchange operation;
- the sixth determining module 307 is configured to determine an operation cost of the insert operation + an operation cost of the delete operation > an operation cost of the replacement operation according to the relationship between the replacement operation and the insert operation and the delete operation.
- the apparatus further includes:
- the seventh determining module 308 is configured to determine, according to the relationship between the insert operation and the delete operation, an operation cost of the insert operation equal to an operation cost of the delete operation.
- the first determining module 302 is configured to determine an edit distance between the first character string and the second string according to the predefined edit distance algorithm and the first sequence and the second sequence by using the following formula 1. :
- the first determining subunit 30331 is configured to determine the first string and the second string by using the following formula 2 according to the editing distance, the number of pairs, the operation cost of the replacement operation, and the operation cost of the swap operation.
- S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the minimum semantic edit distance, d is the edit distance, p is the pairing number, and cost(J) is the operation of the swap operation. Cost, cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0.
- the third determining subunit 30334 is configured to determine the first character string and the second character string by using the following formula 3 according to the editing distance, the number of pairs, and the operation cost of the replacement operation, and the operation cost of the switching operation. Between First semantic edit distance:
- S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the first semantic editing distance, d is the editing distance, p is the pairing number, and cost(J) is the switching operation.
- the cost of the operation cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0, and 2cost(T)-cost(J)>0.
- the fourth determining subunit 30335 is configured to use one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and a number of words, a second sequence in the first sequence.
- the number of words in the first semantic edit distance between the first character string and the second character string is determined by the following formula 4:
- costM cost(C), if n ⁇ m;
- costM cost(S), if n>m
- Equation 4 normFact(S1, S2) is the second semantic edit distance, n is the number of words in the first sequence, m is the number of words in the second sequence, and cost(T) is the operational cost of the replacement operation, cost( S) is the operational cost of the delete operation, and cost(C) is the operational cost of the insert operation.
- the fifth determining subunit 30336 is configured to determine the similarity between the first string and the second string according to the following formula 5 according to the first semantic edit distance and the second semantic edit distance:
- sim(S1, S2) is the similarity between the first string and the second string
- minCost(S1, S2) is the first semantic edit distance
- normFact(S1, S2) is the second semantic edit. distance
- the similarity determining apparatus provided in the embodiment corresponding to the foregoing FIG. 3 to FIG. 9 may be used to perform the similarity determining method provided by the embodiment corresponding to FIG. 1 or FIG. 2, wherein the specific manner in which each module performs the operation has been A detailed description is made in the embodiment relating to the method, and will not be explained in detail herein.
- FIG. 10 is a block diagram of a terminal 600, which may be used to perform the similarity determination method provided by the embodiment corresponding to FIG. 1 or FIG. 2, according to an exemplary embodiment.
- terminal 600 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
- terminal 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, I/O (Input/Output) interface 612, Sensor component 614, and communication component 616.
- processing component 602 memory 604, power component 606, multimedia component 608, audio component 610, I/O (Input/Output) interface 612, Sensor component 614, and communication component 616.
- Processing component 602 typically controls the overall operations of terminal 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
- Processing component 602 can include one or more processors 620 to execute instructions to perform all or part of the steps of the above described methods.
- processing component 602 can include one or more modules to facilitate interaction between component 602 and other components.
- processing component 602 can include a multimedia module to facilitate interaction between multimedia component 608 and processing component 602.
- Memory 604 is configured to store various types of data to support operation at terminal 600. Examples of such data include instructions for any application or method operating on terminal 600, contact data, phone book data, messages, pictures, videos, and the like.
- the memory 604 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as SRAM (Static Random Access Memory), EEPROM (Electrically-Erasable Programmable Read-Only Memory, Erasable Programmable Read Only Memory (EPROM), PROM (Programmable Read-Only Memory), ROM (Read-Only Memory, Read only memory), magnetic memory, flash memory, disk or optical disk.
- SRAM Static Random Access Memory
- EEPROM Electrically-Erasable Programmable Read-Only Memory
- EPROM Erasable Programmable Read Only Memory
- PROM Programmable Read-Only Memory
- ROM Read-Only Memory, Read only memory
- magnetic memory flash memory, disk or optical disk.
- Power component 606 provides power to various components of terminal 600.
- Power component 606 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for terminal 600.
- the multimedia component 608 includes a screen between the terminal 600 and the user that provides an output interface.
- the screen may include an LCD (Liquid Crystal Display) and a TP (Touch Panel). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
- the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
- the multimedia component 608 includes a front camera and/or a rear camera. When the terminal 600 is in an operation mode such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
- the audio component 610 is configured to output and/or input an audio signal.
- the audio component 610 includes a MIC (Microphone) that is configured to receive an external audio signal when the terminal 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
- the received audio signal may be further stored in memory 604 or transmitted via communication component 616.
- audio component 610 also includes a speaker for outputting an audio signal.
- the I/O interface 612 provides an interface between the processing component 602 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
- Sensor assembly 614 includes one or more sensors for providing terminal 600 with various aspects of status assessment.
- sensor component 614 can detect an open/closed state of terminal 600, a relative positioning of components, such as a group
- the device is a display and a keypad of the terminal 600.
- the sensor component 614 can also detect a change in the position of a component of the terminal 600 or the terminal 600, the presence or absence of contact of the user with the terminal 600, the orientation of the terminal 600 or the acceleration/deceleration and the temperature of the terminal 600. Variety.
- Sensor assembly 614 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
- Sensor assembly 614 may also include a light sensor, such as a CMOS (Complementary Metal Oxide Semiconductor) or CCD (Charge-coupled Device) image sensor for use in imaging applications.
- the sensor component 614 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
- Communication component 616 is configured to facilitate wired or wireless communication between terminal 600 and other devices.
- the terminal 600 can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
- communication component 616 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
- the communication component 616 also includes an NFC (Near Field Communication) module to facilitate short range communication.
- the NFC module can be based on RFID (Radio Frequency Identification) technology, IrDA (Infra-red Data Association) technology, UWB (Ultra Wideband) technology, BT (Bluetooth) technology and Other technologies are implemented.
- the terminal 600 may be configured by one or more ASICs (Application Specific Integrated Circuits), DSP (Digital Signal Processor), DSPD (Digital Signal Processor Device). Device), PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array), controller, microcontroller, microprocessor or other electronic component implementation for performing the above diagram 1 or the similarity determination method provided by the embodiment corresponding to FIG. 2.
- ASICs Application Specific Integrated Circuits
- DSP Digital Signal Processor
- DSPD Digital Signal Processor Device
- PLD Programmable Logic Device
- FPGA Field Programmable Gate Array
- controller microcontroller, microprocessor or other electronic component implementation for performing the above diagram 1 or the similarity determination method provided by the embodiment corresponding to FIG. 2.
- non-transitory computer readable storage medium comprising instructions, such as a memory 604 comprising instructions executable by processor 620 of terminal 600 to perform the similarity determination method described above.
- the non-transitory computer readable storage medium may be a ROM, a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, and optical data. Storage devices, etc.
- the non-transitory computer readable storage medium converts the first character string and the second character string into a first sequence and a second sequence, respectively, so as to determine that the first character string is converted into the second character.
- the edit distance of the string is implemented based on each word in the first sequence and the second sequence, and is not based on each character in the first string and the second string, and each word in the string may include At least one character such that the similarity determined according to the edit distance combines the correlation between the individual characters in the string, making the determined similarity more accurate.
- FIG. 11 is a block diagram of a server, which may perform the above FIG. 1 or FIG. 1 according to an exemplary embodiment. 2 similarity determination method provided by the corresponding embodiment.
- server 700 includes a processing component 722 that further includes one or more processors, and memory resources represented by memory 732 for storage by processing component 722.
- the execution of instructions such as an application.
- An application stored in memory 732 can include one or more modules each corresponding to a set of instructions.
- the processing component 722 is configured to execute instructions to perform the similarity determination method provided by the embodiment corresponding to FIG. 1 or FIG. 2 above.
- Server 700 may also include a power component 726 configured to perform power management of server 700, a wired or wireless network interface 750 configured to connect server 700 to the network, and an input/output (I/O) interface 758.
- Server 700 can operate based on the operating system stored in memory 732, for example, Windows Server TM, Mac OS X TM , Unix TM, Linux TM, FreeBSD TM or similar.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
Claims (25)
- 一种相似性确定方法,其特征在于,所述方法包括:分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离;根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求1所述的方法,其特征在于,所述根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性,包括:获取由所述第一序列向所述第二序列变换时所进行的各操作信息中的替换操作信息;根据所述各替换操作信息确定配对数,其中,所述配对数是指同时存在于所述第一序列和所述第二序列中的两个词的个数;根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求2所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性,所述各操作包括替换操作、交换操作,包括:根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的最小语义编辑距离;对所述最小语义编辑距离进行归一化,得到归一化结果;根据所述归一化结果确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求2所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性,所述各操作包括替换操作、交换操作、插入操作、删除操作中的至少其中之一,包括:根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的第一语义编辑距离;根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的第二语义编辑距离;根据所述第一语义编辑距离和所述第二语义编辑距离,确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求2至4中任一项所述的方法,其特征在于,所述方法还包括:根据替换操作与交换操作之间的关系,确定替换操作的操作代价及交换操作的操作代价;根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价、删除操作的操作代价及替换操作的操作代价。
- 根据权利要求5中任一项所述的方法,其特征在于,所述方法还包括:根据替换操作与交换操作之间的关系,确定2*替换操作的操作代价>交换操作的操作代价;根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价+删除操作的操作代价>替换操作的操作代价。
- 根据权利要求5中所述的方法,其特征在于,所述方法还包括:根据插入操作与删除操作之间的关系,确定插入操作的操作代价等于删除操作的操作代价。
- 根据权利要求2所述的方法,其特征在于,所述根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离,包括:根据预先定义的编辑距离算法及所述第一序列和所述第二序列,通过如下公式一确定所述第一字符串和所述第二字符串之间的编辑距离:公式一:minCost[i,j]=min(minCost[i-1,j]+cost(S),minCost[i,j-1]+cost(C),minCost[i-1,j-1]+cost(T))公式一中,i表示所述第一序列中的第i个词;j表示所述第二序列中的第j个词;cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价,cost(T)为替换操作的操作代价。
- 根据权利要求3所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的最小语义编辑距离,包括:根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式二确定所述第一字符串与所述第二字符串之间的最小语义编辑距离:公式二:minCost(S1,S2)=d-p(2cost(T)-cost(J));公式二中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述最小语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0。
- 根据权利要求4所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的第一语义编辑距离,包括:根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式三确定所述第一字符串与所述第二字符串之间的第一语义编辑距离:公式三:minCost(S1,S2)=d-p(2cost(T)-cost(J));公式三中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述第一语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0,且2cost(T)-cost(J)>0。
- 根据权利要求4所述的方法,其特征在于,所述根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的第二语义编辑距离,包括:根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,通过如下公式四确定所述第一字符串与所述第二字符串之间的第二语义编辑距离:公式四:normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costMcostM=cost(C),if n<m;costM=cost(S),if n>m公式四中,normFact(S1,S2)为所述第二语义编辑距离,n为所述第一序列的词个数,m为所述第二序列的词个数,cost(T)为所述替换操作的操作代价,cost(S)为所述删除操作的操作代价,cost(C)为所述插入操作的操作代价。
- 根据权利要求4所述的方法,其特征在于,所述根据所述第一语义编辑距离和所述第二语义编辑距离,确定所述第一字符串与所述第二字符串之间的相似性,包括:根据所述第一语义编辑距离和所述第二语义编辑距离,通过如下公式五确定所述第一字符串与所述第二字符串之间的相似性:公式五:sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2);公式五中,sim(S1,S2)为所述第一字符串与所述第二字符串之间的相似性,minCost(S1,S2)为所述第一语义编辑距离,normFact(S1,S2)为所述第二语义编辑距离。
- 一种相似性确定装置,其特征在于,所述装置包括:分词模块,用于分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;第一确定模块,用于根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离;第二确定模块,用于根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求13所述的装置,其特征在于,所述第二确定模块包括:获取单元,用于获取由所述第一序列向所述第二序列变换时所进行的各操作信息中的替换操作信息;第一确定单元,用于根据所述各替换操作信息确定配对数,其中,所述配对数是指同时存在于所述第一序列和所述第二序列中的两个词的个数;第二确定单元,用于根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求14所述的装置,其特征在于,所述各操作包括替换操作、交换操作,所述第二确定单元包括:第一确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的最小语义编辑距离;归一化子单元,用于对所述最小语义编辑距离进行归一化,得到归一化结果;第二确定子单元,用于根据所述归一化结果确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求14所述的装置,其特征在于,所述各操作包括替换操作、交换操作、插入操作、删除操作中的至少其中之一,所述第二确定单元包括:第三确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的第一语义编辑距离;第四确定子单元,用于根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的第二语义编辑距离;第五确定子单元,用于根据所述第一语义编辑距离和所述第二语义编辑距离,确定所述第一字符串与所述第二字符串之间的相似性。
- 根据权利要求14至16中任一项所述的装置,其特征在于,所述装置还包括:第三确定模块,用于根据替换操作与交换操作之间的关系,确定替换操作的操作代价及交换操作的操作代价;第四确定模块,用于根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价、删除操作的操作代价及替换操作的操作代价。
- 根据权利要求17中任一项所述的装置,其特征在于,所述装置还包括:第五确定模块,用于根据替换操作与交换操作之间的关系,确定2*替换操作的操作代价>交换操作的操作代价;第六确定模块,用于根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价+删除操作的操作代价>替换操作的操作代价。
- 根据权利要求17中所述的装置,其特征在于,所述装置还包括:第七确定模块,用于根据插入操作与删除操作之间的关系,确定插入操作的操作代价等于删除操作的操作代价。
- 根据权利要求14所述的装置,其特征在于,所述第一确定模块,用于根据预先定义的编辑距离算法及所述第一序列和所述第二序列,通过如下公式一确定所述第一字符串和所述第二字符串之间的编辑距离:公式一:minCost[i,j]=min(minCost[i-1,j]+cost(S),minCost[i,j-1]+cost(C),minCost[i-1,j-1]+cost(T))公式一中,i表示所述第一序列中的第i个词;j表示所述第二序列中的第j个词;cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价,cost(T)为替换操作的操作代价。
- 根据权利要求15所述的装置,其特征在于,所述第一确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式二确定所述第一字符串与所述第二字符串之间的最小语义编辑距离:公式二:minCost(S1,S2)=d-p(2cost(T)-cost(J));公式二中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述最小语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0。
- 根据权利要求16所述的装置,其特征在于,所述第三确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式三确定所述第一字符串与所述第二字符串之间的第一语义编辑距离:公式三:minCost(S1,S2)=d-p(2cost(T)-cost(J));公式三中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述第一语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0,且 2cost(T)-cost(J)>0。
- 根据权利要求16所述的装置,其特征在于,第四确定子单元,用于根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,通过如下公式四确定所述第一字符串与所述第二字符串之间的第二语义编辑距离:公式四:normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costMcostM=cost(C),if n<m;costM=cost(S),if n>m公式四中,normFact(S1,S2)为所述第二语义编辑距离,n为所述第一序列的词个数,m为所述第二序列的词个数,cost(T)为所述替换操作的操作代价,cost(S)为所述删除操作的操作代价,cost(C)为所述插入操作的操作代价。
- 根据权利要求16所述的装置,其特征在于,所述第五确定子单元,用于根据所述第一语义编辑距离和所述第二语义编辑距离,通过如下公式五确定所述第一字符串与所述第二字符串之间的相似性:公式五:sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2);公式五中,sim(S1,S2)为所述第一字符串与所述第二字符串之间的相似性,minCost(S1,S2)为所述第一语义编辑距离,normFact(S1,S2)为所述第二语义编辑距离。
- 一种终端,其特征在于,所述终端包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为:分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串时之间的编辑距离;根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020167006741A KR101782923B1 (ko) | 2015-12-03 | 2015-12-29 | 유사성 확정 방법, 장치, 단말, 프로그램 및 저장매체 |
RU2016118758A RU2664002C2 (ru) | 2015-12-03 | 2015-12-29 | Способ и устройство для определения сходства, а также терминал |
MX2016005489A MX365897B (es) | 2015-12-03 | 2015-12-29 | Método y aparato para determinar similitud y terminal. |
JP2017553299A JP6321306B2 (ja) | 2015-12-03 | 2015-12-29 | 類似性特定方法、装置、端末、プログラム及び記録媒体 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510882468.2A CN105446957B (zh) | 2015-12-03 | 2015-12-03 | 相似性确定方法、装置及终端 |
CN201510882468.2 | 2015-12-03 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017092122A1 true WO2017092122A1 (zh) | 2017-06-08 |
Family
ID=55557172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/099523 WO2017092122A1 (zh) | 2015-12-03 | 2015-12-29 | 相似性确定方法、装置及终端 |
Country Status (8)
Country | Link |
---|---|
US (1) | US10089301B2 (zh) |
EP (1) | EP3179379A1 (zh) |
JP (1) | JP6321306B2 (zh) |
KR (1) | KR101782923B1 (zh) |
CN (1) | CN105446957B (zh) |
MX (1) | MX365897B (zh) |
RU (1) | RU2664002C2 (zh) |
WO (1) | WO2017092122A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111352549A (zh) * | 2020-02-25 | 2020-06-30 | 腾讯科技(深圳)有限公司 | 一种数据对象展示方法、装置、设备及存储介质 |
CN114757153A (zh) * | 2022-05-12 | 2022-07-15 | 阿里巴巴(中国)有限公司 | 字符串、字符串集合处理方法、计算机设备及存储介质 |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10296788B1 (en) * | 2016-12-19 | 2019-05-21 | Matrox Electronic Systems Ltd. | Method and system for processing candidate strings detected in an image to identify a match of a model string in the image |
US10853457B2 (en) * | 2018-02-06 | 2020-12-01 | Didi Research America, Llc | System and method for program security protection |
US10515149B2 (en) * | 2018-03-30 | 2019-12-24 | BlackBoiler, LLC | Method and system for suggesting revisions to an electronic document |
WO2020061910A1 (zh) * | 2018-09-27 | 2020-04-02 | 北京字节跳动网络技术有限公司 | 用于生成信息的方法和装置 |
SG10201904554TA (en) * | 2019-05-21 | 2019-09-27 | Alibaba Group Holding Ltd | Methods and devices for quantifying text similarity |
CN110750615B (zh) * | 2019-09-30 | 2020-07-24 | 贝壳找房(北京)科技有限公司 | 文本重复性判定方法和装置、电子设备和存储介质 |
CN110909161B (zh) * | 2019-11-12 | 2022-04-08 | 西安电子科技大学 | 基于密度聚类和视觉相似度的英文单词分类方法 |
US11776529B2 (en) * | 2020-04-28 | 2023-10-03 | Samsung Electronics Co., Ltd. | Method and apparatus with speech processing |
KR20210132855A (ko) * | 2020-04-28 | 2021-11-05 | 삼성전자주식회사 | 음성 처리 방법 및 장치 |
CN111967270B (zh) * | 2020-08-16 | 2023-11-21 | 云知声智能科技股份有限公司 | 一种基于字符与语义融合的方法和设备 |
US11681864B2 (en) | 2021-01-04 | 2023-06-20 | Blackboiler, Inc. | Editing parameters |
CN112597313B (zh) * | 2021-03-03 | 2021-06-29 | 北京沃丰时代数据科技有限公司 | 短文本聚类方法、装置、电子设备及存储介质 |
KR102517661B1 (ko) * | 2022-07-15 | 2023-04-04 | 주식회사 액션파워 | 텍스트 정보에서 타겟 단어에 대응하는 단어를 식별하는 방법 |
CN116564414B (zh) * | 2023-07-07 | 2024-03-26 | 腾讯科技(深圳)有限公司 | 分子序列的比对方法、装置、电子设备、存储介质及产品 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040141354A1 (en) * | 2003-01-18 | 2004-07-22 | Carnahan John M. | Query string matching method and apparatus |
CN101561813A (zh) * | 2009-05-27 | 2009-10-21 | 东北大学 | 一种Web环境下的字符串相似度的分析方法 |
CN101751430A (zh) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | 电子词典模糊检索方法 |
CN102622338A (zh) * | 2012-02-24 | 2012-08-01 | 北京工业大学 | 一种短文本间语义距离的计算机辅助计算方法 |
Family Cites Families (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5757959A (en) * | 1995-04-05 | 1998-05-26 | Panasonic Technologies, Inc. | System and method for handwriting matching using edit distance computation in a systolic array processor |
NO983175L (no) | 1998-07-10 | 2000-01-11 | Fast Search & Transfer Asa | Soekesystem for gjenfinning av data |
JP2001291060A (ja) * | 2000-04-04 | 2001-10-19 | Toshiba Corp | 単語列照合装置および単語列照合方法 |
US7107204B1 (en) * | 2000-04-24 | 2006-09-12 | Microsoft Corporation | Computer-aided writing system and method with cross-language writing wizard |
US6810376B1 (en) * | 2000-07-11 | 2004-10-26 | Nusuara Technologies Sdn Bhd | System and methods for determining semantic similarity of sentences |
EP1668541A1 (en) * | 2003-09-30 | 2006-06-14 | British Telecommunications Public Limited Company | Information retrieval |
JP2005352888A (ja) * | 2004-06-11 | 2005-12-22 | Hitachi Ltd | 表記揺れ対応辞書作成システム |
US8077984B2 (en) * | 2008-01-04 | 2011-12-13 | Xerox Corporation | Method for computing similarity between text spans using factored word sequence kernels |
US8775441B2 (en) | 2008-01-16 | 2014-07-08 | Ab Initio Technology Llc | Managing an archive for approximate string matching |
US8812493B2 (en) * | 2008-04-11 | 2014-08-19 | Microsoft Corporation | Search results ranking using editing distance and document information |
US8170969B2 (en) * | 2008-08-13 | 2012-05-01 | Siemens Aktiengesellschaft | Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge |
US8219583B2 (en) * | 2008-11-10 | 2012-07-10 | Nbcuniversal Media, Llc | Methods and systems for mining websites |
US8290989B2 (en) * | 2008-11-12 | 2012-10-16 | Sap Ag | Data model optimization |
CN101957828B (zh) | 2009-07-20 | 2013-03-06 | 阿里巴巴集团控股有限公司 | 一种对搜索结果进行排序的方法和装置 |
WO2014136173A1 (ja) * | 2013-03-04 | 2014-09-12 | 三菱電機株式会社 | 検索装置 |
CN103399907A (zh) * | 2013-07-31 | 2013-11-20 | 深圳市华傲数据技术有限公司 | 一种基于编辑距离计算中文字符串相似度的方法及装置 |
CA2861469A1 (en) * | 2013-08-14 | 2015-02-14 | National Research Council Of Canada | Method and apparatus to construct program for assisting in reviewing |
JP6143638B2 (ja) * | 2013-10-17 | 2017-06-07 | 株式会社日立ソリューションズ東日本 | データ処理装置およびデータ処理方法 |
US9430463B2 (en) * | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US9672206B2 (en) * | 2015-06-01 | 2017-06-06 | Information Extraction Systems, Inc. | Apparatus, system and method for application-specific and customizable semantic similarity measurement |
-
2015
- 2015-12-03 CN CN201510882468.2A patent/CN105446957B/zh active Active
- 2015-12-29 WO PCT/CN2015/099523 patent/WO2017092122A1/zh active Application Filing
- 2015-12-29 JP JP2017553299A patent/JP6321306B2/ja active Active
- 2015-12-29 RU RU2016118758A patent/RU2664002C2/ru active
- 2015-12-29 KR KR1020167006741A patent/KR101782923B1/ko active IP Right Grant
- 2015-12-29 MX MX2016005489A patent/MX365897B/es active IP Right Grant
-
2016
- 2016-09-26 EP EP16190672.2A patent/EP3179379A1/en not_active Ceased
- 2016-11-10 US US15/348,697 patent/US10089301B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040141354A1 (en) * | 2003-01-18 | 2004-07-22 | Carnahan John M. | Query string matching method and apparatus |
CN101751430A (zh) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | 电子词典模糊检索方法 |
CN101561813A (zh) * | 2009-05-27 | 2009-10-21 | 东北大学 | 一种Web环境下的字符串相似度的分析方法 |
CN102622338A (zh) * | 2012-02-24 | 2012-08-01 | 北京工业大学 | 一种短文本间语义距离的计算机辅助计算方法 |
Non-Patent Citations (1)
Title |
---|
13 April 2015 (2015-04-13), XP055598524, Retrieved from the Internet <URL:http://wenku.baidu.com/view/df531365227916888586d702.html> * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111352549A (zh) * | 2020-02-25 | 2020-06-30 | 腾讯科技(深圳)有限公司 | 一种数据对象展示方法、装置、设备及存储介质 |
CN111352549B (zh) * | 2020-02-25 | 2022-01-07 | 腾讯科技(深圳)有限公司 | 一种数据对象展示方法、装置、设备及存储介质 |
CN114757153A (zh) * | 2022-05-12 | 2022-07-15 | 阿里巴巴(中国)有限公司 | 字符串、字符串集合处理方法、计算机设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US10089301B2 (en) | 2018-10-02 |
KR101782923B1 (ko) | 2017-09-28 |
CN105446957A (zh) | 2016-03-30 |
MX2016005489A (es) | 2017-11-30 |
MX365897B (es) | 2019-06-19 |
JP6321306B2 (ja) | 2018-05-09 |
CN105446957B (zh) | 2018-07-20 |
RU2664002C2 (ru) | 2018-08-14 |
US20170161260A1 (en) | 2017-06-08 |
RU2016118758A (ru) | 2017-11-20 |
EP3179379A1 (en) | 2017-06-14 |
JP2018501597A (ja) | 2018-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017092122A1 (zh) | 相似性确定方法、装置及终端 | |
EP3079082B1 (en) | Method and apparatus for album display | |
WO2020029966A1 (zh) | 视频处理方法及装置、电子设备和存储介质 | |
WO2021027343A1 (zh) | 一种人脸图像识别方法及装置、电子设备和存储介质 | |
TW202105199A (zh) | 資料更新方法、電子設備和儲存介質 | |
WO2017028416A1 (zh) | 分类器训练方法、类型识别方法及装置 | |
WO2021036382A1 (zh) | 图像处理方法及装置、电子设备和存储介质 | |
EP3176709A1 (en) | Video categorization method and apparatus, computer program and recording medium | |
US9661133B2 (en) | Electronic device and method for extracting incoming/outgoing information and managing contacts | |
WO2019165832A1 (zh) | 文字信息处理方法、装置及终端 | |
WO2017088247A1 (zh) | 输入处理方法、装置及设备 | |
WO2016082461A1 (zh) | 推荐信息获取方法、终端及服务器 | |
CN106777016B (zh) | 基于即时通信进行信息推荐的方法及装置 | |
TW202117707A (zh) | 資料處理方法、電子設備和電腦可讀儲存介質 | |
CN111160047A (zh) | 一种数据处理方法、装置和用于数据处理的装置 | |
WO2023078414A1 (zh) | 相关文章搜索方法、装置、电子设备和存储介质 | |
CN106911706B (zh) | 通话背景添加方法及装置 | |
CN104268151A (zh) | 联系人分组方法及装置 | |
WO2016197549A1 (zh) | 一种进行搜索的方法和装置 | |
CN108241438B (zh) | 一种输入方法、装置和用于输入的装置 | |
WO2015188589A1 (zh) | 用户数据更新方法及装置 | |
CN113435205A (zh) | 语义解析方法及装置 | |
WO2019196527A1 (zh) | 一种数据处理方法、装置和电子设备 | |
TWI739633B (zh) | 儲存和讀取方法、電子設備和電腦可讀儲存介質 | |
CN115374075A (zh) | 一种文件类型识别方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2017553299 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020167006741 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: MX/A/2016/005489 Country of ref document: MX |
|
ENP | Entry into the national phase |
Ref document number: 2016118758 Country of ref document: RU Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15909636 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15909636 Country of ref document: EP Kind code of ref document: A1 |