WO2017092122A1 - 相似性确定方法、装置及终端 - Google Patents

相似性确定方法、装置及终端 Download PDF

Info

Publication number
WO2017092122A1
WO2017092122A1 PCT/CN2015/099523 CN2015099523W WO2017092122A1 WO 2017092122 A1 WO2017092122 A1 WO 2017092122A1 CN 2015099523 W CN2015099523 W CN 2015099523W WO 2017092122 A1 WO2017092122 A1 WO 2017092122A1
Authority
WO
WIPO (PCT)
Prior art keywords
cost
sequence
character string
edit distance
determining
Prior art date
Application number
PCT/CN2015/099523
Other languages
English (en)
French (fr)
Inventor
汪平仄
张涛
龙飞
Original Assignee
小米科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 小米科技有限责任公司 filed Critical 小米科技有限责任公司
Priority to KR1020167006741A priority Critical patent/KR101782923B1/ko
Priority to RU2016118758A priority patent/RU2664002C2/ru
Priority to MX2016005489A priority patent/MX365897B/es
Priority to JP2017553299A priority patent/JP6321306B2/ja
Publication of WO2017092122A1 publication Critical patent/WO2017092122A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing

Definitions

  • the present disclosure relates to the field of natural language processing, and in particular, to a similarity determining method, apparatus, and terminal.
  • the related art can be realized by calculating the edit distance between two when determining the similarity between strings.
  • the two strings can be separately segmented into characters; then, by deleting, inserting, or replacing the characters in one string, one string is converted into another string; , calculate the minimum number of operations required to convert from one string to another, using the minimum number of operations as the edit distance between the two strings; finally, calculate the similarity between the two strings based on the edit distance Sex.
  • the present disclosure provides a similarity determination method, apparatus, and terminal.
  • a similarity determining method comprising:
  • a similarity determining apparatus comprising:
  • a word segmentation module configured to respectively segment the first character string and the second character string to obtain a first sequence and a second sequence, the first sequence and the second sequence respectively including at least one word;
  • a first determining module configured to determine an edit distance between the first character string and the second character string according to a predefined edit distance algorithm and the first sequence and the second sequence;
  • a second determining module configured to determine, between the first character string and the second string, according to the edit distance and information of each operation performed by the first sequence to the second sequence transformation Similarity.
  • a terminal comprising:
  • a memory for storing processor executable instructions
  • processor is configured to:
  • each word in the string may include at least one character, such that the similarity determined according to the edit distance is combined
  • the correlation between the characters in the string makes the determined similarity more accurate.
  • FIG. 1 is a flow chart showing a similarity determination method according to an exemplary embodiment.
  • FIG. 2 is a flow chart showing a similarity determination method according to an exemplary embodiment.
  • FIG. 3 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
  • FIG. 4 is a block diagram of a second determining module, according to an exemplary embodiment.
  • FIG. 5 is a block diagram of a second determining unit, according to an exemplary embodiment.
  • FIG. 6 is a block diagram of a second determining unit, according to an exemplary embodiment.
  • FIG. 7 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
  • FIG. 8 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
  • FIG. 9 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
  • FIG. 10 is a block diagram of a terminal, according to an exemplary embodiment.
  • FIG. 11 is a block diagram of a server, according to an exemplary embodiment.
  • FIG. 1 is a flowchart of a similarity determining method according to an exemplary embodiment.
  • the similarity determining method provided by the embodiment of the present disclosure may be used in a terminal. As shown in FIG. 1 , the similarity determining method provided by the embodiment of the present disclosure includes the following steps.
  • step S101 the first character string and the second character string are respectively segmented to obtain a first sequence and a second sequence, wherein the first sequence and the second sequence respectively include at least one word.
  • step S102 an edit distance between the first character string and the second character string is determined according to the predefined edit distance algorithm and the first sequence and the second sequence.
  • step S103 the similarity between the first character string and the second character string is determined according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation.
  • the method provided by the embodiment of the present disclosure when the first character string and the second character string are respectively segmented into the first sequence and the second sequence, so that when the editing distance is changed when the first character string is converted into the second character string, Is implemented based on each word in the first sequence and the second sequence, and is not based on each character in the first string and the second string, and each word in the string may include at least one character, thereby
  • the similarity of the edit distance determination combines the correlation between the characters in the string, making the determined similarity more accurate.
  • the similarity between the first character string and the second character string is determined according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation, including:
  • the similarity between the first character string and the second character string is determined according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence.
  • each operation includes replacement operations, exchange operations, including:
  • the similarity between the first string and the second string is determined according to the normalized result.
  • each operation includes at least one of a replacement operation, an exchange operation, an insertion operation, and a deletion operation, including:
  • the method further includes:
  • the operation cost of the insertion operation the operation cost of the deletion operation, and the operation cost of the replacement operation are determined.
  • the method further includes:
  • the operation cost of the insertion operation + the operation cost of the deletion operation > the operation cost of the replacement operation is determined.
  • the method further includes:
  • determining an edit distance between the first character string and the second character string according to the predefined edit distance algorithm and the first sequence and the second sequence including:
  • the edit distance between the first string and the second string is determined by the following formula 1:
  • the minimum semantic edit distance between the first character string and the second character string is determined according to the edit distance, the number of pairs, and the operation cost of the replacement operation, and the operation cost of the exchange operation, including:
  • the minimum semantic edit distance between the first string and the second string is determined by the following formula 2:
  • S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the minimum semantic edit distance, d is the edit distance, p is the pairing number, and cost(J) is the operation of the swap operation. Cost, cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0.
  • the first semantic edit distance between the first character string and the second character string is determined according to the edit distance, the number of pairs, and the operation cost of the replacement operation, and the operation cost of the exchange operation, including:
  • the first semantic edit distance between the first character string and the second character string is determined by the following formula 3 according to the editing distance, the number of pairs, the operation cost of the replacement operation, and the operation cost of the exchange operation:
  • S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the first semantic editing distance, d is the editing distance, p is the pairing number, and cost(J) is the switching operation.
  • the cost of the operation cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0, and 2cost(T)-cost(J)>0.
  • the first is determined according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and the number of words in the first sequence and the number of words in the second sequence.
  • the second semantic edit distance between the string and the second string including:
  • the operation cost of the replacement operation is determined by the following formula 4
  • costM cost(C), if n ⁇ m;
  • costM cost(S), if n>m
  • Equation 4 normFact(S1, S2) is the second semantic edit distance, n is the number of words in the first sequence, m is the number of words in the second sequence, and cost(T) is the operation cost of the replacement operation, which is deleted.
  • the operational cost of the operation, cost(C) is the operational cost of the insert operation.
  • determining the similarity between cost(S) between the first character string and the second character string according to the first semantic edit distance and the second semantic edit distance including:
  • the similarity between the first character string and the second character string is determined by the following formula 5:
  • sim(S1, S2) is the similarity between the first string and the second string
  • minCost(S1, S2) is the first semantic edit distance
  • normFact(S1, S2) is the second semantic edit. distance
  • FIG. 2 is a flowchart of a similarity determining method according to an exemplary embodiment, and the similarity determining method may be applied to a terminal.
  • the similarity determining method provided by the embodiment of the present disclosure includes the following steps.
  • step S201 the first character string and the second character string are respectively segmented to obtain a first sequence and a second sequence.
  • the embodiment of the present disclosure does not separate the two strings into characters, but two characters.
  • the string is segmented, each string is segmented into words, and the segmented string includes at least one word.
  • the embodiment of the present disclosure defines two character strings that need to determine similarity as a first character string and a second character string, respectively, and after the first character string is divided into words, the first sequence is obtained; After the two string parts are the individual words, the second sequence is obtained.
  • the first sequence and the second sequence respectively comprise at least one word
  • the first sequence and the second sequence are (S11, S12, S13, ..., S1n) and (S21, S22, S23, ..., S2m, respectively). ).
  • the number of words in S1 is n
  • the number of words in S2 is m.
  • the embodiment of the present disclosure is not specifically limited.
  • the first string and the second string may both be Chinese, or both are English.
  • the first character string and the second character string may each be a sentence.
  • the first string is “Today I am going to Xiangshan” and the second string is “I am going to Xiangshan today”.
  • step S202 the operation cost of the replacement operation and the operation cost of the exchange operation are determined according to the relationship between the replacement operation and the exchange operation, and the operation cost and the deletion operation of the insertion operation are determined according to the relationship between the replacement operation and the insertion operation and the deletion operation.
  • the operational cost and the operational cost of the replacement operation are determined according to the relationship between the replacement operation and the exchange operation.
  • the traditional method of determining the similarity between strings when converting a string to another string, often includes three editing operations, namely, an insert operation, a delete operation, and a replacement operation, and operations of the three operations.
  • the price is the same.
  • some components appear in different parts of the string and do not change the overall meaning of the string. For example, “Today I am going to Xiangshan”, “I plan to go to Xiangshan today”, “I am going to Xiangshan today”, although the words are in different positions in the string, but the three strings mean the same meaning. Therefore, in the embodiment of the present disclosure, Based on the system's insert operation, delete operation and replacement operation, the exchange operation is newly defined, and different operation costs are defined for different operations according to the relationship between various operations.
  • the embodiment of the present disclosure can determine the replacement operation cost and the exchange operation cost according to the relationship between the replacement operation and the exchange operation, such as the embodiment of the present disclosure.
  • the relationship between the defined replacement operation cost and the exchange operation cost is satisfied: 2* the operation cost of the replacement operation > the operation cost of the exchange operation, ie:
  • cost(T) is the operational cost of the replacement operation and cost(J) is the operational cost of the exchange operation.
  • the embodiment of the present disclosure can determine the insertion operation cost and the deletion operation cost according to the relationship between the replacement operation and the insertion operation and the deletion operation, as in the embodiment of the present disclosure.
  • the relationship between the defined replacement operation cost, the exchange operation cost, and the deletion operation cost is satisfied: the operation cost of the insertion operation + the operation cost of the deletion operation > the operation cost of the replacement operation.
  • the operational cost of the replacement operation is greater than the operational cost of the insertion operation and the maximum of the operational cost of the deletion operation.
  • such a relationship can be expressed as the following formula:
  • Cost(S) is the operation cost of the delete operation
  • cost(C) is the operation cost of the insert operation
  • the insertion operation cost is determined to be deleted according to the relationship between the insert operation and the delete operation. Operating cost.
  • the insertion operation cost and the deletion operation cost may be different or the same, and the embodiment of the present disclosure does not specifically limit this.
  • step S203 a predefined edit distance algorithm is generated according to the operation cost of the replacement operation, the operation cost of the deletion operation, and the operation cost of the insertion operation.
  • the predefined edit distance algorithm can be as shown in Equation 1 below.
  • the edit distance algorithm predefined in the embodiment of the present disclosure is a dynamic programming algorithm
  • the pre-defined edit distance algorithm has a pre-defined operation cost of the delete operation and an operation cost of the insert operation according to an embodiment of the present disclosure. And the cost of the operation of the replacement operation is obtained.
  • step S202 and step S203 are steps that need to be performed before determining the similarity, and are not required to be performed each time the similarity between two strings is determined, and it is determined that the similarity is determined before the similarity is determined.
  • the operational cost of various operations and the predefined edit distance algorithm can be used.
  • step S204 an edit distance between the first character string and the second character string is determined according to the predefined edit distance algorithm and the first sequence and the second sequence.
  • the edit distance between two strings refers to the minimum number of edit operations required to convert one of the strings to another, where each edit corresponds to an operation cost, so the total operation at the time of the transformation can be The cost is the edit distance.
  • the editing operations that can be performed include a replacement operation, an insertion operation, a deletion operation, and an exchange operation.
  • the embodiment of the present disclosure may determine the edit distance between the first character string and the second character string according to the predefined edit distance algorithm and the first sequence and the second sequence.
  • the edit distance between the first character string and the second character string is calculated by the above formula 1 according to the pre-defined edit distance algorithm and the first sequence and the second sequence.
  • the principle of calculating the edit distance by the formula 1 is the same as the principle of calculating the edit distance based on the existing dynamic plan algorithm, and the embodiment of the present disclosure does not elaborate on this.
  • step S205 the similarity between the first character string and the second character string is determined according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation.
  • the information of each operation performed by the first sequence to the second sequence transformation includes the type of operation, the number of operations of each type of operation, and the operation cost of each type of operation.
  • the different operations are pre-defined in the embodiment of the present disclosure with different operation costs, combined with the definition of the edit distance between the two strings, various operations required when converting from the first character string to the second character string
  • the operational cost will directly affect the editing distance. Therefore, when determining the similarity between the first character string and the second character string, it is required to implement the operation information according to the editing distance and the various operations performed when the editing distance is obtained, and the operation information includes the operation cost and is edited.
  • the operational cost of various operations at the time of distance is preset in step S202.
  • the operation performed when converting from the first character string to the second character string includes two insertion operations, one deletion operation, one exchange operation, and one replacement operation, between the first character string and the second character string
  • step S2051 Go to step S2053 to achieve:
  • step S2051 when the edit distance is obtained, the replacement operation information in each operation information performed when the first sequence is converted to the second character string is acquired.
  • the replacement operation is to replace one word in the first string with another word.
  • the information of each replacement operation performed in the conversion process is counted, and the information of each replacement operation is recorded. Recorded in the specified collection.
  • the information of the replacement operation includes the replaced word of the replacement operation and the position of the replaced word in the sequence, and therefore, the data recorded in the specified set includes the position of the replaced word and the replaced word in the first sequence. For example, if the first string is "I plan to go to Xiangshan today", the first sequence is "I-Plan-Today-Go-Xiangshan", and the words "Xiangshan” and "Plan” are replaced, then the records in the collection are specified.
  • the information of the replacement operation includes "Intent-2, Xiangshan-5". Therefore, when the edit distance is obtained from the specified set, the replacement operation information in each operation information performed when the first character string is converted into the second character string, specifically, the replaced words and each of each replacement operation can be obtained. The position of the replaced word in the first sequence.
  • the embodiment of the present disclosure newly defines the switching operation according to the relationship between the replacement operation and the switching operation, and defines 2cost(T)-cost(J)>0 in advance, thereby obtaining two replacement operations
  • the cost is greater than the cost of performing a swap operation. Therefore, if a swap operation implementation can be performed when the first string is converted to the second string, it is not implemented by two replacement operations. Therefore, in addition to recording the replaced words in the first sequence and the position of each replaced word in the first sequence, it is further determined whether any two words in the specified set exist in the second sequence. If any two words exist in the second sequence, the two words and the position of each word in the second sequence will also be recorded in the specified set.
  • the first string is "I am going to Xiangshan today," the first sequence is "I-Plan-Today-Go-Xiangshan” and the words “Xiangshan” and “Plan” are replaced.
  • the second string is "Today I am going to Xiangshan”, the second sequence is "Today-I-Plan-Go-Xiangshan", because the replaced words “Xiangshan” and "Intended” exist in both the first sequence and the second sequence, therefore,
  • the data recorded in the collection can be "Plan-S12, Xiangshan-S15; Planned-S23, Xiangshan-S25".
  • Xiangshan and “intended” are defined as a matching word between the first character string and the second character string.
  • the matching words refer to any two words that exist in both the first sequence and the second sequence.
  • step S2052 the number of pairs is determined based on the replacement operation information.
  • the number of pairs refers to the number of matching words in the first sequence and the second sequence, that is, the number of two words that exist in the first sequence and the second sequence at the same time. In conjunction with the above interpretation of the data recorded in the specified set, the number of pairs can be determined from the data recorded in the specified set.
  • the number of pairs can be determined as 2.
  • step S2053 the similarity between the first character string and the second character string is determined according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence. Sex.
  • the embodiment of the present disclosure determines the first character string and the second character string according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence.
  • each operation may include a replacement operation, an exchange operation.
  • step S2053 can be implemented by the following steps S20531 to S20533:
  • step S20531 the minimum semantic edit distance between the first character string of the word and the second character string of the word is determined according to the word edit distance, the number of word pairs, and the operation cost of the replacement operation, and the operation cost of the exchange operation.
  • the operating cost of the exchange operation may be based on the edit distance, the number of pairs, and the replacement operation.
  • the minimum semantic edit distance between the first string and the second string is determined by Equation 2 below:
  • S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the minimum semantic edit distance, d is the edit distance, p is the pairing number, and cost(J) is the operation of the swap operation. Cost, cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0.
  • step S20532 the minimum semantic edit distance is normalized to obtain a normalized result.
  • the minimum semantic edit distance can be normalized by the maximum semantic edit distance between the first string and the second string.
  • the maximum semantic edit distance can be expressed as the following formula four:
  • costM cost(C), if n ⁇ m;
  • costM cost(S), if n>m
  • Equation 4 normFact(S1, S2) represents the maximum semantic edit distance, n represents the number of words in the first sequence, and m represents the number of words in the second sequence.
  • the normalized result obtained by normalizing the minimum semantic edit distance minCost(S1, S2) is minCost(S1, S2)/normFact(S1, S2).
  • minCost(S1, S2)/normFact(S1, S2) can be mapped between 0 and 1, thereby facilitating intuitive determination of similarity.
  • step S20533 the similarity between the first character string of the word and the second character string of the word is determined based on the result of word normalization.
  • the similarity between the first string and the second string can be determined according to the word normalization result by the following formula 5:
  • sim(S1, S2) is the similarity between the first string and the second string
  • minCost(S1, S2) is the minimum semantic edit distance
  • normFact(S1, S2) is the maximum semantic edit distance
  • minCost(S1, S2)/normFact(S1, S2) is the normalized result.
  • the embodiment of the present disclosure determines the first character string and the second character string according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence.
  • each operation may be at least one of a replacement operation, an exchange operation, an insertion operation, and a deletion operation.
  • it can be implemented by the following steps S20534 to S20536:
  • step S20534 the first semantic edit distance between the first character string and the second character string is determined according to the edit distance, the number of pairs, and the operation cost of the replacement operation and the operation cost of the exchange operation.
  • the first semantic edit distance may be a minimum semantic edit distance between the first character string and the second character string.
  • Equation 3 S1 and S2 represent the first character string and the second character string, respectively, minCost(S1, S2) represents the first semantic edit distance, d represents the edit distance, p represents the number of pairs, and cost(J) represents the exchange operation cost. .
  • the first semantic edit distance may be the minimum semantic edit distance between the first character string and the second character string, and the formula 2 is different from the meaning represented by the formula 3 only minCost (S1, S2).
  • the first string is determined according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and the number of words in the first sequence and the number of words in the second sequence.
  • a second semantic edit distance between the second string is determined according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and the number of words in the first sequence and the number of words in the second sequence.
  • the second semantic edit distance may be a maximum semantic edit distance between the first character string and the second character string.
  • determining the first character string and the number of words in the first sequence and the number of words in the second sequence according to one of an operation cost of the insertion operation and an operation cost of the deletion operation, an operation cost of the replacement operation, and a number of words in the second sequence includes, but is not limited to, implemented by Equation 4 below.
  • costM cost(C), if n ⁇ m;
  • costM cost(S), if n>m
  • Equation 4 normFact(S1, S2) represents the second semantic edit distance, n represents the number of words in the first sequence, and m represents the number of words in the second sequence.
  • normFact(S1, S2) is a normalization factor, which is used to map minCost(S1, S2)/normFact(S1, S2) to between 0 and 1, so as to facilitate the intuitive determination of similarity.
  • step S20536 the similarity between the first character string and the second character string is determined according to the first semantic edit distance and the second semantic edit distance.
  • Equation 5 the similarity between the first string and the second string.
  • sim(S1, S2) represents the similarity between the first string and the second string.
  • the method provided by the embodiment of the present disclosure when the first character string and the second character string are respectively segmented into the first sequence and the second sequence, so that when the editing distance is changed when the first character string is converted into the second character string, Is implemented based on each word in the first sequence and the second sequence, and is not based on each character in the first string and the second string, and
  • Each word in the string may include at least one character such that the similarity determined according to the edit distance combines the correlation between the individual characters in the string, making the determined similarity more accurate.
  • FIG. 3 is a block diagram of a similarity determining apparatus, according to an exemplary embodiment.
  • the similarity determining apparatus includes a word segmentation module 301, a first determination module 302, and a second determination module 303. among them:
  • a word segmentation module 301 configured to respectively segment the first character string and the second character string to obtain a first sequence and a second sequence, the first sequence and the second sequence respectively including at least one word;
  • the first determining module 302 is configured to determine an edit distance between the first character string and the second character string according to the predefined edit distance algorithm and the first sequence and the second sequence;
  • the second determining module 303 is configured to determine the similarity between the first character string and the second character string according to the edit distance and the information of each operation performed by the first sequence to the second sequence transformation.
  • the apparatus by dividing the first character string and the second character string into the first sequence and the second sequence, when determining the editing distance, is implemented based on words in the string, not based on characters
  • the characters in the string are implemented, and the words in the string may include at least one character, so that the similarity determined according to the edit distance combines the correlation between the characters in the string, so that the determined similarity is more accurate.
  • the second determining module 303 includes:
  • the obtaining unit 3031 is configured to acquire replacement operation information in each operation information performed when the first sequence is converted to the second sequence;
  • a first determining unit 3032 configured to determine a number of pairs according to each replacement operation information, where the number of pairs refers to the number of two words that exist in the first sequence and the second sequence at the same time;
  • the second determining unit 3033 is configured to determine, according to the edit distance, the number of pairs, the operation cost of each operation, the number of words in the first sequence, and the number of words in the second sequence, the first character string and the second character string. Similarity between the two.
  • each operation includes a replacement operation and an exchange operation
  • the second determining unit 3033 includes:
  • a first determining sub-unit 30331 configured to determine, according to an edit distance, a pairing number, an operation cost of the replacement operation, and an operation cost of the exchange operation, a minimum semantic edit distance between the first character string and the second character string;
  • the normalization sub-unit 30332 is configured to normalize the minimum semantic edit distance to obtain a normalized result
  • the second determining subunit 30333 is configured to determine the similarity between the first character string and the second character string according to the normalization result.
  • each operation includes at least one of a replacement operation, an exchange operation, an insertion operation, and a deletion operation
  • the second determining unit 3033 includes:
  • a third determining sub-unit 30334 configured to determine, according to an edit distance, a pairing number, an operation cost of the replacement operation, and an operation cost of the exchange operation, a first semantic edit distance between the first character string and the second character string;
  • the fourth determining subunit 30335 is configured to determine, according to one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, a number of words in the first sequence, and a number of words in the second sequence.
  • First string a second semantic edit distance between the second string;
  • the fifth determining subunit 30336 is configured to determine the similarity between the first character string and the second character string according to the first semantic edit distance and the second semantic edit distance.
  • the apparatus further includes:
  • the third determining module 304 is configured to determine an operation cost of the replacement operation and an operation cost of the exchange operation according to the relationship between the replacement operation and the exchange operation;
  • the fourth determining module 305 is configured to determine an operation cost of the insert operation, an operation cost of the delete operation, and an operation cost of the replacement operation according to the relationship between the replacement operation and the insert operation and the delete operation.
  • the apparatus further includes:
  • a fifth determining module 306, configured to determine an operation cost of the 2* replacement operation and an operation cost of the exchange operation according to a relationship between the replacement operation and the exchange operation;
  • the sixth determining module 307 is configured to determine an operation cost of the insert operation + an operation cost of the delete operation > an operation cost of the replacement operation according to the relationship between the replacement operation and the insert operation and the delete operation.
  • the apparatus further includes:
  • the seventh determining module 308 is configured to determine, according to the relationship between the insert operation and the delete operation, an operation cost of the insert operation equal to an operation cost of the delete operation.
  • the first determining module 302 is configured to determine an edit distance between the first character string and the second string according to the predefined edit distance algorithm and the first sequence and the second sequence by using the following formula 1. :
  • the first determining subunit 30331 is configured to determine the first string and the second string by using the following formula 2 according to the editing distance, the number of pairs, the operation cost of the replacement operation, and the operation cost of the swap operation.
  • S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the minimum semantic edit distance, d is the edit distance, p is the pairing number, and cost(J) is the operation of the swap operation. Cost, cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0.
  • the third determining subunit 30334 is configured to determine the first character string and the second character string by using the following formula 3 according to the editing distance, the number of pairs, and the operation cost of the replacement operation, and the operation cost of the switching operation. Between First semantic edit distance:
  • S1 and S2 are the first character string and the second character string, respectively, minCost(S1, S2) is the first semantic editing distance, d is the editing distance, p is the pairing number, and cost(J) is the switching operation.
  • the cost of the operation cost(T) is the operational cost of the replacement operation, and 2cost(T)-cost(J)>0, and 2cost(T)-cost(J)>0.
  • the fourth determining subunit 30335 is configured to use one of an operation cost of the insert operation and an operation cost of the delete operation, an operation cost of the replacement operation, and a number of words, a second sequence in the first sequence.
  • the number of words in the first semantic edit distance between the first character string and the second character string is determined by the following formula 4:
  • costM cost(C), if n ⁇ m;
  • costM cost(S), if n>m
  • Equation 4 normFact(S1, S2) is the second semantic edit distance, n is the number of words in the first sequence, m is the number of words in the second sequence, and cost(T) is the operational cost of the replacement operation, cost( S) is the operational cost of the delete operation, and cost(C) is the operational cost of the insert operation.
  • the fifth determining subunit 30336 is configured to determine the similarity between the first string and the second string according to the following formula 5 according to the first semantic edit distance and the second semantic edit distance:
  • sim(S1, S2) is the similarity between the first string and the second string
  • minCost(S1, S2) is the first semantic edit distance
  • normFact(S1, S2) is the second semantic edit. distance
  • the similarity determining apparatus provided in the embodiment corresponding to the foregoing FIG. 3 to FIG. 9 may be used to perform the similarity determining method provided by the embodiment corresponding to FIG. 1 or FIG. 2, wherein the specific manner in which each module performs the operation has been A detailed description is made in the embodiment relating to the method, and will not be explained in detail herein.
  • FIG. 10 is a block diagram of a terminal 600, which may be used to perform the similarity determination method provided by the embodiment corresponding to FIG. 1 or FIG. 2, according to an exemplary embodiment.
  • terminal 600 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • terminal 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, I/O (Input/Output) interface 612, Sensor component 614, and communication component 616.
  • processing component 602 memory 604, power component 606, multimedia component 608, audio component 610, I/O (Input/Output) interface 612, Sensor component 614, and communication component 616.
  • Processing component 602 typically controls the overall operations of terminal 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • Processing component 602 can include one or more processors 620 to execute instructions to perform all or part of the steps of the above described methods.
  • processing component 602 can include one or more modules to facilitate interaction between component 602 and other components.
  • processing component 602 can include a multimedia module to facilitate interaction between multimedia component 608 and processing component 602.
  • Memory 604 is configured to store various types of data to support operation at terminal 600. Examples of such data include instructions for any application or method operating on terminal 600, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 604 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as SRAM (Static Random Access Memory), EEPROM (Electrically-Erasable Programmable Read-Only Memory, Erasable Programmable Read Only Memory (EPROM), PROM (Programmable Read-Only Memory), ROM (Read-Only Memory, Read only memory), magnetic memory, flash memory, disk or optical disk.
  • SRAM Static Random Access Memory
  • EEPROM Electrically-Erasable Programmable Read-Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • PROM Programmable Read-Only Memory
  • ROM Read-Only Memory, Read only memory
  • magnetic memory flash memory, disk or optical disk.
  • Power component 606 provides power to various components of terminal 600.
  • Power component 606 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for terminal 600.
  • the multimedia component 608 includes a screen between the terminal 600 and the user that provides an output interface.
  • the screen may include an LCD (Liquid Crystal Display) and a TP (Touch Panel). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • the multimedia component 608 includes a front camera and/or a rear camera. When the terminal 600 is in an operation mode such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 610 is configured to output and/or input an audio signal.
  • the audio component 610 includes a MIC (Microphone) that is configured to receive an external audio signal when the terminal 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in memory 604 or transmitted via communication component 616.
  • audio component 610 also includes a speaker for outputting an audio signal.
  • the I/O interface 612 provides an interface between the processing component 602 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
  • Sensor assembly 614 includes one or more sensors for providing terminal 600 with various aspects of status assessment.
  • sensor component 614 can detect an open/closed state of terminal 600, a relative positioning of components, such as a group
  • the device is a display and a keypad of the terminal 600.
  • the sensor component 614 can also detect a change in the position of a component of the terminal 600 or the terminal 600, the presence or absence of contact of the user with the terminal 600, the orientation of the terminal 600 or the acceleration/deceleration and the temperature of the terminal 600. Variety.
  • Sensor assembly 614 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 614 may also include a light sensor, such as a CMOS (Complementary Metal Oxide Semiconductor) or CCD (Charge-coupled Device) image sensor for use in imaging applications.
  • the sensor component 614 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 616 is configured to facilitate wired or wireless communication between terminal 600 and other devices.
  • the terminal 600 can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
  • communication component 616 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • the communication component 616 also includes an NFC (Near Field Communication) module to facilitate short range communication.
  • the NFC module can be based on RFID (Radio Frequency Identification) technology, IrDA (Infra-red Data Association) technology, UWB (Ultra Wideband) technology, BT (Bluetooth) technology and Other technologies are implemented.
  • the terminal 600 may be configured by one or more ASICs (Application Specific Integrated Circuits), DSP (Digital Signal Processor), DSPD (Digital Signal Processor Device). Device), PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array), controller, microcontroller, microprocessor or other electronic component implementation for performing the above diagram 1 or the similarity determination method provided by the embodiment corresponding to FIG. 2.
  • ASICs Application Specific Integrated Circuits
  • DSP Digital Signal Processor
  • DSPD Digital Signal Processor Device
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • controller microcontroller, microprocessor or other electronic component implementation for performing the above diagram 1 or the similarity determination method provided by the embodiment corresponding to FIG. 2.
  • non-transitory computer readable storage medium comprising instructions, such as a memory 604 comprising instructions executable by processor 620 of terminal 600 to perform the similarity determination method described above.
  • the non-transitory computer readable storage medium may be a ROM, a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, and optical data. Storage devices, etc.
  • the non-transitory computer readable storage medium converts the first character string and the second character string into a first sequence and a second sequence, respectively, so as to determine that the first character string is converted into the second character.
  • the edit distance of the string is implemented based on each word in the first sequence and the second sequence, and is not based on each character in the first string and the second string, and each word in the string may include At least one character such that the similarity determined according to the edit distance combines the correlation between the individual characters in the string, making the determined similarity more accurate.
  • FIG. 11 is a block diagram of a server, which may perform the above FIG. 1 or FIG. 1 according to an exemplary embodiment. 2 similarity determination method provided by the corresponding embodiment.
  • server 700 includes a processing component 722 that further includes one or more processors, and memory resources represented by memory 732 for storage by processing component 722.
  • the execution of instructions such as an application.
  • An application stored in memory 732 can include one or more modules each corresponding to a set of instructions.
  • the processing component 722 is configured to execute instructions to perform the similarity determination method provided by the embodiment corresponding to FIG. 1 or FIG. 2 above.
  • Server 700 may also include a power component 726 configured to perform power management of server 700, a wired or wireless network interface 750 configured to connect server 700 to the network, and an input/output (I/O) interface 758.
  • Server 700 can operate based on the operating system stored in memory 732, for example, Windows Server TM, Mac OS X TM , Unix TM, Linux TM, FreeBSD TM or similar.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种相似性确定方法、装置及终端,属于自然语言处理领域。包括:分别对第一字符串和第二字符串进行分词,得到分别包括至少一个词的第一序列和第二序列(S101);根据预先定义的编辑距离算法及第一序列和第二序列确定第一字符串和第二字符串之间的编辑距离(S102);根据编辑距离及由第一序列向第二序列变换所做的各操作的信息确定第一字符串与第二字符串之间的相似性(S103)。通过将第一字符串和第二字符串分词为第一序列和第二序列,使在确定编辑距离时,是基于字符串中的词实现的,而并非基于字符串中的字符实现的,而字符串中的各个词可能包括至少一个字符,从而使根据编辑距离确定的相似性结合了字符串中各个字符之间的相关性,使确定的相似性更准确。

Description

相似性确定方法、装置及终端
本申请基于申请号为201510882468.2、申请日为2015年12月03日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开涉及自然语言处理领域,特别涉及一种相似性确定方法、装置及终端。
背景技术
在自然语言处理领域,字符串之间的相似性确定方法是一个基本问题,其可以被应用于很多场景,如文本聚类、信息检索等。因此,如何确定字符串之间的相似性,受到研究人员的广泛关注。
相关技术在确定字符串之间的相似性时,可以通过计算两个之间的编辑距离来实现。具体地,可以分别将这两个字符串分词为各个字符;然后,通过对一个字符串中的字符进行删除操作、插入操作或替换操作,从而将一个字符串变换为另一个字符串;接下来,计算由一个字符串变换为另一个字符串需要的最小操作次数,将该最小操作次数作为这两个字符串之间的编辑距离;最后,根据编辑距离计算这两个字符串之间的相似性。
发明内容
本公开提供一种相似性确定方法、装置及终端。
根据本公开实施例的第一方面,提供一种相似性确定方法,所述方法包括:
分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;
根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离;
根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
根据本公开的第二方面,提供一种相似性确定装置,所述装置包括:
分词模块,用于分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;
第一确定模块,用于根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离;
第二确定模块,用于根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
根据本公开的第三方面,提供一种终端,所述终端包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为:
分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;
根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离;
根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
本公开的实施例提供的技术方案可以包括以下有益效果:
通过分别将第一字符串和第二字符串分词为第一序列和第二序列,使得在确定由第一字符串变换为第二字符串时的编辑距离时,是基于第一序列及第二序列中的各个词实现的,而并非基于第一字符串和第二字符串中的各个字符实现的,而字符串中的各个词可能包括至少一个字符,从而使得根据编辑距离确定的相似性结合了字符串中各个字符之间的相关性,使得确定的相似性更加准确。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。
图1是根据一示例性实施例示出的一种相似性确定方法的流程图。
图2是根据一示例性实施例示出的一种相似性确定方法的流程图。
图3是根据一示例性实施例示出的一种相似性确定装置的框图。
图4是根据一示例性实施例示出的一种第二确定模块的框图。
图5是根据一示例性实施例示出的一种第二确定单元的框图。
图6是根据一示例性实施例示出的一种第二确定单元的框图。
图7是根据一示例性实施例示出的一种相似性确定装置的框图。
图8是根据一示例性实施例示出的一种相似性确定装置的框图。
图9是根据一示例性实施例示出的一种相似性确定装置的框图。
图10是根据一示例性实施例示出的一种终端的框图。
图11是根据一示例性实施例示出的一种服务器的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。
在自然语言处理领域,字符串之间的相似性确定方法是一个基本问题,其可以被应用于很多场景,如文本聚类、信息检索等。为了使得确定的两个字符串之间的相似性更加准确,本公开实施例提供了一种相似性确定方法。图1是根据一示例性实施例示出的一种相似性确定方法的流程图,本公开实施例提供的相似性确定方法可以用于终端中。如图1所示,本公开实施例提供的相似性确定方法包括以下步骤。
在步骤S101中,分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,其中,第一序列和第二序列分别包括至少一个词。
在步骤S102中,根据预先定义的编辑距离算法及第一序列和第二序列,确定第一字符串和第二字符串之间的编辑距离。
在步骤S103中,根据编辑距离及由第一序列向第二序列变换所做的各操作的信息,确定第一字符串与第二字符串之间的相似性。
本公开实施例提供的方法,通过分别将第一字符串和第二字符串分词为第一序列和第二序列,使得在确定由第一字符串变换为第二字符串时的编辑距离时,是基于第一序列及第二序列中的各个词实现,而并非基于第一字符串和第二字符串中的各个字符实现的,而字符串中的各个词可能包括至少一个字符,从而使得根据编辑距离确定的相似性结合了字符串中各个字符之间的相关性,使得确定的相似性更加准确。
在另一个实施例中,根据编辑距离及由第一序列向第二序列变换所做的各操作的信息,确定第一字符串与第二字符串之间的相似性,包括:
获取由第一序列向第二序列变换时所进行的各操作信息中的替换操作信息;
根据各替换操作信息确定配对数,其中,配对数是指同时存在于第一序列和第二序列中的两个词的个数;
根据编辑距离、配对数及各操作的操作代价、第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的相似性。
在另一个实施例中,根据编辑距离、配对数及各操作的操作代价、第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的相似性,各操作包括替换操作、交换操作,包括:
根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,确定第一字符串与第二字符串之间的最小语义编辑距离;
对最小语义编辑距离进行归一化,得到归一化结果;
根据归一化结果确定第一字符串与第二字符串之间的相似性。
在另一个实施例中,根据编辑距离、配对数及各操作的操作代价、第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的相似性,各操作包括替换操作、交换操作、插入操作、删除操作中的至少其中之一,包括:
根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,确定第一字符串与第二字符串之间的第一语义编辑距离;
根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的第二语义编辑距离;
根据第一语义编辑距离和第二语义编辑距离,确定第一字符串与第二字符串之间的相似性。
在另一个实施例中,方法还包括:
根据替换操作与交换操作之间的关系,确定替换操作的操作代价及交换操作的操作代价;
根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价、删除操作的操作代价及替换操作的操作代价。
在另一个实施例中,方法还包括:
根据替换操作与交换操作之间的关系,确定2*替换操作的操作代价>交换操作的操作代价;
根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价+删除操作的操作代价>替换操作的操作代价。
在另一个实施例中,方法还包括:
根据插入操作与删除操作之间的关系,确定插入操作的操作代价等于删除操作的操作代价。
在另一个实施例中,根据预先定义的编辑距离算法及第一序列和第二序列,确定第一字符串和第二字符串之间的编辑距离,包括:
根据预先定义的编辑距离算法及第一序列和第二序列,通过如下公式一确定第一字符串和第二字符串之间的编辑距离:
公式一:
minCost[i,j]=min(
minCost[i-1,j]+cost(S),
minCost[i,j-1]+cost(C),
minCost[i-1,j-1]+cost(T))
公式一中,i表示第一序列中的第i个词;j表示第二序列中的第j个词;cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价,cost(T)为替换操作的操作代价。
在另一个实施例中,根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,确定第一字符串与第二字符串之间的最小语义编辑距离,包括:
根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式二确定第一字符串与第二字符串之间的最小语义编辑距离:
公式二:minCost(S1,S2)=d-p(2cost(T)-cost(J));
公式二中,S1和S2分别为第一字符串和第二字符串,minCost(S1,S2)为最小语义编辑距离,d为编辑距离,p为配对数,cost(J)为交换操作的操作代价,cost(T)为替换操作的操作代价,且2cost(T)-cost(J)>0。
在另一个实施例中,根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,确定第一字符串与第二字符串之间的第一语义编辑距离,包括:
根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式三确定第一字符串与第二字符串之间的第一语义编辑距离:
公式三:
minCost(S1,S2)=d-p(2cost(T)-cost(J));
公式三中,S1和S2分别为第一字符串和第二字符串,minCost(S1,S2)为第一语义编辑距离,d为编辑距离,p为配对数,cost(J)为交换操作的操作代价,cost(T)为替换操作的操作代价,且2cost(T)-cost(J)>0,且2cost(T)-cost(J)>0。
在另一个实施例中,根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的第二语义编辑距离,包括:
根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及第一序列中的词个数、第二序列中的词个数,通过如下公式四确定第一字符串与第二字符串之间的第二语义编辑距离:
公式四:
normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costM
costM=cost(C),if n<m;
costM=cost(S),if n>m
公式四中,normFact(S1,S2)为第二语义编辑距离,n为第一序列的词个数,m为第二序列的词个数,cost(T)为替换操作的操作代价,为删除操作的操作代价,cost(C)为插入操作的操作代价。
在另一个实施例中,根据第一语义编辑距离和第二语义编辑距离,确定第一字符串与第二字符串之间cost(S)间的相似性,包括:
根据第一语义编辑距离和第二语义编辑距离,通过如下公式五确定第一字符串与第二字符串之间的相似性:
公式五:
sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2);
公式五中,sim(S1,S2)为第一字符串与第二字符串之间的相似性,minCost(S1,S2)为第一语义编辑距离,normFact(S1,S2)为第二语义编辑距离。
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。
结合图1所对应实施例的内容,图2是根据一示例性实施例提供的一种相似性确定方法的流程图,该相似性确定方法可以应用于终端中。如图2所示,本公开实施例提供的相似性确定方法包括以下步骤。
在步骤S201中,分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列。
由于字符串中的各个字符之间并不是完全独立的,而可能具有一定的相关性,有些相邻的字符串可能是一个不可分割的整体,例如,“今天我去爬香山”中“今天”和“香山”即是一个不可分割的整体,因此,本公开实施例在确定两个字符串之间的相似性时,并不将两个字符串分别分词为各个字符,而是对两个字符串进行分词,将每个字符串分词为各个词,被切分后的字符串包括至少一个词。为了便于说明,本公开实施例将需要确定相似性的两个字符串分别定义为第一字符串和第二字符串,且将第一字符串分词为各个词后,得到第一序列;将第二字符串分词为各个词后,得到第二序列。其中,第一序列和第二序列分别包括至少一个词
例如,当第一字符串和第二字符串分别为S1和S2时,第一序列和第二序列分别为(S11,S12,S13,…,S1n)和(S21,S22,S23,…,S2m)。其中,S1中词的个数为n,S2中词的个数为m。
关于第一字符串和第二字符串的语言,本公开实施例不作具体限定。例如,第一字符串和第二字符串可以均为汉语,或者均为英语等。其中,第一字符串和第二字符串可以分别为一个句子。如,第一字符串为“今天我打算去香山”,第二字符串为“我打算今天去香山”。
在步骤S202中,根据替换操作与交换操作之间的关系确定替换操作的操作代价及交换操作的操作代价,根据替换操作与插入操作及删除操作之间的关系确定插入操作的操作代价、删除操作的操作代价及替换操作的操作代价。
传统的确定字符串之间的相似度的方法,在将一个字符串变换为另一个字符串时,往往包括三种编辑操作,即插入操作、删除操作和替换操作,且这三种操作的操作代价相同。然而,在字符串中,有些成分在字符串的不同位置出现,并不改变字符串的整体含义。例如,“今天我打算去香山”、“我打算今天去香山”、“我今天打算去香山”,虽然各个词在字符串中的位置不同,但这三个字符串所表达的意思相同。因此,在本公开实施例中,在传 统的插入操作、删除操作及替换操作的基础上,新定义了交换操作,并根据各种操作之间的关系,为不同的操作定义了不同的操作代价。
关于为各种操作分配的操作代价的具体数值,本公开实施例不作具体限定。然而,在具体实施时,由于交换操作可以分解为两次替换操作,因此,本公开实施例可以根据替换操作与交换操作之间的关系确定替换操作代价及交换操作代价,如,本公开实施例定义的替换操作代价与交换操作代价之间的关系满足:2*替换操作的操作代价>交换操作的操作代价,即:
2cost(T)-cost(J)>0;
其中,cost(T)为替换操作的操作代价,cost(J)为交换操作的操作代价。
又由于一次替换操作可以分解为一次删除操作和一次插入操作,因此,本公开实施例可以根据替换操作与插入操作及删除操作之间的关系确定插入操作代价及删除操作代价,如本公开实施例定义的替换操作代价、交换操作代价及删除操作代价之间的关系满足:插入操作的操作代价+删除操作的操作代价>替换操作的操作代价。进一步地,可以确定替换操作的操作代价大于插入操作的操作代价和删除操作的操作代价中的最大值。示例地,该种关系可以表示为如下公式:
max(cost(C),cost(S))<cost(T)<cost(C)+cost(S);
其中,cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价。
另外,如果相似性是对称的,即在第一字符串中插入一个字符相当于在第二字符串中删除一个字符,则可以根据插入操作与删除操作之间的关系,确定插入操作代价等于删除操作代价。当然,针对相似性是非对称的情况,也可以定义插入操作代价与删除操作代价不相同或者相同,本公开实施例对此不作具体限定。
在步骤S203中,根据替换操作的操作代价、删除操作的操作代价及插入操作的操作代价,生成预先定义的编辑距离算法。
示例地,预先定义的编辑距离算法可以如下述公式一。
公式一:
minCost[i,j]=min(
minCost[i-1,j]+cost(S),
minCost[i,j-1]+cost(C),
minCost[i-1,j-1]+cost(T))
公式一中,i表示第一序列中的第i个词;j表示第二序列中的第j个词;cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价,cost(T)为替换操作的操作代价。
由该公式一可得,本公开实施例预先定义的编辑距离算法为一种动态规划算法,该预先定义的编辑距离算法根据本公开实施例预先定义的删除操作的操作代价、插入操作的操作代价和替换操作的操作代价得到。
需要说明的是,步骤S202和步骤S203为确定相似性之前需要执行的步骤,并不是每次确定两个字符串之间的相似性时均需执行的步骤,保证在确定相似性之前,已经确定各种操作的操作代价及预先定义的编辑距离算法即可。
在步骤S204中,根据预先定义的编辑距离算法及第一序列和第二序列,确定第一字符串和第二字符串之间的编辑距离。
两个字符串之间的编辑距离是指由其中一个字符串变换为另一个字符串所需的最少编辑操作次数,其中,每种编辑操作对应一个操作代价,因此,可以将变换时的总操作代价作为编辑距离。在本公开实施例中,在进行字符串之间的变换时,可以进行的编辑操作包括替换操作、插入操作、删除操作和交换操作。
结合上述公式一预先定义的编辑距离算法,本公开实施例在根据预先定义的编辑距离算法及第一序列和第二序列,确定第一字符串与第二字符串之间的编辑距离时,可以根据预先定义的编辑距离算法及第一序列和第二序列,通过如上公式一递推计算第一字符串与第二字符串之间的编辑距离。具体的通过公式一计算编辑距离的原理与现有的基于动态规划算法计算编辑距离的原理相同,本公开实施例对此不作详细阐述。
在步骤S205中,根据编辑距离及由第一序列向第二序列变换所做的各操作的信息,确定第一字符串与第二字符串之间的相似性。
在本公开实施例中,第一序列向第二序列变换所做的各操作的信息包括操作的类型、每种类型的操作的操作次数及每种类型的操作的操作代价。
由于本公开实施例中预先定义了不同的操作具有不同的操作代价,结合两个字符串之间的编辑距离的定义,在由第一字符串向第二字符串变换时所需的各种操作的操作代价将直接影响编辑距离。因此,在确定第一字符串与第二字符串之间的相似性时,需要根据编辑距离以及得到编辑距离时所进行的各种操作的操作信息实现,而操作信息包括操作代价,且得到编辑距离时的各种操作的操作代价在步骤S202中已预先设定。
例如,如果在由第一字符串变换为第二字符串时进行的操作包括两次插入操作、一次删除操作、一次交换操作和一次替换操作,则第一字符串和第二字符串之间的编辑距离d为d=2cost(C)+cost(S))+cost(T)+cost(J)。此时,在确定第一字符串与第二字符串之间的相似度时,根据编辑距离及插入操作的操作代价、删除操作的操作代价、交换操作的操作代价和替换操作的操作代价实现。
示例地,在根据编辑距离及由第一序列向第二序列变换所做的各操作的信息,确定第一字符串与第二字符串之间的相似性时,包括但不限于通过如下步骤S2051至步骤S2053来实现:
在步骤S2051中,获取得到编辑距离时,由第一序列向第二字符串变换时所进行的各操作信息中的替换操作信息。
替换操作是将第一字符串中的某一个词替换为另一个词。本公开实施例在确定编辑距离的同时,会统计变换过程中所进行的各个替换操作的信息,并将各个替换操作的信息记 录在指定集合中。其中,替换操作的信息包括替换操作的被替换词及被替换词在序列中的位置,因此,该指定集合中记录的数据包括被替换词及被替换词在第一序列中的位置。如,如果第一字符串为“我打算今天去香山”,第一序列为“我-打算-今天-去-香山”,且被替换词为“香山”和“打算”,则指定集合中记录的替换操作的信息包括“打算-2,香山-5”。因此,可以从指定集合中获取得到编辑距离时,由第一字符串变换为第二字符串时所进行的各操作信息中的替换操作信息,具体可以得到各个替换操作的被替换词及每个被替换词在第一序列中的位置。
另外,由于本公开实施例根据替换操作与交换操作之间的关系,新定义了交换操作,且预先定义2cost(T)-cost(J)>0,由此可得,进行两次替换操作的代价大于进行一次交换操作的代价,因此,如果在将第一字符串变换到第二字符串时能进行一次交换操作实现,则不通过两次替换操作实现。因此,在指定集合中除记录了第一序列中的被替换词及每个被替换词在第一序列中的位置外,还可以进一步确定指定集合中的任两个词是否存在于第二序列中;如果任两个词存在于第二序列中,则在指定集合中还将记录这两个词及每个词在第二序列中的位置。
例如,如果第一字符串为“我打算今天去香山”,第一序列为“我-打算-今天-去-香山”,且被替换词为“香山”和“打算”,第二字符串为“今天我打算去香山”,第二序列为“今天-我-打算-去-香山”,由于被替换词“香山”和“打算”同时存在于第一序列和第二序列中,因此,指定集合中记录的数据可以为“打算-S12,香山-S15;打算-S23,香山-S25”。
在本公开实施例中,将“香山”和“打算”定义为第一字符串和第二字符串之间的一个匹配词。由上面举例可得,匹配词是指同时存在于第一序列和第二序列中的任两个词。
在步骤S2052中,根据替换操作信息确定配对数。
其中,配对数是指第一序列和第二序列中的匹配词的个数,即同时存在于第一序列和第二序列中的两个词的个数。结合上述对指定集合中所记录的数据的解释可得,可以根据指定集合中记录的数据确定配对数。
例如,如果指定集合中记录的数据为“打算-S12,香山-S15;打算-S23,香山-S25;我-S11,去-S14;我-S21,去-S24”,则可以确定配对数为2。
在步骤S2053中,根据编辑距离、配对数及各操作的操作代价、第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的相似性。
示例地,本公开实施例在根据编辑距离、配对数及各操作的操作代价、第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的相似性时,各操作可以包括替换操作、交换操作。
结合各操作的类型,步骤S2053可以通过如下步骤S20531至步骤S20533来实现:
在步骤S20531中,根据词编辑距离、词配对数及替换操作的操作代价、交换操作的操作代价,确定词第一字符串与词第二字符串之间的最小语义编辑距离。
示例地,可以根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价, 通过如下公式二确定第一字符串与第二字符串之间的最小语义编辑距离:
公式二:minCost(S1,S2)=d-p(2cost(T)-cost(J));
公式二中,S1和S2分别为第一字符串和第二字符串,minCost(S1,S2)为最小语义编辑距离,d为编辑距离,p为配对数,cost(J)为交换操作的操作代价,cost(T)为替换操作的操作代价,且2cost(T)-cost(J)>0。
在步骤S20532中,对最小语义编辑距离进行归一化,得到归一化结果。
示例地,可以通过第一字符串与第二字符串之间的最大语义编辑距离对最小语义编辑距离进行归一化。最大语义编辑距离可以表示为如下公式四:
公式四:
normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costM
costM=cost(C),if n<m;
costM=cost(S),if n>m
公式四中,normFact(S1,S2)表示最大语义编辑距离,n表示第一序列的词数量,m表示第二序列的词数量。
对最小语义编辑距离minCost(S1,S2)进行归一化处理得到的归一化结果为minCost(S1,S2)/normFact(S1,S2)。通过对最小语义编辑距离进行归一化,可以将minCost(S1,S2)/normFact(S1,S2)映射至0至1之间,从而便于直观地确定相似度。
在步骤S20533中,根据词归一化结果确定词第一字符串与词第二字符串之间的相似性。
示例地,可以根据词归一化结果通过如下公式五确定第一字符串与第二字符串之间的相似性:
公式五:
sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2);
公式五中,sim(S1,S2)为第一字符串与第二字符串之间的相似性,minCost(S1,S2)为最小语义编辑距离,normFact(S1,S2)为最大语义编辑距离,minCost(S1,S2)/normFact(S1,S2)为归一化结果。
更具体地,本公开实施例在根据编辑距离、配对数及各操作的操作代价、第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的相似性时,各操作还可以为替换操作、交换操作、插入操作、删除操作中的至少其中之一。在此基础上,可以通过如下步骤S20534至步骤S20536来实现:
在步骤S20534中,根据编辑距离、配对数及替换操作的操作代价和交换操作的操作代价,确定第一字符串与第二字符串之间的第一语义编辑距离。
其中,第一语义编辑距离可以是第一字符串与第二字符串之间的最小语义编辑距离。
示例地,在根据编辑距离、配对数及替换操作的操作代价和交换操作的操作代价,确 定第一字符串与第二字符串之间的第一语义编辑距离时,包括但不限于通过如下公式三来实现:
公式三:
minCost(S1,S2)=d-p(2cost(T)-cost(J));
公式三中,S1和S2分别表示第一字符串和第二字符串,minCost(S1,S2)表示第一语义编辑距离,d表示编辑距离,p表示配对数,cost(J)表示交换操作代价。
由公式三和公式二可得,第一语义编辑距离可以是第一字符串与第二字符串之间的最小语义编辑距离,公式二与公式三仅minCost(S1,S2)表示的意义不同。
在步骤S20535中,根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的第二语义编辑距离。
其中,第二语义编辑距离可以是第一字符串与第二字符串之间的最大语义编辑距离。
示例地,在根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的第二语义编辑距离时,包括但不限于通过如下公式四实现。
公式四:
normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costM
costM=cost(C),if n<m;
costM=cost(S),if n>m
公式四中,normFact(S1,S2)表示第二语义编辑距离,n表示第一序列的词数量,m表示第二序列的词数量。
其中,normFact(S1,S2)为归一化因子,其作用为将minCost(S1,S2)/normFact(S1,S2)映射至0至1之间,从而便于直观地确定相似度。
在步骤S20536中,根据第一语义编辑距离和第二语义编辑距离,确定第一字符串与第二字符串之间的相似性。
示例地,可以通过如下公式五确定第一字符串与第二字符串之间的相似性:
公式五:
sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2)。
公式五中,sim(S1,S2)表示第一字符串与第二字符串之间的相似性。
例如,当minCost(S1,S2)为1.5,normFact(S1,S2)为2.5,则S1与S2之间的相似度为1-1.5/2.5=0.4。
本公开实施例提供的方法,通过分别将第一字符串和第二字符串分词为第一序列和第二序列,使得在确定由第一字符串变换为第二字符串时的编辑距离时,是基于第一序列及第二序列中的各个词实现,而并非基于第一字符串和第二字符串中的各个字符实现的,而 字符串中的各个词可能包括至少一个字符,从而使得根据编辑距离确定的相似性结合了字符串中各个字符之间的相关性,使得确定的相似性更加准确。
图3是根据一示例性实施例示出的一种相似性确定装置的框图。参照图3,该相似性确定装置包括分词模块301、第一确定模块302和第二确定模块303。其中:
分词模块301,用于分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,第一序列和第二序列分别包括至少一个词;
第一确定模块302,用于根据预先定义的编辑距离算法及第一序列和第二序列,确定第一字符串和第二字符串之间的编辑距离;
第二确定模块303,用于根据编辑距离及由第一序列向第二序列变换所做的各操作的信息,确定第一字符串与第二字符串之间的相似性。
本公开实施例提供的装置,通过将第一字符串和第二字符串分词为第一序列和第二序列,使得在确定编辑距离时,是基于字符串中的词实现的,而并非基于字符串中的字符实现的,而字符串中的各个词可能包括至少一个字符,从而使根据编辑距离确定的相似性结合了字符串中各个字符之间的相关性,使得确定的相似性更加准确。
在另一个实施例中,参见图4,第二确定模块303包括:
获取单元3031,用于获取由第一序列向第二序列变换时所进行的各操作信息中的替换操作信息;
第一确定单元3032,用于根据各替换操作信息确定配对数,其中,配对数是指同时存在于第一序列和第二序列中的两个词的个数;
第二确定单元3033,用于根据编辑距离、配对数及各操作的操作代价、第一序列中的词个数、第二序列中的词个数,确定第一字符串与第二字符串之间的相似性。
在另一个实施例中,参见图5,各操作包括替换操作、交换操作,第二确定单元3033包括:
第一确定子单元30331,用于根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,确定第一字符串与第二字符串之间的最小语义编辑距离;
归一化子单元30332,用于对最小语义编辑距离进行归一化,得到归一化结果;
第二确定子单元30333,用于根据归一化结果确定第一字符串与第二字符串之间的相似性。
在另一个实施例中,参见图6,各操作包括替换操作、交换操作、插入操作、删除操作中的至少其中之一,第二确定单元3033包括:
第三确定子单元30334,用于根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,确定第一字符串与第二字符串之间的第一语义编辑距离;
第四确定子单元30335,用于根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及第一序列中的词个数、第二序列中的词个数,确定第一字符串 与第二字符串之间的第二语义编辑距离;
第五确定子单元30336,用于根据第一语义编辑距离和第二语义编辑距离,确定第一字符串与第二字符串之间的相似性。
在另一个实施例中,参见图7,装置还包括:
第三确定模块304,用于根据替换操作与交换操作之间的关系,确定替换操作的操作代价及交换操作的操作代价;
第四确定模块305,用于根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价、删除操作的操作代价及替换操作的操作代价。
在另一个实施例中,参见图8,装置还包括:
第五确定模块306,用于根据替换操作与交换操作之间的关系,确定2*替换操作的操作代价>交换操作的操作代价;
第六确定模块307,用于根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价+删除操作的操作代价>替换操作的操作代价。
在另一个实施例中,参见图9,装置还包括:
第七确定模块308,用于根据插入操作与删除操作之间的关系,确定插入操作的操作代价等于删除操作的操作代价。
在另一个实施例中,第一确定模块302,用于根据预先定义的编辑距离算法及第一序列和第二序列,通过如下公式一确定第一字符串和第二字符串之间的编辑距离:
公式一:
minCost[i,j]=min(
minCost[i-1,j]+cost(S),
minCost[i,j-1]+cost(C),
minCost[i-1,j-1]+cost(T))
公式一中,i表示第一序列中的第i个词;j表示第二序列中的第j个词;cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价,cost(T)为替换操作的操作代价。
在另一个实施例中,第一确定子单元30331,用于根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式二确定第一字符串与第二字符串之间的最小语义编辑距离:
公式二:minCost(S1,S2)=d-p(2cost(T)-cost(J));
公式二中,S1和S2分别为第一字符串和第二字符串,minCost(S1,S2)为最小语义编辑距离,d为编辑距离,p为配对数,cost(J)为交换操作的操作代价,cost(T)为替换操作的操作代价,且2cost(T)-cost(J)>0。
在另一个实施例中,第三确定子单元30334,用于根据编辑距离、配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式三确定第一字符串与第二字符串之间的 第一语义编辑距离:
公式三:
minCost(S1,S2)=d-p(2cost(T)-cost(J));
公式三中,S1和S2分别为第一字符串和第二字符串,minCost(S1,S2)为第一语义编辑距离,d为编辑距离,p为配对数,cost(J)为交换操作的操作代价,cost(T)为替换操作的操作代价,且2cost(T)-cost(J)>0,且2cost(T)-cost(J)>0。
在另一个实施例中,第四确定子单元30335,用于根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及第一序列中的词个数、第二序列中的词个数,通过如下公式四确定第一字符串与第二字符串之间的第二语义编辑距离:
公式四:
normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costM
costM=cost(C),if n<m;
costM=cost(S),if n>m
公式四中,normFact(S1,S2)为第二语义编辑距离,n为第一序列的词个数,m为第二序列的词个数,cost(T)为替换操作的操作代价,cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价。
在另一个实施例中,第五确定子单元30336,用于根据第一语义编辑距离和第二语义编辑距离,通过如下公式五确定第一字符串与第二字符串之间的相似性:
公式五:
sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2);
公式五中,sim(S1,S2)为第一字符串与第二字符串之间的相似性,minCost(S1,S2)为第一语义编辑距离,normFact(S1,S2)为第二语义编辑距离。
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。
关于上述图3至图9所对应实施例中提供的相似性确定装置,可以用于执行上述图1或图2所对应实施例提供的相似性确定方法,其中的各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
图10是根据一示例性实施例示出的一种终端600的框图,所述终端可以用于执行上述图1或图2所对应实施例提供的相似性确定方法。例如,终端600可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图10,终端600可以包括以下一个或多个组件:处理组件602,存储器604,电源组件606,多媒体组件608,音频组件610,I/O(Input/Output,输入/输出)接口612, 传感器组件614,以及通信组件616。
处理组件602通常控制终端600的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件602可以包括一个或多个处理器620来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件602可以包括一个或多个模块,便于处理组件602和其它组件之间的交互。例如,处理组件602可以包括多媒体模块,以方便多媒体组件608和处理组件602之间的交互。
存储器604被配置为存储各种类型的数据以支持在终端600的操作。这些数据的示例包括用于在终端600上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器604可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如SRAM(Static Random Access Memory,静态随机存取存储器),EEPROM(Electrically-Erasable Programmable Read-Only Memory,电可擦除可编程只读存储器),EPROM(Erasable Programmable Read Only Memory,可擦除可编程只读存储器),PROM(Programmable Read-Only Memory,可编程只读存储器),ROM(Read-Only Memory,只读存储器),磁存储器,快闪存储器,磁盘或光盘。
电源组件606为终端600的各种组件提供电力。电源组件606可以包括电源管理系统,一个或多个电源,及其他与为终端600生成、管理和分配电力相关联的组件。
多媒体组件608包括在所述终端600和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括LCD(Liquid Crystal Display,液晶显示器)和TP(Touch Panel,触摸面板)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件608包括一个前置摄像头和/或后置摄像头。当终端600处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件610被配置为输出和/或输入音频信号。例如,音频组件610包括一个MIC(Microphone,麦克风),当终端600处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器604或经由通信组件616发送。在一些实施例中,音频组件610还包括一个扬声器,用于输出音频信号。
I/O接口612为处理组件602和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件614包括一个或多个传感器,用于为终端600提供各个方面的状态评估。例如,传感器组件614可以检测到终端600的打开/关闭状态,组件的相对定位,例如组 件为终端600的显示器和小键盘,传感器组件614还可以检测终端600或终端600一个组件的位置改变,用户与终端600接触的存在或不存在,终端600方位或加速/减速和终端600的温度变化。传感器组件614可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件614还可以包括光传感器,如CMOS(Complementary Metal Oxide Semiconductor,互补金属氧化物)或CCD(Charge-coupled Device,电荷耦合元件)图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件614还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件616被配置为便于终端600和其他设备之间有线或无线方式的通信。终端600可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件616经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件616还包括NFC(Near Field Communication,近场通信)模块,以促进短程通信。例如,在NFC模块可基于RFID(Radio Frequency Identification,射频识别)技术,IrDA(Infra-red Data Association,红外数据协会)技术,UWB(Ultra Wideband,超宽带)技术,BT(Bluetooth,蓝牙)技术和其它技术来实现。
在示例性实施例中,终端600可以被一个或多个ASIC(Application Specific Integrated Circuit,应用专用集成电路)、DSP(Digital signal Processor,数字信号处理器)、DSPD(Digital signal Processor Device,数字信号处理设备)、PLD(Programmable Logic Device,可编程逻辑器件)、FPGA(Field Programmable Gate Array,现场可编程门阵列)、控制器、微控制器、微处理器或其它电子元件实现,用于执行上述图1或图2所对应实施例提供的相似性确定方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器604,上述指令可由终端600的处理器620执行以完成上述相似性确定方法。例如,所述非临时性计算机可读存储介质可以是ROM、RAM(Random Access Memory,随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory,光盘只读存储器)、磁带、软盘和光数据存储设备等。
本公开实施例提供的非临时性计算机可读存储介质,通过分别将第一字符串和第二字符串分词为第一序列和第二序列,使得在确定由第一字符串变换为第二字符串时的编辑距离时,是基于第一序列及第二序列中的各个词实现,而并非基于第一字符串和第二字符串中的各个字符实现的,而字符串中的各个词可能包括至少一个字符,从而使得根据编辑距离确定的相似性结合了字符串中各个字符之间的相关性,使得确定的相似性更加准确。
当然,上述图1或图2所对实施例提供的相似性确定方法还可以由服务器执行,图11是根据一示例性实施例示出的一种服务器的框图,该服务器可以执行上述图1或图2所对应实施例提供的相似性确定方法。参照图11,服务器700包括处理组件722,其进一步包括一个或多个处理器,以及由存储器732所代表的存储器资源,用于存储可由处理组件722 的执行的指令,例如应用程序。存储器732中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件722被配置为执行指令,以执行上述图1或图2所对应实施例提供的相似性确定方法。
服务器700还可以包括一个电源组件726被配置为执行服务器700的电源管理,一个有线或无线网络接口750被配置为将服务器700连接到网络,和一个输入输出(I/O)接口758。服务器700可以操作基于存储在存储器732的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由下面的权利要求指出。
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。

Claims (25)

  1. 一种相似性确定方法,其特征在于,所述方法包括:
    分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;
    根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离;
    根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性,包括:
    获取由所述第一序列向所述第二序列变换时所进行的各操作信息中的替换操作信息;
    根据所述各替换操作信息确定配对数,其中,所述配对数是指同时存在于所述第一序列和所述第二序列中的两个词的个数;
    根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性,所述各操作包括替换操作、交换操作,包括:
    根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的最小语义编辑距离;
    对所述最小语义编辑距离进行归一化,得到归一化结果;
    根据所述归一化结果确定所述第一字符串与所述第二字符串之间的相似性。
  4. 根据权利要求2所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性,所述各操作包括替换操作、交换操作、插入操作、删除操作中的至少其中之一,包括:
    根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的第一语义编辑距离;
    根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的第二语义编辑距离;
    根据所述第一语义编辑距离和所述第二语义编辑距离,确定所述第一字符串与所述第二字符串之间的相似性。
  5. 根据权利要求2至4中任一项所述的方法,其特征在于,所述方法还包括:
    根据替换操作与交换操作之间的关系,确定替换操作的操作代价及交换操作的操作代价;
    根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价、删除操作的操作代价及替换操作的操作代价。
  6. 根据权利要求5中任一项所述的方法,其特征在于,所述方法还包括:
    根据替换操作与交换操作之间的关系,确定2*替换操作的操作代价>交换操作的操作代价;
    根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价+删除操作的操作代价>替换操作的操作代价。
  7. 根据权利要求5中所述的方法,其特征在于,所述方法还包括:
    根据插入操作与删除操作之间的关系,确定插入操作的操作代价等于删除操作的操作代价。
  8. 根据权利要求2所述的方法,其特征在于,所述根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离,包括:
    根据预先定义的编辑距离算法及所述第一序列和所述第二序列,通过如下公式一确定所述第一字符串和所述第二字符串之间的编辑距离:
    公式一:
    minCost[i,j]=min(
    minCost[i-1,j]+cost(S),
    minCost[i,j-1]+cost(C),
    minCost[i-1,j-1]+cost(T))
    公式一中,i表示所述第一序列中的第i个词;j表示所述第二序列中的第j个词;cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价,cost(T)为替换操作的操作代价。
  9. 根据权利要求3所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的最小语义编辑距离,包括:
    根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式二确定所述第一字符串与所述第二字符串之间的最小语义编辑距离:
    公式二:minCost(S1,S2)=d-p(2cost(T)-cost(J));
    公式二中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述最小语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0。
  10. 根据权利要求4所述的方法,其特征在于,所述根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的第一语义编辑距离,包括:
    根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式三确定所述第一字符串与所述第二字符串之间的第一语义编辑距离:
    公式三:
    minCost(S1,S2)=d-p(2cost(T)-cost(J));
    公式三中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述第一语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0,且2cost(T)-cost(J)>0。
  11. 根据权利要求4所述的方法,其特征在于,所述根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的第二语义编辑距离,包括:
    根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,通过如下公式四确定所述第一字符串与所述第二字符串之间的第二语义编辑距离:
    公式四:
    normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costM
    costM=cost(C),if n<m;
    costM=cost(S),if n>m
    公式四中,normFact(S1,S2)为所述第二语义编辑距离,n为所述第一序列的词个数,m为所述第二序列的词个数,cost(T)为所述替换操作的操作代价,cost(S)为所述删除操作的操作代价,cost(C)为所述插入操作的操作代价。
  12. 根据权利要求4所述的方法,其特征在于,所述根据所述第一语义编辑距离和所述第二语义编辑距离,确定所述第一字符串与所述第二字符串之间的相似性,包括:
    根据所述第一语义编辑距离和所述第二语义编辑距离,通过如下公式五确定所述第一字符串与所述第二字符串之间的相似性:
    公式五:
    sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2);
    公式五中,sim(S1,S2)为所述第一字符串与所述第二字符串之间的相似性,minCost(S1,S2)为所述第一语义编辑距离,normFact(S1,S2)为所述第二语义编辑距离。
  13. 一种相似性确定装置,其特征在于,所述装置包括:
    分词模块,用于分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;
    第一确定模块,用于根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串之间的编辑距离;
    第二确定模块,用于根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
  14. 根据权利要求13所述的装置,其特征在于,所述第二确定模块包括:
    获取单元,用于获取由所述第一序列向所述第二序列变换时所进行的各操作信息中的替换操作信息;
    第一确定单元,用于根据所述各替换操作信息确定配对数,其中,所述配对数是指同时存在于所述第一序列和所述第二序列中的两个词的个数;
    第二确定单元,用于根据所述编辑距离、所述配对数及各操作的操作代价、所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的相似性。
  15. 根据权利要求14所述的装置,其特征在于,所述各操作包括替换操作、交换操作,所述第二确定单元包括:
    第一确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的最小语义编辑距离;
    归一化子单元,用于对所述最小语义编辑距离进行归一化,得到归一化结果;
    第二确定子单元,用于根据所述归一化结果确定所述第一字符串与所述第二字符串之间的相似性。
  16. 根据权利要求14所述的装置,其特征在于,所述各操作包括替换操作、交换操作、插入操作、删除操作中的至少其中之一,所述第二确定单元包括:
    第三确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,确定所述第一字符串与所述第二字符串之间的第一语义编辑距离;
    第四确定子单元,用于根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,确定所述第一字符串与所述第二字符串之间的第二语义编辑距离;
    第五确定子单元,用于根据所述第一语义编辑距离和所述第二语义编辑距离,确定所述第一字符串与所述第二字符串之间的相似性。
  17. 根据权利要求14至16中任一项所述的装置,其特征在于,所述装置还包括:
    第三确定模块,用于根据替换操作与交换操作之间的关系,确定替换操作的操作代价及交换操作的操作代价;
    第四确定模块,用于根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价、删除操作的操作代价及替换操作的操作代价。
  18. 根据权利要求17中任一项所述的装置,其特征在于,所述装置还包括:
    第五确定模块,用于根据替换操作与交换操作之间的关系,确定2*替换操作的操作代价>交换操作的操作代价;
    第六确定模块,用于根据替换操作与插入操作及删除操作之间的关系,确定插入操作的操作代价+删除操作的操作代价>替换操作的操作代价。
  19. 根据权利要求17中所述的装置,其特征在于,所述装置还包括:
    第七确定模块,用于根据插入操作与删除操作之间的关系,确定插入操作的操作代价等于删除操作的操作代价。
  20. 根据权利要求14所述的装置,其特征在于,所述第一确定模块,用于根据预先定义的编辑距离算法及所述第一序列和所述第二序列,通过如下公式一确定所述第一字符串和所述第二字符串之间的编辑距离:
    公式一:
    minCost[i,j]=min(
    minCost[i-1,j]+cost(S),
    minCost[i,j-1]+cost(C),
    minCost[i-1,j-1]+cost(T))
    公式一中,i表示所述第一序列中的第i个词;j表示所述第二序列中的第j个词;cost(S)为删除操作的操作代价,cost(C)为插入操作的操作代价,cost(T)为替换操作的操作代价。
  21. 根据权利要求15所述的装置,其特征在于,所述第一确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式二确定所述第一字符串与所述第二字符串之间的最小语义编辑距离:
    公式二:minCost(S1,S2)=d-p(2cost(T)-cost(J));
    公式二中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述最小语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0。
  22. 根据权利要求16所述的装置,其特征在于,所述第三确定子单元,用于根据所述编辑距离、所述配对数及替换操作的操作代价、交换操作的操作代价,通过如下公式三确定所述第一字符串与所述第二字符串之间的第一语义编辑距离:
    公式三:
    minCost(S1,S2)=d-p(2cost(T)-cost(J));
    公式三中,S1和S2分别为所述第一字符串和第所述二字符串,minCost(S1,S2)为所述第一语义编辑距离,d为所述编辑距离,p为所述配对数,cost(J)为所述交换操作的操作代价,cost(T)为所述替换操作的操作代价,且2cost(T)-cost(J)>0,且 2cost(T)-cost(J)>0。
  23. 根据权利要求16所述的装置,其特征在于,第四确定子单元,用于根据插入操作的操作代价和删除操作的操作代价中的一个、替换操作的操作代价及所述第一序列中的词个数、所述第二序列中的词个数,通过如下公式四确定所述第一字符串与所述第二字符串之间的第二语义编辑距离:
    公式四:
    normFact(S1,S2)=min(n,m)cost(T)+(max(n,m)-min(n,m))×costM
    costM=cost(C),if n<m;
    costM=cost(S),if n>m
    公式四中,normFact(S1,S2)为所述第二语义编辑距离,n为所述第一序列的词个数,m为所述第二序列的词个数,cost(T)为所述替换操作的操作代价,cost(S)为所述删除操作的操作代价,cost(C)为所述插入操作的操作代价。
  24. 根据权利要求16所述的装置,其特征在于,所述第五确定子单元,用于根据所述第一语义编辑距离和所述第二语义编辑距离,通过如下公式五确定所述第一字符串与所述第二字符串之间的相似性:
    公式五:
    sim(S1,S2)=1-minCost(S1,S2)/normFact(S1,S2);
    公式五中,sim(S1,S2)为所述第一字符串与所述第二字符串之间的相似性,minCost(S1,S2)为所述第一语义编辑距离,normFact(S1,S2)为所述第二语义编辑距离。
  25. 一种终端,其特征在于,所述终端包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为:
    分别对第一字符串和第二字符串进行分词,得到第一序列和第二序列,所述第一序列和所述第二序列分别包括至少一个词;
    根据预先定义的编辑距离算法及所述第一序列和所述第二序列,确定所述第一字符串和所述第二字符串时之间的编辑距离;
    根据所述编辑距离及由所述第一序列向所述第二序列变换所做的各操作的信息,确定所述第一字符串与所述第二字符串之间的相似性。
PCT/CN2015/099523 2015-12-03 2015-12-29 相似性确定方法、装置及终端 WO2017092122A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020167006741A KR101782923B1 (ko) 2015-12-03 2015-12-29 유사성 확정 방법, 장치, 단말, 프로그램 및 저장매체
RU2016118758A RU2664002C2 (ru) 2015-12-03 2015-12-29 Способ и устройство для определения сходства, а также терминал
MX2016005489A MX365897B (es) 2015-12-03 2015-12-29 Método y aparato para determinar similitud y terminal.
JP2017553299A JP6321306B2 (ja) 2015-12-03 2015-12-29 類似性特定方法、装置、端末、プログラム及び記録媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510882468.2A CN105446957B (zh) 2015-12-03 2015-12-03 相似性确定方法、装置及终端
CN201510882468.2 2015-12-03

Publications (1)

Publication Number Publication Date
WO2017092122A1 true WO2017092122A1 (zh) 2017-06-08

Family

ID=55557172

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/099523 WO2017092122A1 (zh) 2015-12-03 2015-12-29 相似性确定方法、装置及终端

Country Status (8)

Country Link
US (1) US10089301B2 (zh)
EP (1) EP3179379A1 (zh)
JP (1) JP6321306B2 (zh)
KR (1) KR101782923B1 (zh)
CN (1) CN105446957B (zh)
MX (1) MX365897B (zh)
RU (1) RU2664002C2 (zh)
WO (1) WO2017092122A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111352549A (zh) * 2020-02-25 2020-06-30 腾讯科技(深圳)有限公司 一种数据对象展示方法、装置、设备及存储介质
CN114757153A (zh) * 2022-05-12 2022-07-15 阿里巴巴(中国)有限公司 字符串、字符串集合处理方法、计算机设备及存储介质

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10296788B1 (en) * 2016-12-19 2019-05-21 Matrox Electronic Systems Ltd. Method and system for processing candidate strings detected in an image to identify a match of a model string in the image
US10853457B2 (en) * 2018-02-06 2020-12-01 Didi Research America, Llc System and method for program security protection
US10515149B2 (en) * 2018-03-30 2019-12-24 BlackBoiler, LLC Method and system for suggesting revisions to an electronic document
WO2020061910A1 (zh) * 2018-09-27 2020-04-02 北京字节跳动网络技术有限公司 用于生成信息的方法和装置
SG10201904554TA (en) * 2019-05-21 2019-09-27 Alibaba Group Holding Ltd Methods and devices for quantifying text similarity
CN110750615B (zh) * 2019-09-30 2020-07-24 贝壳找房(北京)科技有限公司 文本重复性判定方法和装置、电子设备和存储介质
CN110909161B (zh) * 2019-11-12 2022-04-08 西安电子科技大学 基于密度聚类和视觉相似度的英文单词分类方法
US11776529B2 (en) * 2020-04-28 2023-10-03 Samsung Electronics Co., Ltd. Method and apparatus with speech processing
KR20210132855A (ko) * 2020-04-28 2021-11-05 삼성전자주식회사 음성 처리 방법 및 장치
CN111967270B (zh) * 2020-08-16 2023-11-21 云知声智能科技股份有限公司 一种基于字符与语义融合的方法和设备
US11681864B2 (en) 2021-01-04 2023-06-20 Blackboiler, Inc. Editing parameters
CN112597313B (zh) * 2021-03-03 2021-06-29 北京沃丰时代数据科技有限公司 短文本聚类方法、装置、电子设备及存储介质
KR102517661B1 (ko) * 2022-07-15 2023-04-04 주식회사 액션파워 텍스트 정보에서 타겟 단어에 대응하는 단어를 식별하는 방법
CN116564414B (zh) * 2023-07-07 2024-03-26 腾讯科技(深圳)有限公司 分子序列的比对方法、装置、电子设备、存储介质及产品

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040141354A1 (en) * 2003-01-18 2004-07-22 Carnahan John M. Query string matching method and apparatus
CN101561813A (zh) * 2009-05-27 2009-10-21 东北大学 一种Web环境下的字符串相似度的分析方法
CN101751430A (zh) * 2008-12-12 2010-06-23 汉王科技股份有限公司 电子词典模糊检索方法
CN102622338A (zh) * 2012-02-24 2012-08-01 北京工业大学 一种短文本间语义距离的计算机辅助计算方法

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5757959A (en) * 1995-04-05 1998-05-26 Panasonic Technologies, Inc. System and method for handwriting matching using edit distance computation in a systolic array processor
NO983175L (no) 1998-07-10 2000-01-11 Fast Search & Transfer Asa Soekesystem for gjenfinning av data
JP2001291060A (ja) * 2000-04-04 2001-10-19 Toshiba Corp 単語列照合装置および単語列照合方法
US7107204B1 (en) * 2000-04-24 2006-09-12 Microsoft Corporation Computer-aided writing system and method with cross-language writing wizard
US6810376B1 (en) * 2000-07-11 2004-10-26 Nusuara Technologies Sdn Bhd System and methods for determining semantic similarity of sentences
EP1668541A1 (en) * 2003-09-30 2006-06-14 British Telecommunications Public Limited Company Information retrieval
JP2005352888A (ja) * 2004-06-11 2005-12-22 Hitachi Ltd 表記揺れ対応辞書作成システム
US8077984B2 (en) * 2008-01-04 2011-12-13 Xerox Corporation Method for computing similarity between text spans using factored word sequence kernels
US8775441B2 (en) 2008-01-16 2014-07-08 Ab Initio Technology Llc Managing an archive for approximate string matching
US8812493B2 (en) * 2008-04-11 2014-08-19 Microsoft Corporation Search results ranking using editing distance and document information
US8170969B2 (en) * 2008-08-13 2012-05-01 Siemens Aktiengesellschaft Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge
US8219583B2 (en) * 2008-11-10 2012-07-10 Nbcuniversal Media, Llc Methods and systems for mining websites
US8290989B2 (en) * 2008-11-12 2012-10-16 Sap Ag Data model optimization
CN101957828B (zh) 2009-07-20 2013-03-06 阿里巴巴集团控股有限公司 一种对搜索结果进行排序的方法和装置
WO2014136173A1 (ja) * 2013-03-04 2014-09-12 三菱電機株式会社 検索装置
CN103399907A (zh) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 一种基于编辑距离计算中文字符串相似度的方法及装置
CA2861469A1 (en) * 2013-08-14 2015-02-14 National Research Council Of Canada Method and apparatus to construct program for assisting in reviewing
JP6143638B2 (ja) * 2013-10-17 2017-06-07 株式会社日立ソリューションズ東日本 データ処理装置およびデータ処理方法
US9430463B2 (en) * 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9672206B2 (en) * 2015-06-01 2017-06-06 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040141354A1 (en) * 2003-01-18 2004-07-22 Carnahan John M. Query string matching method and apparatus
CN101751430A (zh) * 2008-12-12 2010-06-23 汉王科技股份有限公司 电子词典模糊检索方法
CN101561813A (zh) * 2009-05-27 2009-10-21 东北大学 一种Web环境下的字符串相似度的分析方法
CN102622338A (zh) * 2012-02-24 2012-08-01 北京工业大学 一种短文本间语义距离的计算机辅助计算方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
13 April 2015 (2015-04-13), XP055598524, Retrieved from the Internet <URL:http://wenku.baidu.com/view/df531365227916888586d702.html> *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111352549A (zh) * 2020-02-25 2020-06-30 腾讯科技(深圳)有限公司 一种数据对象展示方法、装置、设备及存储介质
CN111352549B (zh) * 2020-02-25 2022-01-07 腾讯科技(深圳)有限公司 一种数据对象展示方法、装置、设备及存储介质
CN114757153A (zh) * 2022-05-12 2022-07-15 阿里巴巴(中国)有限公司 字符串、字符串集合处理方法、计算机设备及存储介质

Also Published As

Publication number Publication date
US10089301B2 (en) 2018-10-02
KR101782923B1 (ko) 2017-09-28
CN105446957A (zh) 2016-03-30
MX2016005489A (es) 2017-11-30
MX365897B (es) 2019-06-19
JP6321306B2 (ja) 2018-05-09
CN105446957B (zh) 2018-07-20
RU2664002C2 (ru) 2018-08-14
US20170161260A1 (en) 2017-06-08
RU2016118758A (ru) 2017-11-20
EP3179379A1 (en) 2017-06-14
JP2018501597A (ja) 2018-01-18

Similar Documents

Publication Publication Date Title
WO2017092122A1 (zh) 相似性确定方法、装置及终端
EP3079082B1 (en) Method and apparatus for album display
WO2020029966A1 (zh) 视频处理方法及装置、电子设备和存储介质
WO2021027343A1 (zh) 一种人脸图像识别方法及装置、电子设备和存储介质
TW202105199A (zh) 資料更新方法、電子設備和儲存介質
WO2017028416A1 (zh) 分类器训练方法、类型识别方法及装置
WO2021036382A1 (zh) 图像处理方法及装置、电子设备和存储介质
EP3176709A1 (en) Video categorization method and apparatus, computer program and recording medium
US9661133B2 (en) Electronic device and method for extracting incoming/outgoing information and managing contacts
WO2019165832A1 (zh) 文字信息处理方法、装置及终端
WO2017088247A1 (zh) 输入处理方法、装置及设备
WO2016082461A1 (zh) 推荐信息获取方法、终端及服务器
CN106777016B (zh) 基于即时通信进行信息推荐的方法及装置
TW202117707A (zh) 資料處理方法、電子設備和電腦可讀儲存介質
CN111160047A (zh) 一种数据处理方法、装置和用于数据处理的装置
WO2023078414A1 (zh) 相关文章搜索方法、装置、电子设备和存储介质
CN106911706B (zh) 通话背景添加方法及装置
CN104268151A (zh) 联系人分组方法及装置
WO2016197549A1 (zh) 一种进行搜索的方法和装置
CN108241438B (zh) 一种输入方法、装置和用于输入的装置
WO2015188589A1 (zh) 用户数据更新方法及装置
CN113435205A (zh) 语义解析方法及装置
WO2019196527A1 (zh) 一种数据处理方法、装置和电子设备
TWI739633B (zh) 儲存和讀取方法、電子設備和電腦可讀儲存介質
CN115374075A (zh) 一种文件类型识别方法及装置

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2017553299

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020167006741

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: MX/A/2016/005489

Country of ref document: MX

ENP Entry into the national phase

Ref document number: 2016118758

Country of ref document: RU

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15909636

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15909636

Country of ref document: EP

Kind code of ref document: A1