WO2020061910A1 - 用于生成信息的方法和装置 - Google Patents

用于生成信息的方法和装置 Download PDF

Info

Publication number
WO2020061910A1
WO2020061910A1 PCT/CN2018/107990 CN2018107990W WO2020061910A1 WO 2020061910 A1 WO2020061910 A1 WO 2020061910A1 CN 2018107990 W CN2018107990 W CN 2018107990W WO 2020061910 A1 WO2020061910 A1 WO 2020061910A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
determining
similarity
target word
Prior art date
Application number
PCT/CN2018/107990
Other languages
English (en)
French (fr)
Inventor
乔超
李航
牛艺霖
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to PCT/CN2018/107990 priority Critical patent/WO2020061910A1/zh
Publication of WO2020061910A1 publication Critical patent/WO2020061910A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Definitions

  • Embodiments of the present application relate to the field of computer technology, and in particular, to a method and an apparatus for generating information.
  • a method such as a bag-of-words model is usually used for text similarity calculation.
  • the embodiments of the present application provide a method and device for generating information.
  • an embodiment of the present application provides a method for generating information.
  • the method includes: using a dynamic programming algorithm to determine a minimum edit for converting a first text into a second text by performing an editing operation on the first text.
  • Distance where the minimum editing distance is determined based on the cost of the editing operation, and the cost of the editing operation is determined based on the semantic similarity between the target word in the first text and the target word in the second text, and the target word is the word involved in the editing operation,
  • the editing operation is divided into a delete word operation, an insert word operation, and a replacement word operation; the minimum editing distance is normalized, and the normalized value is determined as the similarity between the first text and the second text.
  • the semantic similarity between the target word in the first text and the target word in the second text is determined by the following semantic similarity determination step: determining the target word in the first text and the target word in the second text Whether they are the same; if not, determine whether the target word in the first text and the target word in the second text are unregistered words; in response to determining that the target word in the first text and / or the target word in the second text are For unregistered words, the first preset value is determined as the semantic similarity between the target word in the first text and the target word in the second text.
  • the semantic similarity determining step further includes: in response to determining that the target word in the first text is not an unregistered word and the target word in the second text is not an unregistered word, performing the following step: determining the first The cosine similarity between the word vector of the target word in the text and the word vector of the target word in the second text; determining the product of the cosine similarity and the first preset parameter; inputting the sum of the product and the second preset parameter to the target Function to determine the value of the objective function as the semantic similarity between the target word in the first text and the target word in the second text.
  • the semantic similarity determining step further includes: in response to determining that the target word in the first text is the same as the target word in the second text, determining a second preset value as the target word in the first text and The semantic similarity of the target word in the second text.
  • the cost of the replacement word operation is determined by the following steps: determining the word to be replaced in the first text as the target word in the first text; and determining the word in the second text to replace the word to be replaced Is the target word in the second text; determining the semantic similarity between the target word in the first text and the target word in the second text; determining the difference between the third preset value and the semantic similarity as the cost of the replacement word operation .
  • the cost of the word deletion operation is determined by the following steps: the word to be deleted in the first text is used as the target word in the first text, and the word in the second text and the target in the first text are determined one by one Semantic similarity of words; determine the word in the second text corresponding to the maximum semantic similarity as the target word in the second text, determine the product of the maximum similarity and the third preset parameter, and set the fourth preset The difference between the parameter and the product is determined as the cost of the delete word operation.
  • the cost of the word insertion operation is determined by the following steps: the words in the second text to be inserted into the first text are used as the target words in the second text, and the words in the first text and the first The semantic similarity of the target word in the second text; determining the word in the first text corresponding to the maximum value of the semantic similarity as the target word in the first text, determining the product of the maximum similarity and the third preset parameter, The difference between the fourth preset parameter and the product is determined as the cost of the insertion operation.
  • the minimum editing distance is normalized, and the normalized value is determined as the similarity between the first text and the second text, including: the sequence of words constituting the first text and the second text, respectively.
  • the number of words in the first number and the second number are determined; based on the comparison between the minimum edit distance, the first number, the second number, the fourth preset parameter and the preset threshold, it is determined that the first text is similar to the second text degree.
  • determining the similarity between the first text and the second text based on the comparison between the minimum edit distance, the first number, the second number, the fourth preset parameter and the preset threshold includes: in response to determining the fourth The preset parameter is less than the preset threshold, and the following steps are performed: determining the sum of the first quantity and the second quantity as the first intermediate value; determining the product of the first intermediate value and the fourth preset parameter as the second intermediate value; determining A ratio between the minimum edit distance and the second intermediate value; and determining a difference between the fourth preset value and the ratio as the similarity between the first text and the second text.
  • determining the similarity between the first text and the second text based on the comparison between the minimum edit distance, the first number, the second number, the fourth preset parameter and the preset threshold includes: in response to determining the fourth The preset parameter is not less than the preset threshold, and the following steps are performed: determining the difference between the second quantity and the first quantity as the third intermediate value; determining the product of the third intermediate value and the fourth preset parameter as the fourth intermediate value ; Determine the sum of the fourth intermediate value and the first quantity as the fifth intermediate value; determine the ratio between the minimum edit distance and the fifth intermediate value; determine the difference between the fourth preset value and the ratio as the first text and the second The similarity of the text.
  • the method further includes: displaying a similarity calculation result including similarity; or in response to determining that the similarity is greater than a preset similarity threshold, establishing a correspondence between the first text and the second text, and storing the Correspondence information representing correspondences.
  • an embodiment of the present application provides an apparatus for generating information.
  • the apparatus includes: a first determining unit configured to use a dynamic programming algorithm to determine that the first text is The minimum editing distance converted to the second text, where the minimum editing distance is determined based on the cost of the editing operation, and the cost of the editing operation is determined based on the semantic similarity between the target word in the first text and the target word in the second text.
  • the editing operation is divided into a deletion word operation, an insertion word operation, and a replacement word operation;
  • the second determination unit is configured to normalize the minimum editing distance, and determine the normalized value as The similarity between the first text and the second text.
  • the first determination unit is further configured to perform the following semantic similarity determination step: determine whether the target word in the first text is the same as the target word in the second text; if not, determine the first text separately Whether the target word in the target text and the target word in the second text are unregistered words; and in response to determining that the target word in the first text and / or the target word in the second text are unregistered words, determine the first preset value Is the semantic similarity between the target word in the first text and the target word in the second text.
  • the semantic similarity determining step further includes: in response to determining that the target word in the first text is not an unregistered word and the target word in the second text is not an unregistered word, performing the following step: determining the first The cosine similarity between the word vector of the target word in the text and the word vector of the target word in the second text; determining the product of the cosine similarity and the first preset parameter; inputting the sum of the product and the second preset parameter to the target Function to determine the value of the objective function as the semantic similarity between the target word in the first text and the target word in the second text.
  • the semantic similarity determining step further includes: in response to determining that the target word in the first text is the same as the target word in the second text, determining a second preset value as the target word in the first text and The semantic similarity of the target word in the second text.
  • the first determining unit is further configured to perform the following steps: determining a word to be replaced in the first text as a target word in the first text; and replacing the word to be replaced in the second text with the word to be replaced Is determined as the target word in the second text; the semantic similarity between the target word in the first text and the target word in the second text is determined; and the difference between the third preset value and the semantic similarity is determined as the replacement word The cost of the operation.
  • the first determining unit is further configured to perform the following steps: use the word to be deleted in the first text as the target word in the first text, and determine the word in the second text and the first text one by one.
  • the semantic similarity of the target word in the text determine the word in the second text corresponding to the maximum value of the semantic similarity as the target word in the second text, determine the product of the maximum similarity and the third preset parameter, and The difference between the four preset parameters and the product is determined as the cost of the word deletion operation.
  • the first determining unit is further configured to perform the following steps: use the words in the second text to be inserted into the first text as target words in the second text, and determine the ones in the first text one by one. Semantic similarity between the word and the target word in the second text; determining the word in the first text corresponding to the maximum value of the semantic similarity as the target word in the first text, determining the maximum similarity and the third preset parameter A product of, the difference between the fourth preset parameter and the product is determined as the cost of the insertion operation.
  • the second determining unit includes: a first determining module configured to determine the number of words in the word sequence constituting the first text and the second text as the first number and the second number, respectively;
  • the two determination modules are configured to determine the similarity between the first text and the second text based on a comparison between the minimum edit distance, the first number, the second number, the fourth preset parameter and a preset threshold.
  • the second determining module is further configured to: in response to determining that the fourth preset parameter is smaller than a preset threshold, perform the following steps: determine a sum of the first quantity and the second quantity as a first intermediate value; Determine the product of the first value and the second number as the second intermediate value; determine the ratio between the minimum edit distance and the second intermediate value; determine the difference between the third preset value and the ratio as the difference between the first text and the second text Similarity.
  • the second determination module is further configured to: in response to determining that the fourth preset parameter is not less than a preset threshold, perform the following steps: determine a difference between the second quantity and the first quantity as a third intermediate The value of the product of the third intermediate value and the fourth preset parameter is determined as the fourth intermediate value; the sum of the fourth intermediate value and the first quantity is determined as the fifth intermediate value; the minimum edit distance and the fifth intermediate value are determined Ratio; determining the difference between the fourth preset value and the ratio as the similarity between the first text and the second text.
  • the apparatus further includes: a display unit configured to display a similarity calculation result including the similarity; or a storage unit configured to establish the first text in response to determining that the similarity is greater than a preset similarity threshold Correspondence with the second text, and store correspondence information used to characterize the correspondence.
  • an embodiment of the present application provides an electronic device including: one or more processors; a storage device that stores one or more programs thereon; when one or more programs are processed by one or more processors During execution: Using a dynamic programming algorithm, determine the minimum editing distance for converting the first text into the second text by editing the first text, where the minimum editing distance is determined based on the cost of the editing operation and the cost of the editing operation is based on the The semantic similarity between the target word in one text and the target word in the second text is determined.
  • the target word is the word involved in the editing operation.
  • the editing operation is divided into a delete word operation, an insert word operation, and a replacement word operation. Perform normalization, and determine the normalized value as the similarity between the first text and the second text.
  • an embodiment of the present application provides a computer-readable medium having a computer program stored thereon.
  • the processor causes the processor to use a dynamic programming algorithm to determine an editing operation on the first text.
  • the minimum editing distance for converting the first text into the second text wherein the minimum editing distance is determined based on the cost of the editing operation, and the cost of the editing operation is based on the semantic similarity between the target word in the first text and the target word in the second text
  • the degree of determination is determined, and the target word is the word involved in the editing operation.
  • the editing operation is divided into a deletion word operation, an insertion word operation, and a replacement word operation.
  • the minimum editing distance is normalized, and the normalized value is determined as the first text. Similarity to the second text.
  • the method and device for generating information determine a minimum editing distance for converting the first text into the second text by performing an editing operation on the first text through a dynamic programming algorithm, so that the minimum editing The distance is normalized, so that the normalized value is determined as the similarity between the first text and the second text.
  • editing operations are divided into insert word operations, delete word operations, and replace word operations.
  • the cost of the editing operation is determined based on the semantic similarity between the target word in the first text and the target word in the second text. Therefore, the order of the words in the text and the word correspondence similarity can be considered at the same time, and the accuracy of the text similarity calculation is improved.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application can be applied;
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of a method for generating information according to the present application
  • FIG. 3 is a schematic diagram of an application scenario of a method for generating information according to the present application.
  • FIG. 4 is a flowchart of still another embodiment of a method for generating information according to the present application.
  • FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for generating information according to the present application.
  • FIG. 6 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
  • FIG. 1 illustrates an exemplary system architecture 100 to which the method for generating information or the apparatus for generating information of the present application can be applied.
  • the system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105.
  • the network 104 is a medium for providing a communication link between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, and 103, such as text editing applications, news browsing applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102, and 103 can be various electronic devices capable of network communication, including but not limited to smart phones, tablet computers, e-book readers, laptop computers, and desktop computers.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (for example, to provide distributed services), or it can be implemented as a single software or software module. It is not specifically limited here.
  • the server 105 may be a server that provides various services, such as a background server that processes text uploaded by the terminal devices 101, 102, and 103.
  • the background server can analyze and process the text and generate processing results (such as similarity).
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster consisting of multiple servers or as a single server.
  • the server can be implemented as multiple software or software modules (for example, to provide distributed services), or it can be implemented as a single software or software module. It is not specifically limited here.
  • the method for generating information provided by the embodiments of the present application is generally executed by the server 105, and accordingly, the apparatus for generating information is generally set in the server 105.
  • terminal devices, networks, and servers in FIG. 1 are merely exemplary. According to implementation needs, there can be any number of terminal devices, networks, and servers.
  • a flowchart 200 of one embodiment of a method for generating information according to the present application is shown.
  • the method for generating information includes the following steps:
  • Step 201 Use a dynamic programming algorithm to determine a minimum editing distance for converting the first text into the second text by performing an editing operation on the first text.
  • an execution subject for example, the server 105 shown in FIG. 1
  • the first text and the second text may be texts to be subjected to similarity calculation.
  • the first text and the second text may each be composed of a sequence of words.
  • the first text may be represented as A.
  • the second text can be represented as B.
  • the sequence of words that make up the first text can be expressed as
  • the sequence of words that make up the second text can be expressed as Among them, w can be used to represent words in the text.
  • n may be the number of words constituting the first text.
  • m may be the number of words constituting the second text. Both n and m are positive numbers not less than 1.
  • the first text and the second text may be stored locally in the execution body in advance. At this time, the execution body may directly extract the first text and the second text from the local.
  • the first text and the second text may be sent by a terminal (for example, the terminal devices 101, 102, and 103 shown in FIG. 1) to the execution subject through a wired connection or a wireless connection.
  • a terminal for example, the terminal devices 101, 102, and 103 shown in FIG. 1
  • the above wireless connection methods may include, but are not limited to, 3G / 4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods known or developed in the future .
  • one of the first text and the second text may be stored in the execution body in advance. Another text may be sent by the terminal to the above-mentioned execution subject.
  • the execution body may use a dynamic programming algorithm to determine a minimum editing distance (which may be represented by WED) for converting the first text into the second text by performing an editing operation on the first text.
  • a minimum editing distance which may be represented by WED
  • the edit distance (also called edit cost) can be a quantitative measurement of the degree of difference between two texts.
  • the edit distance can be used to characterize the cost of converting one text (or string) to another text (another string).
  • the minimum editing distance is the minimum value of the editing distance, which is the minimum cost of converting one text into another.
  • the cost can be understood as the degree of processing of the text, and can be expressed by a numerical value. The greater the degree of processing of the text, the greater the cost; the less the degree of processing of the text, the less the cost.
  • the minimum editing distance may refer to a minimum cost of converting the first text into the second text. Converting the first text to the second text usually requires one or more editing operations. For each editing operation, the execution body may determine the cost of the editing operation. The minimum editing distance can be determined based on the cost of each editing operation.
  • a word insertion operation may be an operation of inserting a word in the first text.
  • Deleting a word at a time may be an operation of deleting a word in the first text.
  • the one-time word replacement operation may be an operation of replacing one word in the first text with one word in the second text.
  • the execution body may use a dynamic programming algorithm to determine the minimum editing distance for converting the first text into the second text based on the cost of each editing operation on the first text.
  • Dynamic Programming DP is a branch of operations research and a mathematical method for optimizing the decision process. The basic idea is to decompose the problem to be solved into several sub-problems, first solve the sub-problems, and then obtain the solution of the original problem from the solutions of these sub-problems.
  • the state transition equation used by the dynamic programming algorithm may adopt the following formula:
  • i is an integer of not less than 1 and not more than n.
  • j is an integer of not less than 1 and not more than m.
  • To delete words in the first text The price.
  • To insert words in the first text The price.
  • For the words in the first text Replace with words in the second text
  • f i, j represents the minimum cost of converting the first i words in the first text to the first j words in the second text.
  • f i-1, j-1 represents the minimum cost of converting the first i-1 words in the first text to the first j-1 words in the second text.
  • f i-1, j represents the minimum cost of converting the first i-1 words in the first text to the first j words in the second text.
  • f i, j-1 represents the minimum cost of converting the first i words in the first text to the first j-1 words in the second text.
  • min represents the minimum value.
  • the above-mentioned execution body can use the dynamic programming algorithm to calculate the final f n, m based on the cost of each editing operation, which is the minimum cost of converting the first text to the second text, that is, the minimum editing. Distance from WED.
  • the cost of the editing operation may be determined based on the semantic similarity between the target word in the first text and the target word in the second text.
  • the target word may be a word involved in an editing operation.
  • the target word in the first text may be a word to be deleted in the first text; the target word in the second text may be the one with the greatest semantic similarity to the word to be deleted. word.
  • the editing operation is an insert word operation, the target word in the second text may be a word in the second text to be inserted into the first text; the target word in the first text may be the same as the second text.
  • the target word in the word with the highest semantic similarity when the editing operation is a replacement word operation, the target word in the first text may be a word to be replaced in the first text.
  • the target word in the second text may be a word in the second text for replacing the word to be replaced.
  • the execution body may determine the cost of the editing operation based on the calculation result of the semantic similarity between the target word in the first text and the target word in the second text.
  • the execution subject may preset a correspondence relationship between the semantic similarity and the cost of the editing operation, such as a correspondence relationship table, a formula, and the like.
  • the execution body may directly substitute the calculation result of the semantic similarity between the target word in the first text and the target word in the second text into the corresponding relationship corresponding to the editing operation to obtain the The cost of editing operations.
  • the same or different corresponding relationships can be set in advance.
  • the correspondence between the semantic similarity and the cost of the deleted word operation, and the correspondence between the semantic similarity and the cost of the inserted word operation may use the same correspondence table or formula.
  • the correspondence between the semantic similarity and the cost of the word deletion operation, and the correspondence between the semantic similarity and the cost of the replacement word operation may use different correspondence tables or formulas. It is not limited here.
  • the execution subject may determine the semantic similarity between the target word in the first text and the target word in the second text in various ways.
  • the word vectors of the target word in the first text and the target word of the second text may be determined separately.
  • the word vector may be an embedded representation of a word obtained by using a word embedding technique.
  • the above-mentioned execution subject can be performed by various existing word vector calculation methods (for example, principal component analysis of a word-text co-occurrence matrix), or can use existing word vector calculation tools or models (for example, word2vec model, glove model, ELMo model). It is not limited here.
  • word vectors can contain semantic features of words.
  • similarity calculation may be performed using various similarity calculation methods. For example, Euclidean distance, cosine similarity, etc.
  • the semantic similarity between the target word in the first text and the target word in the second text may be determined by the following semantic similarity determination step: in the first step, the Target word in first text (can be used here (Represented) and the target word in the second text above (can be used here Indicates whether they are the same.
  • the second step in response to determining that the target word in the first text is not the same as the target word in the second text, it may be determined whether the target word in the first text and the target word in the second text are not Login Word (Out of Vocabulary, OOV).
  • the unregistered words may be words that are not included in the word segmentation vocabulary but must be segmented.
  • a first preset value for example, 0
  • the semantic similarity between the target word and the target word in the second text above can be used here Means).
  • the word vector of the unregistered words is usually not obtained, and thus the semantic similarity cannot be determined. In this realistic way, the existence of unregistered words can be considered.
  • the semantic similarity is set to a first preset value (for example, 0), so that it can still be obtained Semantic similarity of two target words. Therefore, the words in the text are considered more comprehensively, and the accuracy of the text similarity calculation is improved.
  • the above-mentioned execution subject may perform the following steps: First, determine the word vector of the target word in the first text (which can be used here (Representation) and the word vector of the target word in the second text above (can be used here Cosine similarity (represented) Means).
  • the word vector may be determined using various existing word vector calculation methods, or may be determined using an existing word vector calculation tool or model.
  • a product of the above-mentioned cosine similarity and a first preset parameter (which may be represented by ⁇ ) may be determined.
  • the sum of the above product and the second preset parameter (which can be represented by ⁇ here) can be input to an objective function (such as a sigmoid function, which can be represented by ⁇ here), and the value of the objective function is determined as the first Semantic similarity between the target word in the text and the target word in the second text which is:
  • the ⁇ , ⁇ , and ⁇ functions can map the cosine similarity to a specified numerical interval (for example, [0,1]).
  • the values of ⁇ and ⁇ can be set as required. In practice, ⁇ can be set to a number greater than 0.
  • the execution subject may determine the second preset value (for example, 1) as the semantic similarity between the target word in the first text and the target word in the second text. Therefore, when the target words in the two texts are the same, the semantic similarity can no longer be calculated through the word vector, and the semantic similarity can be directly determined as the second preset value. Improved data processing efficiency.
  • the second preset value for example, 1
  • the similarity between the target word in the first text and the target word in the second text can be determined according to the following formula. :
  • the cost of the replacement word operation can be determined by the following steps (to replace the word in the first text Replace with words in the second text The price Example):
  • the first step is to replace the word to be replaced in the above first text Determined as the target word in the above first text.
  • the second step is to replace the words in the second text with the words to be replaced. Determined as the target word in the above second text.
  • the third step is to determine the semantic similarity between the target word in the first text and the target word in the second text
  • the fourth step is to determine the difference between the third preset value (for example, 1) and the semantic similarity as the cost of the replacement word operation. For example, when the third preset value is 1, the cost of the replacement word operation is determined according to the following formula
  • the cost of the word deletion operation can be determined by the following steps (here, the word in the first text is deleted The price Example):
  • the first step is to delete the words to be deleted in the first text.
  • the semantic similarity between the words in the second text and the target words in the first text is determined one by one. That is, OK Among them, w B is a word in the second text.
  • the second step is to maximize the semantic similarity (here can be expressed as The corresponding word in the second text is determined as the target word in the second text, the product of the maximum similarity and the third preset parameter (which can be expressed as ⁇ 2 ) is determined, and the fourth preset The difference between the parameter (here can be expressed as ⁇ 1 ) and the above product is determined as the cost of the word deletion operation which is:
  • max represents the maximum value.
  • the cost of the word insertion operation can be determined by the following steps (here, inserting words in the first text The price Example):
  • the first step is to insert the words in the second text to be inserted into the first text.
  • the semantic similarity between the words in the first text and the target words in the second text is determined one by one. That is, OK Where w A is a word in the first text.
  • the second step is to maximize the semantic similarity (here can be expressed as The word in the first text corresponding to the above) is determined as the target word in the first text, the product of the maximum similarity and the third preset parameter ( ⁇ 2 ) is determined, and the fourth preset parameter ( ⁇ 1 The difference between) and the above product is determined as the cost of the insertion operation which is:
  • max represents the maximum value
  • ⁇ 1 can be used to adjust the relative size of the cost of the deleted word operation or the cost of the inserted word operation.
  • ⁇ 1 can be set to a value not less than 0.
  • ⁇ 2 can be used to adjust the degree of influence of similarity.
  • ⁇ 2 can be set to a value not less than 0 and not more than 1. When ⁇ 2 is equal to 1, there will be no cost in different positions but the same words.
  • values of the above parameters can be set in advance as needed, or can be set in advance based on a large amount of data statistics and experiments.
  • the specific values are not limited here.
  • calculation formulas for the cost of the replacement word operation, the cost of the deleted word operation, and the cost of the inserted word operation are not limited to those listed in the above implementation, and can be set to other formulas that use the semantic similarity of the target word as a variable . It is not limited here.
  • Step 202 Normalize the minimum editing distance, and determine the normalized value as the similarity between the first text and the second text.
  • the minimum editing distance is normalized, and the normalized value is determined as the similarity between the first text and the second text.
  • normalization refers to limiting the data to be processed to a specified range (through some algorithm). For example, convert a value to a value in the range [0,1]. Normalizing the minimum editing distance can facilitate data comparison and subsequent processing.
  • various existing normalization functions or formulas established in advance can be used to normalize the minimum editing distance obtained in step 201.
  • the number n of words in the word sequence constituting the first text may be determined first.
  • the number m of words in the word sequence constituting the second text can be determined.
  • the sum of the number of words in the word sequence constituting the two texts can be determined.
  • the ratio of the minimum editing distance to the sum of the quantities can be determined as the similarity between the first text and the second text.
  • the execution subject may first determine the number of words in the word sequence constituting the first text and the second text as the first number (n) and the second number, respectively. (m).
  • the execution subject may be based on the minimum edit distance, the first number, the second number, and the fourth The comparison between the preset parameter ⁇ 1 and the preset threshold determines the similarity between the first text and the second text (which can be represented by sim here).
  • the execution body may perform the following steps: First, the sum of the first quantity n and the second quantity m may be determined as the first An intermediate value. Then, a product of the first intermediate value and the fourth preset parameter ⁇ 1 may be determined as a second intermediate value. After that, a ratio of the minimum editing distance WED to the second intermediate value may be determined. Finally, the difference between the fourth preset value (for example, 1) and the ratio may be determined as the similarity between the first text and the second text.
  • the specific value of the fourth preset value may be determined based on actual requirements, and is not limited herein.
  • the execution body may perform the following steps: First, the difference between the second quantity m and the first quantity n may be determined. Is the third middle value. Then, a product of the third intermediate value and the fourth preset parameter ⁇ 1 may be determined as a fourth intermediate value. After that, the sum of the fourth intermediate value and the first quantity n may be determined as a fifth intermediate value. Then, a ratio of the minimum editing distance WED to the fifth intermediate value may be determined. Finally, the difference between the fourth preset value (for example, 1) and the ratio may be determined as the similarity between the first text and the second text.
  • a preset threshold for example, 0.5
  • the similarity between the first text and the second text may be determined with reference to the following formula:
  • the execution may further display a similarity calculation result including the similarity.
  • a similarity calculation result including the similarity.
  • a corresponding relationship between the first text and the second text may be established, and corresponding relationship information used to characterize the corresponding relationship may be stored.
  • the first text or the second text may be pushed to a specified user or the like.
  • FIG. 3 is a schematic diagram of an application scenario of the method for generating information according to this embodiment.
  • the user first sends a similarity calculation request to the server 302 by using the terminal device 301, and the similarity calculation request includes a first text 303 and a second text 304 to be subjected to similarity calculation.
  • the server 302 determines a minimum editing distance for converting the first text into the second text by using a dynamic programming algorithm. Then, the minimum editing distance is normalized, and the normalized value is determined as the similarity 305 between the first text and the second text.
  • the server sends a similarity calculation result 306 including the similarity 305 to the terminal device.
  • the method provided by the foregoing embodiments of the present application determines, through a dynamic programming algorithm, a minimum editing distance for determining the first text to be converted to the second text by performing an editing operation on the first text, so as to reduce the minimum editing distance. Normalization, thereby determining the normalized value as the similarity between the first text and the second text.
  • the editing operation is divided into an insertion word operation, a deletion word operation, and a replacement word operation.
  • the cost of the editing operation is determined based on the semantic similarity between the target word in the first text and the target word in the second text. Therefore, the order of the words in the text, the similarity of the word correspondence, and the alignment of the words can be considered at the same time, which improves the accuracy of the text similarity calculation.
  • a flowchart 400 of yet another embodiment of a method for generating information is shown.
  • the process 400 of the method for generating information includes the following steps:
  • Step 401 Use a dynamic programming algorithm to determine a minimum editing distance for converting a first text into a second text by performing an editing operation on the first text.
  • an execution subject of the method for generating information may use a dynamic programming algorithm to determine that the first text (represented as A) is edited by editing the first text.
  • the minimum edit distance (denoted as WED) converted to the second text (denoted as B).
  • the number of words constituting the first text can be expressed as n. Both n and m are positive numbers not less than 1.
  • the above editing operation can be divided into an insertion word operation, a deletion word operation, and a replacement word operation.
  • the cost of the editing operation can be divided into the cost of the delete word operation (can be represented by D), the cost of the insert word operation (can be represented by I), and the cost of the replacement word operation (can be represented by S).
  • the execution body may use a dynamic programming algorithm to determine the minimum editing distance for converting the first text into the second text based on the cost of each editing operation on the first text.
  • the state transition equation used can be as follows:
  • i is an integer of not less than 1 and not more than n.
  • j is an integer of not less than 1 and not more than m.
  • To delete words in the first text The price.
  • To insert words in the first text The price.
  • For the words in the first text Replace with words in the second text
  • the price. f i, j represents the minimum cost of converting the first i (that is, the first to the i) words in the first text to the first j (the first to the j) words in the second text.
  • f i-1, j-1 represents the minimum cost of converting the first i-1 words in the first text to the first j-1 words in the second text.
  • f i-1, j represents the minimum cost of converting the first i-1 words in the first text to the first j words in the second text.
  • f i, j-1 represents the minimum cost of converting the first i words in the first text to the first j-1 words in the second text.
  • min represents the minimum value.
  • the above execution body can use the dynamic programming algorithm to calculate the final value of the state transition equation one by one based on the cost of each editing operation, which is the minimum cost of converting the first text to the second text. That is, the minimum editing distance WED.
  • the cost of the editing operation may be determined based on the semantic similarity between the target word in the first text and the target word in the second text.
  • ⁇ 1 can be used to adjust the relative size of the cost of the deleted word operation or the cost of the inserted word operation.
  • ⁇ 1 can be set to a value not less than 0.
  • ⁇ 2 can be used to adjust the degree of influence of similarity.
  • ⁇ 2 can be set to a value not less than 0 and not more than 1.
  • ⁇ 2 is equal to 1, there will be no cost in different positions but the same words.
  • the above-mentioned values of ⁇ 1 and ⁇ 2 can be set in advance as required, or can be set in advance based on a large amount of data statistics and experiments, and the specific values are not limited here.
  • Target word in first text The maximum value of the semantic similarity with each word w B in the second text. which is:
  • Target word in second text The maximum value of the semantic similarity to each word w A in the first text. which is:
  • the target word in the above first text With the target word in the second text above Semantic similarity It can be determined through the following semantic similarity determination steps:
  • the second preset value (for example, 1) can be determined as versus Semantic similarity. Therefore, when the target words in the two texts are the same, the semantic similarity can no longer be calculated through the word vector, and the semantic similarity can be directly determined as the second preset value. Improved data processing efficiency.
  • the second step is in response to the determination versus Not the same, you can determine the above separately And above Whether it is an unregistered word.
  • the first preset value (such as 0) can be determined as versus Semantic similarity. Since the unregistered words are not included in the vocabulary, the word vector of the unregistered words is usually not obtained, and thus the semantic similarity cannot be determined. In this realistic way, the existence of unregistered words can be considered.
  • the semantic similarity is set to a first preset value (for example, 0), so that it can still be obtained Semantic similarity of two target words. Therefore, the words in the text are considered more comprehensively, and the accuracy of the text similarity calculation is improved.
  • the third step is in response to the determination with Are not unregistered words, you can determine first Word vector versus Word vector Cosine similarity Then, determine it according to the following formula versus Semantic similarity:
  • the ⁇ , ⁇ , and ⁇ functions can map the cosine similarity to a specified numerical interval (for example, [0,1]).
  • the values of ⁇ and ⁇ can be set as required. In practice, ⁇ can be set to a number greater than 0.
  • target words in the above first text The semantic similarity to each word w B in the above second text can be referred to versus Semantic similarity The calculation method is determined.
  • Target word in second text The semantic similarity to each word w A in the above first text can also refer to versus Semantic similarity The calculation method is determined. I won't repeat them here.
  • Step 402 Determine the number of words in the word sequence constituting the first text and the second text as the first number and the second number, respectively.
  • the above-mentioned execution subject may respectively determine the number of words in the word sequence constituting the first text and the second text as the first number (represented as n) and the second number (represented as m).
  • Step 403 Determine the similarity between the first text and the second text based on the comparison between the minimum edit distance, the first number, the second number, the fourth preset parameter and the preset threshold.
  • the similarity between the first text and the second text may be determined with reference to the following formula:
  • Step 404 Display a similarity calculation result including the similarity.
  • the execution subject may display a similarity calculation result including the above-mentioned similarity, so as to present the similarity calculation result to a user for the user to view.
  • the process 400 of the method for generating information in this embodiment provides a minimum editing distance and a method for determining text similarity based on the minimum editing distance. calculation process. Therefore, the solution described in this embodiment can simultaneously consider the order of words in the text and the similarity of the corresponding words, thereby improving the accuracy of the text similarity calculation. At the same time, parameters can be flexibly adjusted according to the task, and the order of words in the text and semantic similarity can be used to different degrees, which improves the flexibility of text similarity calculation.
  • this application provides an embodiment of an apparatus for generating information.
  • the apparatus embodiment corresponds to the method embodiment shown in FIG. 2.
  • the device can be specifically applied to various electronic devices.
  • the apparatus 500 for generating information includes a first determining unit 501 configured to use a dynamic programming algorithm to determine that the first text is to be edited by performing an editing operation on the first text.
  • the minimum editing distance converted to the second text where the minimum editing distance is determined based on the cost of the editing operation, and the cost of the editing operation is determined based on the semantic similarity between the target word in the first text and the target word in the second text ,
  • the target word is a word involved in the editing operation, and the editing operation is divided into a deletion word operation, an insertion word operation, and a replacement word operation;
  • the second determination unit 502 is configured to normalize the minimum editing distance described above, and normalize The subsequent value is determined as the similarity between the first text and the second text.
  • the first determining unit 501 may be configured to perform the following semantic similarity determining step: determining whether the target word in the first text and the target word in the second text are The same; if not, determine whether the target word in the first text and the target text in the second text are unregistered words respectively; in response to determining the target word in the first text and / or the second text The target word is an unregistered word, and the first preset value is determined as the semantic similarity between the target word in the first text and the target word in the second text.
  • the semantic similarity determining step may further include: in response to determining that the target word in the first text is not an unregistered word and the target word in the second text is not For unregistered words, perform the following steps: determine the cosine similarity between the word vector of the target word in the first text and the word vector of the target word in the second text; determine the product of the cosine similarity and the first preset parameter ; Inputting the sum of the product and the second preset parameter to an objective function, and determining the value of the objective function as the semantic similarity between the target word in the first text and the target word in the second text.
  • the foregoing semantic similarity determining step may further include: in response to determining that the target word in the first text is the same as the target word in the second text, presetting the second preset The value is determined as the semantic similarity between the target word in the first text and the target word in the second text.
  • the first determining unit 501 may be further configured to perform the following steps to determine the cost of the replacement word operation: determining the word to be replaced in the first text as the first A target word in a text; determining a word for replacing the word to be replaced in the second text as a target word in the second text; determining a target word in the first text and a target word in the second text Semantic similarity of the target word; determining the difference between the third preset value and the semantic similarity as the cost of the replacement word operation.
  • the first determining unit 501 may be further configured to perform the following steps to determine the cost of the word deletion operation: using the word to be deleted in the first text as the first The target words in a text determine the semantic similarity between the words in the second text and the target words in the first text one by one; determine the words in the second text corresponding to the maximum semantic similarity as the first The target word in the two texts determines the product of the maximum similarity value and the third preset parameter, and determines the difference between the fourth preset parameter and the product as the cost of the word deletion operation.
  • the first determining unit 501 may be further configured to perform the following steps to determine the cost of the word insertion operation: inserting the second text to be inserted into the first text
  • the words in the second text are used as target words in the second text, and the semantic similarity between the words in the first text and the target words in the second text is determined one by one;
  • the word is determined as the target word in the first text, the product of the maximum similarity value and the third preset parameter is determined, and the difference between the fourth preset parameter and the product is determined as the cost of the word insertion operation.
  • the foregoing second determination unit 502 may include a first determination module and a second determination module (not shown in the figure).
  • the first determining module may be configured to determine the number of words in the word sequence constituting the first text and the second text as the first number and the second number, respectively.
  • the second determination module may be configured to determine that the first text is similar to the second text based on a comparison between the minimum edit distance, the first number, the second number, the fourth preset parameter, and a preset threshold. degree.
  • the second determining module may be further configured to: in response to determining that the fourth preset parameter is smaller than a preset threshold, perform the following steps: comparing the first quantity with the first The sum of the two quantities is determined as the first intermediate value; the product of the first value and the second quantity is determined as the second intermediate value; the ratio of the minimum edit distance to the second intermediate value is determined; the third preset value is determined The difference from the ratio is determined as the similarity between the first text and the second text.
  • the second determining module may be further configured to: in response to determining that the fourth preset parameter is not less than a preset threshold, perform the following steps: comparing the second quantity with the above The difference between the first quantity is determined as a third intermediate value; the product of the third intermediate value and the fourth preset parameter is determined as a fourth intermediate value; the sum of the fourth intermediate value and the first quantity is determined as A fifth intermediate value; determining a ratio between the minimum editing distance and the fifth intermediate value; determining a difference between the fourth preset value and the ratio as the similarity between the first text and the second text.
  • the device may further include a display unit or a storage unit (not shown in the figure).
  • the display unit may be configured to display a similarity calculation result including the similarity.
  • the storage unit may be configured to, in response to determining that the similarity is greater than a preset similarity threshold, establish a correspondence between the first text and the second text, and store correspondence information used to characterize the correspondence.
  • the first determining unit 501 determines a minimum editing distance for converting the first text into the second text by performing an editing operation on the first text through a dynamic programming algorithm, so that the second
  • the determining unit 502 normalizes the minimum editing distance to determine the normalized value as the similarity between the first text and the second text.
  • the editing operation is divided into an insertion word operation, a deletion word operation, and a replacement word operation.
  • the cost of the editing operation is determined based on the semantic similarity between the target word in the first text and the target word in the second text. Therefore, the order of the words in the text, the similarity of the word correspondence, and the alignment of the words can be considered at the same time, which improves the accuracy of the text similarity calculation.
  • FIG. 6 illustrates a schematic structural diagram of a computer system 600 suitable for implementing an electronic device according to an embodiment of the present application.
  • the electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.
  • the computer system 600 includes a central processing unit (CPU) 601, which can be loaded into a random access memory (RAM) 603 according to a program stored in a read-only memory (ROM) 602 or from a storage portion 608. Instead, perform various appropriate actions and processes.
  • RAM random access memory
  • ROM read-only memory
  • various programs and data required for the operation of the system 600 are also stored.
  • the CPU 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input / output (I / O) interface 605 is also connected to the bus 604.
  • the following components are connected to the I / O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the speaker; a storage portion 608 including a hard disk and the like; a communication section 609 including a network interface card such as a LAN card, a modem, and the like.
  • the communication section 609 performs communication processing via a network such as the Internet.
  • the driver 610 is also connected to the I / O interface 605 as necessary.
  • a removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 610 as needed, so that a computer program read therefrom is installed into the storage section 608 as needed.
  • the process described above with reference to the flowchart may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing a method shown in a flowchart.
  • the computer program may be downloaded and installed from a network through the communication portion 609, and / or installed from a removable medium 611.
  • CPU central processing unit
  • the computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the foregoing.
  • the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programming read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal that is included in baseband or propagated as part of a carrier wave, and which carries computer-readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable medium may send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • each block in the flowchart or block diagram may represent a module, a program segment, or a part of code, which contains one or more functions to implement a specified logical function Executable instructions.
  • the functions noted in the blocks may also occur in a different order than those marked in the drawings. For example, two successively represented boxes may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts can be implemented by a dedicated hardware-based system that performs the specified function or operation , Or it can be implemented with a combination of dedicated hardware and computer instructions.
  • the units described in the embodiments of the present application may be implemented by software or hardware.
  • the described unit may also be provided in a processor, for example, it may be described as: a processor includes a first determining unit and a second determining unit. Among them, the names of these units do not constitute a limitation on the unit itself in some cases.
  • the first determination unit can also be described as "the minimum editing for determining the first text into the second text using a dynamic programming algorithm" Distance unit. "
  • the present application also provides a computer-readable medium, which may be included in the device described in the foregoing embodiments; or may exist alone without being assembled into the device.
  • the computer-readable medium carries one or more programs.
  • the device is caused to use a dynamic programming algorithm to determine that the first text is edited by performing an editing operation on the first text.
  • the minimum editing distance converted to the second text wherein the minimum editing distance is determined based on the cost of the editing operation, and the cost of the editing operation is determined based on the semantic similarity between the target word in the first text and the target word in the second text ,
  • the target word is the word involved in the editing operation, and the editing operation is divided into a deletion word operation, an insertion word operation, and a replacement word operation; the minimum editing distance is normalized, and the normalized value is determined as the first text Similarity to the second text.
  • the semantic similarity between the target word in the first text and the target word in the second text may be determined by the following semantic similarity determination step: determining that the target word in the first text is in the second text Whether the target word in is the same; if not, determine whether the target word in the first text and the target word in the second text are unregistered words respectively; in response to determining the target word in the first text and / or the first The target word in the second text is an unregistered word, and the first preset value is determined as the semantic similarity between the target word in the first text and the target word in the second text.
  • the step of determining the semantic similarity may further include: in response to determining that the target word in the first text is not an unregistered word and the target word in the second text is not an unregistered word, performing the following steps: determining The cosine similarity between the word vector of the target word in the first text and the word vector of the target word in the second text; determining the product of the cosine similarity and the first preset parameter; and combining the product with the second preset The sum of the parameters is input to the objective function, and the value of the objective function is determined as the semantic similarity between the objective word in the first text and the objective word in the second text.
  • the step of determining the semantic similarity may further include: in response to determining that the target word in the first text is the same as the target word in the second text, determining a second preset value as the target word in the first text. The semantic similarity between the target word and the target word in the second text.
  • the cost of the replacement word operation can be determined by the following steps: determining the word to be replaced in the first text as the target word in the first text; and replacing the word to be replaced in the second text with the word to be replaced. Determines the target word in the second text; determines the semantic similarity between the target word in the first text and the target word in the second text; the difference between the third preset value and the semantic similarity Determined as the cost of the replacement word operation.
  • the cost of the word deletion operation can be determined by the following steps: the word to be deleted in the first text is used as the target word in the first text, and the words in the second text and the first text are determined one by one The semantic similarity of the target word in the text; determine the word in the second text corresponding to the maximum value of the semantic similarity as the target word in the second text, and determine the product of the maximum similarity and the third preset parameter , Determining the difference between the fourth preset parameter and the above product as the cost of the word deletion operation.
  • the cost of the word insertion operation can be determined by the following steps: the words in the second text to be inserted into the first text are used as the target words in the second text, and the The semantic similarity between the word and the target word in the second text; determining the word in the first text corresponding to the maximum value of semantic similarity as the target word in the first text, determining the maximum similarity and the first The product of the three preset parameters determines the difference between the fourth preset parameter and the product as the cost of the word insertion operation.
  • the above normalizing the minimum editing distance, and determining the normalized value as the similarity between the first text and the second text may include: constituting the first text and the first text respectively.
  • the number of words in the two-word word sequence is determined as a first number and a second number; based on a comparison between the minimum editing distance, the first number, the second number, the fourth preset parameter, and a preset threshold, determine The similarity between the first text and the second text.
  • the determining the similarity between the first text and the second text based on the comparison between the minimum editing distance, the first quantity, the second quantity, the fourth preset parameter, and a preset threshold may include: : In response to determining that the fourth preset parameter is less than a preset threshold, perform the following steps: determine a sum of the first quantity and the second quantity as a first intermediate value; and set the first intermediate value and the fourth preset value The product of the parameters is determined as the second intermediate value; the ratio of the minimum edit distance to the second intermediate value is determined; the difference between the fourth preset value and the ratio is determined as the similarity between the first text and the second text .
  • the determining the similarity between the first text and the second text based on the comparison between the minimum editing distance, the first quantity, the second quantity, the fourth preset parameter, and a preset threshold may include: : In response to determining that the fourth preset parameter is not less than a preset threshold, perform the following steps: determine a difference between the second quantity and the first quantity as a third intermediate value; and determine the third intermediate value and the fourth intermediate value The product of the preset parameters is determined as the fourth intermediate value; the sum of the fourth intermediate value and the first quantity is determined as the fifth intermediate value; the ratio of the minimum edit distance to the fifth intermediate value is determined; The difference between the value and the ratio is determined as the similarity between the first text and the second text.
  • a similarity calculation result including the above-mentioned similarity may also be displayed; or in response to determining that the similarity is greater than a preset similarity threshold, establishing the first Correspondence between the text and the second text, and stores correspondence information used to represent the correspondence.

Abstract

一种用于生成信息的方法和装置,提高了文本相似度计算的准确性。该方法包括:利用动态规划算法,确定通过对第一文本进行编辑操作,将该第一文本转换为第二文本的最小编辑距离(201),其中,该最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于该第一文本中的目标词与该第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;将该最小编辑距离进行归一化,将归一化后的数值确定为该第一文本与该第二文本的相似度(202)。

Description

用于生成信息的方法和装置 技术领域
本申请实施例涉及计算机技术领域,具体涉及用于生成信息的方法和装置。
背景技术
当缺乏足够的标注数据时,有监督的文本相似度计算方法无法适用。而实际情况中,大多数任务都面临缺乏标注数据的问题。因而,通常使用无监督的文本相似度计算方法进行文本间相似度的计算。
相关的方式,通常使用词袋模型(Bag-of-words model)等方法进行文本相似度计算。
发明内容
本申请实施例提出了用于生成信息的方法和装置。
第一方面,本申请实施例提供了一种用于生成信息的方法,该方法包括:利用动态规划算法,确定通过对第一文本进行编辑操作,将第一文本转换为第二文本的最小编辑距离,其中,最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于第一文本中的目标词与第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;将最小编辑距离进行归一化,将归一化后的数值确定为第一文本与第二文本的相似度。
在一些实施例中,第一文本中的目标词与第二文本中的目标词的语义相似度通过如下语义相似度确定步骤确定:确定第一文本中的目标词与第二文本中的目标词是否相同;若否,分别确定第一文本中的目标词和第二文本中的目标词是否为未登录词;响应于确定第一文本中的目标词和/或第二文本中的目标词为未登录词,将第一预设数值确 定为第一文本中的目标词与第二文本中的目标词的语义相似度。
在一些实施例中,语义相似度确定步骤还包括:响应于确定第一文本中的目标词不为未登录词且第二文本中的目标词不为未登录词,执行如下步骤:确定第一文本中的目标词的词向量与第二文本中的目标词的词向量的余弦相似度;确定余弦相似度与第一预设参数的乘积;将乘积与第二预设参数的和输入至目标函数,将目标函数的值确定为第一文本中的目标词与第二文本中的目标词的语义相似度。
在一些实施例中,语义相似度确定步骤还包括:响应于确定第一文本中的目标词与第二文本中的目标词相同,将第二预设数值确定为第一文本中的目标词与第二文本中的目标词的语义相似度。
在一些实施例中,替换词操作的代价通过如下步骤确定:将第一文本中的待替换词确定为第一文本中的目标词;将第二文本中的用于替换待替换词的词确定为第二文本中的目标词;确定第一文本中的目标词与第二文本中的目标词的语义相似度;将第三预设数值与语义相似度的差值确定为替换词操作的代价。
在一些实施例中,删除词操作的代价通过如下步骤确定:将第一文本中的待删除的词作为第一文本中的目标词,逐一确定第二文本中的词与第一文本中的目标词的语义相似度;将语义相似度最大值所对应的第二文本中的词确定为第二文本中的目标词,确定相似度最大值与第三预设参数的乘积,将第四预设参数与乘积的差值确定为删除词操作的代价。
在一些实施例中,插入词操作的代价通过如下步骤确定:将第二文本中的待插入至第一文本中的词作为第二文本中的目标词,逐一确定第一文本中的词与第二文本中的目标词的语义相似度;将语义相似度最大值所对应的第一文本中的词确定为第一文本中的目标词,确定相似度最大值与第三预设参数的乘积,将第四预设参数与乘积的差值确定为插入词操作的代价。
在一些实施例中,对最小编辑距离进行归一化,将归一化后的数值确定为第一文本与第二文本的相似度,包括:分别将构成第一文本、第二文本的词序列中的词的数量确定为第一数量、第二数量;基于最 小编辑距离、第一数量、第二数量、第四预设参数与预设阈值的比较,确定第一文本与第二文本的相似度。
在一些实施例中,基于最小编辑距离、第一数量、第二数量、第四预设参数与预设阈值的比较,确定第一文本与第二文本的相似度,包括:响应于确定第四预设参数小于预设阈值,执行如下步骤:将第一数量与第二数量之和确定为第一中间数值;将第一中间数值与第四预设参数的乘积确定为第二中间数值;确定最小编辑距离与第二中间数值的比值;将第四预设数值与比值的差值确定为第一文本与第二文本的相似度。
在一些实施例中,基于最小编辑距离、第一数量、第二数量、第四预设参数与预设阈值的比较,确定第一文本与第二文本的相似度,包括:响应于确定第四预设参数不小于预设阈值,执行如下步骤:将第二数量与第一数量的差值确定为第三中间数值;将第三中间数值与第四预设参数的乘积确定为第四中间数值;将第四中间数值与第一数量之和确定为第五中间数值;确定最小编辑距离与第五中间数值的比值;将第四预设数值与比值的差值确定为第一文本与第二文本的相似度。
在一些实施例中,该方法还包括:显示包含相似度的相似度计算结果;或者响应于确定相似度大于预设相似度阈值,建立第一文本与第二文本的对应关系,并存储用于表征对应关系的对应关系信息。
第二方面,本申请实施例提供了一种用于生成信息的装置,该装置包括:第一确定单元,被配置成利用动态规划算法,确定通过对第一文本进行编辑操作,将第一文本转换为第二文本的最小编辑距离,其中,最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于第一文本中的目标词与第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;第二确定单元,被配置成将最小编辑距离进行归一化,将归一化后的数值确定为第一文本与第二文本的相似度。
在一些实施例中,第一确定单元,进一步被配置成执行如下语义相似度确定步骤:确定第一文本中的目标词与第二文本中的目标词是 否相同;若否,分别确定第一文本中的目标词和第二文本中的目标词是否为未登录词;响应于确定第一文本中的目标词和/或第二文本中的目标词为未登录词,将第一预设数值确定为第一文本中的目标词与第二文本中的目标词的语义相似度。
在一些实施例中,语义相似度确定步骤还包括:响应于确定第一文本中的目标词不为未登录词且第二文本中的目标词不为未登录词,执行如下步骤:确定第一文本中的目标词的词向量与第二文本中的目标词的词向量的余弦相似度;确定余弦相似度与第一预设参数的乘积;将乘积与第二预设参数的和输入至目标函数,将目标函数的值确定为第一文本中的目标词与第二文本中的目标词的语义相似度。
在一些实施例中,语义相似度确定步骤还包括:响应于确定第一文本中的目标词与第二文本中的目标词相同,将第二预设数值确定为第一文本中的目标词与第二文本中的目标词的语义相似度。
在一些实施例中,第一确定单元,进一步被配置成执行如下步骤:将第一文本中的待替换词确定为第一文本中的目标词;将第二文本中的用于替换待替换词的词确定为第二文本中的目标词;确定第一文本中的目标词与第二文本中的目标词的语义相似度;将第三预设数值与语义相似度的差值确定为替换词操作的代价。
在一些实施例中,第一确定单元,进一步被配置成执行如下步骤:将第一文本中的待删除的词作为第一文本中的目标词,逐一确定第二文本中的词与第一文本中的目标词的语义相似度;将语义相似度最大值所对应的第二文本中的词确定为第二文本中的目标词,确定相似度最大值与第三预设参数的乘积,将第四预设参数与乘积的差值确定为删除词操作的代价。
在一些实施例中,第一确定单元,进一步被配置成执行如下步骤:将第二文本中的待插入至第一文本中的词作为第二文本中的目标词,逐一确定第一文本中的词与第二文本中的目标词的语义相似度;将语义相似度最大值所对应的第一文本中的词确定为第一文本中的目标词,确定相似度最大值与第三预设参数的乘积,将第四预设参数与乘积的差值确定为插入词操作的代价。
在一些实施例中,第二确定单元,包括:第一确定模块,被配置成分别将构成第一文本、第二文本的词序列中的词的数量确定为第一数量、第二数量;第二确定模块,被配置成基于最小编辑距离、第一数量、第二数量、第四预设参数与预设阈值的比较,确定第一文本与第二文本的相似度。
在一些实施例中,第二确定模块,进一步被配置成:响应于确定第四预设参数小于预设阈值,执行如下步骤:将第一数量与第二数量之和确定为第一中间数值;将第一数值与第二数量的乘积确定为第二中间数值;确定最小编辑距离与第二中间数值的比值;将第三预设数值与比值的差值确定为第一文本与第二文本的相似度。
在一些实施例中,第二确定模块,进一步被配置成:响应于确定第四预设参数不小于预设阈值,执行如下步骤:将第二数量与第一数量的差值确定为第三中间数值;将第三中间数值与第四预设参数的乘积确定为第四中间数值;将第四中间数值与第一数量之和确定为第五中间数值;确定最小编辑距离与第五中间数值的比值;将第四预设数值与比值的差值确定为第一文本与第二文本的相似度。
在一些实施例中,该装置还包括:显示单元,被配置成显示包含相似度的相似度计算结果;或者存储单元,被配置成响应于确定相似度大于预设相似度阈值,建立第一文本与第二文本的对应关系,并存储用于表征对应关系的对应关系信息。
第三方面,本申请实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行时:利用动态规划算法,确定通过对第一文本进行编辑操作,将第一文本转换为第二文本的最小编辑距离,其中,最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于第一文本中的目标词与第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;将最小编辑距离进行归一化,将归一化后的数值确定为第一文本与第二文本的相似度。
第四方面,本申请实施例提供了一种计算机可读介质,其上存储 有计算机程序,该程序被处理器执行时,使得处理器:利用动态规划算法,确定通过对第一文本进行编辑操作,将第一文本转换为第二文本的最小编辑距离,其中,最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于第一文本中的目标词与第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;将最小编辑距离进行归一化,将归一化后的数值确定为第一文本与第二文本的相似度。
本申请实施例提供的用于生成信息的方法和装置,通过动态规划算法,确定出确定通过对第一文本进行编辑操作,将第一文本转换为第二文本的最小编辑距离,以便将最小编辑距离进行归一化,从而将归一化后的数值确定为第一文本与第二文本的相似度。其中,编辑操作分为插入词操作、删除词操作和替换词操作。编辑操作的代价基于第一文本中的目标词与第二文本中的目标词的语义相似度确定。从而,能够同时考虑文本中的词的顺序和词对应相似度,提高了文本相似度计算的准确性。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是本申请的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本申请的用于生成信息的方法的一个实施例的流程图;
图3是根据本申请的用于生成信息的方法的一个应用场景的示意图;
图4是根据本申请的用于生成信息的方法的又一个实施例的流程图;
图5是根据本申请的用于生成信息的装置的一个实施例的结构示意图;
图6是适于用来实现本申请实施例的电子设备的计算机系统的结 构示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
图1示出了可以应用本申请的用于生成信息的方法或用于生成信息的装置的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如文本编辑类应用、新闻浏览类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是能够进行网络通信的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103所上传的文本进行处理的后台服务器。后台服务器可以对文本进行分析等处理,并生成处理结果(例如相似度)。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本申请实施例所提供的用于生成信息的方法一般由服务器105执行,相应地,用于生成信息的装置一般设置于服务器105中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本申请的用于生成信息的方法的一个实施例的流程200。该用于生成信息的方法,包括以下步骤:
步骤201,利用动态规划算法,确定通过对第一文本进行编辑操作,将第一文本转换为第二文本的最小编辑距离。
在本实施例中,用于生成信息的方法的执行主体(例如图1所示的服务器105)可以预先获取第一文本和第二文本。其中,第一文本和第二文本可以是待进行相似度计算的文本。第一文本和第二文本可以分别可以由词序列构成。
此处,第一文本可以表示为A。第二文本可以表示为B。构成第一文本的词序列可以表示为
Figure PCTCN2018107990-appb-000001
构成第二文本的词序列可以表示为
Figure PCTCN2018107990-appb-000002
其中,w可以用于表示文本中的词。
Figure PCTCN2018107990-appb-000003
可以分别表示第一文本中的第一个词、第二个词、第n个词。
Figure PCTCN2018107990-appb-000004
可以分别表示第二文本中的第一个词、第二个词、第m个词。此处,n可以是构成第一文本的词的数量。m可以是构成第二文本的词的数量。n和m均为不小于1的正数。
在一种场景中,上述第一文本、上述第二文本可以预先存储在上述执行主体的本地。此时,上述执行主体可以直接从本地提取上述第一文本和第二文本。
在另一中场景中,上述第一文本、上述第二文本可以是终端(例 如图1所示的终端设备101、102、103)通过有线连接或者无线连接方式,发送至上述执行主体的。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。
在另一种场景中,上述第一文本和上述第二文本中的其中一个文本可预先存储在上述执行主体中。另一个文本可以是终端发送至上述执行主体的。
在本实施例中,上述执行主体可以利用动态规划算法,确定通过对第一文本进行编辑操作,将上述第一文本转换为第二文本的最小编辑距离(可以用WED表示)。
实践中,编辑距离(Edit Distance,也可以称为编辑代价)可以是针对两个文本的差异程度的量化量测。编辑距离可以用于表征将一个文本(或者字符串)转换成另一个文本(另一字符串)的代价。最小编辑距离即为编辑距离的最小值,也即将一个文本转换成另一个文本的最小代价。此处,代价可以理解为对文本的处理程度,可以用数值进行表示。对文本的处理程度越大,代价越大;对文本的处理程度越小,代价越小。
在本实施例中,最小编辑距离可以指将上述第一文本转换成上述第二文本的最小代价。将上述第一文本转换成上述第二文本,通常需要进行一次或多次编辑操作。上述执行主体对于每一次编辑操作,可以确定该次编辑操作的代价。最小编辑距离可以基于各次编辑操作的代价确定。
此处,上述编辑操作可以分为插入词操作、删除词操作和替换词操作。这里,一次插入词操作可以是在第一文本中插入一个词的操作。一次删除词操可以是删除第一文本中的一个词的操作。一次替换词操作可以是将第一文本中的一个词替换为第二文本中的一个词的操作。可以理解的是,由于编辑操作可以被划分为删除词操作、插入词操作和替换词操作,因而,编辑操作的代价可以被划分为删除词操作的代价(可以D表示)、插入词操作的代价(可以用I表示)和替换词操作 的代价(可以用S表示)。
具体地,上述执行主体可以使用动态规划算法,基于对第一文本进行各次编辑操作的代价,确定出将上述第一文本转换为第二文本的最小编辑距离。实践中,动态规划(Dynamic Programming,DP)是运筹学的一个分支,是求解决策过程(decision process)最优化的数学方法。基本思想是将待求解问题分解成若干个子问题,先求解子问题,然后从这些子问题的解得到原问题的解。
在本实施例中,动态规划算法所使用的状态转移方程可以采用如下公式:
Figure PCTCN2018107990-appb-000005
其中,
Figure PCTCN2018107990-appb-000006
是构成第一文本的词序列中的第i个词。
Figure PCTCN2018107990-appb-000007
是构成第二文本的词序列中的第j个词。i为不小于1且不大于n的整数。j为不小于1且不大于m的整数。
Figure PCTCN2018107990-appb-000008
为删除第一文本中的词
Figure PCTCN2018107990-appb-000009
的代价。
Figure PCTCN2018107990-appb-000010
为在第一文本中插入词
Figure PCTCN2018107990-appb-000011
的代价。
Figure PCTCN2018107990-appb-000012
为将第一文本中的词
Figure PCTCN2018107990-appb-000013
替换为第二文本中的词
Figure PCTCN2018107990-appb-000014
的代价。f i,j表示将第一文本中的前i个词转换为第二文本中的前j个词的最小代价。f i-1,j-1表示将第一文本中的前i-1个词转换为第二文本中的前j-1个词的最小代价。f i-1,j表示将第一文本中的前i-1个词转换为第二文本中的前j个词的最小代价。f i,j-1表示将第一文本中的前i个词转换为第二文本中的前j-1个词的最小代价。min表示最小值。
通过上述状态转移方程,上述执行主体可以基于各次编辑操作的代价,利用动态规划算法,逐次计算出最终的f n,m,即将第一文本转换为第二文本的最小代价,也即最小编辑距离WED。
在本实施例中,编辑操作的代价可以基于上述第一文本中的目标词与上述第二文本中的目标词的语义相似度确定。其中,目标词可以是为编辑操作所涉及的词。作为示例,编辑操作为删除词操作时,第一文本中的目标词可以是第一文本中待删除的词;第二文本中的目标 词可以是与上述待删除的词的语义相似度最大的词。作为又一示例,编辑操作为插入词操作时,第二文本中的目标词可以是第二文本中的、待插入至第一文本的词;第一文本中的目标词可以是与第二文本中的目标词的语义相似度最大的词。作为再一示例,编辑操作为替换词操作时,第一文本中的目标词可以是第一文本中待替换词。第二文本中的目标词可以是第二文本中的用于替换上述待替换词的词。
对于某一次编辑操作,上述执行主体可以基于第一文本中的目标词与第二文本中的目标词的语义相似度计算结果,确定出该次编辑操作的代价。此处,上述执行主体可以预先设置语义相似度与编辑操作的代价的对应关系,例如对应关系表、公式等。在确定某一次编辑操作的代价时,上述执行主体可以将第一文本中的目标词与第二文本中的目标词的语义相似度计算结果直接代入与该次编辑操作对应的对应关系,得到该次编辑操作的代价。
可以理解的是,针对不同的编辑操作,可以预先设置有相同或不同的对应关系。例如,语义相似度与删除词操作的代价的对应关系、语义相似度与插入词操作的代价的对应关系可以采用相同的对应关系表或者公式。语义相似度与删除词操作的代价的对应关系、语义相似度与替换词操作的代价的对应关系,可以采用不同的对应关系表或者公式。此处不作限定。
需要说明的是,上述执行主体可以利用各种方式确定第一文本中的目标词与第二文本中的目标词的语义相似度。作为示例,首先,可以分别确定第一文本中的目标词与第二文本的目标词的词向量。此处,词向量可以是利用词嵌入技术所得到的词的嵌入表示。上述执行主体可以通过各种现有的词向量计算方法(例如对单词-文本共现矩阵进行主成分分析),或者,可以使用现有的词向量计算工具或模型(例如word2vec模型、glove模型、ELMo模型)来确定。此处不作限定。接着,可以将第一文本中的目标词的词向量与第二文本的目标词的词向量进行语义相似度计算。实践中,词向量可以包含词的语义特征。通过将词向量进行相似度计算,可以得到两个目标词的语义相似度。此处,可以利用各种相似度计算方法进行相似度计算。例如欧氏距离、 余弦相似度等。
在本实施例的一些可选的实现方式中,上述第一文本中的目标词与上述第二文本中的目标词的语义相似度可以通过如下语义相似度确定步骤确定:第一步,确定上述第一文本中的目标词(此处可以用
Figure PCTCN2018107990-appb-000015
表示)与上述第二文本中的目标词(此处可以用
Figure PCTCN2018107990-appb-000016
表示)是否相同。第二步,响应于确定上述第一文本中的目标词与上述第二文本中的目标词不相同,可以分别确定上述第一文本中的目标词和上述第二文本中的目标词是否为未登录词(Out of Vocabulary,OOV)。此处,未登录词可以是没有被收录在分词词表中但必须切分出来的词。例如,可以包括各类专有名词(人名、地名、企业名等)、缩写词、新增词汇等等。第三步,响应于确定上述第一文本中的目标词和/或上述第二文本中的目标词为未登录词,可以将第一预设数值(例如0)确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度(此处可以用
Figure PCTCN2018107990-appb-000017
表示)。
由于未登录词由于未收录在词表中,因此,通常无法得到未登录词的词向量,由此,无法确定语义相似度。通过这种现实方式,可以考虑到未登录词的存在。当第一文本中的目标词和/或第二文本的目标词存在未登录词,并且两个目标词不同时,将语义相似度设置成第一预设数值(例如0),从而依然可以得到两个目标词的语义相似度。从而,更全面地考虑了文本中的词,提高了文本相似度计算的准确性。
在本实施例的一些可选的实现方式中,在上述语义相似度确定步骤的第二步执行之后,响应于确定上述第一文本中的目标词不为未登录词且上述第二文本中的目标词不为未登录词,上述执行主体可以执行如下步骤:首先,确定上述第一文本中的目标词的词向量(此处可以用
Figure PCTCN2018107990-appb-000018
表示)与上述第二文本中的目标词的词向量(此处可以用
Figure PCTCN2018107990-appb-000019
表示)的余弦相似度(此处可以用
Figure PCTCN2018107990-appb-000020
表示)。此处,词向量可以采用各种现有的词向量计算方法确定,或者,可以使用现有的词向量计算工具或模型确定。接着,可以确定上述余弦相似度与第一预设参数(此处可以用α表示)的乘积。接着,可以将上述乘积与第二预设参数(此处可以用β表示)的和输入至目标函数(例如sigmoid函数, 此处可以用σ表示),将上述目标函数的值确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度
Figure PCTCN2018107990-appb-000021
即:
Figure PCTCN2018107990-appb-000022
其中,α、β和σ函数可以将余弦相似度映射至指定数值区间(例如[0,1])。α、β的值可以根据需要设定。实践中,α可以被设置为大于0的数。
在本实施例的一些可选的实现方式中,在上述语义相似度确定步骤的第一步执行之后,响应于确定上述第一文本中的目标词与上述第二文本中的目标词相同,上述执行主体可以将第二预设数值(例如1)确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度。由此,当两个文本中的目标词的相同时,可以不再通过词向量计算语义相似度,可直接将语义相似度确定为第二预设数值。提高了数据处理效率。
在本实施例的一些可选的实现方式中,对于某一次编辑操作,可以按照如下公式确定出该次编辑操作所涉及的第一文本中的目标词与第二文本中的目标词的相似度:
Figure PCTCN2018107990-appb-000023
在本实施例的一些可选的实现方式中,替换词操作的代价可以通过如下步骤确定(以将第一文本中的词
Figure PCTCN2018107990-appb-000024
替换为第二文本中的词
Figure PCTCN2018107990-appb-000025
的代价
Figure PCTCN2018107990-appb-000026
为例):
第一步,将上述第一文本中的待替换词
Figure PCTCN2018107990-appb-000027
确定为上述第一文本中的目标词。
第二步,将上述第二文本中的用于替换上述待替换词的词
Figure PCTCN2018107990-appb-000028
确定为上述第二文本中的目标词。
第三步,确定上述第一文本中的目标词与上述第二文本中的目标词的语义相似度
Figure PCTCN2018107990-appb-000029
第四步,将第三预设数值(例如1)与上述语义相似度的差值确定为替换词操作的代价
Figure PCTCN2018107990-appb-000030
例如,当第三预设数值为1时,按 照如下公式确定替换词操作的代价
Figure PCTCN2018107990-appb-000031
Figure PCTCN2018107990-appb-000032
在本实施例的一些可选的实现方式中,删除词操作的代价可以通过如下步骤确定(此处以删除第一文本中的词
Figure PCTCN2018107990-appb-000033
的代价
Figure PCTCN2018107990-appb-000034
为例):
第一步,将上述第一文本中的待删除的词
Figure PCTCN2018107990-appb-000035
作为上述第一文本中的目标词,逐一确定上述第二文本中的词与上述第一文本中的目标词的语义相似度。即,确定
Figure PCTCN2018107990-appb-000036
其中,w B为第二文本中的词。
第二步,将语义相似度最大值(此处可以表示为
Figure PCTCN2018107990-appb-000037
所对应的上述第二文本中的词确定为上述第二文本中的目标词,确定上述相似度最大值与第三预设参数(此处可以表示为λ 2)的乘积,将第四预设参数(此处可以表示为λ 1)与上述乘积的差值确定为删除词操作的代价
Figure PCTCN2018107990-appb-000038
即:
Figure PCTCN2018107990-appb-000039
其中,
Figure PCTCN2018107990-appb-000040
max表示最大值。
在本实施例的一些可选的实现方式中,插入词操作的代价可以通过如下步骤确定(此处以在第一文本中插入词
Figure PCTCN2018107990-appb-000041
的代价
Figure PCTCN2018107990-appb-000042
为例):
第一步,将上述第二文本中的待插入至上述第一文本中的词
Figure PCTCN2018107990-appb-000043
作为上述第二文本中的目标词,逐一确定上述第一文本中的词与上述第二文本中的目标词的语义相似度。即,确定
Figure PCTCN2018107990-appb-000044
其中,w A为第一文本中的词。
第二步,将语义相似度最大值(此处可以表示为
Figure PCTCN2018107990-appb-000045
)所对应的上述第一文本中的词确定为上述第一文本中的目标词,确定上述相似度最大值与第三预设参数(λ 2)的乘积,将第四预设参数(λ 1)与上述乘积的差值确定为插入词操作的代价
Figure PCTCN2018107990-appb-000046
即:
Figure PCTCN2018107990-appb-000047
其中,
Figure PCTCN2018107990-appb-000048
max表示最大值
需要说明的是,λ 1可以用于调整删除词操作的代价或者插入词操作的代价的相对大小。实践中,可以将λ 1设置为不小于0的数值。λ 2可以用于调整相似度的影响程度。实践中,可以将λ 2设置为不小于0且 不大于1的数值。当λ 2等于1时,处于不同位置但是相同的词将不会产生代价。
需要指出的是,上述各参数(α、β、λ 1、λ 2)的值可以根据需要预先设定,也可以是根据大量数据统计和试验而预先制定,具体取值此处不作限定。
还需要指出的是,替换词操作的代价、删除词操作的代价以及插入词操作的代价的计算公式不限于上述实现方式中列举,还可以设置成以目标词的语义相似度作为变量的其他公式。此处不作限定。
步骤202,将最小编辑距离进行归一化,将归一化后的数值确定为第一文本与第二文本的相似度。
在本实施例中,将上述最小编辑距离进行归一化,将归一化后的数值确定为上述第一文本与上述第二文本的相似度。实践中,归一化是指将待处理的数据经过处理后(通过某种算法)限制在指定范围内。例如,将某个数值转化为数值区间[0,1]内的值。对最小编辑距离进行归一化,可以便于数据的比较和后续处理。
此处,可以利用各种现有的归一化函数,或者预先建立的公式,对步骤201所得到的最小编辑距离进行归一化。作为示例,可以首先确定构成第一文本的词序列中的词的数量n。同时,可以确定构成第二文本的词序列中的词的数量m。而后,可以确定构成两文本的词序列中的词的数量之和。最后,可以将最小编辑距离与上述数量之和的比值确定为上述第一文本与上述第二文本的相似度。
在本实施例的一些可选的实现方式中,上述执行主体可以首先分别将构成上述第一文本、上述第二文本的词序列中的词的数量确定为第一数量(n)、第二数量(m)。当删除词操作的代价和插入词的代价是按照步骤201中所阐述的可选的实现方式确定时,上述执行主体可以基于上述最小编辑距离、上述第一数量、上述第二数量、上述第四预设参数λ 1与预设阈值的比较,确定上述第一文本与上述第二文本的相似度(此处可以用sim表示)。
可选的,响应于确定上述第四预设参数小于预设阈值(例如0.5),上述执行主体可以执行如下步骤:首先,可以将上述第一数量n与上 述第二数量m之和确定为第一中间数值。而后,可以将上述第一中间数值与上述第四预设参数λ 1的乘积确定为第二中间数值。之后,可以确定上述最小编辑距离WED与上述第二中间数值的比值。最后,可以将第四预设数值(例如1)与上述比值的差值确定为上述第一文本与上述第二文本的相似度。此处,第四预设数值的具体值可以基于实际需求确定,此处不作限定。
可选的,响应于确定上述第四预设参数不小于预设阈值(例如0.5),上述执行主体可以执行如下步骤:首先,可以将上述第二数量m与上述第一数量n的差值确定为第三中间数值。而后,可以将上述第三中间数值与上述第四预设参数λ 1的乘积确定为第四中间数值。之后,可以将上述第四中间数值与上述第一数量n之和确定为第五中间数值。然后,可以确定上述最小编辑距离WED与上述第五中间数值的比值。最后,可以将第四预设数值(例如1)与上述比值的差值确定为上述第一文本与上述第二文本的相似度。
作为示例,上述实现方式中,当预设阈值为0.5、第四预设数值为1时,可以参照如下公式确定上述第一文本与上述第二文本的相似度:
Figure PCTCN2018107990-appb-000049
在本实施例的一些可选的实现方式中,在确定出第一本文与第二文本的相似度之后,上述执行还可以显示包含上述相似度的相似度计算结果。或者,响应于确定上述相似度大于预设相似度阈值,可以建立上述第一文本与上述第二文本的对应关系,并存储用于表征上述对应关系的对应关系信息。或者,可以推送上述第一文本或者第二文本给指定用户等。
继续参见图3,图3是根据本实施例的用于生成信息的方法的应用场景的一个示意图。在图3的应用场景中,用户首先利用终端设备301向服务器302发送了相似度计算请求,相似度计算请求中包含待进行相似度计算的第一文本303和第二文本304。而后,服务器302 利用动态规划算法,确定出将上述第一文本转换为第二文本的最小编辑距离。而后,将上述最小编辑距离进行归一化,将归一化后的数值确定为上述第一文本与上述第二文本的相似度305。最后,服务器向终端设备发送包含相似度305的相似度计算结果306。
本申请的上述实施例提供的方法,通过动态规划算法,确定出确定通过对第一文本进行编辑操作,将上述第一文本转换为第二文本的最小编辑距离,以便将上述最小编辑距离进行归一化,从而将归一化后的数值确定为上述第一文本与上述第二文本的相似度。其中,上述编辑操作分为插入词操作、删除词操作和替换词操作。编辑操作的代价基于上述第一文本中的目标词与上述第二文本中的目标词的语义相似度确定。从而,能够同时考虑文本中的词的顺序、词对应相似度以及词的对齐关系,提高了文本相似度计算的准确性。
进一步参考图4,其示出了用于生成信息的方法的又一个实施例的流程400。该用于生成信息的方法的流程400,包括以下步骤:
步骤401,利用动态规划算法,确定通过对第一文本进行编辑操作,将第一文本转换为第二文本的最小编辑距离。
在本实施例中,用于生成信息的方法的执行主体(例如图1所示的服务器105)可以利用动态规划算法,确定通过对第一文本进行编辑操作,将第一文本(表示为A)转换为第二文本(表示为B)的最小编辑距离(表示为WED)。其中,构成第一文本的词的数量可以表示为n。n和m均为不小于1的正数。此处,上述编辑操作可以分为插入词操作、删除词操作和替换词操作。编辑操作的代价可以被划分为删除词操作的代价(可以D表示)、插入词操作的代价(可以用I表示)和替换词操作的代价(可以用S表示)。上述执行主体可以使用动态规划算法,基于对第一文本进行各次编辑操作的代价,确定出将上述第一文本转换为第二文本的最小编辑距离。此处,所使用的状态转移方程可以采用如下公式:
Figure PCTCN2018107990-appb-000050
其中,
Figure PCTCN2018107990-appb-000051
是构成第一文本的词序列中的第i个词。
Figure PCTCN2018107990-appb-000052
是构成第二文本的词序列中的第j个词。i为不小于1且不大于n的整数。j为不小于1且不大于m的整数。
Figure PCTCN2018107990-appb-000053
为删除第一文本中的词
Figure PCTCN2018107990-appb-000054
的代价。
Figure PCTCN2018107990-appb-000055
为在第一文本中插入词
Figure PCTCN2018107990-appb-000056
的代价。
Figure PCTCN2018107990-appb-000057
为将第一文本中的词
Figure PCTCN2018107990-appb-000058
替换为第二文本中的词
Figure PCTCN2018107990-appb-000059
的代价。f i,j表示将第一文本中的前i个(即第1个至第i个)词转换为第二文本中的前j个词(即第1个至第j个)的最小代价。f i-1,j-1表示将第一文本中的前i-1个词转换为第二文本中的前j-1个词的最小代价。f i-1,j表示将第一文本中的前i-1个词转换为第二文本中的前j个词的最小代价。f i,j-1表示将第一文本中的前i个词转换为第二文本中的前j-1个词的最小代价。min表示最小值。
通过上述状态转移方程,上述执行主体可以基于各次编辑操作的代价,利用动态规划算法,逐次计算出状态转移方程的最终的值,即为将第一文本转换为第二文本的最小代价,也即最小编辑距离WED。
此处,编辑操作的代价可以是基于上述第一文本中的目标词与上述第二文本中的目标词的语义相似度确定的。
上述状态转移方程中,
Figure PCTCN2018107990-appb-000060
其中,
Figure PCTCN2018107990-appb-000061
为第一文本中的词
Figure PCTCN2018107990-appb-000062
与上述第二文本中的词
Figure PCTCN2018107990-appb-000063
的语义相似度。此处,可以将编辑操作所涉及的词作为目标词。由于替换词操作涉及到第一文本中的词
Figure PCTCN2018107990-appb-000064
和第二文本中的词
Figure PCTCN2018107990-appb-000065
因此,可以将
Figure PCTCN2018107990-appb-000066
作为第一文本中的目标词,将
Figure PCTCN2018107990-appb-000067
作为第二文本中的目标词。
上述状态转移方程中,
Figure PCTCN2018107990-appb-000068
其中,λ 1可以用于调整删除词操作的代价或者插入词操作的代价的相对大小。实践中,可以将λ 1设置为不小于0的数值。λ 2可以用于调整相似度的影响程度。实践中,可以将λ 2设置为不小于0且不大于1的数值。当λ 2等于1时,处于不同位置但是相同的词将不会产生代价。需要指出的是,上述λ 1、 λ 2的值可以根据需要预先设定,也可以是根据大量数据统计和试验而预先制定,具体取值此处不作限定。
Figure PCTCN2018107990-appb-000069
为第一文本中的目标词
Figure PCTCN2018107990-appb-000070
与上述第二文本中的各个词w B的语义相似度的最大值。即:
Figure PCTCN2018107990-appb-000071
上述状态转移方程中,
Figure PCTCN2018107990-appb-000072
其中,
Figure PCTCN2018107990-appb-000073
为第二文本中的目标词
Figure PCTCN2018107990-appb-000074
与上述第一文本中的各个词w A的语义相似度的最大值。即:
Figure PCTCN2018107990-appb-000075
在本实施例中,上述第一文本中的目标词
Figure PCTCN2018107990-appb-000076
与上述第二文本中的目标词
Figure PCTCN2018107990-appb-000077
的语义相似度
Figure PCTCN2018107990-appb-000078
可以通过如下语义相似度确定步骤确定:
第一步,确定
Figure PCTCN2018107990-appb-000079
Figure PCTCN2018107990-appb-000080
是否相同。
若相同,可以将第二预设数值(例如1)确定为
Figure PCTCN2018107990-appb-000081
Figure PCTCN2018107990-appb-000082
的语义相似度。由此,当两个文本中的目标词的相同时,可以不再通过词向量计算语义相似度,可直接将语义相似度确定为第二预设数值。提高了数据处理效率。
Figure PCTCN2018107990-appb-000083
Figure PCTCN2018107990-appb-000084
不同,可以执行如下第二步。
第二步,响应于确定
Figure PCTCN2018107990-appb-000085
Figure PCTCN2018107990-appb-000086
不相同,可以分别确定上述
Figure PCTCN2018107990-appb-000087
和上述
Figure PCTCN2018107990-appb-000088
是否为未登录词。
Figure PCTCN2018107990-appb-000089
和/或
Figure PCTCN2018107990-appb-000090
是未登录词,可以将第一预设数值(例如0)确定为
Figure PCTCN2018107990-appb-000091
Figure PCTCN2018107990-appb-000092
的语义相似度。由于未登录词由于未收录在词表中,因此,通常无法得到未登录词的词向量,由此,无法确定语义相似度。通过这种现实方式,可以考虑到未登录词的存在。当第一文本中的目标词和/或第二文本的目标词存在未登录词,并且两个目标词不同时,将语义相似度设置成第一预设数值(例如0),从而依然可以得到两个目标词的语义相似度。从而,更全面地考虑了文本中的词,提高了文本相似度计算的准确性。
Figure PCTCN2018107990-appb-000093
Figure PCTCN2018107990-appb-000094
均不是未登录词,可以执行如下第三步。
第三步,响应于确定
Figure PCTCN2018107990-appb-000095
Figure PCTCN2018107990-appb-000096
均不是未登录词,可以首先确定
Figure PCTCN2018107990-appb-000097
的词向量
Figure PCTCN2018107990-appb-000098
Figure PCTCN2018107990-appb-000099
的词向量
Figure PCTCN2018107990-appb-000100
的余弦相似度
Figure PCTCN2018107990-appb-000101
而后,按照如下公 式确定
Figure PCTCN2018107990-appb-000102
Figure PCTCN2018107990-appb-000103
的语义相似度:
Figure PCTCN2018107990-appb-000104
其中,α、β和σ函数可以将余弦相似度映射至指定数值区间(例如[0,1])。α、β的值可以根据需要设定。实践中,α可以被设置为大于0的数。
当上述第一预设数值为0、上述第二预设数值为1时,可参照如下公式:
Figure PCTCN2018107990-appb-000105
需要说明的是,上述第一文本中的目标词
Figure PCTCN2018107990-appb-000106
与上述第二文本中的各个词w B的语义相似度可以参照
Figure PCTCN2018107990-appb-000107
Figure PCTCN2018107990-appb-000108
的语义相似度
Figure PCTCN2018107990-appb-000109
计算方法确定。第二文本中的目标词
Figure PCTCN2018107990-appb-000110
与上述第一文本中的各个词w A的语义相似度也可以参照
Figure PCTCN2018107990-appb-000111
Figure PCTCN2018107990-appb-000112
的语义相似度
Figure PCTCN2018107990-appb-000113
计算方法确定。此处不再赘述。
步骤402,分别将构成第一文本、第二文本的词序列中的词的数量确定为第一数量、第二数量。
在本实施例中,上述执行主体可以分别将构成第一文本、第二文本的词序列中的词的数量确定为第一数量(表示为n)、第二数量(表示为m)。
步骤403,基于最小编辑距离、第一数量、第二数量、第四预设参数与预设阈值的比较,确定第一文本与第二文本的相似度。
在本实施例中,可以参照如下公式确定第一文本与第二文本的相似度:
Figure PCTCN2018107990-appb-000114
步骤404,显示包含上述相似度的相似度计算结果。
在本实施例中,上述执行主体可以显示包含上述相似度的相似度计算结果,以将相似度计算结果呈现给用户,供用户查看。
从图4中可以看出,与图2对应的实施例相比,本实施例中的用于生成信息的方法的流程400给出了一种最小编辑距离以及基于最小编辑距离确定文本相似度的计算过程。由此,本实施例描述的方案能够同时考虑文本中的词的顺序和词对应相似度,提高了文本相似度计算的准确性。同时,可以根据任务灵活调整参数,对文本中的词的顺序以及语义相似度进行不同程度的使用,提高了文本相似度计算的灵活性。
进一步参考图5,作为对上述各图所示方法的实现,本申请提供了一种用于生成信息的装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例所述的用于生成信息的装置500包括:第一确定单元501,被配置成利用动态规划算法,确定通过对第一文本进行编辑操作,将上述第一文本转换为第二文本的最小编辑距离,其中,上述最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于上述第一文本中的目标词与上述第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;第二确定单元502,被配置成将上述最小编辑距离进行归一化,将归一化后的数值确定为上述第一文本与上述第二文本的相似度。
在本实施例的一些可选的实现方式中,上述第一确定单元501可以被配置成执行如下语义相似度确定步骤:确定上述第一文本中的目标词与上述第二文本中的目标词是否相同;若否,分别确定上述第一文本中的目标词和上述第二文本中的目标词是否为未登录词;响应于确定上述第一文本中的目标词和/或上述第二文本中的目标词为未登录词,将第一预设数值确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度。
在本实施例的一些可选的实现方式中,上述语义相似度确定步骤还可以包括:响应于确定上述第一文本中的目标词不为未登录词且上述第二文本中的目标词不为未登录词,执行如下步骤:确定上述第一 文本中的目标词的词向量与上述第二文本中的目标词的词向量的余弦相似度;确定上述余弦相似度与第一预设参数的乘积;将上述乘积与第二预设参数的和输入至目标函数,将上述目标函数的值确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度。
在本实施例的一些可选的实现方式中,上述语义相似度确定步骤还可以包括:响应于确定上述第一文本中的目标词与上述第二文本中的目标词相同,将第二预设数值确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度。
在本实施例的一些可选的实现方式中,上述第一确定单元501可以进一步被配置成执行如下步骤,以确定替换词操作的代价:将上述第一文本中的待替换词确定为上述第一文本中的目标词;将上述第二文本中的用于替换上述待替换词的词确定为上述第二文本中的目标词;确定上述第一文本中的目标词与上述第二文本中的目标词的语义相似度;将第三预设数值与上述语义相似度的差值确定为替换词操作的代价。
在本实施例的一些可选的实现方式中,上述第一确定单元501可以进一步被配置成执行如下步骤,以确定删除词操作的代价:将上述第一文本中的待删除的词作为上述第一文本中的目标词,逐一确定上述第二文本中的词与上述第一文本中的目标词的语义相似度;将语义相似度最大值所对应的上述第二文本中的词确定为上述第二文本中的目标词,确定上述相似度最大值与第三预设参数的乘积,将第四预设参数与上述乘积的差值确定为删除词操作的代价。
在本实施例的一些可选的实现方式中,上述第一确定单元501可以进一步被配置成执行如下步骤,以确定插入词操作的代价:将上述第二文本中的待插入至上述第一文本中的词作为上述第二文本中的目标词,逐一确定上述第一文本中的词与上述第二文本中的目标词的语义相似度;将语义相似度最大值所对应的上述第一文本中的词确定为上述第一文本中的目标词,确定上述相似度最大值与第三预设参数的乘积,将第四预设参数与上述乘积的差值确定为插入词操作的代价。
在本实施例的一些可选的实现方式中,上述第二确定单元502可 以包括第一确定模块和第二确定模块(图中未示出)。其中,上述第一确定模块可以被配置成分别将构成上述第一文本、上述第二文本的词序列中的词的数量确定为第一数量、第二数量。上述第二确定模块可以被配置成基于上述最小编辑距离、上述第一数量、上述第二数量、上述第四预设参数与预设阈值的比较,确定上述第一文本与上述第二文本的相似度。
在本实施例的一些可选的实现方式中,上述第二确定模块可以进一步被配置成:响应于确定上述第四预设参数小于预设阈值,执行如下步骤:将上述第一数量与上述第二数量之和确定为第一中间数值;将上述第一数值与上述第二数量的乘积确定为第二中间数值;确定上述最小编辑距离与上述第二中间数值的比值;将第三预设数值与上述比值的差值确定为上述第一文本与上述第二文本的相似度。
在本实施例的一些可选的实现方式中,上述第二确定模块可以进一步被配置成:响应于确定上述第四预设参数不小于预设阈值,执行如下步骤:将上述第二数量与上述第一数量的差值确定为第三中间数值;将上述第三中间数值与上述第四预设参数的乘积确定为第四中间数值;将上述第四中间数值与上述第一数量之和确定为第五中间数值;确定上述最小编辑距离与上述第五中间数值的比值;将第四预设数值与上述比值的差值确定为上述第一文本与上述第二文本的相似度。
在本实施例的一些可选的实现方式中,该装置还可以包括显示单元或者存储单元(图中未示出)。其中,上述显示单元可以被配置成显示包含上述相似度的相似度计算结果。上述存储单元可以被配置成响应于确定上述相似度大于预设相似度阈值,建立上述第一文本与上述第二文本的对应关系,并存储用于表征上述对应关系的对应关系信息。
本申请的上述实施例提供的装置,第一确定单元501通过动态规划算法,确定出确定通过对第一文本进行编辑操作,将上述第一文本转换为第二文本的最小编辑距离,以便第二确定单元502将上述最小编辑距离进行归一化,从而将归一化后的数值确定为上述第一文本与上述第二文本的相似度。其中,上述编辑操作分为插入词操作、删除词操作和替换词操作。编辑操作的代价基于上述第一文本中的目标词 与上述第二文本中的目标词的语义相似度确定。从而,能够同时考虑文本中的词的顺序、词对应相似度以及词的对齐关系,提高了文本相似度计算的准确性。
下面参考图6,其示出了适于用来实现本申请实施例的电子设备的计算机系统600的结构示意图。图6示出的电子设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,计算机系统600包括中央处理单元(CPU)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储部分608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有系统600操作所需的各种程序和数据。CPU 601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
以下部件连接至I/O接口605:包括键盘、鼠标等的输入部分606;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分607;包括硬盘等的存储部分608;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器610上,以便于从其上读出的计算机程序根据需要被安装入存储部分608。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分609从网络上被下载和安装,和/或从可拆卸介质611被安装。在该计算机程序被中央处理单元(CPU)601执行时,执行本申请的方法中限定的上述功能。需要说明的是,本申请所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以 是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理 器中,例如,可以描述为:一种处理器包括第一确定单元和第二确定单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一确定单元还可以被描述为“利用动态规划算法确定将第一文本转换为第二文本的最小编辑距离的单元”。
作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的装置中所包含的;也可以是单独存在,而未装配入该装置中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该装置执行时,使得该装置:利用动态规划算法,确定通过对第一文本进行编辑操作,将该第一文本转换为第二文本的最小编辑距离,其中,上述最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于该第一文本中的目标词与该第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;将该最小编辑距离进行归一化,将归一化后的数值确定为该第一文本与该第二文本的相似度。
可选的,上述第一文本中的目标词与上述第二文本中的目标词的语义相似度可以通过如下语义相似度确定步骤确定:确定上述第一文本中的目标词与上述第二文本中的目标词是否相同;若否,分别确定上述第一文本中的目标词和上述第二文本中的目标词是否为未登录词;响应于确定上述第一文本中的目标词和/或上述第二文本中的目标词为未登录词,将第一预设数值确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度。
可选的,上述语义相似度确定步骤还可以包括:响应于确定上述第一文本中的目标词不为未登录词且上述第二文本中的目标词不为未登录词,执行如下步骤:确定上述第一文本中的目标词的词向量与上述第二文本中的目标词的词向量的余弦相似度;确定上述余弦相似度与第一预设参数的乘积;将上述乘积与第二预设参数的和输入至目标函数,将上述目标函数的值确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度。
可选的,上述语义相似度确定步骤还可以包括:响应于确定上述 第一文本中的目标词与上述第二文本中的目标词相同,将第二预设数值确定为上述第一文本中的目标词与上述第二文本中的目标词的语义相似度。
可选的,替换词操作的代价可以通过如下步骤确定:将上述第一文本中的待替换词确定为上述第一文本中的目标词;将上述第二文本中的用于替换上述待替换词的词确定为上述第二文本中的目标词;确定上述第一文本中的目标词与上述第二文本中的目标词的语义相似度;将第三预设数值与上述语义相似度的差值确定为替换词操作的代价。
可选的,删除词操作的代价可以通过如下步骤确定:将上述第一文本中的待删除的词作为上述第一文本中的目标词,逐一确定上述第二文本中的词与上述第一文本中的目标词的语义相似度;将语义相似度最大值所对应的上述第二文本中的词确定为上述第二文本中的目标词,确定上述相似度最大值与第三预设参数的乘积,将第四预设参数与上述乘积的差值确定为删除词操作的代价。
可选的,插入词操作的代价可以通过如下步骤确定:将上述第二文本中的待插入至上述第一文本中的词作为上述第二文本中的目标词,逐一确定上述第一文本中的词与上述第二文本中的目标词的语义相似度;将语义相似度最大值所对应的上述第一文本中的词确定为上述第一文本中的目标词,确定上述相似度最大值与第三预设参数的乘积,将第四预设参数与上述乘积的差值确定为插入词操作的代价。
可选的,上述对上述最小编辑距离进行归一化,将归一化后的数值确定为上述第一文本与上述第二文本的相似度,可以包括:分别将构成上述第一文本、上述第二文本的词序列中的词的数量确定为第一数量、第二数量;基于上述最小编辑距离、上述第一数量、上述第二数量、上述第四预设参数与预设阈值的比较,确定上述第一文本与上述第二文本的相似度。
可选的,上述基于上述最小编辑距离、上述第一数量、上述第二数量、上述第四预设参数与预设阈值的比较,确定上述第一文本与上述第二文本的相似度,可以包括:响应于确定上述第四预设参数小于 预设阈值,执行如下步骤:将上述第一数量与上述第二数量之和确定为第一中间数值;将上述第一中间数值与上述第四预设参数的乘积确定为第二中间数值;确定上述最小编辑距离与上述第二中间数值的比值;将第四预设数值与上述比值的差值确定为上述第一文本与上述第二文本的相似度。
可选的,上述基于上述最小编辑距离、上述第一数量、上述第二数量、上述第四预设参数与预设阈值的比较,确定上述第一文本与上述第二文本的相似度,可以包括:响应于确定上述第四预设参数不小于预设阈值,执行如下步骤:将上述第二数量与上述第一数量的差值确定为第三中间数值;将上述第三中间数值与上述第四预设参数的乘积确定为第四中间数值;将上述第四中间数值与上述第一数量之和确定为第五中间数值;确定上述最小编辑距离与上述第五中间数值的比值;将第四预设数值与上述比值的差值确定为上述第一文本与上述第二文本的相似度。
可选的,在确定出第一文本与第二文本的相似度之后,还可以显示包含上述相似度的相似度计算结果;或者响应于确定上述相似度大于预设相似度阈值,建立上述第一文本与上述第二文本的对应关系,并存储用于表征上述对应关系的对应关系信息。
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (24)

  1. 一种用于生成信息的方法,包括:
    利用动态规划算法,确定通过对第一文本进行编辑操作,将所述第一文本转换为第二文本的最小编辑距离,其中,所述最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于所述第一文本中的目标词与所述第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;
    将所述最小编辑距离进行归一化,将归一化后的数值确定为所述第一文本与所述第二文本的相似度。
  2. 根据权利要求1所述的用于生成信息的方法,其中,所述第一文本中的目标词与所述第二文本中的目标词的语义相似度通过如下语义相似度确定步骤确定:
    确定所述第一文本中的目标词与所述第二文本中的目标词是否相同;
    若否,分别确定所述第一文本中的目标词和所述第二文本中的目标词是否为未登录词;
    响应于确定所述第一文本中的目标词和/或所述第二文本中的目标词为未登录词,将第一预设数值确定为所述第一文本中的目标词与所述第二文本中的目标词的语义相似度。
  3. 根据权利要求2所述的用于生成信息的方法,其中,所述语义相似度确定步骤还包括:
    响应于确定所述第一文本中的目标词不为未登录词且所述第二文本中的目标词不为未登录词,执行如下步骤:
    确定所述第一文本中的目标词的词向量与所述第二文本中的目标词的词向量的余弦相似度;
    确定所述余弦相似度与第一预设参数的乘积;
    将所述乘积与第二预设参数的和输入至目标函数,将所述目标函 数的值确定为所述第一文本中的目标词与所述第二文本中的目标词的语义相似度。
  4. 根据权利要求2所述的用于生成信息的方法,其中,所述语义相似度确定步骤还包括:
    响应于确定所述第一文本中的目标词与所述第二文本中的目标词相同,将第二预设数值确定为所述第一文本中的目标词与所述第二文本中的目标词的语义相似度。
  5. 根据权利要求1所述的用于生成信息的方法,其中,替换词操作的代价通过如下步骤确定:
    将所述第一文本中的待替换词确定为所述第一文本中的目标词;
    将所述第二文本中的用于替换所述待替换词的词确定为所述第二文本中的目标词;
    确定所述第一文本中的目标词与所述第二文本中的目标词的语义相似度;
    将第三预设数值与所述语义相似度的差值确定为替换词操作的代价。
  6. 根据权利要求1所述的用于生成信息的方法,其中,删除词操作的代价通过如下步骤确定:
    将所述第一文本中的待删除的词作为所述第一文本中的目标词,逐一确定所述第二文本中的词与所述第一文本中的目标词的语义相似度;
    将语义相似度最大值所对应的所述第二文本中的词确定为所述第二文本中的目标词,确定所述相似度最大值与第三预设参数的乘积,将第四预设参数与所述乘积的差值确定为删除词操作的代价。
  7. 根据权利要求1所述的用于生成信息的方法,其中,插入词操作的代价通过如下步骤确定:
    将所述第二文本中的待插入至所述第一文本中的词作为所述第二文本中的目标词,逐一确定所述第一文本中的词与所述第二文本中的目标词的语义相似度;
    将语义相似度最大值所对应的所述第一文本中的词确定为所述第一文本中的目标词,确定所述相似度最大值与第三预设参数的乘积,将第四预设参数与所述乘积的差值确定为插入词操作的代价。
  8. 根据权利要求6或7所述的用于生成信息的方法,其中,所述对所述最小编辑距离进行归一化,将归一化后的数值确定为所述第一文本与所述第二文本的相似度,包括:
    分别将构成所述第一文本、所述第二文本的词序列中的词的数量确定为第一数量、第二数量;
    基于所述最小编辑距离、所述第一数量、所述第二数量、所述第四预设参数与预设阈值的比较,确定所述第一文本与所述第二文本的相似度。
  9. 根据权利要求8所述的用于生成信息的方法,其中,所述基于所述最小编辑距离、所述第一数量、所述第二数量、所述第四预设参数与预设阈值的比较,确定所述第一文本与所述第二文本的相似度,包括:
    响应于确定所述第四预设参数小于预设阈值,执行如下步骤:
    将所述第一数量与所述第二数量之和确定为第一中间数值;
    将所述第一中间数值与所述第四预设参数的乘积确定为第二中间数值;
    确定所述最小编辑距离与所述第二中间数值的比值;
    将第四预设数值与所述比值的差值确定为所述第一文本与所述第二文本的相似度。
  10. 根据权利要求8所述的用于生成信息的方法,其中,所述基于所述最小编辑距离、所述第一数量、所述第二数量、所述第四预设 参数与预设阈值的比较,确定所述第一文本与所述第二文本的相似度,包括:
    响应于确定所述第四预设参数不小于预设阈值,执行如下步骤:
    将所述第二数量与所述第一数量的差值确定为第三中间数值;
    将所述第三中间数值与所述第四预设参数的乘积确定为第四中间数值;
    将所述第四中间数值与所述第一数量之和确定为第五中间数值;
    确定所述最小编辑距离与所述第五中间数值的比值;
    将第四预设数值与所述比值的差值确定为所述第一文本与所述第二文本的相似度。
  11. 根据权利要求1所述的用于生成信息的方法,其中,所述方法还包括:
    显示包含所述相似度的相似度计算结果;或者
    响应于确定所述相似度大于预设相似度阈值,建立所述第一文本与所述第二文本的对应关系,并存储用于表征所述对应关系的对应关系信息。
  12. 一种用于生成信息的装置,包括:
    第一确定单元,被配置成利用动态规划算法,确定通过对第一文本进行编辑操作,将所述第一文本转换为第二文本的最小编辑距离,其中,所述最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于所述第一文本中的目标词与所述第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;
    第二确定单元,被配置成将所述最小编辑距离进行归一化,将归一化后的数值确定为所述第一文本与所述第二文本的相似度。
  13. 根据权利要求12所述的用于生成信息的装置,其中,所述第一确定单元,进一步被配置成执行如下语义相似度确定步骤:
    确定所述第一文本中的目标词与所述第二文本中的目标词是否相同;
    若否,分别确定所述第一文本中的目标词和所述第二文本中的目标词是否为未登录词;
    响应于确定所述第一文本中的目标词和/或所述第二文本中的目标词为未登录词,将第一预设数值确定为所述第一文本中的目标词与所述第二文本中的目标词的语义相似度。
  14. 根据权利要求13所述的用于生成信息的装置,其中,所述语义相似度确定步骤还包括:
    响应于确定所述第一文本中的目标词不为未登录词且所述第二文本中的目标词不为未登录词,执行如下步骤:
    确定所述第一文本中的目标词的词向量与所述第二文本中的目标词的词向量的余弦相似度;
    确定所述余弦相似度与第一预设参数的乘积;
    将所述乘积与第二预设参数的和输入至目标函数,将所述目标函数的值确定为所述第一文本中的目标词与所述第二文本中的目标词的语义相似度。
  15. 根据权利要求13所述的用于生成信息的装置,其中,所述语义相似度确定步骤还包括:
    响应于确定所述第一文本中的目标词与所述第二文本中的目标词相同,将第二预设数值确定为所述第一文本中的目标词与所述第二文本中的目标词的语义相似度。
  16. 根据权利要求12所述的用于生成信息的装置,其中,所述第一确定单元,进一步被配置成执行如下步骤:
    将所述第一文本中的待替换词确定为所述第一文本中的目标词;
    将所述第二文本中的用于替换所述待替换词的词确定为所述第二文本中的目标词;
    确定所述第一文本中的目标词与所述第二文本中的目标词的语义相似度;
    将第三预设数值与所述语义相似度的差值确定为替换词操作的代价。
  17. 根据权利要求12所述的用于生成信息的装置,其中,所述第一确定单元,进一步被配置成执行如下步骤:
    将所述第一文本中的待删除的词作为所述第一文本中的目标词,逐一确定所述第二文本中的词与所述第一文本中的目标词的语义相似度;
    将语义相似度最大值所对应的所述第二文本中的词确定为所述第二文本中的目标词,确定所述相似度最大值与第三预设参数的乘积,将第四预设参数与所述乘积的差值确定为删除词操作的代价。
  18. 根据权利要求12所述的用于生成信息的装置,其中,所述第一确定单元,进一步被配置成执行如下步骤:
    将所述第二文本中的待插入至所述第一文本中的词作为所述第二文本中的目标词,逐一确定所述第一文本中的词与所述第二文本中的目标词的语义相似度;
    将语义相似度最大值所对应的所述第一文本中的词确定为所述第一文本中的目标词,确定所述相似度最大值与第三预设参数的乘积,将第四预设参数与所述乘积的差值确定为插入词操作的代价。
  19. 根据权利要求17或18所述的用于生成信息的装置,其中,所述第二确定单元,包括:
    第一确定模块,被配置成分别将构成所述第一文本、所述第二文本的词序列中的词的数量确定为第一数量、第二数量;
    第二确定模块,被配置成基于所述最小编辑距离、所述第一数量、所述第二数量、所述第四预设参数与预设阈值的比较,确定所述第一文本与所述第二文本的相似度。
  20. 根据权利要求19所述的用于生成信息的装置,其中,所述第二确定模块,进一步被配置成:
    响应于确定所述第四预设参数小于预设阈值,执行如下步骤:
    将所述第一数量与所述第二数量之和确定为第一中间数值;
    将所述第一中间数值与所述第四预设参数的乘积确定为第二中间数值;
    确定所述最小编辑距离与所述第二中间数值的比值;
    将第四预设数值与所述比值的差值确定为所述第一文本与所述第二文本的相似度。
  21. 根据权利要求19所述的用于生成信息的装置,其中,所述第二确定模块,进一步被配置成:
    响应于确定所述第四预设参数不小于预设阈值,执行如下步骤:
    将所述第二数量与所述第一数量的差值确定为第三中间数值;
    将所述第三中间数值与所述第四预设参数的乘积确定为第四中间数值;
    将所述第四中间数值与所述第一数量之和确定为第五中间数值;
    确定所述最小编辑距离与所述第五中间数值的比值;
    将第四预设数值与所述比值的差值确定为所述第一文本与所述第二文本的相似度。
  22. 根据权利要求12所述的用于生成信息的装置,其中,所述装置还包括:
    显示单元,被配置成显示包含所述相似度的相似度计算结果;或者
    存储单元,被配置成响应于确定所述相似度大于预设相似度阈值,建立所述第一文本与所述第二文本的对应关系,并存储用于表征所述对应关系的对应关系信息。
  23. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行时:
    利用动态规划算法,确定通过对第一文本进行编辑操作,将所述第一文本转换为第二文本的最小编辑距离,其中,所述最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于所述第一文本中的目标词与所述第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;
    将所述最小编辑距离进行归一化,将归一化后的数值确定为所述第一文本与所述第二文本的相似度。
  24. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时,使得所述处理器:
    利用动态规划算法,确定通过对第一文本进行编辑操作,将所述第一文本转换为第二文本的最小编辑距离,其中,所述最小编辑距离基于编辑操作的代价确定,编辑操作的代价基于所述第一文本中的目标词与所述第二文本中的目标词的语义相似度确定,目标词为编辑操作所涉及的词,编辑操作分为删除词操作、插入词操作和替换词操作;
    将所述最小编辑距离进行归一化,将归一化后的数值确定为所述第一文本与所述第二文本的相似度。
PCT/CN2018/107990 2018-09-27 2018-09-27 用于生成信息的方法和装置 WO2020061910A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/107990 WO2020061910A1 (zh) 2018-09-27 2018-09-27 用于生成信息的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/107990 WO2020061910A1 (zh) 2018-09-27 2018-09-27 用于生成信息的方法和装置

Publications (1)

Publication Number Publication Date
WO2020061910A1 true WO2020061910A1 (zh) 2020-04-02

Family

ID=69949475

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/107990 WO2020061910A1 (zh) 2018-09-27 2018-09-27 用于生成信息的方法和装置

Country Status (1)

Country Link
WO (1) WO2020061910A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120059821A1 (en) * 2008-07-03 2012-03-08 Tsinghua University Method for Efficiently Supporting Interactive, Fuzzy Search on Structured Data
CN103902597A (zh) * 2012-12-27 2014-07-02 百度在线网络技术(北京)有限公司 确定目标关键词所对应的搜索相关性类别的方法和设备
CN104090865A (zh) * 2014-07-08 2014-10-08 安一恒通(北京)科技有限公司 文本相似度计算方法及装置
CN105446957A (zh) * 2015-12-03 2016-03-30 小米科技有限责任公司 相似性确定方法、装置及终端
CN106126494A (zh) * 2016-06-16 2016-11-16 上海智臻智能网络科技股份有限公司 同义词发现方法及装置、数据处理方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120059821A1 (en) * 2008-07-03 2012-03-08 Tsinghua University Method for Efficiently Supporting Interactive, Fuzzy Search on Structured Data
CN103902597A (zh) * 2012-12-27 2014-07-02 百度在线网络技术(北京)有限公司 确定目标关键词所对应的搜索相关性类别的方法和设备
CN104090865A (zh) * 2014-07-08 2014-10-08 安一恒通(北京)科技有限公司 文本相似度计算方法及装置
CN105446957A (zh) * 2015-12-03 2016-03-30 小米科技有限责任公司 相似性确定方法、装置及终端
CN106126494A (zh) * 2016-06-16 2016-11-16 上海智臻智能网络科技股份有限公司 同义词发现方法及装置、数据处理方法及装置

Similar Documents

Publication Publication Date Title
JP2020532012A (ja) ランダム・ドキュメント埋め込みを用いたテキスト・データ表現学習
CN109858045B (zh) 机器翻译方法和装置
CN109740167B (zh) 用于生成信息的方法和装置
CN107437111B (zh) 基于神经网络的数据处理方法、介质、装置和计算设备
WO2022007438A1 (zh) 情感语音数据转换方法、装置、计算机设备及存储介质
CN108121699B (zh) 用于输出信息的方法和装置
WO2020147409A1 (zh) 一种文本分类方法、装置、计算机设备及存储介质
US20210286950A1 (en) Conversation Space Artifact Generation Using Natural Language Processing, Machine Learning, and Ontology-Based Techniques
US9588952B2 (en) Collaboratively reconstituting tables
CN111950279B (zh) 实体关系的处理方法、装置、设备及计算机可读存储介质
US10699197B2 (en) Predictive analysis with large predictive models
WO2022174496A1 (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
CN107766498B (zh) 用于生成信息的方法和装置
CN113268560A (zh) 用于文本匹配的方法和装置
WO2022152018A1 (zh) 用于识别一人多账号的方法及装置
WO2018189427A1 (en) Displaying and editing an electronic document
CN112182255A (zh) 用于存储媒体文件和用于检索媒体文件的方法和装置
CN110738056B (zh) 用于生成信息的方法和装置
US20190122122A1 (en) Predictive engine for multistage pattern discovery and visual analytics recommendations
CN113468344A (zh) 实体关系抽取方法、装置、电子设备和计算机可读介质
CN110046670B (zh) 特征向量降维方法和装置
CN112307738A (zh) 用于处理文本的方法和装置
US11847599B1 (en) Computing system for automated evaluation of process workflows
CN116048463A (zh) 基于标签管理的需求项内容智能推荐方法及装置
WO2020061910A1 (zh) 用于生成信息的方法和装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.07.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18934904

Country of ref document: EP

Kind code of ref document: A1