WO2020020287A1 - 一种获取文本相似度的方法、装置、设备及可读存储介质 - Google Patents

一种获取文本相似度的方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2020020287A1
WO2020020287A1 PCT/CN2019/097691 CN2019097691W WO2020020287A1 WO 2020020287 A1 WO2020020287 A1 WO 2020020287A1 CN 2019097691 W CN2019097691 W CN 2019097691W WO 2020020287 A1 WO2020020287 A1 WO 2020020287A1
Authority
WO
WIPO (PCT)
Prior art keywords
distance
text
text pair
obtaining
word
Prior art date
Application number
PCT/CN2019/097691
Other languages
English (en)
French (fr)
Inventor
李鹏
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2020020287A1 publication Critical patent/WO2020020287A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the present disclosure relates to, but is not limited to, the field of communication technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for obtaining text similarity.
  • Text similarity is widely discussed in different fields. Due to different application scenarios and their connotations, there is no uniform and accepted definition. From the perspective of information theory, text similarity is related to the similarities and differences between texts. The greater the commonality and the smaller the difference, the higher the similarity between the texts; conversely, the smaller the commonality and the greater the difference, the lower the similarity between the texts.
  • the present disclosure is to provide a method, a device, a device, and a readable storage medium for obtaining text similarity.
  • a method for obtaining text similarity includes: obtaining a numerical feature of the text pair according to a data set of the text pair; and constructing a sample feature matrix based on the numerical feature of the text pair. Performing model training according to the sample feature matrix and the prediction vector to obtain a prediction model; and obtaining a target text pair, obtaining a similarity score of the target text pair according to the sample feature matrix and the prediction model.
  • an apparatus for obtaining text similarity includes: a training module configured to obtain a numerical feature of a text pair according to a data set of the text pair; a matrix construction module, which is Configured to construct a sample feature matrix based on the numerical features of the text pair; a prediction module configured to perform model training according to the sample feature matrix and a prediction vector to obtain a prediction model; and an online acquisition module configured to obtain a target text pair Obtaining a similarity score of the target text pair according to the sample feature matrix and the prediction model.
  • an electronic device provided by an embodiment of the present disclosure includes a memory, a processor, and at least one application program stored in the memory and configured to be executed by the processor, the application The program is configured to perform the method of obtaining text similarity described above.
  • a readable storage medium provided by an embodiment of the present disclosure stores a computer program thereon, and when the program is executed by a processor, the method for obtaining a text similarity described above is implemented.
  • FIG. 1 is a flowchart of a method for obtaining text similarity according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of step S10 in FIG. 1;
  • FIG. 3 is a flowchart of step S40 in FIG. 1;
  • FIG. 4 is an exemplary structural block diagram of an apparatus for obtaining text similarity according to an embodiment of the present disclosure
  • FIG. 5 is an exemplary structural block diagram of the training module in FIG. 4;
  • FIG. 6 is an exemplary structural block diagram of the online acquisition module in FIG. 4.
  • One embodiment of the present disclosure provides a method for obtaining text similarity. As shown in Figure 1, the method includes:
  • the similarity of the text is determined by obtaining a plurality of numerical characteristics of the text pair, taking into account both the semantic and the syntactic structure.
  • the method for obtaining text similarity considers the semantic similarity between texts including features such as word meaning, edit distance, bag-of-words model, etc., and also considers grammatical similarity including syntactic structure.
  • the method for obtaining text similarity combines semantics and syntax, and uses neural networks for higher-level feature extraction. It has the advantages of trainable weights, less manual intervention, simple and fast, easy to implement, and high accuracy, which improves the user experience.
  • step S10 a large amount of labeled text pair data is first prepared as a training corpus; each sample in the training corpus is a set of text pairs and corresponding labeled similarity scores, for example, can be formally expressed as [text1; text2; score ], Where text1 and text2 are text pairs used for similarity acquisition, and score is the similarity score of labeled text1 and text2.
  • Annotation scores can be derived from manual annotations or other a priori information, such as the user's satisfaction with the system response in the question answering system, and the user's browsing of the system's arrangement information in the retrieval system. All samples are saved in the file originalData.txt, and each line contains one training sample.
  • the text1, text2, and score in each training sample can be divided by tabs.
  • the similarity score score is a real number between 0 and 1, and a larger number indicates a higher similarity between the text pairs, and vice versa. It can be understood that a score of 0 indicates that the text pairs are completely irrelevant, and a score of 1 indicates that the text pairs are exactly the same.
  • the accuracy of the score is not fixed. For example, derived from manual labeling may be a decimal with a precision of 0.3, 0.6, etc., and derived from other application systems may be a decimal with a precision of 0.563, 0.8192, etc. Understandably, the training corpus can be used as a standard reference corpus.
  • the file originalData.txt has the form:
  • the sample feature matrix extracted from the training corpus can be expressed as X ⁇ R M ⁇ N.
  • the numerical features include: a semantic feature based on an ordered edit distance, a semantic feature based on an unordered edit distance, a semantic feature based on a word sense distance, and a syntactic feature based on a dependency relationship.
  • the moving distance of the unordered words is also considered, which has stronger adaptability to the text that simply reverses the word order and can greatly improve the recall rate of the system.
  • the method of this embodiment also obtains syntactic similarity according to the number of valid dependency pairs in a sentence, which can better measure the number of core words in the sentence and words that have a dependency relationship with them.
  • the step S10 includes:
  • the word vector training method may use the Word2Vec method, and the specific steps include:
  • the corpus file originalDataForWord2Vec.txt has the following form:
  • the word vector is a matrix of 1 row and d w column, where w is a variable and can refer to any word, such as "ZTE".
  • the step S10 further includes: S14. According to the word vector matrix and the edit distance, a first improved edit distance between the text pairs is obtained as a semantic feature based on the ordered edit distance.
  • the editing operations defined in the first improved editing distance c A include: match (Mat), insert (Ins), delete (Del), and replace (Sub), and the corresponding operation costs are c Mat and c, respectively.
  • the specific calculation steps include:
  • text1 is "I want to apply for in-app purchase of ZTE mobile phone", and the word segmentation is [I
  • the word sequence t1 is [Apply
  • text2 is "How to apply for in-app purchase of ZTE products”
  • the word segmentation is [How to
  • stop words get the word sequence t2 is [How to
  • “I,” “think,” “being,” “about,” “”, and "?” Are stop words.
  • mobile phone] to t2 [how
  • the edit path Path A is [Ins, Mat, Sub, Sub], corresponding to the editing element sequence Elements A is [how, apply, in-app purchase ⁇ ZTE, ZTE ⁇ product, mobile phone ⁇ in-app purchase].
  • no arrow indicates Mat, Ins, and Del operations, and an arrow indicates Sub operation.
  • the edit path Path A is [Ins, Mat, Sub, Sub, Sub]
  • the corresponding edit operation cost vector is [c Ins , c Mat , c Sub , c Sub , c Sub ].
  • the editing element sequence Elements A is [how, apply, in-app purchase ⁇ ZTE, ZTE ⁇ product, mobile phone ⁇ in-app purchase], and the corresponding edit element distance vector Dis A is [1,1,0.218,0.294,0.511].
  • c A 1 * c Ins + 1 * c Mat + 1 * c Ins + 0.218 * c Sub + 0.294 * c Sub + 0.511 * c Sub .
  • the step S10 further includes: S15. According to the edit distance and a bag-of-word model, calculate a second improved edit distance between the text pairs as a semantic feature based on the out-of-order edit distance.
  • the editing operations defined in the second improved editing distance c B include matching (Mat), insertion (Ins), and deletion (Del), and the corresponding operation costs are c Mat , c Ins , and c Del respectively .
  • the specific calculation steps include:
  • the operation is performed on Mat; if the word or its synonyms exist in t1 Synonym, if the word or its synonym does not exist in t2, the operation Del is performed; if the word or its synonym does not exist in t1, and the word or the synonym exists in t2, the operation Ins is performed.
  • the editing path Path B can be obtained, and then the corresponding editing operation cost vector Action B can be obtained.
  • t1 [application
  • mobile] to t2 [how
  • the editing path Path B is [Ins, Mat, Mat, Mat, Del, Ins], then
  • the edit operation cost vector Action B is [c Ins , c Mat , c Mat , c Mat , c Del , c Ins ].
  • the step S10 further includes: S16. According to the word vector matrix, calculate a word sense distance between the text pairs as a semantic feature based on the word sense distance.
  • the word sequences t1 and t2 obtained by performing word segmentation and stop word operations on the texts text1 and text2.
  • t1 contains the word t2 contains the words Second, count words With word Meaning distance between words
  • the subscript m indicates the total number of t1 word sequences
  • the superscript 1 indicates that the word belongs to t1.
  • the sequence is the same.
  • the subscript n indicates the total number of t2 word sequences
  • the superscript 2 indicates that the word belongs to t2, thus defining the word in t1
  • the meaning distance from t2 is t2
  • the meaning distance from t1 is Finally, calculate the word similarity between two texts As a semantic feature based on word sense distance.
  • the step S10 further includes: S17. Perform a dependency syntax analysis on the text pairs, and calculate a syntactic distance between the text pairs as a syntactic feature based on the dependency relationship.
  • the word sequences t1 and t2 obtained by performing word segmentation and stop word operations on the texts text1 and text2.
  • t1 and t2 are analyzed by dependency syntax respectively, and the number of valid word collocation pairs in t1 and t2 are calculated respectively, denoted as p 1 and p 2 .
  • the effective word collocation pair refers to the collocation pair consisting of the core word in the sentence and the effective word directly dependent on it.
  • Core words that is, the only core vocabulary in the whole sentence after the sentence is analyzed by dependency syntax; valid words, that is, the nouns, verbs, and adjectives that are after the sentence is analyzed by dependency syntax.
  • the step S40 includes:
  • a training network structure is first built, then model training is performed according to the sample feature matrix X and the prediction vector y obtained in the previous section, and the model is finally saved for subsequent online acquisition.
  • the network structure adopts a multi-layer perceptron (MLP), and uses the sample feature matrix X and the prediction vector y to perform model training on the network structure using a general method.
  • MLP multi-layer perceptron
  • the model parameters obtained are recorded as W 1 * , b 1 * , W 2 * , and b 2 * , where W 1 * represents the connection weight of the first layer in MLP, and b 1 * is the weight of the second layer in MLP. Offset, W 2 * is the connection weight of the second layer in the MLP, and b 2 * is the offset of the second layer in the MLP.
  • the prediction model can be expressed as Where g 1 is the non-linear activation function of the first layer in the MLP, g 2 is the non-linear activation function of the second layer in the MLP, and x T is the feature vector of the target text pair.
  • the four numerical features c A , c B , c C , and c D of the text pair are sequentially calculated according to the above-mentioned numerical feature calculation steps to form the target text.
  • the eigenvector x T of the pair [c A , c B , c C , c D ].
  • the similarity score of the target text pair t1 and t2 can be obtained:
  • An embodiment of the present disclosure provides a device for obtaining text similarity. As shown in FIG. 4, the device includes:
  • the training module 10 is configured to obtain the numerical characteristics of the text pair according to the data set of the text pair;
  • a matrix construction module 20 configured to construct a sample feature matrix from the numerical features of the text pair
  • a prediction module 30 configured to perform model training according to the sample feature matrix and a prediction vector to obtain a prediction model
  • the online obtaining module 40 is configured to obtain a target text pair, and obtain a similarity score of the target text pair according to the sample feature matrix and the prediction model.
  • the similarity of the text is determined by obtaining a plurality of numerical characteristics of the text pair, taking into account both the semantic and the syntactic structure.
  • the device for obtaining text similarity takes into account semantic similarity between texts including features such as word meaning, editing distance, bag-of-words model, etc., and also considers grammatical similarity including syntactic structure.
  • the device for obtaining text similarity combines semantics and syntax, and uses neural networks for higher-level feature extraction. It has the advantages of trainable weights, less manual intervention, simple and fast, easy to implement, and high accuracy. user experience.
  • a large amount of labeled text pair data is first prepared as a training corpus; each sample in the training corpus is a set of text pairs and a corresponding labeled similarity score, for example, can be formally expressed as [text1; text2; score], where text1 and text2 are text pairs used for similarity calculation, and score is the similarity score of labeled text1 and text2.
  • Annotation scores can be derived from manual annotations or other a priori information, such as the user's satisfaction with the system response in the question answering system, and the user's browsing of the system's arrangement information in the retrieval system. All samples are saved in the file originalData.txt, and each line contains one training sample.
  • the text1, text2, and score in each training sample can be divided by tabs.
  • the similarity score score is a real number between 0 and 1, and a larger number indicates a higher similarity between the text pairs, and vice versa. It can be understood that a score of 0 indicates that the text pairs are completely irrelevant, and a score of 1 indicates that the text pairs are exactly the same.
  • the accuracy of the score is not fixed. For example, derived from manual labeling may be a decimal with a precision of 0.3, 0.6, etc., and derived from other application systems may be a decimal with a precision of 0.563, 0.8192, etc. Understandably, the training corpus can be used as a standard reference corpus.
  • the file originalData.txt has the form:
  • the sample feature matrix extracted from the training corpus can be expressed as X ⁇ R M ⁇ N.
  • the numerical features include: a semantic feature based on an ordered edit distance, a semantic feature based on an unordered edit distance, a semantic feature based on a word sense distance, and a syntactic feature based on a dependency relationship.
  • the moving distance of the unordered words is also considered, which has stronger adaptability to the text that simply reverses the word order and can greatly improve the recall rate of the system.
  • the device of this embodiment also calculates the syntax similarity according to the number of valid dependency pairs in the sentence, which can better measure the number of core words in the sentence and words that have a dependency relationship with them.
  • the training module 10 includes:
  • the obtaining unit 11 is configured to obtain a training corpus file, where the training corpus file includes several groups of text pairs and a similarity score of each group of text pairs;
  • An extraction unit 12 configured to obtain a training data set according to the training corpus file
  • the word vector acquisition unit 13 is configured to obtain a word vector matrix from the training data set.
  • the word vector training method may use the Word2Vec method, and the specific steps include:
  • the corpus file originalDataForWord2Vec.txt has the following form:
  • the word vector is a matrix of 1 row and d w column, where w is a variable and can refer to any word, such as "ZTE".
  • the training module 10 further includes: an ordered editing distance obtaining unit 14 configured to obtain a first improved editing distance between a text pair according to the word vector matrix and the editing distance, as a semantic feature based on the ordered editing distance. .
  • the editing operations defined in the first improved editing distance c A include: match (Mat), insert (Ins), delete (Del), and replace (Sub), and the corresponding operation costs are c Mat and c, respectively.
  • the specific calculation steps include:
  • text1 is "I want to apply for in-app purchase of ZTE mobile phone", and the word segmentation is [I
  • the word sequence t1 is [Apply
  • text2 is "How to apply for in-app purchase of ZTE products”
  • the word segmentation is [How to
  • After removing stop words, get the word sequence t2 is [How to
  • I "think,” “being,” “about,” “”, and "?” are stop words.
  • mobile phone] to t2 [how
  • the edit path Path A is [Ins, Mat, Sub, Sub], corresponding to the editing element sequence Elements A is [how, apply, in-app purchase ⁇ ZTE, ZTE ⁇ product, mobile phone ⁇ in-app purchase].
  • no arrow indicates Mat, Ins, and Del operations, and an arrow indicates Sub operation.
  • the edit path Path A is [Ins, Mat, Sub, Sub, Sub]
  • the corresponding edit operation cost vector is [c Ins , c Mat , c Sub , c Sub , c Sub ].
  • the editing element sequence Elements A is [how, apply, in-app purchase ⁇ ZTE, ZTE ⁇ product, mobile phone ⁇ in-app purchase], and the corresponding edit element distance vector Dis A is [1,1,0.218,0.294,0.511].
  • c A 1 * c Ins + 1 * c Mat + 1 * c Ins + 0.218 * c Sub + 0.294 * c Sub + 0.511 * c Sub .
  • the training module 10 further includes: an out-of-order editing distance obtaining unit 15 configured to obtain a second improved editing distance between a text pair according to the editing distance and a bag-of-word model, as a basis Semantic Features of Out-of-Order Editing Distance.
  • the editing operations defined in the second improved editing distance c B include matching (Mat), insertion (Ins), and deletion (Del), and the corresponding operation costs are c Mat , c Ins , and c Del respectively .
  • the specific calculation steps include:
  • the operation is performed on Mat; if the word or its synonyms exist in t1 Synonym, if the word or its synonym does not exist in t2, the operation Del is performed; if the word or its synonym does not exist in t1, and the word or the synonym exists in t2, the operation Ins is performed.
  • the editing path Path B can be obtained, and then the corresponding editing operation cost vector Action B can be obtained.
  • t1 [application
  • mobile] to t2 [how
  • the editing path Path B is [Ins, Mat, Mat, Mat, Del, Ins], then
  • the edit operation cost vector Action B is [c Ins , c Mat , c Mat , c Mat , c Del , c Ins ].
  • the training module 10 further includes a word sense distance obtaining unit 16 configured to obtain a word sense distance between text pairs according to the word vector matrix as a semantic feature based on the word sense distance.
  • the word sequences t1 and t2 obtained by performing word segmentation and stop word operations on the texts text1 and text2.
  • t1 contains the word t2 contains the words Second, count words With word Meaning distance between words
  • the subscript m indicates the total number of t1 word sequences
  • the superscript 1 indicates that the word belongs to t1.
  • the sequence is the same.
  • the subscript n indicates the total number of t2 word sequences
  • the superscript 2 indicates that the word belongs to t2, thus defining the word in t1
  • the meaning distance from t2 is t2
  • the meaning distance from t1 is Finally, calculate the word similarity between two texts As a semantic feature based on word sense distance.
  • the training module 10 further includes: a syntactic distance obtaining unit 17 configured to perform dependency syntactic analysis on the text pairs, and obtain the syntactic distance between the text pairs as a syntactic feature based on the dependency relationship.
  • a syntactic distance obtaining unit 17 configured to perform dependency syntactic analysis on the text pairs, and obtain the syntactic distance between the text pairs as a syntactic feature based on the dependency relationship.
  • the word sequences t1 and t2 obtained by performing word segmentation and stop word operations on the texts text1 and text2.
  • t1 and t2 are analyzed by dependency syntax respectively, and the number of valid word collocation pairs in t1 and t2 are calculated respectively, denoted as p 1 and p 2 .
  • the effective word collocation pair refers to the collocation pair consisting of the core word in the sentence and the effective word directly dependent on it.
  • Core words that is, the only core vocabulary in the whole sentence after the sentence is analyzed by dependency syntax; valid words, that is, the nouns, verbs, and adjectives that are after the sentence is analyzed by dependency syntax.
  • the online acquisition module 40 includes:
  • a feature vector obtaining unit 41 configured to obtain a target text pair, calculate a numerical feature of the target text pair, and form a feature vector of the target text pair;
  • the similarity acquisition unit 42 is configured to substitute a feature vector of the target text pair into the prediction model to obtain a similarity score of the target text pair.
  • a training network structure is first built, then model training is performed based on the sample feature matrix X and the prediction vector y obtained in the previous section, and finally the saved model is configured for subsequent online calculation.
  • the network structure adopts a multilayer perceptron (MLP), and uses the sample feature matrix X and the prediction vector y to perform model training on the network structure using a general method.
  • MLP multilayer perceptron
  • W 1 * represents the connection weight of the first layer in MLP
  • b 1 * is the offset of the second layer in MLP
  • W 2 * is the connection weight of the second layer in MLP
  • b 2 * is the second layer in MLP Of the bias.
  • the prediction model can be expressed as Where g 1 is the non-linear activation function of the first layer in the MLP, g 2 is the non-linear activation function of the second layer in the MLP, and x T is the feature vector of the target text pair.
  • the four numerical features c A , c B , c C , and c D of the text pair are sequentially calculated according to the above-mentioned numerical feature calculation steps to form the target text.
  • the eigenvector x T of the pair [c A , c B , c C , c D ].
  • the similarity score of the target text pair t1 and t2 can be obtained:
  • An embodiment of the present disclosure provides an electronic device including a memory, a processor, and at least one application program stored in the memory and configured to be executed by the processor, the application program being configured to use The method for obtaining text similarity described in the first embodiment is performed.
  • An embodiment of the present disclosure provides a readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the method embodiment according to any one of the method embodiments for obtaining text similarity described above is implemented.
  • a method, an apparatus, a device, and a readable storage medium for obtaining a text similarity according to an embodiment of the present disclosure includes: obtaining a numerical feature of the text pair according to a data set of the text pair; and passing the numerical feature of the text pair. Construct a sample feature matrix; perform model training according to the sample feature matrix and a prediction vector to obtain a prediction model; obtain a target text pair, and obtain a similarity score of the target text pair according to the sample feature matrix and the prediction model.
  • this method of obtaining text similarity takes into account both semantic and syntactic structures to determine text similarity. It has the advantages of trainable weights, less manual intervention, simple and fast, easy to implement, and high accuracy. Improved user experience.
  • the division between functional modules / units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components execute cooperatively.
  • Some or all physical components can be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit .
  • Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage medium includes volatile and non-volatile implemented in any method or technology used to store information such as computer-readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or Any other medium used to store desired information and which can be accessed by a computer.
  • a communication medium typically contains computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开公开了一种获取文本相似度的方法、装置、设备及可读存储介质。该方法包括:根据文本对的数据集得到所述文本对的数值特征;通过所述文本对的数值特征构造样本特征矩阵;根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。摘图1

Description

一种获取文本相似度的方法、装置、设备及可读存储介质 技术领域
本公开涉及但不限于通信技术领域,尤其涉及一种获取文本相似度的方法、装置、设备及可读存储介质。
背景技术
在信息爆炸的时代下,人们对从海量信息中快速准确获取所需内容的需求与日俱增。为实现这一需求,许多应用应运而生,如信息检索、智能问答、文献查重、个性推荐等。在这些应用背后,文本相似度计算技术是关键的核心技术之一。
文本相似度在不同领域被广泛讨论。由于应用场景不同,其内涵有所差异,故没有统一、公认的定义。从信息论的角度来看,文本相似度与文本之间的共性和差异有关。共性越大、差异越小,则文本间的相似度越高;反之,共性越小、差异越大,则文本间的相似度越低。
发明内容
本公开在于提供一种获取文本相似度的方法、装置、设备及可读存储介质。
根据一个方面,本公开的一个实施例提供的一种获取文本相似度的方法,包括:根据文本对的数据集得到所述文本对的数值特征;通过所述文本对的数值特征构造样本特征矩阵;根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;以及获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。
根据另一个方面,本公开的一个实施例提供的一种获取文本相似度的装置,包括:训练模块,被配置为根据文本对的数据集得到所述文本对的数值特征;矩阵构造模块,被配置为通过所述文本对的数值特征构造样本特征矩阵;预测模块,被配置为根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;以及在线获取模块,被配 置为获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。
根据再一个方面,本公开的一个实施例提供的一种电子设备,包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序,所述应用程序被配置为用于执行以上所述的获取文本相似度的方法。
根据再一个方面,本公开的一个实施例提供的一种可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以上所述的获取文本相似度的方法。
附图说明
图1为本公开实施例提供的一种获取文本相似度的方法的流程图;
图2为图1中步骤S10的流程图;
图3为图1中步骤S40的流程图;
图4为本公开实施例提供的一种获取文本相似度的装置的示范性结构框图;
图5为图4中训练模块的示范性结构框图;
图6为图4中在线获取模块的示范性结构框图。
具体实施方式
为了使本公开所要解决的技术问题、技术方案及有益效果更加清楚、明白,以下结合附图和实施例,对本公开进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本公开,并不用于限定本公开。
本公开的一个实施例提供了一种获取文本相似度的方法。如图1所示,该方法包括:
S10、根据文本对的数据集得到所述文本对的数值特征;
S20、通过所述文本对的数值特征构造样本特征矩阵;
S30、根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;以及
S40、获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。
在本实施例中,通过获取文本对的多个数值特征,兼顾语义和句法结构,来判断文本相似度。该获取文本相似度的方法考虑了文本间包含词义、编辑距离、词袋模型等特征在内的语义相似度,还考虑了包含句法结构的语法相似度。同时,该获取文本相似度的方法将语义与句法相结合,使用神经网络进行更高层面的特征抽取,具有权重可训练、人工干预少、简单快捷、易于实施、准确率高等优点,从而提高了用户体验。
在步骤S10中,首先要准备大量已标注文本对数据作为训练语料;训练语料中每个样本为一组文本对和对应的标注相似度得分,例如,可形式化表述为[text1;text2;score],其中text1和text2为用于相似度获取的文本对,score为标注的text1和text2的相似度得分。标注得分可以来源于人工标注,亦可来源于其它先验信息,如问答系统中用户对系统答复的满意程度、检索系统中用户对系统排列信息的浏览情况等。所有样本保存在文件originalData.txt中,每行为一个训练样本,每个训练样本中text1、text2、score之间可通过制表符tab进行分割。在一个示例性实施例中,标注相似度得分score为0到1之间的实数,数字越大表示文本对之间相似度越高,反之亦然。可以理解的是,score为0表示文本对完全不相关,score为1表示文本对完全相同。根据来源不同,score的精度不固定。例如,来源于人工标注可能为0.3、0.6等一位精度的小数,而来源于其他应用系统可能为0.563、0.8192等多位精度的小数。可以理解的是,训练语料可以用作标准的参照语料。
在一个示例性实施例中,文件originalData.txt形式如下:
Figure PCTCN2019097691-appb-000001
Figure PCTCN2019097691-appb-000002
在本实施例中,假设训练语料文件共M行文本对,且针对训练语料中的每个文本对得到N个数值特征,则从训练语料中抽取出的样本特征矩阵可以表示为X∈R M×N。针对训练语料中的每个文本对将其标注相似度得分作为该样本的预测值,可以从训练语料中抽取出预测向量y∈R M×1。因此,训练数据集可以表示为D=[X,y]。
在本实施例中,所述数值特征包括:基于有序编辑距离的语义特征,基于无序编辑距离的语义特征,基于词义距离的语义特征,基于依存关系的句法特征。
在本实施例中,除了有序编辑距离,还考虑了无序词语的移动距离,这对简单颠倒语序的文本具有更强的适应性,可大大提升系统召回率。而且,本实施例的方法还根据语句中的有效依存配对的数量来获取句法相似度,可以更好的衡量句中核心词和与其存在依赖关系的词的数量。
如图2所示,在本实施例中,所述步骤S10包括:
S11、获取训练语料文件,所述训练语料文件包括若干组文本对及每组文本对的相似度得分;
S12、根据所述训练语料文件得到训练数据集;以及
S13、从所述训练数据集中得到词向量矩阵。
在本实施例中,例如,词向量训练方法可以采用Word2Vec方法,具体步骤包括:
S131、由文件originalData.txt生成新的训练语料文件originalDataForWord2Vec.txt,对文件originalData.txt中每行样本只获取text1和text2,然后将text1和text2分为两行存储,其中
语料文件originalDataForWord2Vec.txt形式如下:
我想问下在哪里可以购入中兴手机
中兴手机在哪里购买
中兴公司在南京市雨花台区
南京雨花台区的中兴通讯公司
智能问答系统团队又出新成果
智能问答领域日新月异
办理信用卡的渠道有哪些
借记卡申请的方式;
S132、采用word2vec进行词向量训练,其中,向量长度记为d w(比如,d w=400);
S133、将训练得到的wordv2ec模型记为矩阵
Figure PCTCN2019097691-appb-000003
其中V为语料文件中所有词汇构成的词汇表,|V|为该词汇表中的词汇个数,
Figure PCTCN2019097691-appb-000004
表示|V|行d w列的实数矩阵;以及
S134、用该wordv2ec模型得到的词向量表示单词w,
Figure PCTCN2019097691-appb-000005
其中,
Figure PCTCN2019097691-appb-000006
表示词向量为1行d w列的矩阵,其中,w为变量,可以指代任意单词,如“中兴”。
所述步骤S10还包括:S14、根据所述词向量矩阵和编辑距离,获取文本对之间的第一改进编辑距离,作为基于有序编辑距离的语义特征。
在本实施例中,第一改进编辑距离c A中定义的编辑操作包括:匹配(Mat)、插入(Ins)、删除(Del)、替换(Sub),分别对应的操作代价为c Mat、c Ins、c Del、c Sub。具体计算步骤包括:
S141、对文本text1和text2,分别进行分词、去停止词操作后,得到词序列t1和t2。
例如,text1为“我想申请内购中兴手机了”,分词后为[我|想|申请|内购|中兴|手机|了],去掉停止词后,得到词序列t1为[申请|内购|中兴|手机];text2为“如何申请一下中兴产品的内购呢”,分词后为[如何|申请|一下|中兴|产品|的|内购|呢],去掉停止词后,得到词序列t2为[如何|申请|中兴|产品|内购]。这里,“我”“想”“了”“一下”“的”“呢”均为停止词。
S142、使用通用方法(如基于动态规划的方法)计算词序列t1到 词序列t2的编辑路径Path A和对应编辑元素序列Elements A
例如,使用通用方法可以计算出t1=[申请|内购|中兴|手机]到t2=[如何|申请|中兴|产品|内购]的编辑路径Path A为[Ins,Mat,Sub,Sub,Sub],对应编辑元素序列Elements A为[如何,申请,内购→中兴,中兴→产品,手机→内购]。其中,无箭头表示Mat、Ins、Del操作,有箭头表示Sub操作。
S143、对编辑路径Path A得到相应的编辑操作代价向量Action A
具体的,在S143中,将所有编辑操作换成对应的操作代价,形成编辑操作代价向量即可。
例如,编辑路径Path A为[Ins,Mat,Sub,Sub,Sub],对应编辑操作代价向量即为[c Ins,c Mat,c Sub,c Sub,c Sub]。
S144、对编辑元素序列Elements A中每个元素计算编辑元素距离,从而得到编辑元素距离向量Dis A。具体的,进行Mat、Ins、Del操作的编辑元素距离为1,进行Sub操作的编辑元素距离为sim cos(w 1,w 2)。其中,
Figure PCTCN2019097691-appb-000007
为词w 1和词w 2的余弦相似度,可以表示为
Figure PCTCN2019097691-appb-000008
例如,编辑元素序列Elements A为[如何,申请,内购→中兴,中兴→产品,手机→内购],对应的编辑元素距离向量Dis A为[1,1,0.218,0.294,0.511]。
S145、根据编辑操作代价向量Action A和对应的编辑元素距离向量Dis A,计算两文本间的改进编辑距离
Figure PCTCN2019097691-appb-000009
作为基于有序编辑距离的语义特征。
例如,编辑操作代价向量为[c Ins,c Mat,c Sub,c Sub,c Sub],对应的编辑元素距离向量为[1,1,0.218,0.294,0.511],则有:
c A=1*c Ins+1*c Mat+1*c Ins+0.218*c Sub+0.294*c Sub+0.511*c Sub
所述步骤S10还包括:S15、根据所述编辑距离和词袋(bag-of-word)模型,计算文本对之间的第二改进编辑距离,作为基于无序编辑距离的语义特征。
在本实施例中,第二改进编辑距离c B中定义的编辑操作包括:匹配(Mat)、插入(Ins)、删除(Del),分别对应的操作代价为c Mat、c Ins、c Del。具体计算步骤包括:
S151、对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。
S152、将词序列t1和t2中所有不重复的词加入到集合中,构成词袋BOW。
例如,针对t1=[申请|内购|中兴|手机]和t2=[如何|申请|中兴|产品|内购],得到的词袋BOW为[如何|申请|内购|中兴|手机|产品]。
S153、根据词袋BOW和t1、t2,计算t1到t2的编辑距离。
在一个示例性计算方式中,对于词袋BOW中的某词w,若t1中存在该词或其同义词,t2中存在该词或其同义词,则进行操作Mat;若t1中存在该词或其同义词,t2中不存在该词或其同义词,则进行操作Del;若t1中不存在该词或其同义词,t2中存在该词或其同义词,则进行操作Ins。对词袋BOW中所有词依次执行上述操作后,可以得到编辑路径Path B,进而得到相应的编辑操作代价向量Action B
例如,t1=[申请|内购|中兴|手机]到t2=[如何|申请|中兴|产品|内购]的编辑路径Path B为[Ins,Mat,Mat,Mat,Del,Ins],则编辑操作代价向量Action B为[c Ins,c Mat,c Mat,c Mat,c Del,c Ins]。
S154、将编辑操作代价向量Action B中所有元素加和,得到两文本间的第二改进编辑距离c B,作为基于无序编辑距离的语义特征。
例如,对编辑操作代价向量Action B=[c Ins,c Mat,c Mat,c Mat,c Del,c Ins],c B=c Ins+c Mat+c Mat+c Mat+c Del+c Ins
所述步骤S10还包括:S16、根据所述词向量矩阵,计算文本对之间的词义距离,作为基于词义距离的语义特征。
在本步骤中,首先,对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。假设t1包含的词为
Figure PCTCN2019097691-appb-000010
t2包含的词为
Figure PCTCN2019097691-appb-000011
其次,计算词
Figure PCTCN2019097691-appb-000012
与词
Figure PCTCN2019097691-appb-000013
之间的词义距离
Figure PCTCN2019097691-appb-000014
其中,以
Figure PCTCN2019097691-appb-000015
为例,下标m表示t1词序列的总数,上 标1表示该词属于t1,序列,同理,
Figure PCTCN2019097691-appb-000016
表示,下标n表示t2词序列的总数,上标2表示该词属于t2,从而定义t1中词
Figure PCTCN2019097691-appb-000017
与t2的词义距离为
Figure PCTCN2019097691-appb-000018
t2中词
Figure PCTCN2019097691-appb-000019
与t1的词义距离为
Figure PCTCN2019097691-appb-000020
最后,计算两文本间的词义相似度
Figure PCTCN2019097691-appb-000021
作为基于词义距离的语义特征。
所述步骤S10还包括:S17、对文本对进行依存句法分析,计算文本对之间的句法距离,作为基于依存关系的句法特征。
在本步骤中,首先,对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。其次,使用通用方法(如StanfordNLP、FNLP等工具),对t1和t2分别进行依存句法分析,并分别计算t1和t2中有效词搭配对的数量,记为p 1和p 2。其中,有效词搭配对,指句中核心词和直接依存于它的有效词组成的搭配对。核心词,即句子经依存句法分析后得到的全句中唯一的核心词汇;有效词,即句子经依存句法分析后的名词、动词和形容词。
例如,针对t1=[申请|内购|中兴|手机],经依存句法分析后,核心词为“内购”,与之直接依存的词有“申请”“手机”,且这两个词都是有效词,因此,t1的有效搭配对数量为2。根据为p 1和p 2计算两文本间的句法结构距离c D=|p 1-p 2|,作为基于依存关系的句法特征。
如图3所示,在本实施例中,所述步骤S40包括:
S41、获取目标文本对,获取所述目标文本对的数值特征,构成目标文本对的特征向量;以及
S42、将所述目标文本对的特征向量代入所述预测模型,获得所述目标文本对的相似度得分。
在本实施例中,首先搭建训练用的网络结构,然后根据上节得到的样本特征矩阵X和预测向量y进行模型训练,最后保存模型用于后续 的在线获取。
在一个示例性实施例中,网络结构采用多层感知机(MLP,Multi-layer Perceptron),利用样本特征矩阵X和预测向量y,使用通用方法,对上述网络结构进行模型训练。
训练后,得到的模型参数记为W 1*、b 1*、W 2*、b 2*,其中,W 1*表示MLP中第一层的连接权重,b 1*为MLP中第二层的偏置,W 2*为MLP中第二层的连接权重,b 2*为MLP中第二层的偏置。预测模型可以表示为
Figure PCTCN2019097691-appb-000022
其中g 1为MLP中第一层的非线性激活函数,g 2为MLP中第二层的非线性激活函数,x T为目标文本对的特征向量。
在本实施例中,针对输入系统的目标文本对t1和t2,根据上述数值特征的计算步骤,依次计算出文本对的四个数值特征c A、c B、c C、c D,构成目标文本对的特征向量x T=[c A,c B,c C,c D]。
将上述目标文本对的特征向量代入预测模型,即可得到目标文本对t1和t2的相似度得分:
Figure PCTCN2019097691-appb-000023
本公开的一个实施例提供了一种获取文本相似度的装置,如图4所示,该装置包括:
训练模块10,被配置为根据文本对的数据集得到所述文本对的数值特征;
矩阵构造模块20,被配置为通过所述文本对的数值特征构造样本特征矩阵;
预测模块30,被配置为根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;以及
在线获取模块40,被配置为获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。
在本实施例中,通过获取文本对的多个数值特征,兼顾语义和句法结构,来判断文本相似度。该获取文本相似度的装置考虑了文本间 包含词义、编辑距离、词袋模型等特征在内的语义相似度,还考虑了包含句法结构的语法相似度。同时,该获取文本相似度的装置将语义与句法相结合,使用神经网络进行更高层面的特征抽取,具有权重可训练、人工干预少、简单快捷、易于实施、准确率高等优点,从而提高了用户体验。
在本实施例中,首先要准备大量已标注文本对数据作为训练语料;训练语料中每个样本为一组文本对和对应的标注相似度得分,例如,可形式化表述为[text1;text2;score],其中text1和text2为用于相似度计算的文本对,score为标注的text1和text2的相似度得分。标注得分可以来源于人工标注,亦可来源于其它先验信息,如问答系统中用户对系统答复的满意程度、检索系统中用户对系统排列信息的浏览情况等。所有样本保存在文件originalData.txt中,每行为一个训练样本,每个训练样本中text1、text2、score之间可通过制表符tab进行分割。在一个示例性实施例中,标注相似度得分score为0到1之间的实数,数字越大表示文本对之间相似度越高,反之亦然。可以理解的是,score为0表示文本对完全不相关,score为1表示文本对完全相同。根据来源不同,score的精度不固定。例如,来源于人工标注可能为0.3、0.6等一位精度的小数,而来源于其他应用系统可能为0.563、0.8192等多位精度的小数。可以理解的是,训练语料可以用作标准的参照语料。
在一个示例性实施例中,文件originalData.txt形式如下:
Figure PCTCN2019097691-appb-000024
在本实施例中,假设训练语料文件共M行文本对,且针对训练语 料中的每个文本对得到N个数值特征,则从训练语料中抽取出的样本特征矩阵可以表示为X∈R M×N。针对训练语料中的每个文本对将其标注相似度得分作为该样本的预测值,可以从训练语料中抽取出预测向量y∈R M×1。因此,训练数据集可以表示为D=[X,y]。
在本实施例中,所述数值特征包括:基于有序编辑距离的语义特征,基于无序编辑距离的语义特征,基于词义距离的语义特征,基于依存关系的句法特征。
在本实施例中,除了有序编辑距离,还考虑了无序词语的移动距离,这对简单颠倒语序的文本具有更强的适应性,可大大提升系统召回率。而且,本实施例的装置还根据语句中的有效依存配对的数量来计算句法相似度,可以更好的衡量句中核心词和与其存在依赖关系的词的数量。
如图5所示,在本实施例中,所述训练模块10包括:
获取单元11,被配置为获取训练语料文件,所述训练语料文件包括若干组文本对及每组文本对的相似度得分;
提取单元12,被配置为根据所述训练语料文件得到训练数据集;以及
词向量获取单元13,被配置为从所述训练数据集中得到词向量矩阵。
在本实施例中,例如,词向量训练方法可以采用Word2Vec方法,具体步骤包括:
S131、由文件originalData.txt生成新的训练语料文件originalDataForWord2Vec.txt,对文件originalData.txt中每行样本只获取text1和text2,然后将text1和text2分为两行存储,其中
语料文件originalDataForWord2Vec.txt形式如下:
我想问下在哪里可以购入中兴手机
中兴手机在哪里购买
中兴公司在南京市雨花台区
南京雨花台区的中兴通讯公司
智能问答系统团队又出新成果
智能问答领域日新月异
办理信用卡的渠道有哪些
借记卡申请的方式;
S132、采用word2vec进行词向量训练,其中,向量长度记为d w(比如d w=400);
S133、将训练得到的wordv2ec模型记为矩阵
Figure PCTCN2019097691-appb-000025
其中V为语料文件中所有词汇构成的词汇表,|V|为该词汇表中的词汇个数,
Figure PCTCN2019097691-appb-000026
表示|V|行d w列的实数矩阵;以及
S134、用该wordv2ec模型得到的词向量表示单词w,
Figure PCTCN2019097691-appb-000027
其中,
Figure PCTCN2019097691-appb-000028
表示词向量为1行d w列的矩阵,其中,w为变量,可以指代任意单词,如“中兴”。
所述训练模块10还包括:有序编辑距离获取单元14,被配置为根据所述词向量矩阵和编辑距离,获取文本对之间的第一改进编辑距离,作为基于有序编辑距离的语义特征。
在本实施例中,第一改进编辑距离c A中定义的编辑操作包括:匹配(Mat)、插入(Ins)、删除(Del)、替换(Sub),分别对应的操作代价为c Mat、c Ins、c Del、c Sub。具体计算步骤包括:
S141、对文本text1和text2,分别进行分词、去停止词操作后,得到词序列t1和t2。
例如,text1为“我想申请内购中兴手机了”,分词后为[我|想|申请|内购|中兴|手机|了],去掉停止词后,得到词序列t1为[申请|内购|中兴|手机];text2为“如何申请一下中兴产品的内购呢”,分词后为[如何|申请|一下|中兴|产品|的|内购|呢],去掉停止词后,得到词序列t2为[如何|申请|中兴|产品|内购]。这里,“我”“想”“了”“一下”“的”“呢”均为停止词。
S142、使用通用方法(如基于动态规划的方法)计算词序列t1到词序列t2的编辑路径Path A和对应编辑元素序列Elements A
例如,使用通用方法可以计算出t1=[申请|内购|中兴|手机]到t2=[如 何|申请|中兴|产品|内购]的编辑路径Path A为[Ins,Mat,Sub,Sub,Sub],对应编辑元素序列Elements A为[如何,申请,内购→中兴,中兴→产品,手机→内购]。其中,无箭头表示Mat、Ins、Del操作,有箭头表示Sub操作。
S143、对编辑路径Path A得到相应的编辑操作代价向量Action A
具体的,在S143中,将所有编辑操作换成对应的操作代价,形成编辑操作代价向量即可。
例如,编辑路径Path A为[Ins,Mat,Sub,Sub,Sub],对应编辑操作代价向量即为[c Ins,c Mat,c Sub,c Sub,c Sub]。
S144、对编辑元素序列Elements A中每个元素计算编辑元素距离,从而得到编辑元素距离向量Dis A。具体的,进行Mat、Ins、Del操作的编辑元素距离为1,进行Sub操作的编辑元素距离为sim cos(w 1,w 2)。其中,
Figure PCTCN2019097691-appb-000029
为词w 1和词w 2的余弦相似度,可以表示为
Figure PCTCN2019097691-appb-000030
例如,编辑元素序列Elements A为[如何,申请,内购→中兴,中兴→产品,手机→内购],对应的编辑元素距离向量Dis A为[1,1,0.218,0.294,0.511]。
S145、根据编辑操作代价向量Action A和对应的编辑元素距离向量Dis A,计算两文本间的改进编辑距离
Figure PCTCN2019097691-appb-000031
作为基于有序编辑距离的语义特征。
例如,编辑操作代价向量为[c Ins,c Mat,c Sub,c Sub,c Sub],对应的编辑元素距离向量为[1,1,0.218,0.294,0.511],则有:
c A=1*c Ins+1*c Mat+1*c Ins+0.218*c Sub+0.294*c Sub+0.511*c Sub
所述训练模块10还包括:无序编辑距离获取单元15,被配置为根据所述编辑距离和词袋(bag-of-word)模型,获取文本对之间的第二改进编辑距离,作为基于无序编辑距离的语义特征。
在本实施例中,第二改进编辑距离c B中定义的编辑操作包括:匹 配(Mat)、插入(Ins)、删除(Del),分别对应的操作代价为c Mat、c Ins、c Del。具体计算步骤包括:
S151、对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。
S152、将词序列t1和t2中所有不重复的词加入到集合中,构成词袋BOW。
例如,针对t1=[申请|内购|中兴|手机]和t2=[如何|申请|中兴|产品|内购],得到的词袋BOW为[如何|申请|内购|中兴|手机|产品]。
S153、根据词袋BOW和t1、t2,计算t1到t2的编辑距离。
在一个示例性计算方式中,对于词袋BOW中的某词w,若t1中存在该词或其同义词,t2中存在该词或其同义词,则进行操作Mat;若t1中存在该词或其同义词,t2中不存在该词或其同义词,则进行操作Del;若t1中不存在该词或其同义词,t2中存在该词或其同义词,则进行操作Ins。对词袋BOW中所有词依次执行上述操作后,可以得到编辑路径Path B,进而得到相应的编辑操作代价向量Action B
例如,t1=[申请|内购|中兴|手机]到t2=[如何|申请|中兴|产品|内购]的编辑路径Path B为[Ins,Mat,Mat,Mat,Del,Ins],则编辑操作代价向量Action B为[c Ins,c Mat,c Mat,c Mat,c Del,c Ins]。
S154、将编辑操作代价向量Action B中所有元素加和,得到两文本间的第二改进编辑距离c B,作为基于无序编辑距离的语义特征。
例如,对编辑操作代价向量Action B=[c Ins,c Mat,c Mat,c Mat,c Del,c Ins],c B=c Ins+c Mat+c Mat+c Mat+c Del+c Ins
所述训练模块10还包括:词义距离获取单元16,被配置为根据所述词向量矩阵,获取文本对之间的词义距离,作为基于词义距离的语义特征。
在本实施例中,首先,对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。假设t1包含的词为
Figure PCTCN2019097691-appb-000032
t2包含的词为
Figure PCTCN2019097691-appb-000033
其次,计算词
Figure PCTCN2019097691-appb-000034
与词
Figure PCTCN2019097691-appb-000035
之间的词义距离
Figure PCTCN2019097691-appb-000036
其中,以
Figure PCTCN2019097691-appb-000037
为例,下标m表示t1词序列的总数, 上标1表示该词属于t1,序列,同理,
Figure PCTCN2019097691-appb-000038
表示,下标n表示t2词序列的总数,上标2表示该词属于t2,从而定义t1中词
Figure PCTCN2019097691-appb-000039
与t2的词义距离为
Figure PCTCN2019097691-appb-000040
t2中词
Figure PCTCN2019097691-appb-000041
与t1的词义距离为
Figure PCTCN2019097691-appb-000042
最后,计算两文本间的词义相似度
Figure PCTCN2019097691-appb-000043
作为基于词义距离的语义特征。
所述训练模块10还包括:句法距离获取单元17,被配置为对文本对进行依存句法分析,获取文本对之间的句法距离,作为基于依存关系的句法特征。
在本实施例中,首先,对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。其次,使用通用方法(如StanfordNLP、FNLP等工具),对t1和t2分别进行依存句法分析,并分别计算t1和t2中有效词搭配对的数量,记为p 1和p 2。其中,有效词搭配对,指句中核心词和直接依存于它的有效词组成的搭配对。核心词,即句子经依存句法分析后得到的全句中唯一的核心词汇;有效词,即句子经依存句法分析后的名词、动词和形容词。
例如,针对t1=[申请|内购|中兴|手机],经依存句法分析后,核心词为“内购”,与之直接依存的词有“申请”“手机”,且这两个词都是有效词,因此,t1的有效搭配对数量为2。根据为p 1和p 2计算两文本间的句法结构距离c D=|p 1-p 2|,作为基于依存关系的句法特征。
如图6所示,在本实施例中,所述在线获取模块40包括:
特征向量获取单元41,被配置为获取目标文本对,计算所述目标文本对的数值特征,构成目标文本对的特征向量;以及
相似度获取单元42,被配置为将所述目标文本对的特征向量代入所述预测模型,获得所述目标文本对的相似度得分。
在本实施例中,首先搭建训练用的网络结构,然后根据上节得到 的样本特征矩阵X和预测向量y进行模型训练,最后保存模型被配置为后续的在线计算。
在一个示例性实施例中,网络结构采用多层感知机(MLP),利用样本特征矩阵X和预测向量y,使用通用方法,对上述网络结构进行模型训练。
训练后,得到的模型参数记为W 1*、b 1*、W 2*、b 2*,则预测模型可以表示为
Figure PCTCN2019097691-appb-000044
其中,W 1*表示MLP中第一层的连接权重,b 1*为MLP中第二层的偏置,W 2*为MLP中第二层的连接权重,b 2*为MLP中第二层的偏置。预测模型可以表示为
Figure PCTCN2019097691-appb-000045
其中g 1为MLP中第一层的非线性激活函数,g 2为MLP中第二层的非线性激活函数,x T为目标文本对的特征向量。
在本实施例中,针对输入系统的目标文本对t1和t2,根据上述数值特征的计算步骤,依次计算出文本对的四个数值特征c A、c B、c C、c D,构成目标文本对的特征向量x T=[c A,c B,c C,c D]。
将上述目标文本对的特征向量代入预测模型,即可得到目标文本对t1和t2的相似度得分:
Figure PCTCN2019097691-appb-000046
本公开的一个实施例提供了一种电子设备,包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序,所述应用程序被配置为用于执行实施例一所述的获取文本相似度的方法。
本公开的一个实施例提供一种可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述获取文本相似度的方法实施例中任一所述的方法实施例。
需要说明的是,上述装置(设备)实施例和可读存储介质实施例与方法实施例属于同一构思,其具体实现过程详见方法实施例。所述方法实施例中的技术特征在装置实施例中均对应适用,这里不再赘述。
本公开实施例的一种获取文本相似度的方法、装置、设备及可读存储介质,该方法包括:根据文本对的数据集得到所述文本对的数值特征;通过所述文本对的数值特征构造样本特征矩阵;根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。通过获取文本对的多个数值特征,该获取文本相似度的方法兼顾语义和句法结构,来判断文本相似度,具有权重可训练、人工干预少、简单快捷、易于实施、准确率高等优点,从而提高了用户体验。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数 据信号中的其他数据,并且可包括任何信息递送介质。
以上参照附图说明了本公开的优选实施例,并非因此局限本公开的权利范围。本领域技术人员不脱离本公开的范围和实质内所作的任何修改、等同替换和改进,均应在本公开的权利范围之内。

Claims (10)

  1. 一种获取文本相似度的方法,包括:
    根据文本对的数据集得到所述文本对的数值特征;
    通过所述文本对的数值特征构造样本特征矩阵;
    根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;以及
    获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。
  2. 根据权利要求1所述的方法,其中,所述数值特征包括:基于有序编辑距离的语义特征,基于无序编辑距离的语义特征,基于词义距离的语义特征,基于依存关系的句法特征。
  3. 根据权利要求2所述的方法,其中,所述根据文本对的数据集得到所述文本对的数值特征的步骤包括:
    获取训练语料文件,所述训练语料文件包括若干组文本对及每组文本对的相似度得分;
    根据所述训练语料文件得到训练数据集;
    从所述训练数据集中得到词向量矩阵;
    根据所述词向量矩阵和编辑距离,获取文本对之间的第一改进编辑距离,作为基于有序编辑距离的语义特征;
    根据所述编辑距离和词袋模型,获取文本对之间的第二改进编辑距离,作为基于无序编辑距离的语义特征;
    根据所述词向量矩阵,获取文本对之间的词义距离,作为基于词义距离的语义特征;以及
    对文本对进行依存句法分析,获取文本对之间的句法距离,作为基于依存关系的句法特征。
  4. 根据权利要求3所述的方法,其中,所述获取目标文本对,根 据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分的步骤包括:
    获取目标文本对,获取所述目标文本对的数值特征,构成目标文本对的特征向量;以及
    将所述目标文本对的特征向量代入所述预测模型,获得所述目标文本对的相似度得分。
  5. 一种获取文本相似度的装置,包括:
    训练模块,被配置为根据文本对的数据集得到所述文本对的数值特征;
    矩阵构造模块,被配置为通过所述文本对的数值特征构造样本特征矩阵;
    预测模块,被配置为根据所述样本特征矩阵和预测向量进行模型训练,得到预测模型;以及
    在线获取模块,被配置为获取目标文本对,根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。
  6. 根据权利要求1所述的装置,其中,所述数值特征包括:基于有序编辑距离的语义特征,基于无序编辑距离的语义特征,基于词义距离的语义特征,基于依存关系的句法特征。
  7. 根据权利要求6所述的装置,其中,所述训练模块包括:
    获取单元,被配置为获取训练语料文件,所述训练语料文件包括若干组文本对及每组文本对的相似度得分;
    提取单元,被配置为根据所述训练语料文件得到训练数据集;
    词向量获取单元,被配置为从所述训练数据集中得到词向量矩阵;
    有序编辑距离获取单元,被配置为根据所述词向量矩阵和编辑距离,获取文本对之间的第一改进编辑距离,作为基于有序编辑距离的语义特征;
    无序编辑距离获取单元,被配置为根据所述编辑距离和词袋模型, 获取文本对之间的第二改进编辑距离,作为基于无序编辑距离的语义特征;
    词义距离获取单元,被配置为根据所述词向量矩阵,获取文本对之间的词义距离,作为基于词义距离的语义特征;以及
    句法距离获取单元,被配置为对文本对进行依存句法分析,获取文本对之间的句法距离,作为基于依存关系的句法特征。
  8. 根据权利要求7所述的装置,其中,所述在线获取模块包括:
    特征向量获取单元,被配置为获取目标文本对,获取所述目标文本对的数值特征,构成目标文本对的特征向量;
    相似度获取单元,被配置为将所述目标文本对的特征向量代入所述预测模型,获得所述目标文本对的相似度得分。
  9. 一种电子设备,包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序,其中,所述应用程序被配置为被配置为执行权利要求1-4任一项所述的获取文本相似度的方法。
  10. 一种可读存储介质,存储有计算机程序,其中,该计算机程序被处理器执行时实现如权利要求1-4任一所述的获取文本相似度的方法。
PCT/CN2019/097691 2018-07-25 2019-07-25 一种获取文本相似度的方法、装置、设备及可读存储介质 WO2020020287A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810827262.3A CN110852056B (zh) 2018-07-25 2018-07-25 一种获取文本相似度的方法、装置、设备及可读存储介质
CN201810827262.3 2018-07-25

Publications (1)

Publication Number Publication Date
WO2020020287A1 true WO2020020287A1 (zh) 2020-01-30

Family

ID=69181349

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/097691 WO2020020287A1 (zh) 2018-07-25 2019-07-25 一种获取文本相似度的方法、装置、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN110852056B (zh)
WO (1) WO2020020287A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460783A (zh) * 2020-03-30 2020-07-28 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN112395886A (zh) * 2021-01-19 2021-02-23 深圳壹账通智能科技有限公司 相似文本确定方法及相关设备
WO2021237928A1 (zh) * 2020-05-26 2021-12-02 深圳壹账通智能科技有限公司 文本相似度识别模型的训练方法、装置及相关设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400584A (zh) * 2020-03-16 2020-07-10 南方科技大学 联想词的推荐方法、装置、计算机设备和存储介质
CN112446218A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 长短句文本语义匹配方法、装置、计算机设备及存储介质
CN117573815B (zh) * 2024-01-17 2024-04-30 之江实验室 一种基于向量相似度匹配优化的检索增强生成方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120330978A1 (en) * 2008-06-24 2012-12-27 Microsoft Corporation Consistent phrase relevance measures
CN107729300A (zh) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 文本相似度的处理方法、装置、设备和计算机存储介质
CN108090047A (zh) * 2018-01-10 2018-05-29 华南师范大学 一种文本相似度的确定方法及设备
CN108170684A (zh) * 2018-01-22 2018-06-15 京东方科技集团股份有限公司 文本相似度计算方法及系统、数据查询系统和计算机产品

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622338B (zh) * 2012-02-24 2014-02-26 北京工业大学 一种短文本间语义距离的计算机辅助计算方法
JP5936698B2 (ja) * 2012-08-27 2016-06-22 株式会社日立製作所 単語意味関係抽出装置
CN104063502B (zh) * 2014-07-08 2017-03-22 中南大学 一种基于语义模型的wsdl半结构化文档相似性分析及分类方法
CN106126494B (zh) * 2016-06-16 2018-12-28 上海智臻智能网络科技股份有限公司 同义词发现方法及装置、数据处理方法及装置
CN106484678A (zh) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 一种短文本相似度计算方法及装置
CN106910252B (zh) * 2017-01-20 2018-05-22 东北石油大学 一种基于语义空间投影变换的三维模型在线标注方法与系统
CN107247780A (zh) * 2017-06-12 2017-10-13 北京理工大学 一种基于知识本体的专利文献相似性度量方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120330978A1 (en) * 2008-06-24 2012-12-27 Microsoft Corporation Consistent phrase relevance measures
CN107729300A (zh) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 文本相似度的处理方法、装置、设备和计算机存储介质
CN108090047A (zh) * 2018-01-10 2018-05-29 华南师范大学 一种文本相似度的确定方法及设备
CN108170684A (zh) * 2018-01-22 2018-06-15 京东方科技集团股份有限公司 文本相似度计算方法及系统、数据查询系统和计算机产品

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460783A (zh) * 2020-03-30 2020-07-28 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN111460783B (zh) * 2020-03-30 2021-07-27 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质
WO2021237928A1 (zh) * 2020-05-26 2021-12-02 深圳壹账通智能科技有限公司 文本相似度识别模型的训练方法、装置及相关设备
CN112395886A (zh) * 2021-01-19 2021-02-23 深圳壹账通智能科技有限公司 相似文本确定方法及相关设备

Also Published As

Publication number Publication date
CN110852056B (zh) 2024-09-24
CN110852056A (zh) 2020-02-28

Similar Documents

Publication Publication Date Title
WO2020020287A1 (zh) 一种获取文本相似度的方法、装置、设备及可读存储介质
US20210397980A1 (en) Information recommendation method and apparatus, electronic device, and readable storage medium
CN109101620B (zh) 相似度计算方法、聚类方法、装置、存储介质及电子设备
US9595053B1 (en) Product recommendation using sentiment and semantic analysis
US20180336193A1 (en) Artificial Intelligence Based Method and Apparatus for Generating Article
CN104573054B (zh) 一种信息推送方法和设备
CN104090890B (zh) 关键词相似度获取方法、装置及服务器
AU2014201827B2 (en) Scoring concept terms using a deep network
US8412726B2 (en) Related links recommendation
CN111291765A (zh) 用于确定相似图片的方法和装置
CN110991187A (zh) 一种实体链接的方法、装置、电子设备及介质
CN109635157A (zh) 模型生成方法、视频搜索方法、装置、终端及存储介质
CN111538903B (zh) 搜索推荐词确定方法、装置、电子设备及计算机可读介质
CN108733694B (zh) 检索推荐方法和装置
TWI709905B (zh) 資料分析方法及資料分析系統
Jackovich et al. Machine Learning with AWS: Explore the power of cloud services for your machine learning and artificial intelligence projects
CN112948681A (zh) 一种融合多维度特征的时间序列数据推荐方法
CN112541069A (zh) 一种结合关键词的文本匹配方法、系统、终端及存储介质
CN106446696B (zh) 一种信息处理方法及电子设备
CN110188277A (zh) 一种资源的推荐方法及装置
CN115905472A (zh) 商机业务处理方法、装置、服务器及计算机可读存储介质
Su et al. MeKB-Rec: Personal Knowledge Graph Learning for Cross-Domain Recommendation
CN104978419B (zh) 一种用户资源的上传处理方法和装置
US20240020476A1 (en) Determining linked spam content
KR102526275B1 (ko) 콘텐츠를 필터링하여 제공하는 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19841130

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.06.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19841130

Country of ref document: EP

Kind code of ref document: A1