WO2019196314A1 - Text information similarity matching method and apparatus, computer device, and storage medium - Google Patents

Text information similarity matching method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2019196314A1
WO2019196314A1 PCT/CN2018/102855 CN2018102855W WO2019196314A1 WO 2019196314 A1 WO2019196314 A1 WO 2019196314A1 CN 2018102855 W CN2018102855 W CN 2018102855W WO 2019196314 A1 WO2019196314 A1 WO 2019196314A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
word
idf
vector
sentence
Prior art date
Application number
PCT/CN2018/102855
Other languages
French (fr)
Chinese (zh)
Inventor
周涛涛
周宝
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019196314A1 publication Critical patent/WO2019196314A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Definitions

  • the present application relates to the field of text information recognition technology, and in particular, to a method and apparatus for matching text information similarity based on TF-IDF, and a computer device and a storage medium storing computer readable instructions.
  • customer service robots and chat bots are becoming more and more popular. Users can input text messages to consult customer service robots or chat with chat bots.
  • the feedback information can be determined according to the retrieval method or the generation manner according to the text information.
  • the generation method is to automatically generate an answer based on the model. This method requires a large number of question and answer pairs to train. The current effect is not satisfactory and is in the research stage.
  • the retrieval method has been widely adopted by the industry. The edited question and answer pairs are pre-stored, and then the matching problem is found according to the problem to find the most matching preset problem, thereby retrieving the preset answer. At present, the text matching method of the retrieval method needs to be improved in accuracy.
  • the purpose of the present application is to solve at least one of the above technical drawbacks, in particular, technical defects with low precision.
  • the present application provides a TF-IDF-based text information similarity matching method, including the following steps: acquiring text information; segmenting the text information to obtain each participle w 1 , w 2 , ... w n-1 , w n ; using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle; using the TF-IDF algorithm to calculate the TF of each participle -IDF values k 1 , k 2 , ...
  • the sentence vector V is obtained from the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated and the sentence vector of the pre-stored sentence is calculated The cosine similarity between them determines the pre-stored statement with the largest cosine similarity.
  • the present application further provides a matching device for text information similarity based on TF-IDF, comprising: an obtaining module, configured to obtain text information; and a word segmentation module, configured to perform segmentation on the text information to obtain each participle w 1 , w 2 , w n-1 , w n ; a word vector calculation module for calculating the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ) of each participle using the CBOW model, V(w n );
  • a TF-IDF value calculation module for calculating TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using a TF-IDF algorithm; a sentence vector calculation module for words according to each word segment The product of the vector and the corresponding TF-IDF value is obtained as a sentence vector V; a matching module is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of the pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.
  • the application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform a TF based a method for matching text information similarity of IDF, the method for matching similarity of text information based on TF-IDF includes the following steps: acquiring text information; and segmenting the text information to obtain each participle w 1 , w 2 , ...
  • the application also provides a non-volatile storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to perform a TF-ID based
  • the matching method of text information similarity, the TF-IDF-based text information similarity matching method comprises the following steps: acquiring text information; and segmenting the text information to obtain each participle w 1 , w 2 , ... w N-1 , w n ; use the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle; use the TF-IDF algorithm to calculate The TF-IDF values k 1 , k 2 , ...
  • the sentence vector V is obtained according to the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated and pre-stored
  • the cosine similarity between the sentence vectors of the statement determines the pre-stored statement with the largest cosine similarity.
  • the above TF-IDF-based text information similarity matching method, device, computer device and storage medium can obtain the pre-existing statement most similar to the text information through the above process, and can improve the problem identification in the robot dialogue and information classification. Accuracy, which improves dialogue efficiency or classification efficiency.
  • FIG. 1 is a schematic diagram showing the internal structure of a computer device in an embodiment
  • FIG. 2 is a schematic flow chart of a TF-IDF-based text information similarity matching method according to an embodiment
  • FIG. 3 is a schematic diagram of a TF-IDF-based text information similarity matching device module according to an embodiment.
  • FIG. 1 is a schematic diagram showing the internal structure of a computer device in an embodiment.
  • the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions.
  • the database may store a sequence of control information.
  • the processor may implement a processor.
  • the processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device.
  • the computer device can store, in a memory of the computer device, computer readable instructions that, when executed by the processor, cause the processor to perform a TF-IDF based text information similarity matching method.
  • the network interface of the computer device is used to communicate with the terminal connection.
  • the TF-IDF-based text information similarity matching method described below can be applied to sentence recognition in a robot dialogue, such as a customer service robot (including an online virtual customer service robot) to identify a customer's consultation, a chat robot (including an online virtual chat robot). Identify the customer's voice or entered text messages. It can also be applied to the information classification method, and will not be described here.
  • Pre-processing the corpus in the corpus includes removing special characters, removing URLs, transcoding, and so on.
  • the vocabulary of the word segmentation can be trained by the word2vec CBOW model in the Gensim toolkit to generate and save the CBOW word vector model.
  • the model After generating the CBOW word vector model, the model can be used to generate word vectors in subsequent methods.
  • FIG. 2 is a schematic flow chart of a TF-IDF-based text information similarity matching method according to an embodiment.
  • the present application provides a TF-IDF-based text information similarity matching method, including the following steps:
  • Step S100 Acquire text information.
  • the text information here may be input by the user or may be text information recognized according to the voice data output by the user.
  • the user performs online consultation by sending a text message to the online customer service robot, and the text message received by the online customer service robot acquires the text information.
  • the user performs online chat by sending a text message to the online chat robot, and the text message received by the online customer service robot acquires the text information.
  • the text information may be a sentence or a paragraph, and the length of the text information and the type of language used are not limited here.
  • the voice message needs to be voice-recognized, specifically: acquiring a voice message sent by the user; performing voice recognition on the voice message to generate text information. Speech recognition technology is widely used and will not be described here.
  • the above example is an example of an online robot, but it is not excluded to be a physical robot, such as a sweeping robot, a child educational robot, a customer service robot, a chat robot, and the like, and an intelligent robot having a physical limb.
  • a physical robot such as a sweeping robot, a child educational robot, a customer service robot, a chat robot, and the like, and an intelligent robot having a physical limb.
  • Step S200 segmentation of the text information to obtain the respective participles w 1 , w 2 , ... w n-1 , w n .
  • Chinese Word Segmentation refers to the division of a sequence of Chinese characters into a single word. Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. w 1 , w 2 , ... w n-1 , w n are a single word segmented from the text information.
  • the word segmentation algorithm can be divided into three types: a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.
  • Word segmentation based on word matching This method is also called mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If found in the dictionary, if it is found in the dictionary. A string, the match is successful (a word is recognized). There are several ways to apply a wide range of matching methods:
  • Word-based segmentation method based on understanding This word segmentation method achieves the effect of identifying words by letting the computer simulate human understanding of the sentence.
  • the basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of linguistic knowledge and information.
  • Statistical-based word segmentation method Give a large number of texts that have been segmented, and use the statistical machine learning model to learn the rules of word segmentation (called training), so as to achieve the segmentation of unknown text.
  • training For example, the maximum probability word segmentation method and the maximum entropy word segmentation method.
  • the main statistical models are N-gram, Hidden Markov Model (HMM), Maximum Entropy Model (ME), and Conditional Random Fields (CRF).
  • the textual information may be segmented using a statistical-based word segmentation method, such as segmentation of textual information using a jieba segmentation component.
  • the stuttering part is a Chinese word segmentation component developed by Chinese programmers in Python.
  • the stop words of the text information are also subjected to removal processing.
  • Step S300 Calculate word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle using the CBOW model.
  • Each of the word parts w 1 , w 2 , ... w n-1 , w n corresponds to the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ), respectively.
  • the word vector for each participle can be calculated using the word2vec CBOW model in the Gensim toolkit.
  • Word2vec is also called word embeddings, the Chinese name “word vector”, the role is to convert the words in natural language into a computer-readable Dense Vector.
  • word vector the Chinese name “word vector”
  • the role is to convert the words in natural language into a computer-readable Dense Vector.
  • word2vec Prior to the advent of word2vec, natural language processing often turned words into discrete, separate symbols, the One-Hot Encoder.
  • the city code is random, the vectors are independent of each other, and there is no possible relationship between the cities.
  • the size of the vector dimension depends on the number of words in the corpus. If the vectors corresponding to the names of all the cities in the world are combined into one matrix, then this matrix is too sparse and will cause dimensional disaster.
  • Word2Vec can convert One-Hot Encoder into low-dimensional continuous values, that is, dense vectors, and words with similar meanings will be mapped to similar positions in the vector space.
  • Word2vec is mainly divided into two modes: CBOW (Continuous Bag of Words) and Skip-Gram.
  • CBOW guesses the target word from the original sentence
  • Skip-Gram is the opposite, guessing the original statement from the target word.
  • CBOW is suitable for small databases, while Skip-Gram performs better in large corpora.
  • Step S400 Calculate the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each participle using the TF-IDF algorithm.
  • TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining.
  • TF means Term Frequency and IDF means Inverse Document Frequency.
  • TF frequency
  • n i,j is the number of occurrences of the word t i in the file d j
  • the denominator is the sum of the occurrences of all the words in the file d j .
  • the inverse document frequency (IDF) is a measure of the universal importance of a word.
  • the IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the logarithm of the obtained quotient:
  • is the total number of files in the corpus
  • is the number of files containing the word t i (not equal to 0). If the word is not in the corpus, it will result in the dividend being zero, so you can use 1+
  • the TF-IDF values corresponding to the respective word segments w 1 , w 2 , ... w n-1 , w n are k 1 , k 2 , ... k n-1 , k n , respectively .
  • k n tf n ⁇ idf n
  • tf n the frequency (word frequency) in which the participle w n appears in the text message
  • idf n is the inverse file frequency of the participle w n .
  • Step S500 The sentence vector V is obtained according to the product of the word vector of each participle and the corresponding TF-IDF value.
  • the sentence vector V is obtained from the product H 1 , H 2 , ... H n-1 , H n of the word vector of each of the partial words w 1 , w 2 , ... w n-1 , w n and the corresponding TF-IDF value.
  • the sentence vector V can be obtained using the following formula:
  • V H 1 +H 2 +...H n-1 +H n
  • sentence vector V can be obtained using the following formula:
  • V k 1 ⁇ V(w 1 )+k 2 ⁇ V(w 2 )+...+k n-1 ⁇ V(w n-1 )+k n ⁇ V(w n )
  • Step S600 Calculating a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determining a pre-stored statement (feature pre-stored statement) having the largest cosine similarity.
  • Cosine similarity is a measure of the magnitude of the difference between two individuals using the cosine of the angle between two vectors in vector space. If the corresponding vectors of statement X and statement Y are: (x 1 , x 2 , ..., x 6400 ) and (y 1 , y 2 , ..., y 6400 ), respectively, the cosine distance between them can be used.
  • the cosine of the angle between them is expressed as:
  • the pre-stored statement with the greatest cosine similarity can be found.
  • the user conducts an online consultation by sending a question to the online customer service robot.
  • the online customer service robot calculates the sentence vector V of the problem, and then searches the database for the pre-existing problem with the greatest cosine similarity to the problem vector V. And select the pre-stored question and answer feedback corresponding to the pre-stored question to the user. For example, the user issues the question "Is it okay?", the online customer service robot finds the pre-existing problem with the greatest cosine similarity to the sentence vector of the question in the database as "Is it free?", the corresponding "I ask if it is free” in the database.
  • This pre-existing problem map stores the answer “yes” and returns "yes" to the customer.
  • the pre-existing statement (feature pre-existing statement) which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.
  • FIG. 3 is a schematic diagram of a TF-IDF-based text information similarity matching device module according to an embodiment.
  • the present application further provides a TF-IDF-based text information similarity matching device, comprising: an obtaining module 100, a word segmentation module 200, and a word vector calculation module 300.
  • the obtaining module 100 is configured to obtain text information; the word segmentation module 200 is configured to perform segmentation on the text information to obtain each participle w 1 , w 2 , . . . w n-1 , w n ; the word vector calculation module 300 is configured to use the CBOW model The word vectors V(w 1 ), V(w 2 ), . . . , V(w n-1 ), V(w n ) of each participle are calculated; the TF-IDF value calculation module 400 is used to calculate using the TF-IDF algorithm.
  • the sentence vector calculation module 500 is configured to obtain the sentence vector V according to the product of the word vector of each word segment and the corresponding TF-IDF value;
  • the module 600 is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.
  • the acquisition module 100 obtains text information.
  • the text information here may be input by the user or may be text information recognized according to the voice data output by the user.
  • the user performs online consultation by sending a text message to the online customer service robot, and the text message received by the online customer service robot acquires the text information.
  • the user performs online chat by sending a text message to the online chat robot, and the text message received by the online customer service robot acquires the text information.
  • the text information may be a sentence or a paragraph, and the length of the text information and the type of language used are not limited here.
  • the acquiring module 100 needs to perform voice recognition on the voice message. Specifically, the acquiring module 100 acquires a voice message sent by the user, and performs voice recognition on the voice message to generate text information. Speech recognition technology is widely used and will not be described here.
  • the above example is an example of an online robot, but it is not excluded to be a physical robot, such as a sweeping robot, a child educational robot, a customer service robot, a chat robot, and the like, and an intelligent robot having a physical limb.
  • a physical robot such as a sweeping robot, a child educational robot, a customer service robot, a chat robot, and the like, and an intelligent robot having a physical limb.
  • the word segmentation module 200 segments the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n .
  • Chinese Word Segmentation refers to the division of a sequence of Chinese characters into a single word. Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. w 1 , w 2 , ... w n-1 , w n are a single word segmented from the text information.
  • the word segmentation algorithm can be divided into three types: a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.
  • Word segmentation based on word matching This method is also called mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If found in the dictionary, if it is found in the dictionary. A string, the match is successful (a word is recognized). There are several ways to apply a wide range of matching methods:
  • Word-based segmentation method based on understanding This word segmentation method achieves the effect of identifying words by letting the computer simulate human understanding of the sentence.
  • the basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of linguistic knowledge and information.
  • Statistical-based word segmentation method Give a large number of texts that have been segmented, and use the statistical machine learning model to learn the rules of word segmentation (called training), so as to achieve the segmentation of unknown text.
  • training For example, the maximum probability word segmentation method and the maximum entropy word segmentation method.
  • the main statistical models are N-gram, Hidden Markov Model (HMM), Maximum Entropy Model (ME), and Conditional Random Fields (CRF).
  • the word segmentation module 200 can segment the textual information using a statistical-based segmentation method, such as segmenting the textual information using a jieba segmentation component.
  • the stuttering part is a Chinese word segmentation component developed by Chinese programmers in Python.
  • the word segmentation module 200 also removes the stop words of the text information in the process of segmenting the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n . deal with.
  • the word vector calculation module 300 calculates the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of the respective participles using the CBOW model.
  • Each of the word parts w 1 , w 2 , ... w n-1 , w n corresponds to the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ), respectively.
  • the word vector calculation module 300 can calculate the word vector of each participle by the word2vec CBOW model in the Gensim toolkit.
  • Word2vec is also called word embeddings, the Chinese name “word vector”, the role is to convert the words in natural language into a dense vector (Dense Vector) that can be understood by computers.
  • word vector dense vector
  • the city code is random, the vectors are independent of each other, and there is no possible relationship between the cities.
  • the size of the vector dimension depends on the number of words in the corpus. If the vectors corresponding to the names of all the cities in the world are combined into one matrix, then this matrix is too sparse and will cause dimensional disaster.
  • Word2Vec can convert One-Hot Encoder into low-dimensional continuous values, that is, dense vectors, and words with similar meanings will be mapped to similar positions in the vector space.
  • Word2vec is mainly divided into two modes: CBOW (Continuous Bag of Words) and Skip-Gram.
  • CBOW guesses the target word from the original sentence
  • Skip-Gram is the opposite, guessing the original statement from the target word.
  • CBOW is suitable for small databases, while Skip-Gram performs better in large corpora.
  • the TF-IDF value calculation module 400 calculates the TF-IDF values k 1 , k 2 , ... k n-1 , k n of the respective word segments using the TF-IDF algorithm.
  • TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining.
  • TF means Term Frequency
  • IDF means Inverse Document Frequency.
  • TF frequency
  • n i,j is the number of occurrences of the word t i in the file d j
  • the denominator is the sum of the occurrences of all the words in the file d j .
  • the inverse document frequency (IDF) is a measure of the universal importance of a word.
  • the IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the logarithm of the obtained quotient:
  • is the total number of files in the corpus
  • is the number of files containing the word t i (not equal to 0). If the word is not in the corpus, it will result in the dividend being zero, so you can use 1+
  • the TF-IDF values corresponding to the respective word segments w 1 , w 2 , ... w n-1 , w n are k 1 , k 2 , ... k n-1 , k n , respectively .
  • k n tf n ⁇ idf n
  • tf n the frequency (word frequency) in which the participle w n appears in the text message
  • idf n is the inverse file frequency of the participle w n .
  • the sentence vector calculation module 500 obtains the sentence vector V based on the product of the word vector of each participle and the corresponding TF-IDF value.
  • the sentence vector V is obtained from the product H 1 , H 2 , ... H n-1 , H n of the word vector of each of the partial words w 1 , w 2 , ... w n-1 , w n and the corresponding TF-IDF value.
  • the sentence vector calculation module 500 can obtain the sentence vector V using the following formula:
  • V H 1 +H 2 +...H n-1 +H n
  • the sentence vector calculation module 500 can obtain the sentence vector V using the following formula:
  • V k 1 ⁇ V(w 1 )+k 2 ⁇ V(w 2 )+...+k n-1 ⁇ V(w n-1 )+k n ⁇ V(w n )
  • the matching module 600 calculates a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determines a pre-stored statement (feature pre-stored statement) with the largest cosine similarity.
  • Cosine similarity is a measure of the magnitude of the difference between two individuals using the cosine of the angle between two vectors in vector space. If the corresponding vectors of statement X and statement Y are: (x 1 , x 2 , ..., x 6400 ) and (y 1 , y 2 , ..., y 6400 ), respectively, the cosine distance between them can be used.
  • the cosine of the angle between them is expressed as:
  • the matching module 600 can find the pre-stored statement (feature pre-stored statement) with the largest cosine similarity by comparing the cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence.
  • the user conducts an online consultation by sending a question to the online customer service robot.
  • the online customer service robot calculates the sentence vector V of the problem, and then searches the database for the pre-existing problem with the greatest cosine similarity to the problem vector V. And select the pre-stored question and answer feedback corresponding to the pre-stored question to the user. For example, the user issues the question "Is it okay?", the online customer service robot finds the pre-existing problem with the greatest cosine similarity to the sentence vector of the question in the database as "Is it free?", the corresponding "I ask if it is free” in the database.
  • This pre-existing problem map stores the answer “yes” and returns "yes" to the customer.
  • the pre-existing statement (feature pre-existing statement) which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.
  • the application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform any of the above implementations
  • a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform any of the above implementations
  • the steps of the TF-IDF-based text information similarity matching method are described.
  • the present application also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to perform TF-based operations according to any of the above embodiments.
  • the steps of the IDF text information similarity matching method are described in detail below.
  • the TF-IDF-based text information similarity matching method, device, computer device and storage medium are obtained by acquiring text information; and the text information is segmented to obtain each participle w 1 , w 2 , ... w n-1 , w n ; use the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle; calculate each participle using the TF-IDF algorithm TF-IDF values k 1 , k 2 , ...
  • the sentence vector V is obtained from the product of the word vector of each word segment and the corresponding TF-IDF value; calculating the sentence vector V and the pre-stored statement
  • the cosine similarity between the sentence vectors determines the pre-stored statement with the largest cosine similarity.
  • the storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Abstract

Provided are a TF-IDF-based text information similarity matching method and apparatus. The method comprises: acquiring text information; carrying out word segmentation on the text information to obtain segmented words w1, w2,..., wn-1 and wn; using a CBOW model to calculate word vectors V(w1), V(w2),..., V(wn-1) and V(wn) of the segmented words; using a TF-IDF algorithm to calculate TF-IDF values k1, k2,..., kn-1 and kn of the segmented words; obtaining a sentence vector V according to products of the word vectors of the segmented words and the corresponding TF-IDF values; and calculating the cosine similarity between the sentence vector V and sentence vectors of pre-stored statements, and determining a pre-stored statement having the maximum cosine similarity. By means of the process, a pre-stored statement that is most similar to text information can be found, and the accuracy of problem recognition can be improved in the aspects of robot conversation, information classification, etc., thus improving the conversation efficiency or the classification efficiency. Further provided are a computer device and a storage medium.

Description

文本信息相似度匹配方法、装置、计算机设备及存储介质Text information similarity matching method, device, computer equipment and storage medium
本申请要求于2018年4月10日提交中国专利局、申请号为201810314094.8,发明名称为“文本信息相似度匹配方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application entitled "Text Information Similarity Matching Method, Apparatus, Computer Equipment, and Storage Medium" submitted to the Chinese Patent Office on April 10, 2018, application number 201810314094.8, the entire contents of which are incorporated herein by reference. This is incorporated herein by reference.
技术领域Technical field
本申请涉及文本信息识别技术领域,具体而言,本申请涉及一种基于TF-IDF的文本信息相似度的匹配方法和装置,以及一种计算机设备和存储有计算机可读指令的存储介质。The present application relates to the field of text information recognition technology, and in particular, to a method and apparatus for matching text information similarity based on TF-IDF, and a computer device and a storage medium storing computer readable instructions.
背景技术Background technique
随着智能化的发展,客服机器人和聊天机器人逐渐流行,用户可以通过输入文本信息来向客服机器人进行咨询,或者和聊天机器人进行聊天。With the development of intelligence, customer service robots and chat bots are becoming more and more popular. Users can input text messages to consult customer service robots or chat with chat bots.
发明人意识到机器人在识别用户发出的文本信息时,需要根据文本信息进行反馈。通常而言,根据文本信息可以根据检索方式或生成方式确定反馈信息。生成方式是根据模型自动生成答案,这种方式需要大量的标注问答对进行训练,目前效果不理想,处于研究阶段。而检索方式被业界大量采用,先预存编辑好的问答对,再根据问题用匹配方法找到最匹配的预置问题,从而检索出预置答案。目前该检索方式的文本匹配方法在精准度上还有待提高。The inventor realized that when the robot recognizes the text information sent by the user, it needs to perform feedback based on the text information. Generally, the feedback information can be determined according to the retrieval method or the generation manner according to the text information. The generation method is to automatically generate an answer based on the model. This method requires a large number of question and answer pairs to train. The current effect is not satisfactory and is in the research stage. The retrieval method has been widely adopted by the industry. The edited question and answer pairs are pre-stored, and then the matching problem is found according to the problem to find the most matching preset problem, thereby retrieving the preset answer. At present, the text matching method of the retrieval method needs to be improved in accuracy.
发明内容Summary of the invention
本申请的目的旨在至少能解决上述的技术缺陷之一,特别是精准度不高的技术缺陷。The purpose of the present application is to solve at least one of the above technical drawbacks, in particular, technical defects with low precision.
本申请提供一种基于TF-IDF的文本信息相似度的匹配方法,包括如下步骤:获取文本信息;对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n;使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n);使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k n;根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。 The present application provides a TF-IDF-based text information similarity matching method, including the following steps: acquiring text information; segmenting the text information to obtain each participle w 1 , w 2 , ... w n-1 , w n ; using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle; using the TF-IDF algorithm to calculate the TF of each participle -IDF values k 1 , k 2 , ... k n-1 , k n ; the sentence vector V is obtained from the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated and the sentence vector of the pre-stored sentence is calculated The cosine similarity between them determines the pre-stored statement with the largest cosine similarity.
本申请还提供一种基于TF-IDF的文本信息相似度的匹配装置,包括:获取模块,用于获取文本信息;分词模块,用于对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n;词向量计算模块,用于使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n); The present application further provides a matching device for text information similarity based on TF-IDF, comprising: an obtaining module, configured to obtain text information; and a word segmentation module, configured to perform segmentation on the text information to obtain each participle w 1 , w 2 , w n-1 , w n ; a word vector calculation module for calculating the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ) of each participle using the CBOW model, V(w n );
TF-IDF值计算模块,用于使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k n;句向量计算模块,用于根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;匹配模块,用于计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。 a TF-IDF value calculation module for calculating TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using a TF-IDF algorithm; a sentence vector calculation module for words according to each word segment The product of the vector and the corresponding TF-IDF value is obtained as a sentence vector V; a matching module is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of the pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算 机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行一种基于TF-IDF的文本信息相似度的匹配方法,所述基于TF-IDF的文本信息相似度的匹配方法包括以下步骤:获取文本信息;对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n;使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……V(w n-1)、V(w n);使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k n;根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。 The application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform a TF based a method for matching text information similarity of IDF, the method for matching similarity of text information based on TF-IDF includes the following steps: acquiring text information; and segmenting the text information to obtain each participle w 1 , w 2 , ... ...w n-1 , w n ; use the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ...V(w n-1 ), V(w n ) of each participle; use TF-IDF The algorithm calculates the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment; the sentence vector V is obtained according to the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated The cosine similarity between the sentence vector of the pre-stored statement determines the pre-stored statement with the largest cosine similarity.
本申请还提供一种存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行一种基于TF-IDF的文本信息相似度的匹配方法,所述基于TF-IDF的文本信息相似度的匹配方法包括以下步骤:获取文本信息;对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n;使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……V(w n-1)、V(w n);使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k n;根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。 The application also provides a non-volatile storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to perform a TF-ID based The matching method of text information similarity, the TF-IDF-based text information similarity matching method comprises the following steps: acquiring text information; and segmenting the text information to obtain each participle w 1 , w 2 , ... w N-1 , w n ; use the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle; use the TF-IDF algorithm to calculate The TF-IDF values k 1 , k 2 , ... k n-1 , k n of each participle; the sentence vector V is obtained according to the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated and pre-stored The cosine similarity between the sentence vectors of the statement determines the pre-stored statement with the largest cosine similarity.
上述的基于TF-IDF的文本信息相似度的匹配方法、装置、计算机设备和存储介质,通过上述过程,可以找到与文本信息最相似的预存语句,在机器人对话、信息分类等方面可以提高问题识别的精准度,从而提高对话效率或分类效率。The above TF-IDF-based text information similarity matching method, device, computer device and storage medium can obtain the pre-existing statement most similar to the text information through the above process, and can improve the problem identification in the robot dialogue and information classification. Accuracy, which improves dialogue efficiency or classification efficiency.
附图说明DRAWINGS
图1为一个实施例中计算机设备的内部结构示意图;1 is a schematic diagram showing the internal structure of a computer device in an embodiment;
图2为一个实施例的基于TF-IDF的文本信息相似度的匹配方法流程示意图;2 is a schematic flow chart of a TF-IDF-based text information similarity matching method according to an embodiment;
图3为一个实施例的基于TF-IDF的文本信息相似度的匹配装置模块示意图。FIG. 3 is a schematic diagram of a TF-IDF-based text information similarity matching device module according to an embodiment.
具体实施方式detailed description
图1为一个实施例中计算机设备的内部结构示意图。如图1所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种基于TF-IDF的文本信息相似度的匹配方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种基于TF-IDF的文本信息相似度的匹配方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图1中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。FIG. 1 is a schematic diagram showing the internal structure of a computer device in an embodiment. As shown in FIG. 1, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions. The database may store a sequence of control information. When the computer readable instructions are executed by the processor, the processor may implement a processor. A matching method based on TF-IDF for text information similarity. The processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device. The computer device can store, in a memory of the computer device, computer readable instructions that, when executed by the processor, cause the processor to perform a TF-IDF based text information similarity matching method. The network interface of the computer device is used to communicate with the terminal connection. It will be understood by those skilled in the art that the structure shown in FIG. 1 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
以下描述的基于TF-IDF的文本信息相似度的匹配方法,可以应用于机器人对话中的语句识别,例如客服机器人(包括在线虚拟客服机器人)识别客户的咨询,聊天机器人(包括在线虚拟聊天机器人)识别客户的语音或输入的文字消息。还可以应用于信息分类方法中,在此不赘述。The TF-IDF-based text information similarity matching method described below can be applied to sentence recognition in a robot dialogue, such as a customer service robot (including an online virtual customer service robot) to identify a customer's consultation, a chat robot (including an online virtual chat robot). Identify the customer's voice or entered text messages. It can also be applied to the information classification method, and will not be described here.
要进行文本信息识别,需要生成CBOW词向量模型。具体过程如下:To identify text information, you need to generate a CBOW word vector model. The specific process is as follows:
1、通过网络爬虫爬取语料库。可以使用Python爬虫从例如维基百科、谷歌百科、百度百科、搜狗百科等等网络百科网站爬取语料库。1. Crawl the corpus through the web crawler. You can use Python crawlers to crawl corpora from web encyclopedias such as Wikipedia, Google Encyclopedia, Baidu Encyclopedia, Sogou Encyclopedia, and more.
2、对语料库中的语料进行预处理。预处理包括去除特殊字符、去除网址、编码转换等等。2. Pre-processing the corpus in the corpus. Preprocessing includes removing special characters, removing URLs, transcoding, and so on.
3、对语料库中的语料进行分词。可以采用结巴(jieba)分词对语料库进行中文分词。3. Segment the corpus in the corpus. Chinese vocabulary can be performed on the corpus by using the jieba participle.
4、对分词完毕的语料进行训练,生成CBOW词向量模型。可以通过Gensim工具包中的word2vec CBOW模型对分词完毕的语料进行训练,生成并保存CBOW词向量模型。4. Train the corpus of the word segmentation to generate the CBOW word vector model. The vocabulary of the word segmentation can be trained by the word2vec CBOW model in the Gensim toolkit to generate and save the CBOW word vector model.
生成CBOW词向量模型后,即可以使用该模型在后续方法中进行词向量生成。After generating the CBOW word vector model, the model can be used to generate word vectors in subsequent methods.
图2为一个实施例的基于TF-IDF的文本信息相似度的匹配方法流程示意图。本申请提供一种基于TF-IDF的文本信息相似度的匹配方法,包括如下步骤:FIG. 2 is a schematic flow chart of a TF-IDF-based text information similarity matching method according to an embodiment. The present application provides a TF-IDF-based text information similarity matching method, including the following steps:
步骤S100:获取文本信息。此处的文本信息既可以是用户自行输入的,也可以是根据用户输出的语音数据而识别出的文本信息。Step S100: Acquire text information. The text information here may be input by the user or may be text information recognized according to the voice data output by the user.
例如,用户通过向在线客服机器人发送文本消息进行在线咨询,在线客服机器人所接收到的文本消息即获取文本信息。又例如,用户通过向在线聊天机器人发送文本消息进行在线聊天,在线客服机器人所接收到的文本消息即获取文本信息。文本信息可能是一句话,也可能是一段话,在此不对文本信息的长度、所用语言类型进行限定。For example, the user performs online consultation by sending a text message to the online customer service robot, and the text message received by the online customer service robot acquires the text information. For another example, the user performs online chat by sending a text message to the online chat robot, and the text message received by the online customer service robot acquires the text information. The text information may be a sentence or a paragraph, and the length of the text information and the type of language used are not limited here.
当然,如果用户发送的是语音消息,那么需要对语音消息进行语音识别,具体为:获取用户发送的语音消息;对语音消息进行语音识别,生成文本信息。语音识别技术应用较为广泛,在此不赘述。Of course, if the user sends a voice message, the voice message needs to be voice-recognized, specifically: acquiring a voice message sent by the user; performing voice recognition on the voice message to generate text information. Speech recognition technology is widely used and will not be described here.
当然,上述举例是以在线机器人为例,但是并不排除是实体机器人,例如扫地机器人、儿童教育机器人、客服机器人、聊天机器人等等具有实体肢体的智能机器人。Of course, the above example is an example of an online robot, but it is not excluded to be a physical robot, such as a sweeping robot, a child educational robot, a customer service robot, a chat robot, and the like, and an intelligent robot having a physical limb.
步骤S200:对文本信息进行分词得到各个分词w 1、w 2、……w n-1、w nStep S200: segmentation of the text information to obtain the respective participles w 1 , w 2 , ... w n-1 , w n .
以中文分词为例。中文分词(Chinese Word Segmentation)指的是将一个汉字序列切分成一个一个单独的词。分词就是将连续的字序列按照一定的规范重新组合成词序列的过程。w 1、w 2、……w n-1、w n就是从文本信息中切分出来的一个一个单独的词。 Take the Chinese word segmentation as an example. Chinese Word Segmentation refers to the division of a sequence of Chinese characters into a single word. Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. w 1 , w 2 , ... w n-1 , w n are a single word segmented from the text information.
在一些实施例中,分词算法可分为三种:基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法。In some embodiments, the word segmentation algorithm can be divided into three types: a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.
基于字符串匹配的分词方法:这种方法又叫做机械分词方法,它是按照一定的策略将待分析的汉字串与一个“充分大的”机器词典中的词条进行匹配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。应用较广的匹配方法有以下几种:Word segmentation based on word matching: This method is also called mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If found in the dictionary, if it is found in the dictionary. A string, the match is successful (a word is recognized). There are several ways to apply a wide range of matching methods:
1)正向最大匹配法(由左到右的方向)1) Forward maximum matching method (from left to right)
2)逆向最大匹配法(由右到左的方向)2) Reverse maximum matching method (from right to left)
3)最少切分(使每一句中切出的词数最小)3) Minimal segmentation (minimum number of words cut out in each sentence)
4)双向最大匹配法(进行由左到右、由右到左两次扫描)4) Two-way maximum matching method (performed from left to right, right to left twice)
基于理解的分词方法:这种分词方法是通过让计算机模拟人对句子的理解,达到识别词的效果。其基本思想就是在分词的同时进行句法、语义分析,利用句法信息和语义信息来处理歧义现象。它通常包括三个部分:分词子系统、句法语义子系统、总控部分。在总控部分的协调下,分词子系统可以获得有关词、句子等的句法和语义信息来对分词歧义进行判断,即它模拟了人对句子的理解过程。这种分词方法需要使用大量的语言知识和信息。Word-based segmentation method based on understanding: This word segmentation method achieves the effect of identifying words by letting the computer simulate human understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of linguistic knowledge and information.
基于统计的分词方法:给出大量已经分词的文本,利用统计机器学习模型学习词语切分的规律(称为训练),从而实现对未知文本的切分。例如最大概率分词方法和最大熵分词方法等。主要统计模型:N元文法模型(N-gram),隐马尔可夫模型(Hidden Markov Model,HMM),最大熵模型(ME),条件随机场模型(Conditional Random Fields,CRF)等。Statistical-based word segmentation method: Give a large number of texts that have been segmented, and use the statistical machine learning model to learn the rules of word segmentation (called training), so as to achieve the segmentation of unknown text. For example, the maximum probability word segmentation method and the maximum entropy word segmentation method. The main statistical models are N-gram, Hidden Markov Model (HMM), Maximum Entropy Model (ME), and Conditional Random Fields (CRF).
在一些实施例中,可以采用基于统计的分词方法对文本信息进行分词,例如采用结巴(jieba)分词组件对文本信息进行分词。结巴分词是中国程序员用Python开发的一个中文分词组件。In some embodiments, the textual information may be segmented using a statistical-based word segmentation method, such as segmentation of textual information using a jieba segmentation component. The stuttering part is a Chinese word segmentation component developed by Chinese programmers in Python.
在其中一个实施例中,在对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n的过程中,还对文本信息的停用词进行去除处理。 In one of the embodiments, in the process of segmenting the text information to obtain the respective segmentation words w 1 , w 2 , ... w n-1 , w n , the stop words of the text information are also subjected to removal processing.
步骤S300:使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n)。各个分词w 1、w 2、……w n-1、w n分别对应词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n)。 Step S300: Calculate word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle using the CBOW model. Each of the word parts w 1 , w 2 , ... w n-1 , w n corresponds to the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ), respectively.
可以通过Gensim工具包中的word2vec CBOW模型计算各个分词的词向量。The word vector for each participle can be calculated using the word2vec CBOW model in the Gensim toolkit.
word2vec也叫word embeddings,中文名“词向量”,作用就是将自然语言中的字词转为计算机可以理解的稠密向量(Dense Vector)。在word2vec出现之前,自然语言处理经常把字词转为离散的单独的符号,也就是One-Hot Encoder。Word2vec is also called word embeddings, the Chinese name "word vector", the role is to convert the words in natural language into a computer-readable Dense Vector. Prior to the advent of word2vec, natural language processing often turned words into discrete, separate symbols, the One-Hot Encoder.
杭州[0,0,0,0,0,0,0,1,0,……,0,0,0,0,0,0,0]Hangzhou[0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0]
上海[0,0,0,0,1,0,0,0,0,……,0,0,0,0,0,0,0]Shanghai[0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0]
宁波[0,0,0,1,0,0,0,0,0,……,0,0,0,0,0,0,0]Ningbo [0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0]
北京[0,0,0,0,0,0,0,0,0,……,1,0,0,0,0,0,0]Beijing[0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0]
比如上面的这个例子,在语料库中,杭州、上海、宁波、北京各对应一个向量,向量中只有一个值为1,其余都为0。使用One-Hot Encoder有以下问题。一方面,城市编码是随机的,向量之间相互独立,看不出城市之间可能存在的关联关系。其次,向量维度的大小取决于语料库中字词的多少。如果将世界所有城市名称对应的向量合为一个矩阵的话,那这个矩阵过于稀疏,并且会造成维度灾难。For example, in the above example, in the corpus, Hangzhou, Shanghai, Ningbo, and Beijing each correspond to a vector, and only one value in the vector is 1, and the rest are 0. There are the following issues with the One-Hot Encoder. On the one hand, the city code is random, the vectors are independent of each other, and there is no possible relationship between the cities. Second, the size of the vector dimension depends on the number of words in the corpus. If the vectors corresponding to the names of all the cities in the world are combined into one matrix, then this matrix is too sparse and will cause dimensional disaster.
使用Vector Representations可以有效解决这个问题。Word2Vec可以将One-Hot Encoder转化为低维度的连续值,也就是稠密向量,并且其中意思相近的词将被映射到向量空间中相近的位置。Using Vector Representations can effectively solve this problem. Word2Vec can convert One-Hot Encoder into low-dimensional continuous values, that is, dense vectors, and words with similar meanings will be mapped to similar positions in the vector space.
word2vec主要分为CBOW(Continuous Bag of Words)和Skip-Gram两种模式。CBOW是从原始语句推测目标字词,而Skip-Gram正好相反,是从目标字词推测出原始语句。CBOW对小型数据库比较合适,而Skip-Gram在大型语料中表现更好。Word2vec is mainly divided into two modes: CBOW (Continuous Bag of Words) and Skip-Gram. CBOW guesses the target word from the original sentence, and Skip-Gram is the opposite, guessing the original statement from the target word. CBOW is suitable for small databases, while Skip-Gram performs better in large corpora.
步骤S400:使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k nStep S400: Calculate the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each participle using the TF-IDF algorithm.
TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆向文件 频率(Inverse Document Frequency)。TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency and IDF means Inverse Document Frequency.
在一份给定的文件里,词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(term count)的归一化,以防止它偏向长的文件(同一个词语在长文件里可能会比短文件有更高的词数,而不管该词语重要与否)。对于在某一特定文件里的词语t i来说,它的词频tf i,j可表示为: In a given document, the term frequency (TF) refers to the frequency at which a given word appears in the file. This number is a normalization of the term count to prevent it from biasing towards long files (the same word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not). . For the word t i in a particular file, its word frequency tf i,j can be expressed as:
Figure PCTCN2018102855-appb-000001
Figure PCTCN2018102855-appb-000001
以上式子中n i,j是该词t i在文件d j中的出现次数,而分母则是在文件d j中所有字词的出现次数之和。 In the above formula, n i,j is the number of occurrences of the word t i in the file d j , and the denominator is the sum of the occurrences of all the words in the file d j .
逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到:The inverse document frequency (IDF) is a measure of the universal importance of a word. The IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the logarithm of the obtained quotient:
Figure PCTCN2018102855-appb-000002
Figure PCTCN2018102855-appb-000002
其中,对数的分子|D|为语料库中的文件总数,分母|{j:t i∈d i}|为包含词语t i的文件数目(不等于0)。如果该词语不在语料库中,就会导致被除数为零,因此可以使用1+|{j:t i∈d i}|确保分母不为0。 Wherein, the logarithmic numerator |D| is the total number of files in the corpus, and the denominator |{j:t i ∈d i }| is the number of files containing the word t i (not equal to 0). If the word is not in the corpus, it will result in the dividend being zero, so you can use 1+|{j:t i ∈d i }| to ensure that the denominator is not zero.
然后该词t i的TF-IDF=tf i,j×idf iThen the word t i has TF-IDF=tf i,j ×idf i .
在本实施例中,各个分词w 1、w 2、……w n-1、w n分别对应的TF-IDF值为k 1、k 2、……k n-1、k n。其中,k n=tf n×idf n,其中tf n为分词w n在该文本信息中出现的频率(词频),idf n为分词w n的逆向文件频率。 In the present embodiment, the TF-IDF values corresponding to the respective word segments w 1 , w 2 , ... w n-1 , w n are k 1 , k 2 , ... k n-1 , k n , respectively . Where k n = tf n × idf n , where tf n is the frequency (word frequency) in which the participle w n appears in the text message, and idf n is the inverse file frequency of the participle w n .
步骤S500:根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V。某个词对文本信息的重要性越高,它的TF-IDF值就越大,则TF-IDF值可以代表各个词重要程度,可以理解为权值。Step S500: The sentence vector V is obtained according to the product of the word vector of each participle and the corresponding TF-IDF value. The higher the importance of a word to text information, the greater its TF-IDF value, and the TF-IDF value can represent the importance of each word, which can be understood as a weight.
假设1≤m≤n,分词w m的词向量V(w m)与分词w m的TF-IDF值k m相乘,得到相乘值H m=V(w m)×k m。根据各个分词w 1、w 2、……w n-1、w n的词向量与对应TF-IDF值的乘积H 1、H 2、……H n-1、H n得到句子向量V。 Assuming that 1 ≤ mn , the word vector V(w m ) of the participle w m is multiplied by the TF-IDF value k m of the participle w m to obtain a multiplied value H m = V(w m ) × k m . The sentence vector V is obtained from the product H 1 , H 2 , ... H n-1 , H n of the word vector of each of the partial words w 1 , w 2 , ... w n-1 , w n and the corresponding TF-IDF value.
在其中一个实施例中,可以使用以下公式得到句子向量V:In one of the embodiments, the sentence vector V can be obtained using the following formula:
V=H 1+H 2+……H n-1+H n V=H 1 +H 2 +...H n-1 +H n
具体的,可以使用以下公式得到句子向量V:Specifically, the sentence vector V can be obtained using the following formula:
V=k 1×V(w 1)+k 2×V(w 2)+……+k n-1×V(w n-1)+k n×V(w n) V=k 1 ×V(w 1 )+k 2 ×V(w 2 )+...+k n-1 ×V(w n-1 )+k n ×V(w n )
步骤S600:计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句(特征预存语句)。Step S600: Calculating a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determining a pre-stored statement (feature pre-stored statement) having the largest cosine similarity.
在数据库中,预存着大量的问题(即预存语句)和相应的回答,其中每个问题都会对应存储着其句子向量。在确定了文本信息的句子向量V后,在数据库中查找句子向量之间的余弦相似度最大的那个特征预存语句,从而确定与该特征预存语句 对应的回答为反馈给用户的信息。In the database, there are a large number of problems (ie, pre-stored statements) and corresponding answers, each of which stores its sentence vector. After the sentence vector V of the text information is determined, the feature pre-stored statement with the greatest cosine similarity between the sentence vectors is found in the database, thereby determining that the answer corresponding to the feature pre-stored statement is the information fed back to the user.
余弦相似度,也称为余弦距离,是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量。假如语句X和语句Y对应向量分别是:(x 1,x 2,...,x 6400)和(y 1,y 2,...,y 6400),则它们之间的余弦距离可以用它们之间夹角的余弦值来表示: Cosine similarity, also known as cosine distance, is a measure of the magnitude of the difference between two individuals using the cosine of the angle between two vectors in vector space. If the corresponding vectors of statement X and statement Y are: (x 1 , x 2 , ..., x 6400 ) and (y 1 , y 2 , ..., y 6400 ), respectively, the cosine distance between them can be used. The cosine of the angle between them is expressed as:
Figure PCTCN2018102855-appb-000003
Figure PCTCN2018102855-appb-000003
当两条语句向量夹角余弦等于1时,这两条句子完全相同;当夹角的余弦值接近于1时,两条语句相似;夹角的余弦越小,两条语句越不相关。When the cosine of the two sentence vectors is equal to 1, the two sentences are exactly the same; when the cosine of the angle is close to 1, the two sentences are similar; the smaller the cosine of the angle, the less relevant the two statements are.
通过比较句子向量V与预存语句的句子向量之间的余弦相似度,可以找到余弦相似度最大的预存语句(特征预存语句)。By comparing the cosine similarity between the sentence vector V and the sentence vector of the pre-stored statement, the pre-stored statement with the greatest cosine similarity (feature pre-stored statement) can be found.
例如用户通过向在线客服机器人发送问题进行在线咨询,在线客服机器人接收该问题后,通过计算该问题的句子向量V,然后在数据库中查找与该问题句子向量V的余弦相似度最大的预存问题,并选择与该预存问题对应的预存问答反馈给用户。例如,用户发出问题“请问是不是包邮”,在线客服机器人在数据库中查找与该问题句子向量的余弦相似度最大的预存问题为“请问是否包邮”,数据库中对应“请问是否包邮”这一预存问题映射存储有回答“是的”,则向客户反馈“是的”。For example, the user conducts an online consultation by sending a question to the online customer service robot. After receiving the problem, the online customer service robot calculates the sentence vector V of the problem, and then searches the database for the pre-existing problem with the greatest cosine similarity to the problem vector V. And select the pre-stored question and answer feedback corresponding to the pre-stored question to the user. For example, the user issues the question "Is it okay?", the online customer service robot finds the pre-existing problem with the greatest cosine similarity to the sentence vector of the question in the database as "Is it free?", the corresponding "I ask if it is free" in the database. This pre-existing problem map stores the answer "yes" and returns "yes" to the customer.
通过上述方法,可以找到与文本信息最相似的预存语句(特征预存语句),在机器人对话、信息分类等方面可以提高问题识别的精准度,从而提高对话效率或分类效率。Through the above method, the pre-existing statement (feature pre-existing statement) which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.
图3为一个实施例的基于TF-IDF的文本信息相似度的匹配装置模块示意图。对应上述的基于TF-IDF的文本信息相似度的匹配方法,本申请还提供一种基于TF-IDF的文本信息相似度的匹配装置,包括:获取模块100、分词模块200、词向量计算模块300、TF-IDF值计算模块400、句向量计算模块500、匹配模块600。FIG. 3 is a schematic diagram of a TF-IDF-based text information similarity matching device module according to an embodiment. Corresponding to the above-mentioned TF-IDF-based text information similarity matching method, the present application further provides a TF-IDF-based text information similarity matching device, comprising: an obtaining module 100, a word segmentation module 200, and a word vector calculation module 300. The TF-IDF value calculation module 400, the sentence vector calculation module 500, and the matching module 600.
获取模块100用于获取文本信息;分词模块200用于对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n;词向量计算模块300用于使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n);TF-IDF值计算模块400用于使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k n;句向量计算模块500用于根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;匹配模块600用于计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。 The obtaining module 100 is configured to obtain text information; the word segmentation module 200 is configured to perform segmentation on the text information to obtain each participle w 1 , w 2 , . . . w n-1 , w n ; the word vector calculation module 300 is configured to use the CBOW model The word vectors V(w 1 ), V(w 2 ), . . . , V(w n-1 ), V(w n ) of each participle are calculated; the TF-IDF value calculation module 400 is used to calculate using the TF-IDF algorithm. The TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment ; the sentence vector calculation module 500 is configured to obtain the sentence vector V according to the product of the word vector of each word segment and the corresponding TF-IDF value; The module 600 is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.
获取模块100获取文本信息。此处的文本信息既可以是用户自行输入的,也可以是根据用户输出的语音数据而识别出的文本信息。The acquisition module 100 obtains text information. The text information here may be input by the user or may be text information recognized according to the voice data output by the user.
例如,用户通过向在线客服机器人发送文本消息进行在线咨询,在线客服机器人所接收到的文本消息即获取文本信息。又例如,用户通过向在线聊天机器人发送文本消息进行在线聊天,在线客服机器人所接收到的文本消息即获取文本信息。文本信息可能是一句话,也可能是一段话,在此不对文本信息的长度、所用语言类型进行限定。For example, the user performs online consultation by sending a text message to the online customer service robot, and the text message received by the online customer service robot acquires the text information. For another example, the user performs online chat by sending a text message to the online chat robot, and the text message received by the online customer service robot acquires the text information. The text information may be a sentence or a paragraph, and the length of the text information and the type of language used are not limited here.
当然,如果用户发送的是语音消息,那么获取模块100需要对语音消息进行语音识别,具体为:获取模块100获取用户发送的语音消息;对语音消息进行语音识 别,生成文本信息。语音识别技术应用较为广泛,在此不赘述。Of course, if the user sends a voice message, the acquiring module 100 needs to perform voice recognition on the voice message. Specifically, the acquiring module 100 acquires a voice message sent by the user, and performs voice recognition on the voice message to generate text information. Speech recognition technology is widely used and will not be described here.
当然,上述举例是以在线机器人为例,但是并不排除是实体机器人,例如扫地机器人、儿童教育机器人、客服机器人、聊天机器人等等具有实体肢体的智能机器人。Of course, the above example is an example of an online robot, but it is not excluded to be a physical robot, such as a sweeping robot, a child educational robot, a customer service robot, a chat robot, and the like, and an intelligent robot having a physical limb.
分词模块200对文本信息进行分词得到各个分词w 1、w 2、……w n-1、w nThe word segmentation module 200 segments the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n .
以中文分词为例。中文分词(Chinese Word Segmentation)指的是将一个汉字序列切分成一个一个单独的词。分词就是将连续的字序列按照一定的规范重新组合成词序列的过程。w 1、w 2、……w n-1、w n就是从文本信息中切分出来的一个一个单独的词。 Take the Chinese word segmentation as an example. Chinese Word Segmentation refers to the division of a sequence of Chinese characters into a single word. Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. w 1 , w 2 , ... w n-1 , w n are a single word segmented from the text information.
在一些实施例中,分词算法可分为三种:基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法。In some embodiments, the word segmentation algorithm can be divided into three types: a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.
基于字符串匹配的分词方法:这种方法又叫做机械分词方法,它是按照一定的策略将待分析的汉字串与一个“充分大的”机器词典中的词条进行匹配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。应用较广的匹配方法有以下几种:Word segmentation based on word matching: This method is also called mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If found in the dictionary, if it is found in the dictionary. A string, the match is successful (a word is recognized). There are several ways to apply a wide range of matching methods:
1)正向最大匹配法(由左到右的方向)1) Forward maximum matching method (from left to right)
2)逆向最大匹配法(由右到左的方向)2) Reverse maximum matching method (from right to left)
3)最少切分(使每一句中切出的词数最小)3) Minimal segmentation (minimum number of words cut out in each sentence)
4)双向最大匹配法(进行由左到右、由右到左两次扫描)4) Two-way maximum matching method (performed from left to right, right to left twice)
基于理解的分词方法:这种分词方法是通过让计算机模拟人对句子的理解,达到识别词的效果。其基本思想就是在分词的同时进行句法、语义分析,利用句法信息和语义信息来处理歧义现象。它通常包括三个部分:分词子系统、句法语义子系统、总控部分。在总控部分的协调下,分词子系统可以获得有关词、句子等的句法和语义信息来对分词歧义进行判断,即它模拟了人对句子的理解过程。这种分词方法需要使用大量的语言知识和信息。Word-based segmentation method based on understanding: This word segmentation method achieves the effect of identifying words by letting the computer simulate human understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of linguistic knowledge and information.
基于统计的分词方法:给出大量已经分词的文本,利用统计机器学习模型学习词语切分的规律(称为训练),从而实现对未知文本的切分。例如最大概率分词方法和最大熵分词方法等。主要统计模型:N元文法模型(N-gram),隐马尔可夫模型(Hidden Markov Model,HMM),最大熵模型(ME),条件随机场模型(Conditional Random Fields,CRF)等。Statistical-based word segmentation method: Give a large number of texts that have been segmented, and use the statistical machine learning model to learn the rules of word segmentation (called training), so as to achieve the segmentation of unknown text. For example, the maximum probability word segmentation method and the maximum entropy word segmentation method. The main statistical models are N-gram, Hidden Markov Model (HMM), Maximum Entropy Model (ME), and Conditional Random Fields (CRF).
在一些实施例中,分词模块200可以采用基于统计的分词方法对文本信息进行分词,例如采用结巴(jieba)分词组件对文本信息进行分词。结巴分词是中国程序员用Python开发的一个中文分词组件。In some embodiments, the word segmentation module 200 can segment the textual information using a statistical-based segmentation method, such as segmenting the textual information using a jieba segmentation component. The stuttering part is a Chinese word segmentation component developed by Chinese programmers in Python.
在其中一个实施例中,分词模块200在对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n的过程中,还对文本信息的停用词进行去除处理。 In one embodiment, the word segmentation module 200 also removes the stop words of the text information in the process of segmenting the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n . deal with.
词向量计算模块300使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n)。各个分词w 1、w 2、……w n-1、w n分别对应词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n)。 The word vector calculation module 300 calculates the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of the respective participles using the CBOW model. Each of the word parts w 1 , w 2 , ... w n-1 , w n corresponds to the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ), respectively.
词向量计算模块300可以通过Gensim工具包中的word2vec CBOW模型计算各个分词的词向量。The word vector calculation module 300 can calculate the word vector of each participle by the word2vec CBOW model in the Gensim toolkit.
word2vec也叫word embeddings,中文名“词向量”,作用就是将自然语言中 的字词转为计算机可以理解的稠密向量(Dense Vector)。在word2vec出现之前,自然语言处理经常把字词转为离散的单独的符号,也就是One-Hot Encoder。Word2vec is also called word embeddings, the Chinese name "word vector", the role is to convert the words in natural language into a dense vector (Dense Vector) that can be understood by computers. Prior to the advent of word2vec, natural language processing often turned words into discrete, separate symbols, the One-Hot Encoder.
杭州[0,0,0,0,0,0,0,1,0,……,0,0,0,0,0,0,0]Hangzhou[0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0]
上海[0,0,0,0,1,0,0,0,0,……,0,0,0,0,0,0,0]Shanghai[0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0]
宁波[0,0,0,1,0,0,0,0,0,……,0,0,0,0,0,0,0]Ningbo [0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0]
北京[0,0,0,0,0,0,0,0,0,……,1,0,0,0,0,0,0]Beijing[0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0]
比如上面的这个例子,在语料库中,杭州、上海、宁波、北京各对应一个向量,向量中只有一个值为1,其余都为0。使用One-Hot Encoder有以下问题。一方面,城市编码是随机的,向量之间相互独立,看不出城市之间可能存在的关联关系。其次,向量维度的大小取决于语料库中字词的多少。如果将世界所有城市名称对应的向量合为一个矩阵的话,那这个矩阵过于稀疏,并且会造成维度灾难。For example, in the above example, in the corpus, Hangzhou, Shanghai, Ningbo, and Beijing each correspond to a vector, and only one value in the vector is 1, and the rest are 0. There are the following issues with the One-Hot Encoder. On the one hand, the city code is random, the vectors are independent of each other, and there is no possible relationship between the cities. Second, the size of the vector dimension depends on the number of words in the corpus. If the vectors corresponding to the names of all the cities in the world are combined into one matrix, then this matrix is too sparse and will cause dimensional disaster.
使用Vector Representations可以有效解决这个问题。Word2Vec可以将One-Hot Encoder转化为低维度的连续值,也就是稠密向量,并且其中意思相近的词将被映射到向量空间中相近的位置。Using Vector Representations can effectively solve this problem. Word2Vec can convert One-Hot Encoder into low-dimensional continuous values, that is, dense vectors, and words with similar meanings will be mapped to similar positions in the vector space.
word2vec主要分为CBOW(Continuous Bag of Words)和Skip-Gram两种模式。CBOW是从原始语句推测目标字词,而Skip-Gram正好相反,是从目标字词推测出原始语句。CBOW对小型数据库比较合适,而Skip-Gram在大型语料中表现更好。Word2vec is mainly divided into two modes: CBOW (Continuous Bag of Words) and Skip-Gram. CBOW guesses the target word from the original sentence, and Skip-Gram is the opposite, guessing the original statement from the target word. CBOW is suitable for small databases, while Skip-Gram performs better in large corpora.
TF-IDF值计算模块400使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k nThe TF-IDF value calculation module 400 calculates the TF-IDF values k 1 , k 2 , ... k n-1 , k n of the respective word segments using the TF-IDF algorithm.
TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆向文件频率(Inverse Document Frequency)。TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency, and IDF means Inverse Document Frequency.
在一份给定的文件里,词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(term count)的归一化,以防止它偏向长的文件(同一个词语在长文件里可能会比短文件有更高的词数,而不管该词语重要与否)。对于在某一特定文件里的词语t i来说,它的词频tf i,j可表示为: In a given document, the term frequency (TF) refers to the frequency at which a given word appears in the file. This number is a normalization of the term count to prevent it from biasing towards long files (the same word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not). . For the word t i in a particular file, its word frequency tf i,j can be expressed as:
Figure PCTCN2018102855-appb-000004
Figure PCTCN2018102855-appb-000004
以上式子中n i,j是该词t i在文件d j中的出现次数,而分母则是在文件d j中所有字词的出现次数之和。 In the above formula, n i,j is the number of occurrences of the word t i in the file d j , and the denominator is the sum of the occurrences of all the words in the file d j .
逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到:The inverse document frequency (IDF) is a measure of the universal importance of a word. The IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the logarithm of the obtained quotient:
Figure PCTCN2018102855-appb-000005
Figure PCTCN2018102855-appb-000005
其中,对数的分子|D|为语料库中的文件总数,分母|{j:t i∈d i}|为包含词语t i的文件数目(不等于0)。如果该词语不在语料库中,就会导致被除数为零,因此可以使用1+|{j:t i∈d i}|确保分母不为0。 Wherein, the logarithmic numerator |D| is the total number of files in the corpus, and the denominator |{j:t i ∈d i }| is the number of files containing the word t i (not equal to 0). If the word is not in the corpus, it will result in the dividend being zero, so you can use 1+|{j:t i ∈d i }| to ensure that the denominator is not zero.
然后该词t i的TF-IDF=tf i,j×idf iThen the word t i has TF-IDF=tf i,j ×idf i .
在本实施例中,各个分词w 1、w 2、……w n-1、w n分别对应的TF-IDF值为k 1、k 2、……k n-1、k n。其中,k n=tf n×idf n,其中tf n为分词w n在该文本信息中出现的频率(词频),idf n为分词w n的逆向文件频率。 In the present embodiment, the TF-IDF values corresponding to the respective word segments w 1 , w 2 , ... w n-1 , w n are k 1 , k 2 , ... k n-1 , k n , respectively . Where k n = tf n × idf n , where tf n is the frequency (word frequency) in which the participle w n appears in the text message, and idf n is the inverse file frequency of the participle w n .
句向量计算模块500根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V。某个词对文本信息的重要性越高,它的TF-IDF值就越大,则TF-IDF值可以代表各个词重要程度,可以理解为权值。The sentence vector calculation module 500 obtains the sentence vector V based on the product of the word vector of each participle and the corresponding TF-IDF value. The higher the importance of a word to text information, the greater its TF-IDF value, and the TF-IDF value can represent the importance of each word, which can be understood as a weight.
假设1≤m≤n,分词w m的词向量V(w m)与分词w m的TF-IDF值k m相乘,得到相乘值H m=V(w m)×k m。根据各个分词w 1、w 2、……w n-1、w n的词向量与对应TF-IDF值的乘积H 1、H 2、……H n-1、H n得到句子向量V。 Assuming that 1 ≤ mn , the word vector V(w m ) of the participle w m is multiplied by the TF-IDF value k m of the participle w m to obtain a multiplied value H m = V(w m ) × k m . The sentence vector V is obtained from the product H 1 , H 2 , ... H n-1 , H n of the word vector of each of the partial words w 1 , w 2 , ... w n-1 , w n and the corresponding TF-IDF value.
在其中一个实施例中,句向量计算模块500可以使用以下公式得到句子向量V:In one of the embodiments, the sentence vector calculation module 500 can obtain the sentence vector V using the following formula:
V=H 1+H 2+……H n-1+H n V=H 1 +H 2 +...H n-1 +H n
具体的,句向量计算模块500可以使用以下公式得到句子向量V:Specifically, the sentence vector calculation module 500 can obtain the sentence vector V using the following formula:
V=k 1×V(w 1)+k 2×V(w 2)+……+k n-1×V(w n-1)+k n×V(w n) V=k 1 ×V(w 1 )+k 2 ×V(w 2 )+...+k n-1 ×V(w n-1 )+k n ×V(w n )
匹配模块600计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句(特征预存语句)。The matching module 600 calculates a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determines a pre-stored statement (feature pre-stored statement) with the largest cosine similarity.
在数据库中,预存着大量的问题(即预存语句)和相应的回答,其中每个问题都会对应存储着其句子向量。在确定了文本信息的句子向量V后,在数据库中查找句子向量之间的余弦相似度最大的那个特征预存语句,从而确定与该特征预存语句对应的回答为反馈给用户的信息。In the database, there are a large number of problems (ie, pre-stored statements) and corresponding answers, each of which stores its sentence vector. After the sentence vector V of the text information is determined, the feature pre-stored statement with the greatest cosine similarity between the sentence vectors is found in the database, thereby determining that the answer corresponding to the feature pre-stored statement is information fed back to the user.
余弦相似度,也称为余弦距离,是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量。假如语句X和语句Y对应向量分别是:(x 1,x 2,...,x 6400)和(y 1,y 2,...,y 6400),则它们之间的余弦距离可以用它们之间夹角的余弦值来表示: Cosine similarity, also known as cosine distance, is a measure of the magnitude of the difference between two individuals using the cosine of the angle between two vectors in vector space. If the corresponding vectors of statement X and statement Y are: (x 1 , x 2 , ..., x 6400 ) and (y 1 , y 2 , ..., y 6400 ), respectively, the cosine distance between them can be used. The cosine of the angle between them is expressed as:
Figure PCTCN2018102855-appb-000006
Figure PCTCN2018102855-appb-000006
当两条语句向量夹角余弦等于1时,这两条句子完全相同;当夹角的余弦值接近于1时,两条语句相似;夹角的余弦越小,两条语句越不相关。When the cosine of the two sentence vectors is equal to 1, the two sentences are exactly the same; when the cosine of the angle is close to 1, the two sentences are similar; the smaller the cosine of the angle, the less relevant the two statements are.
匹配模块600通过比较句子向量V与预存语句的句子向量之间的余弦相似度,可以找到余弦相似度最大的预存语句(特征预存语句)。The matching module 600 can find the pre-stored statement (feature pre-stored statement) with the largest cosine similarity by comparing the cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence.
例如用户通过向在线客服机器人发送问题进行在线咨询,在线客服机器人接收该问题后,通过计算该问题的句子向量V,然后在数据库中查找与该问题句子向量V的余弦相似度最大的预存问题,并选择与该预存问题对应的预存问答反馈给用户。例如,用户发出问题“请问是不是包邮”,在线客服机器人在数据库中查找与该问题句子向量的余弦相似度最大的预存问题为“请问是否包邮”,数据库中对应“请 问是否包邮”这一预存问题映射存储有回答“是的”,则向客户反馈“是的”。For example, the user conducts an online consultation by sending a question to the online customer service robot. After receiving the problem, the online customer service robot calculates the sentence vector V of the problem, and then searches the database for the pre-existing problem with the greatest cosine similarity to the problem vector V. And select the pre-stored question and answer feedback corresponding to the pre-stored question to the user. For example, the user issues the question "Is it okay?", the online customer service robot finds the pre-existing problem with the greatest cosine similarity to the sentence vector of the question in the database as "Is it free?", the corresponding "I ask if it is free" in the database. This pre-existing problem map stores the answer "yes" and returns "yes" to the customer.
通过上述装置,可以找到与文本信息最相似的预存语句(特征预存语句),在机器人对话、信息分类等方面可以提高问题识别的精准度,从而提高对话效率或分类效率。Through the above device, the pre-existing statement (feature pre-existing statement) which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述任一实施例所述基于TF-IDF的文本信息相似度的匹配方法的步骤。The application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform any of the above implementations The steps of the TF-IDF-based text information similarity matching method are described.
本申请还提供一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一实施例所述基于TF-IDF的文本信息相似度的匹配方法的步骤。The present application also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to perform TF-based operations according to any of the above embodiments. The steps of the IDF text information similarity matching method.
上述的基于TF-IDF的文本信息相似度的匹配方法、装置、计算机设备和存储介质,通过获取文本信息;对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n;使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n);使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k n;根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。通过上述过程,可以找到与文本信息最相似的预存语句,在机器人对话、信息分类等方面可以提高问题识别的精准度,从而提高对话效率或分类效率。 The TF-IDF-based text information similarity matching method, device, computer device and storage medium are obtained by acquiring text information; and the text information is segmented to obtain each participle w 1 , w 2 , ... w n-1 , w n ; use the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle; calculate each participle using the TF-IDF algorithm TF-IDF values k 1 , k 2 , ... k n-1 , k n ; the sentence vector V is obtained from the product of the word vector of each word segment and the corresponding TF-IDF value; calculating the sentence vector V and the pre-stored statement The cosine similarity between the sentence vectors determines the pre-stored statement with the largest cosine similarity. Through the above process, the pre-existing statement which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。A person skilled in the art can understand that all or part of the process of implementing the above embodiment method can be completed by a computer program to instruct related hardware, and the computer program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Claims (16)

  1. 一种基于TF-IDF的文本信息相似度的匹配方法,包括如下步骤:A TF-IDF-based method for matching text information similarity includes the following steps:
    获取文本信息;Get text information;
    对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w nPerforming word segmentation on the text information to obtain individual word segments w 1 , w 2 , ... w n-1 , w n ;
    使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……V(w n-1)、V(w n); Using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle;
    使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k nCalculating the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using the TF-IDF algorithm;
    根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;Obtaining a sentence vector V according to the product of the word vector of each participle and the corresponding TF-IDF value;
    计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。The cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence is calculated, and the pre-stored statement with the largest cosine similarity is determined.
  2. 根据权利要求1所述的基于TF-IDF的文本信息相似度的匹配方法,在对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n的过程中,还对文本信息的停用词进行去除处理。 The TF-IDF-based text information similarity matching method according to claim 1, wherein in the process of segmenting the text information to obtain each of the word segments w 1 , w 2 , ... w n-1 , w n , The stop words of the text information are also removed.
  3. 根据权利要求1所述的基于TF-IDF的文本信息相似度的匹配方法,采用结巴分词组件对对所述文本信息进行分词。The TF-IDF-based text information similarity matching method according to claim 1, wherein the text information is segmented by using a stalking word segmentation component.
  4. 根据权利要求1所述的基于TF-IDF的文本信息相似度的匹配方法,使用以下公式得到句子向量V:The TF-IDF-based text information similarity matching method according to claim 1, wherein the sentence vector V is obtained using the following formula:
    V=k 1×V(w 1)+k 2×V(w 2)+……+k n-1×V(w n-1)+k n×V(w n)。 V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).
  5. 一种基于TF-IDF的文本信息相似度的匹配装置,包括:A matching device for text information similarity based on TF-IDF, comprising:
    获取模块,用于获取文本信息;Obtaining a module for obtaining text information;
    分词模块,用于对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w na word segmentation module, configured to perform word segmentation on the text information to obtain each segmentation word w 1 , w 2 , ... w n-1 , w n ;
    词向量计算模块,用于使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……、V(w n-1)、V(w n); a word vector calculation module for calculating a word vector V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle using a CBOW model;
    TF-IDF值计算模块,用于使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k na TF-IDF value calculation module for calculating TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using a TF-IDF algorithm;
    句向量计算模块,用于根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;a sentence vector calculation module, configured to obtain a sentence vector V according to a product of a word vector of each word segment and a corresponding TF-IDF value;
    匹配模块,用于计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。The matching module is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of the pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.
  6. 根据权利要求5所述的基于TF-IDF的文本信息相似度的匹配装置,所述分词模块还对文本信息的停用词进行去除处理。The TF-IDF-based text information similarity matching apparatus according to claim 5, wherein the word segmentation module further performs a removal process on the stop words of the text information.
  7. 根据权利要求5所述的基于TF-IDF的文本信息相似度的匹配装置,所述分词模块采用结巴分词组件对对所述文本信息进行分词。The TF-IDF-based text information similarity matching apparatus according to claim 5, wherein the word segmentation module performs segmentation on the text information by using a staging component.
  8. 根据权利要求5所述的基于TF-IDF的文本信息相似度的匹配装置,所述句向量计算模块使用以下公式得到句子向量V:The TF-IDF-based text information similarity matching apparatus according to claim 5, wherein the sentence vector calculation module obtains the sentence vector V using the following formula:
    V=k 1×V(w 1)+k 2×V(w 2)+……+k n-1×V(w n-1)+k n×V(w n)。 V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行一种基于TF-IDF的文本信息相似度的匹配方法,所述基于TF-IDF的文本信息相似度的匹配方法包括以下步骤:A computer apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to execute a TF-IDF based text The method for matching information similarity, the method for matching similarity of text information based on TF-IDF includes the following steps:
    获取文本信息;Get text information;
    对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w nPerforming word segmentation on the text information to obtain individual word segments w 1 , w 2 , ... w n-1 , w n ;
    使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……V(w n-1)、V(w n); Using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle;
    使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k nCalculating the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using the TF-IDF algorithm;
    根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;Obtaining a sentence vector V according to the product of the word vector of each participle and the corresponding TF-IDF value;
    计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。The cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence is calculated, and the pre-stored statement with the largest cosine similarity is determined.
  10. 根据权利要求9所述的计算机设备,在对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n的过程中,还对文本信息的停用词进行去除处理。 The computer apparatus according to claim 9, wherein in the process of segmenting the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n , the stop words of the text information are also removed. deal with.
  11. 根据权利要求9所述的计算机设备,采用结巴分词组件对对所述文本信息进行分词。The computer device of claim 9 wherein the textual information is segmented using a staging component.
  12. 根据权利要求9所述的计算机设备,使用以下公式得到句子向量V:The computer device according to claim 9, wherein the sentence vector V is obtained using the following formula:
    V=k 1×V(w 1)+k 2×V(w 2)+……+k n-1×V(w n-1)+k n×V(w n)。 V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).
  13. 一种存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行一种基于TF-IDF的文本信息相似度的匹配方法,所述基于TF-IDF的文本信息相似度的匹配方法包括以下步骤:A non-volatile storage medium storing computer readable instructions, when executed by one or more processors, causing one or more processors to perform a TF-IDF-based textual message similar The matching method of the TF-IDF-based text information similarity includes the following steps:
    获取文本信息;Get text information;
    对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w nPerforming word segmentation on the text information to obtain individual word segments w 1 , w 2 , ... w n-1 , w n ;
    使用CBOW模型计算各个分词的词向量V(w 1)、V(w 2)、……V(w n-1)、V(w n); Using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle;
    使用TF-IDF算法计算各个分词的TF-IDF值k 1、k 2、……k n-1、k nCalculating the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using the TF-IDF algorithm;
    根据各个分词的词向量与对应TF-IDF值的乘积得到句子向量V;Obtaining a sentence vector V according to the product of the word vector of each participle and the corresponding TF-IDF value;
    计算所述句子向量V与预存语句的句子向量之间的余弦相似度,确定余弦相似度最大的预存语句。The cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence is calculated, and the pre-stored statement with the largest cosine similarity is determined.
  14. 根据权利要求13所述的非易失性存储介质,在对所述文本信息进行分词得到各个分词w 1、w 2、……w n-1、w n的过程中,还对文本信息的停用词进行去除处理。 The nonvolatile storage medium according to claim 13, wherein in the process of segmenting the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n , the text information is stopped. Use words to remove the processing.
  15. 根据权利要求13所述的非易失性存储介质,采用结巴分词组件对对所述文本信息进行分词。The non-volatile storage medium according to claim 13, wherein the text information is segmented using a staging component.
  16. 根据权利要求13所述的非易失性存储介质,使用以下公式得到句子向量V:The nonvolatile storage medium according to claim 13, wherein the sentence vector V is obtained using the following formula:
    V=k 1×V(w 1)+k 2×V(w 2)+……+k n-1×V(w n-1)+k n×V(w n)。 V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).
PCT/CN2018/102855 2018-04-10 2018-08-29 Text information similarity matching method and apparatus, computer device, and storage medium WO2019196314A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810314094.8A CN108628825A (en) 2018-04-10 2018-04-10 Text message Similarity Match Method, device, computer equipment and storage medium
CN201810314094.8 2018-04-10

Publications (1)

Publication Number Publication Date
WO2019196314A1 true WO2019196314A1 (en) 2019-10-17

Family

ID=63704921

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102855 WO2019196314A1 (en) 2018-04-10 2018-08-29 Text information similarity matching method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN108628825A (en)
WO (1) WO2019196314A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488429A (en) * 2020-03-19 2020-08-04 杭州叙简科技股份有限公司 Short text clustering system based on search engine and short text clustering method thereof

Families Citing this family (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109511000B (en) * 2018-11-06 2021-10-15 武汉斗鱼网络科技有限公司 Bullet screen category determination method, bullet screen category determination device, bullet screen category determination equipment and storage medium
CN109657232A (en) * 2018-11-16 2019-04-19 北京九狐时代智能科技有限公司 A kind of intension recognizing method
CN109740143B (en) * 2018-11-28 2022-08-23 平安科技(深圳)有限公司 Sentence distance mapping method and device based on machine learning and computer equipment
CN109658938B (en) * 2018-12-07 2020-03-17 百度在线网络技术(北京)有限公司 Method, device and equipment for matching voice and text and computer readable medium
CN109657213B (en) * 2018-12-21 2023-07-28 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN111382246B (en) * 2018-12-29 2023-03-14 深圳市优必选科技有限公司 Text matching method, matching device, terminal and computer readable storage medium
CN111428486B (en) * 2019-01-08 2023-06-23 北京沃东天骏信息技术有限公司 Article information data processing method, device, medium and electronic equipment
CN109697367B (en) * 2019-01-09 2021-08-24 腾讯科技(深圳)有限公司 Method for displaying blockchain data, blockchain browser, user node and medium
CN109885813B (en) * 2019-02-18 2023-04-28 武汉瓯越网视有限公司 Text similarity operation method and system based on word coverage
CN110083809A (en) * 2019-03-16 2019-08-02 平安城市建设科技(深圳)有限公司 Contract terms similarity calculating method, device, equipment and readable storage medium storing program for executing
CN110096681B (en) * 2019-03-16 2023-11-17 平安科技(深圳)有限公司 Contract term analysis method, apparatus, device and readable storage medium
CN110163478B (en) * 2019-04-18 2024-04-05 平安科技(深圳)有限公司 Risk examination method and device for contract clauses
CN110232914A (en) * 2019-05-20 2019-09-13 平安普惠企业管理有限公司 A kind of method for recognizing semantics, device and relevant device
CN110188180B (en) * 2019-05-31 2021-06-01 腾讯科技(深圳)有限公司 Method and device for determining similar problems, electronic equipment and readable storage medium
CN110471835B (en) * 2019-07-03 2022-07-19 南瑞集团有限公司 Similarity detection method and system based on code files of power information system
CN110516210B (en) * 2019-08-22 2023-06-27 北京影谱科技股份有限公司 Text similarity calculation method and device
CN112445910B (en) * 2019-09-02 2022-12-27 上海哔哩哔哩科技有限公司 Information classification method and system
CN110704621B (en) * 2019-09-25 2023-04-21 北京大米科技有限公司 Text processing method and device, storage medium and electronic equipment
CN110674635B (en) * 2019-09-27 2023-04-25 北京妙笔智能科技有限公司 Method and device for dividing text paragraphs
CN110738059B (en) * 2019-10-21 2023-07-14 支付宝(杭州)信息技术有限公司 Text similarity calculation method and system
CN110825859A (en) * 2019-10-21 2020-02-21 拉扎斯网络科技(上海)有限公司 Retrieval method, retrieval device, readable storage medium and electronic equipment
CN110956031A (en) * 2019-11-13 2020-04-03 广州供电局有限公司 Text similarity matching method, device and system
CN111507085B (en) * 2019-11-25 2023-07-07 江苏艾佳家居用品有限公司 Sentence pattern recognition method
CN111144068A (en) * 2019-11-26 2020-05-12 方正璞华软件(武汉)股份有限公司 Similar arbitration case recommendation method and device
CN111104794B (en) * 2019-12-25 2023-07-04 同方知网数字出版技术股份有限公司 Text similarity matching method based on subject term
CN111192682B (en) * 2019-12-25 2024-04-09 上海联影智能医疗科技有限公司 Image exercise data processing method, system and storage medium
WO2021128342A1 (en) * 2019-12-27 2021-07-01 西门子(中国)有限公司 Document processing method and apparatus
CN113139034A (en) * 2020-01-17 2021-07-20 深圳市优必选科技股份有限公司 Statement matching method, statement matching device and intelligent equipment
CN111310478B (en) * 2020-03-18 2023-09-19 电子科技大学 Similar sentence detection method based on TF-IDF and word vector
CN111476026A (en) * 2020-03-24 2020-07-31 珠海格力电器股份有限公司 Statement vector determination method and device, electronic equipment and storage medium
CN111539196A (en) * 2020-04-15 2020-08-14 京东方科技集团股份有限公司 Text duplicate checking method and device, text management system and electronic equipment
CN111627512A (en) * 2020-05-29 2020-09-04 北京大恒普信医疗技术有限公司 Recommendation method and device for similar medical records, electronic equipment and storage medium
CN113807073B (en) * 2020-06-16 2023-11-14 中国电信股份有限公司 Text content anomaly detection method, device and storage medium
CN111898380A (en) * 2020-08-17 2020-11-06 上海熙满网络科技有限公司 Text matching method and device, electronic equipment and storage medium
CN112002413B (en) * 2020-08-23 2023-09-29 吾征智能技术(北京)有限公司 Intelligent cognitive system, equipment and storage medium for cardiovascular system infection
CN112002415B (en) * 2020-08-23 2024-03-01 吾征智能技术(北京)有限公司 Intelligent cognitive disease system based on human excrement
CN112084791A (en) * 2020-08-31 2020-12-15 北京洛必德科技有限公司 Dialog process intention extraction and utterance prompting method and system and electronic equipment thereof
CN112087448B (en) * 2020-09-08 2023-04-14 南方电网科学研究院有限责任公司 Security log extraction method and device and computer equipment
CN112183111A (en) * 2020-09-28 2021-01-05 亚信科技(中国)有限公司 Long text semantic similarity matching method and device, electronic equipment and storage medium
CN112163071A (en) * 2020-09-28 2021-01-01 广州数鹏通科技有限公司 Unsupervised learning analysis method and system for information correlation degree of emergency
CN112230772B (en) * 2020-10-14 2021-05-28 华中师范大学 Virtual-actual fused teaching aid automatic generation method
CN112257431A (en) * 2020-10-30 2021-01-22 中电万维信息技术有限责任公司 NLP-based short text data processing method
CN112446297B (en) * 2020-10-31 2024-03-26 浙江工业大学 Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same
CN112559658B (en) * 2020-12-08 2022-12-30 中国科学技术大学 Address matching method and device
CN112527988A (en) * 2020-12-14 2021-03-19 深圳市优必选科技股份有限公司 Automatic reply generation method and device and intelligent equipment
CN112765950A (en) * 2021-01-08 2021-05-07 首都师范大学 Template library generation method and system based on cosine similarity and storage medium
CN112861536A (en) * 2021-01-28 2021-05-28 张治� Research learning ability portrayal method, device, computing equipment and storage medium
CN113239150B (en) * 2021-05-17 2024-02-27 平安科技(深圳)有限公司 Text matching method, system and equipment
CN113486165A (en) * 2021-07-08 2021-10-08 山东新一代信息产业技术研究院有限公司 FAQ automatic question answering method, equipment and medium for cloud robot
CN113722438B (en) * 2021-08-31 2023-06-23 平安科技(深圳)有限公司 Sentence vector generation method and device based on sentence vector model and computer equipment
CN113688954A (en) * 2021-10-25 2021-11-23 苏州浪潮智能科技有限公司 Method, system, equipment and storage medium for calculating text similarity
CN114970551A (en) * 2022-07-27 2022-08-30 阿里巴巴达摩院(杭州)科技有限公司 Text processing method and device and electronic equipment
CN114996439A (en) * 2022-08-01 2022-09-02 太极计算机股份有限公司 Text search method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106547734A (en) * 2016-10-21 2017-03-29 上海智臻智能网络科技股份有限公司 A kind of question sentence information processing method and device
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
KR20170120389A (en) * 2016-04-21 2017-10-31 (주)원제로소프트 Method and system for managing total financial information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273350A (en) * 2017-05-16 2017-10-20 广东电网有限责任公司江门供电局 A kind of information processing method and its device for realizing intelligent answer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
KR20170120389A (en) * 2016-04-21 2017-10-31 (주)원제로소프트 Method and system for managing total financial information
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106547734A (en) * 2016-10-21 2017-03-29 上海智臻智能网络科技股份有限公司 A kind of question sentence information processing method and device
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488429A (en) * 2020-03-19 2020-08-04 杭州叙简科技股份有限公司 Short text clustering system based on search engine and short text clustering method thereof

Also Published As

Publication number Publication date
CN108628825A (en) 2018-10-09

Similar Documents

Publication Publication Date Title
WO2019196314A1 (en) Text information similarity matching method and apparatus, computer device, and storage medium
CN110210029B (en) Method, system, device and medium for correcting error of voice text based on vertical field
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN110737758A (en) Method and apparatus for generating a model
CN108681574B (en) Text abstract-based non-fact question-answer selection method and system
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
KR101224660B1 (en) A searching apparatus and method for similar sentence, a storage means and a service system and method for automatic chatting
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN111859987A (en) Text processing method, and training method and device of target task model
CN109783825B (en) Neural network-based ancient language translation method
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN110162630A (en) A kind of method, device and equipment of text duplicate removal
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN110895559A (en) Model training method, text processing method, device and equipment
CN111428490A (en) Reference resolution weak supervised learning method using language model
CN114298055B (en) Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN111160014A (en) Intelligent word segmentation method
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN114943220B (en) Sentence vector generation method and duplicate checking method for scientific research establishment duplicate checking

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18914309

Country of ref document: EP

Kind code of ref document: A1