WO2019196314A1

WO2019196314A1 - Text information similarity matching method and apparatus, computer device, and storage medium

Info

Publication number: WO2019196314A1
Application number: PCT/CN2018/102855
Authority: WO
Inventors: 周涛涛; 周宝; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-04-10
Filing date: 2018-08-29
Publication date: 2019-10-17
Also published as: CN108628825A

Abstract

Provided are a TF-IDF-based text information similarity matching method and apparatus. The method comprises: acquiring text information; carrying out word segmentation on the text information to obtain segmented words w₁, w₂,..., w_n-1 and w_n; using a CBOW model to calculate word vectors V(w₁), V(w₂),..., V(w_n-1) and V(w_n) of the segmented words; using a TF-IDF algorithm to calculate TF-IDF values k₁, k₂,..., k_n-1 and k_n of the segmented words; obtaining a sentence vector V according to products of the word vectors of the segmented words and the corresponding TF-IDF values; and calculating the cosine similarity between the sentence vector V and sentence vectors of pre-stored statements, and determining a pre-stored statement having the maximum cosine similarity. By means of the process, a pre-stored statement that is most similar to text information can be found, and the accuracy of problem recognition can be improved in the aspects of robot conversation, information classification, etc., thus improving the conversation efficiency or the classification efficiency. Further provided are a computer device and a storage medium.

Description

Text information similarity matching method, device, computer equipment and storage medium

This application claims the priority of the Chinese Patent Application entitled "Text Information Similarity Matching Method, Apparatus, Computer Equipment, and Storage Medium" submitted to the Chinese Patent Office on April 10, 2018, application number 201810314094.8, the entire contents of which are incorporated herein by reference. This is incorporated herein by reference.

Technical field

The present application relates to the field of text information recognition technology, and in particular, to a method and apparatus for matching text information similarity based on TF-IDF, and a computer device and a storage medium storing computer readable instructions.

Background technique

With the development of intelligence, customer service robots and chat bots are becoming more and more popular. Users can input text messages to consult customer service robots or chat with chat bots.

The inventor realized that when the robot recognizes the text information sent by the user, it needs to perform feedback based on the text information. Generally, the feedback information can be determined according to the retrieval method or the generation manner according to the text information. The generation method is to automatically generate an answer based on the model. This method requires a large number of question and answer pairs to train. The current effect is not satisfactory and is in the research stage. The retrieval method has been widely adopted by the industry. The edited question and answer pairs are pre-stored, and then the matching problem is found according to the problem to find the most matching preset problem, thereby retrieving the preset answer. At present, the text matching method of the retrieval method needs to be improved in accuracy.

Summary of the invention

The purpose of the present application is to solve at least one of the above technical drawbacks, in particular, technical defects with low precision.

The present application provides a TF-IDF-based text information similarity matching method, including the following steps: acquiring text information; segmenting the text information to obtain each participle w ₁ , w ₂ , ... w _n-1 , w _n ; using the CBOW model to calculate the word vectors V(w ₁ ), V(w ₂ ), ..., V(w _n-1 ), V(w _n ) of each participle; using the TF-IDF algorithm to calculate the TF of each participle -IDF values k ₁ , k ₂ , ... k _n-1 , k _n ; the sentence vector V is obtained from the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated and the sentence vector of the pre-stored sentence is calculated The cosine similarity between them determines the pre-stored statement with the largest cosine similarity.

The present application further provides a matching device for text information similarity based on TF-IDF, comprising: an obtaining module, configured to obtain text information; and a word segmentation module, configured to perform segmentation on the text information to obtain each participle w ₁ , w ₂ , w _n-1 , w _n ; a word vector calculation module for calculating the word vectors V(w ₁ ), V(w ₂ ), ..., V(w _n-1 ) of each participle using the CBOW model, V(w _n );

a TF-IDF value calculation module for calculating TF-IDF values k ₁ , k ₂ , ... k _n-1 , k _{n of} each word segment using a TF-IDF algorithm; a sentence vector calculation module for words according to each word segment The product of the vector and the corresponding TF-IDF value is obtained as a sentence vector V; a matching module is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of the pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.

The application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform a TF based a method for matching text information similarity of IDF, the method for matching similarity of text information based on TF-IDF includes the following steps: acquiring text information; and segmenting the text information to obtain each participle w ₁ , w ₂ , ... ...w _n-1 , w _n ; use the CBOW model to calculate the word vectors V(w ₁ ), V(w ₂ ), ...V(w _n-1 ), V(w _n ) of each participle; use TF-IDF The algorithm calculates the TF-IDF values k ₁ , k ₂ , ... k _n-1 , k _{n of} each word segment; the sentence vector V is obtained according to the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated The cosine similarity between the sentence vector of the pre-stored statement determines the pre-stored statement with the largest cosine similarity.

The application also provides a non-volatile storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to perform a TF-ID based The matching method of text information similarity, the TF-IDF-based text information similarity matching method comprises the following steps: acquiring text information; and segmenting the text information to obtain each participle w ₁ , w ₂ , ... w _N-1 , w _n ; use the CBOW model to calculate the word vectors V(w ₁ ), V(w ₂ ), ... V(w _n-1 ), V(w _n ) of each participle; use the TF-IDF algorithm to calculate The TF-IDF values k ₁ , k ₂ , ... k _n-1 , k _{n of} each participle; the sentence vector V is obtained according to the product of the word vector of each word segment and the corresponding TF-IDF value; the sentence vector V is calculated and pre-stored The cosine similarity between the sentence vectors of the statement determines the pre-stored statement with the largest cosine similarity.

The above TF-IDF-based text information similarity matching method, device, computer device and storage medium can obtain the pre-existing statement most similar to the text information through the above process, and can improve the problem identification in the robot dialogue and information classification. Accuracy, which improves dialogue efficiency or classification efficiency.

DRAWINGS

1 is a schematic diagram showing the internal structure of a computer device in an embodiment;

2 is a schematic flow chart of a TF-IDF-based text information similarity matching method according to an embodiment;

FIG. 3 is a schematic diagram of a TF-IDF-based text information similarity matching device module according to an embodiment.

detailed description

FIG. 1 is a schematic diagram showing the internal structure of a computer device in an embodiment. As shown in FIG. 1, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions. The database may store a sequence of control information. When the computer readable instructions are executed by the processor, the processor may implement a processor. A matching method based on TF-IDF for text information similarity. The processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device. The computer device can store, in a memory of the computer device, computer readable instructions that, when executed by the processor, cause the processor to perform a TF-IDF based text information similarity matching method. The network interface of the computer device is used to communicate with the terminal connection. It will be understood by those skilled in the art that the structure shown in FIG. 1 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.

The TF-IDF-based text information similarity matching method described below can be applied to sentence recognition in a robot dialogue, such as a customer service robot (including an online virtual customer service robot) to identify a customer's consultation, a chat robot (including an online virtual chat robot). Identify the customer's voice or entered text messages. It can also be applied to the information classification method, and will not be described here.

To identify text information, you need to generate a CBOW word vector model. The specific process is as follows:

1. Crawl the corpus through the web crawler. You can use Python crawlers to crawl corpora from web encyclopedias such as Wikipedia, Google Encyclopedia, Baidu Encyclopedia, Sogou Encyclopedia, and more.

2. Pre-processing the corpus in the corpus. Preprocessing includes removing special characters, removing URLs, transcoding, and so on.

3. Segment the corpus in the corpus. Chinese vocabulary can be performed on the corpus by using the jieba participle.

4. Train the corpus of the word segmentation to generate the CBOW word vector model. The vocabulary of the word segmentation can be trained by the word2vec CBOW model in the Gensim toolkit to generate and save the CBOW word vector model.

After generating the CBOW word vector model, the model can be used to generate word vectors in subsequent methods.

FIG. 2 is a schematic flow chart of a TF-IDF-based text information similarity matching method according to an embodiment. The present application provides a TF-IDF-based text information similarity matching method, including the following steps:

Step S100: Acquire text information. The text information here may be input by the user or may be text information recognized according to the voice data output by the user.

For example, the user performs online consultation by sending a text message to the online customer service robot, and the text message received by the online customer service robot acquires the text information. For another example, the user performs online chat by sending a text message to the online chat robot, and the text message received by the online customer service robot acquires the text information. The text information may be a sentence or a paragraph, and the length of the text information and the type of language used are not limited here.

Of course, if the user sends a voice message, the voice message needs to be voice-recognized, specifically: acquiring a voice message sent by the user; performing voice recognition on the voice message to generate text information. Speech recognition technology is widely used and will not be described here.

Of course, the above example is an example of an online robot, but it is not excluded to be a physical robot, such as a sweeping robot, a child educational robot, a customer service robot, a chat robot, and the like, and an intelligent robot having a physical limb.

Step S200: segmentation of the text information to obtain the respective participles w ₁ , w ₂ , ... w _n-1 , w _n .

Take the Chinese word segmentation as an example. Chinese Word Segmentation refers to the division of a sequence of Chinese characters into a single word. Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. w ₁ , w ₂ , ... w _n-1 , w _n are a single word segmented from the text information.

In some embodiments, the word segmentation algorithm can be divided into three types: a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.

Word segmentation based on word matching: This method is also called mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If found in the dictionary, if it is found in the dictionary. A string, the match is successful (a word is recognized). There are several ways to apply a wide range of matching methods:

1) Forward maximum matching method (from left to right)

2) Reverse maximum matching method (from right to left)

3) Minimal segmentation (minimum number of words cut out in each sentence)

4) Two-way maximum matching method (performed from left to right, right to left twice)

Word-based segmentation method based on understanding: This word segmentation method achieves the effect of identifying words by letting the computer simulate human understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of linguistic knowledge and information.

Statistical-based word segmentation method: Give a large number of texts that have been segmented, and use the statistical machine learning model to learn the rules of word segmentation (called training), so as to achieve the segmentation of unknown text. For example, the maximum probability word segmentation method and the maximum entropy word segmentation method. The main statistical models are N-gram, Hidden Markov Model (HMM), Maximum Entropy Model (ME), and Conditional Random Fields (CRF).

In some embodiments, the textual information may be segmented using a statistical-based word segmentation method, such as segmentation of textual information using a jieba segmentation component. The stuttering part is a Chinese word segmentation component developed by Chinese programmers in Python.

In one of the embodiments, in the process of segmenting the text information to obtain the respective segmentation words w ₁ , w ₂ , ... w _n-1 , w _n , the stop words of the text information are also subjected to removal processing.

Step S300: Calculate word vectors V(w ₁ ), V(w ₂ ), ..., V(w _n-1 ), V(w _n ) of each participle using the CBOW model. Each of the word parts w ₁ , w ₂ , ... w _n-1 , w _n corresponds to the word vectors V(w ₁ ), V(w ₂ ), ..., V(w _n-1 ), V(w _n ), respectively.

The word vector for each participle can be calculated using the word2vec CBOW model in the Gensim toolkit.

Word2vec is also called word embeddings, the Chinese name "word vector", the role is to convert the words in natural language into a computer-readable Dense Vector. Prior to the advent of word2vec, natural language processing often turned words into discrete, separate symbols, the One-Hot Encoder.

Hangzhou[0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0]

Shanghai[0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0]

Ningbo [0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0]

Beijing[0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0]

For example, in the above example, in the corpus, Hangzhou, Shanghai, Ningbo, and Beijing each correspond to a vector, and only one value in the vector is 1, and the rest are 0. There are the following issues with the One-Hot Encoder. On the one hand, the city code is random, the vectors are independent of each other, and there is no possible relationship between the cities. Second, the size of the vector dimension depends on the number of words in the corpus. If the vectors corresponding to the names of all the cities in the world are combined into one matrix, then this matrix is too sparse and will cause dimensional disaster.

Using Vector Representations can effectively solve this problem. Word2Vec can convert One-Hot Encoder into low-dimensional continuous values, that is, dense vectors, and words with similar meanings will be mapped to similar positions in the vector space.

Word2vec is mainly divided into two modes: CBOW (Continuous Bag of Words) and Skip-Gram. CBOW guesses the target word from the original sentence, and Skip-Gram is the opposite, guessing the original statement from the target word. CBOW is suitable for small databases, while Skip-Gram performs better in large corpora.

Step S400: Calculate the TF-IDF values k ₁ , k ₂ , ... k _n-1 , k _{n of} each participle using the TF-IDF algorithm.

TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency and IDF means Inverse Document Frequency.

In a given document, the term frequency (TF) refers to the frequency at which a given word appears in the file. This number is a normalization of the term count to prevent it from biasing towards long files (the same word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not). . For the word t _i in a particular file, its word frequency tf _i,j can be expressed as:

In the above formula, n _i,j is the number of occurrences of the word t _i in the file d _j , and the denominator is the sum of the occurrences of all the words in the file d _j .

The inverse document frequency (IDF) is a measure of the universal importance of a word. The IDF of a particular word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the logarithm of the obtained quotient:

Wherein, the logarithmic numerator |D| is the total number of files in the corpus, and the denominator |{j:t _i ∈d _i }| is the number of files containing the word t _i (not equal to 0). If the word is not in the corpus, it will result in the dividend being zero, so you can use 1+|{j:t _i ∈d _i }| to ensure that the denominator is not zero.

Then the word t _i has TF-IDF=tf _i,j ×idf _i .

In the present embodiment, the TF-IDF values corresponding to the respective word segments w ₁ , w ₂ , ... w _n-1 , w _n are k ₁ , k ₂ , ... k _n-1 , k _{n , respectively} . Where k _n = tf _n × idf _n , where tf _n is the frequency (word frequency) in which the participle w _n appears in the text message, and idf _n is the inverse file frequency of the participle w _n .

Step S500: The sentence vector V is obtained according to the product of the word vector of each participle and the corresponding TF-IDF value. The higher the importance of a word to text information, the greater its TF-IDF value, and the TF-IDF value can represent the importance of each word, which can be understood as a weight.

Assuming that 1 ≤ _m ≤ _n , the word vector V(w _m ) of the participle w _m is multiplied by the TF-IDF value k _{m of the} participle w _m to obtain a multiplied value H _m = V(w _m ) × k _m . The sentence vector V is obtained from the product H ₁ , H ₂ , ... H _n-1 , H _n of the word vector of each of the partial words w ₁ , w ₂ , ... w _n-1 , w _n and the corresponding TF-IDF value.

In one of the embodiments, the sentence vector V can be obtained using the following formula:

V=H ₁ +H ₂ +...H _n-1 +H _n

Specifically, the sentence vector V can be obtained using the following formula:

V=k ₁ ×V(w ₁ )+k ₂ ×V(w ₂ )+...+k _n-1 ×V(w _n-1 )+k _n ×V(w _n )

Step S600: Calculating a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determining a pre-stored statement (feature pre-stored statement) having the largest cosine similarity.

In the database, there are a large number of problems (ie, pre-stored statements) and corresponding answers, each of which stores its sentence vector. After the sentence vector V of the text information is determined, the feature pre-stored statement with the greatest cosine similarity between the sentence vectors is found in the database, thereby determining that the answer corresponding to the feature pre-stored statement is the information fed back to the user.

Cosine similarity, also known as cosine distance, is a measure of the magnitude of the difference between two individuals using the cosine of the angle between two vectors in vector space. If the corresponding vectors of statement X and statement Y are: (x ₁ , x ₂ , ..., x ₆₄₀₀ ) and (y ₁ , y ₂ , ..., y ₆₄₀₀ ), respectively, the cosine distance between them can be used. The cosine of the angle between them is expressed as:

When the cosine of the two sentence vectors is equal to 1, the two sentences are exactly the same; when the cosine of the angle is close to 1, the two sentences are similar; the smaller the cosine of the angle, the less relevant the two statements are.

By comparing the cosine similarity between the sentence vector V and the sentence vector of the pre-stored statement, the pre-stored statement with the greatest cosine similarity (feature pre-stored statement) can be found.

For example, the user conducts an online consultation by sending a question to the online customer service robot. After receiving the problem, the online customer service robot calculates the sentence vector V of the problem, and then searches the database for the pre-existing problem with the greatest cosine similarity to the problem vector V. And select the pre-stored question and answer feedback corresponding to the pre-stored question to the user. For example, the user issues the question "Is it okay?", the online customer service robot finds the pre-existing problem with the greatest cosine similarity to the sentence vector of the question in the database as "Is it free?", the corresponding "I ask if it is free" in the database. This pre-existing problem map stores the answer "yes" and returns "yes" to the customer.

Through the above method, the pre-existing statement (feature pre-existing statement) which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.

FIG. 3 is a schematic diagram of a TF-IDF-based text information similarity matching device module according to an embodiment. Corresponding to the above-mentioned TF-IDF-based text information similarity matching method, the present application further provides a TF-IDF-based text information similarity matching device, comprising: an obtaining module 100, a word segmentation module 200, and a word vector calculation module 300. The TF-IDF value calculation module 400, the sentence vector calculation module 500, and the matching module 600.

The obtaining module 100 is configured to obtain text information; the word segmentation module 200 is configured to perform segmentation on the text information to obtain each participle w ₁ , w ₂ , . . . w _n-1 , w _n ; the word vector calculation module 300 is configured to use the CBOW model The word vectors V(w ₁ ), V(w ₂ ), . . . , V(w _n-1 ), V(w _n ) of each participle are calculated; the TF-IDF value calculation module 400 is used to calculate using the TF-IDF algorithm. The TF-IDF values k ₁ , k ₂ , ... k _n-1 , k _{n of} each word _segment ; the sentence vector calculation module 500 is configured to obtain the sentence vector V according to the product of the word vector of each word segment and the corresponding TF-IDF value; The module 600 is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.

The acquisition module 100 obtains text information. The text information here may be input by the user or may be text information recognized according to the voice data output by the user.

Of course, if the user sends a voice message, the acquiring module 100 needs to perform voice recognition on the voice message. Specifically, the acquiring module 100 acquires a voice message sent by the user, and performs voice recognition on the voice message to generate text information. Speech recognition technology is widely used and will not be described here.

The word segmentation module 200 segments the text information to obtain the respective word segments w ₁ , w ₂ , ... w _n-1 , w _n .

1) Forward maximum matching method (from left to right)

2) Reverse maximum matching method (from right to left)

3) Minimal segmentation (minimum number of words cut out in each sentence)

In some embodiments, the word segmentation module 200 can segment the textual information using a statistical-based segmentation method, such as segmenting the textual information using a jieba segmentation component. The stuttering part is a Chinese word segmentation component developed by Chinese programmers in Python.

In one embodiment, the word segmentation module 200 also removes the stop words of the text information in the process of segmenting the text information to obtain the respective word segments w ₁ , w ₂ , ... w _n-1 , w _n . deal with.

The word vector calculation module 300 calculates the word vectors V(w ₁ ), V(w ₂ ), ..., V(w _n-1 ), V(w _n ) of the respective participles using the CBOW model. Each of the word parts w ₁ , w ₂ , ... w _n-1 , w _n corresponds to the word vectors V(w ₁ ), V(w ₂ ), ..., V(w _n-1 ), V(w _n ), respectively.

The word vector calculation module 300 can calculate the word vector of each participle by the word2vec CBOW model in the Gensim toolkit.

Word2vec is also called word embeddings, the Chinese name "word vector", the role is to convert the words in natural language into a dense vector (Dense Vector) that can be understood by computers. Prior to the advent of word2vec, natural language processing often turned words into discrete, separate symbols, the One-Hot Encoder.

Hangzhou[0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0]

Shanghai[0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0]

Ningbo [0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0]

Beijing[0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0]

The TF-IDF value calculation module 400 calculates the TF-IDF values k ₁ , k ₂ , ... k _n-1 , k _{n of the} respective word segments using the TF-IDF algorithm.

TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency, and IDF means Inverse Document Frequency.

Then the word t _i has TF-IDF=tf _i,j ×idf _i .

The sentence vector calculation module 500 obtains the sentence vector V based on the product of the word vector of each participle and the corresponding TF-IDF value. The higher the importance of a word to text information, the greater its TF-IDF value, and the TF-IDF value can represent the importance of each word, which can be understood as a weight.

In one of the embodiments, the sentence vector calculation module 500 can obtain the sentence vector V using the following formula:

V=H ₁ +H ₂ +...H _n-1 +H _n

Specifically, the sentence vector calculation module 500 can obtain the sentence vector V using the following formula:

V=k ₁ ×V(w ₁ )+k ₂ ×V(w ₂ )+...+k _n-1 ×V(w _n-1 )+k _n ×V(w _n )

The matching module 600 calculates a cosine similarity between the sentence vector V and a sentence vector of a pre-stored sentence, and determines a pre-stored statement (feature pre-stored statement) with the largest cosine similarity.

In the database, there are a large number of problems (ie, pre-stored statements) and corresponding answers, each of which stores its sentence vector. After the sentence vector V of the text information is determined, the feature pre-stored statement with the greatest cosine similarity between the sentence vectors is found in the database, thereby determining that the answer corresponding to the feature pre-stored statement is information fed back to the user.

The matching module 600 can find the pre-stored statement (feature pre-stored statement) with the largest cosine similarity by comparing the cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence.

Through the above device, the pre-existing statement (feature pre-existing statement) which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.

The application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform any of the above implementations The steps of the TF-IDF-based text information similarity matching method are described.

The present application also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause one or more processors to perform TF-based operations according to any of the above embodiments. The steps of the IDF text information similarity matching method.

The TF-IDF-based text information similarity matching method, device, computer device and storage medium are obtained by acquiring text information; and the text information is segmented to obtain each participle w ₁ , w ₂ , ... w _n-1 , w _n ; use the CBOW model to calculate the word vectors V(w ₁ ), V(w ₂ ), ..., V(w _n-1 ), V(w _n ) of each participle; calculate each participle using the TF-IDF algorithm TF-IDF values k ₁ , k ₂ , ... k _n-1 , k _n ; the sentence vector V is obtained from the product of the word vector of each word segment and the corresponding TF-IDF value; calculating the sentence vector V and the pre-stored statement The cosine similarity between the sentence vectors determines the pre-stored statement with the largest cosine similarity. Through the above process, the pre-existing statement which is most similar to the text information can be found, and the accuracy of the problem recognition can be improved in the robot dialogue and information classification, thereby improving the dialogue efficiency or the classification efficiency.

A person skilled in the art can understand that all or part of the process of implementing the above embodiment method can be completed by a computer program to instruct related hardware, and the computer program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Claims

A TF-IDF-based method for matching text information similarity includes the following steps:

Get text information;

Performing word segmentation on the text information to obtain individual word segments w 1 , w 2 , ... w n-1 , w n ;

Using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle;

Calculating the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using the TF-IDF algorithm;

Obtaining a sentence vector V according to the product of the word vector of each participle and the corresponding TF-IDF value;

The cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence is calculated, and the pre-stored statement with the largest cosine similarity is determined.
The TF-IDF-based text information similarity matching method according to claim 1, wherein in the process of segmenting the text information to obtain each of the word segments w 1 , w 2 , ... w n-1 , w n , The stop words of the text information are also removed.
The TF-IDF-based text information similarity matching method according to claim 1, wherein the text information is segmented by using a stalking word segmentation component.
The TF-IDF-based text information similarity matching method according to claim 1, wherein the sentence vector V is obtained using the following formula:

V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).
A matching device for text information similarity based on TF-IDF, comprising:

Obtaining a module for obtaining text information;

a word segmentation module, configured to perform word segmentation on the text information to obtain each segmentation word w 1 , w 2 , ... w n-1 , w n ;

a word vector calculation module for calculating a word vector V(w 1 ), V(w 2 ), ..., V(w n-1 ), V(w n ) of each participle using a CBOW model;

a TF-IDF value calculation module for calculating TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using a TF-IDF algorithm;

a sentence vector calculation module, configured to obtain a sentence vector V according to a product of a word vector of each word segment and a corresponding TF-IDF value;

The matching module is configured to calculate a cosine similarity between the sentence vector V and a sentence vector of the pre-stored sentence, and determine a pre-stored statement with the largest cosine similarity.
The TF-IDF-based text information similarity matching apparatus according to claim 5, wherein the word segmentation module further performs a removal process on the stop words of the text information.
The TF-IDF-based text information similarity matching apparatus according to claim 5, wherein the word segmentation module performs segmentation on the text information by using a staging component.
The TF-IDF-based text information similarity matching apparatus according to claim 5, wherein the sentence vector calculation module obtains the sentence vector V using the following formula:

V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).
A computer apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to execute a TF-IDF based text The method for matching information similarity, the method for matching similarity of text information based on TF-IDF includes the following steps:

Get text information;

Performing word segmentation on the text information to obtain individual word segments w 1 , w 2 , ... w n-1 , w n ;

Using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle;

Calculating the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using the TF-IDF algorithm;

Obtaining a sentence vector V according to the product of the word vector of each participle and the corresponding TF-IDF value;

The cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence is calculated, and the pre-stored statement with the largest cosine similarity is determined.
The computer apparatus according to claim 9, wherein in the process of segmenting the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n , the stop words of the text information are also removed. deal with.
The computer device of claim 9 wherein the textual information is segmented using a staging component.
The computer device according to claim 9, wherein the sentence vector V is obtained using the following formula:

V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).
A non-volatile storage medium storing computer readable instructions, when executed by one or more processors, causing one or more processors to perform a TF-IDF-based textual message similar The matching method of the TF-IDF-based text information similarity includes the following steps:

Get text information;

Performing word segmentation on the text information to obtain individual word segments w 1 , w 2 , ... w n-1 , w n ;

Using the CBOW model to calculate the word vectors V(w 1 ), V(w 2 ), ... V(w n-1 ), V(w n ) of each participle;

Calculating the TF-IDF values k 1 , k 2 , ... k n-1 , k n of each word segment using the TF-IDF algorithm;

Obtaining a sentence vector V according to the product of the word vector of each participle and the corresponding TF-IDF value;

The cosine similarity between the sentence vector V and the sentence vector of the pre-stored sentence is calculated, and the pre-stored statement with the largest cosine similarity is determined.
The nonvolatile storage medium according to claim 13, wherein in the process of segmenting the text information to obtain the respective word segments w 1 , w 2 , ... w n-1 , w n , the text information is stopped. Use words to remove the processing.
The non-volatile storage medium according to claim 13, wherein the text information is segmented using a staging component.
The nonvolatile storage medium according to claim 13, wherein the sentence vector V is obtained using the following formula:

V = k 1 × V(w 1 ) + k 2 × V(w 2 ) + ... + k n-1 × V(w n-1 ) + k n × V(w n ).