CN112988970A

CN112988970A - Text matching algorithm serving intelligent question-answering system

Info

Publication number: CN112988970A
Application number: CN202110267040.2A
Authority: CN
Inventors: 励建科; 许化; 顾淼; 陈再蝶; 朱晓秋; 樊伟东; 邓明明; 周杰
Original assignee: Zhejiang Kangxu Technology Co ltd
Current assignee: Zhejiang Kangxu Technology Co ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-18

Abstract

The invention discloses a text matching algorithm serving an intelligent question and answer system, which comprises a question and answer library text data set, an optimized jieba word splitter, a trained word2vector model and a modified cosine similarity model. In the invention, an optimized jieba word splitter is obtained by combining the advantages of a precise mode and a search mode of the jieba word splitting, after Chinese word splitting is carried out on a 'consultation problem', word vector embedding is carried out through a word2vector model, Chinese word splitting is converted into a word vector which can be calculated, cosine similarity model calculation after correction is carried out on the word vector, the precision of similarity calculation is improved, text similarity calculation is realized, finally, the similarity is sequenced, a similarity threshold is given, and a 'fixed problem' which has the highest similarity calculation value and exceeds the given similarity threshold in a text data set of a question-answer library and a corresponding 'fixed answer' are selected as a question-answer pair of the 'consultation problem'.

Description

Text matching algorithm serving intelligent question-answering system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text matching algorithm serving an intelligent question-answering system.

Background

NLP (Natural Language Processing) is an important direction in the fields of computer science and artificial intelligence, and is a field in which computer science, artificial intelligence and linguistics focus on the interaction between computers and human (Natural) languages, and it studies various theories and methods that can realize effective communication between people and computers using Natural languages, and Natural Language Processing is a science integrating linguistics, computer science and mathematics, and thus, the study in this field will relate to Natural languages, i.e. languages used by people daily, it is closely related to, but significantly different from, the study of linguistics, natural language processing is not a general study of natural language, but rather to develop computer systems, and particularly software systems therein, that can efficiently implement natural language communications, and thus are part of computer science.

The invention provides a text matching algorithm serving an intelligent question-answering system, which is used for performing text similarity calculation through Chinese word segmentation and word vector embedding and trying to search a question-answer pair closest to a question in a question-answering library.

Disclosure of Invention

To solve the above-mentioned problems in the background art, a text matching algorithm serving an intelligent question-answering system is proposed.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text matching algorithm serving an intelligent question-answering system comprises a question-answering library text data set, an optimized jieba word segmentation device, a trained word2vector model and a modified cosine similarity model;

the question-answer library text data set comprises a fixed question set and a fixed answer set corresponding to the fixed question set;

the optimized jieba word segmentation device comprises an accurate mode and a search mode;

training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model;

the modified cosine similarity calculation model is as follows:

wherein i and j are vectors of the same dimension,

vector means, R, representing the ith and jth vectors_u,iThe u-th vector value, R, representing the input vector i_u,jRepresents the u-th vector value of the input vector j;

the text matching algorithm comprises the following steps:

s1, inputting 'consultation questions' in the intelligent question answering system;

s2, performing Chinese word segmentation on the 'consultation problem' through the optimized jieba word segmentation device;

s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with unified dimensionality;

s4, carrying out similarity calculation on the word vectors in the step S3 through the modified cosine similarity model, and outputting a plurality of similarity calculation values;

s5, sorting the similarity calculation values in descending order, comparing the similarity calculation values with a given similarity threshold value of the fixed questions in the fixed question set, and selecting the fixed questions and the corresponding fixed answers with the similarity calculation values being the largest or larger than the similarity threshold value as the consultation answers of the consultation questions.

As a further description of the above technical solution:

the given similarity threshold calculation method comprises the following steps:

s1, inputting a 'fixed question' in the intelligent question answering system;

s2, performing Chinese word segmentation on the fixed problem through the optimized jieba word segmentation device;

s3, performing word vector embedding on the Chinese participles in the step S2 through a trained word2vector model, and converting the Chinese participles into word vectors with fixed dimensions;

s5, the similarity calculation values are sorted in descending order, and the similarity calculation value with the largest similarity calculation value is selected as the given similarity threshold.

As a further description of the above technical solution:

the word segmentation mode of the jieba word segmentation device which is not optimized comprises an accurate mode.

As a further description of the above technical solution:

the word segmentation mode of the jieba word segmentation device which is not optimized comprises a full mode.

As a further description of the above technical solution:

the word segmentation mode of the jieba word segmenter which is not optimized comprises a search mode.

As a further description of the above technical solution:

the CBOW model expresses input Chinese participles into 256-dimensional word vectors.

As a further description of the above technical solution:

the unmodified cosine similarity model is:

wherein x and y are two n-dimensional vectors for calculating the similarity.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. in the invention, an optimized jieba word splitter is obtained by combining the advantages of a precise mode and a search mode of the jieba word splitting, the problem that a fixed longest word in the precise mode is prior to information overlapping in the search mode is solved, after Chinese word splitting is carried out on a query problem, word vector embedding is carried out through a word2vector model, Chinese word splitting is converted into a word vector which can be calculated, cosine similarity model calculation after correction is carried out on the word vector, the precision of similarity calculation is improved, text similarity calculation is realized, finally the similarity is sequenced, a similarity threshold is given, and a fixed problem and a corresponding fixed answer which have the highest similarity calculation value and exceed the given similarity threshold in a database text data set are selected as a question-answer pair of the query problem.

2. In the invention, the used Chinese word segmentation skill can combine the advantages of the accurate mode and the search mode of the jieba word segmentation device to obtain the optimized jieba word segmentation device, but the respective disadvantages of the accurate mode and the search mode are abandoned, the problems of longest word priority and information overlapping are solved, the Chinese word segmentation with high efficiency in industry is realized, and the Chinese word segmentation device is a very good choice under the condition that the cost-performance ratio of the training word segmentation model is insufficient or the corpus data is not complete.

3. In the invention, the word vector expression selects the word2vector model trained by the CBOW model, and the obtained word vector can express the self semantics of the words by considering the context information near the words, including the influence of other words and adjacent words around each word on the word meaning of the word and the influence of the relation between the words on the semantics of the sentence.

4. In the invention, a modified cosine similarity model is used, the original cosine similarity model is improved, the mean value of each dimension is subtracted, and the difference of dimensions of each dimension is considered.

Drawings

FIG. 1 is a flow chart illustrating a text matching algorithm serving an intelligent question answering system according to an embodiment of the present invention;

FIG. 2 shows a schematic diagram of a model of a Skip-gram provided according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a CBOW model provided according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating two vectors a, b provided according to an embodiment of the present invention that are similar in cosine;

FIG. 5 is a diagram illustrating two vectors a, b provided according to an embodiment of the present invention with the cosine of the vectors a, b being equal;

fig. 6 is a schematic diagram illustrating that two vectors a, b provided according to an embodiment of the present invention are dissimilar in cosine.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1-6, the present invention provides a technical solution: a text matching algorithm serving an intelligent question-answering system comprises a question-answering library text data set, an optimized jieba word segmentation device, a trained word2vector model and a modified cosine similarity model;

the text data set of the question-answer library comprises a fixed question set and a fixed answer set corresponding to the fixed question set;

specifically, the corpus text dataset is shown in table 1 below:

TABLE 1 questionnaire text training set

the word segmentation modes of the unoptimized jieba word segmentation device comprise three independent word segmentation modes, namely an accurate mode, a full mode and a search mode;

the accurate mode is that the sentence is tried to be cut open most accurately and is suitable for text analysis;

full mode: all words which can be formed into words in the sentence are scanned out, so that the speed is very high, but ambiguity cannot be solved;

a search mode: on the basis of an accurate mode, segmenting long words and rephrases, improving the recall rate and being suitable for word segmentation of a search engine;

specifically, assuming that there is a scene in which a fund manager is in a text data set of a questioning and answering library and a financial manager is not in the text data set, if a jieba word segmentation device which is not optimized is used for Chinese word segmentation in the scene, the following two word segmentation results are obtained, wherein the results of a full mode and a search mode are similar;

word segmentation result in the precision mode one: since the exact mode is the longest word first, we get the results "fund manager" and "financing manager";

and a second word segmentation result in the full mode and the search mode: the results obtained are "fund", "manager", "fund manager" and "financing", "manager", "financing manager".

In an actual application scenario, because the text data set of the questioning and answering library has the vocabulary of "fund manager" but not the "financing manager", the vocabulary of "financing manager" is not learned in the word2vector model, and therefore the word2vector model cannot recognize the vocabulary of "financing manager";

in this case, if we select the jieba word segmenter with the accurate mode, since the accurate mode is the longest word priority, our word2vector model can obtain the word vector of "fund manager" but cannot obtain the word vector of "finance manager";

if we select the jieba word segmentation device in the search mode, we can obtain three word segmentation devices of ' financing manager ', manager ' and ' financing manager ', although the word2vector model cannot recognize the word of ' financing manager ', the word vectors of ' financing ' and ' manager ' can be recognized to replace the meaning expressed by the ' financing manager ', but for ' fund manager ', in the jieba word segmentation device in the search mode, the word segmentation devices which can be recognized by the word2vector models of ' fund ', ' manager ' and ' fund manager ' are obtained, in this case, the two word vectors of ' fund ' and ' manager ' are redundant, and information overlapping is caused, and the word vector expression of the whole sentence is influenced when the word vectors are put in the text;

as shown above, for two vocabularies, namely "fund manager" and "financing manager", if we select the accurate mode of the jieba segmenter, our word vector model can process "fund manager", but because the default longest word is first in this mode and "financing manager" cannot be processed, if we select the search mode of the jieba segmenter, our word vector model can process "financing manager", but information overlapping will be caused when processing "fund manager", therefore, the invention considers that the accurate mode and the search mode of the segmenter are cooperatively used to obtain the optimized jieba segmenter to solve the problem that the longest word is first overlapped with the information, for each text, we select the accurate mode of the jieba segmenter to perform segmentation first, perform word2vector conversion on the segmented vocabulary, and then perform unrecognizable vocabulary which is not learned by the word2vector model, then, carrying out a search mode on the vocabulary and then segmenting words;

specifically, the application examples of the fund manager and the financing manager are also used, and the optimized jieba word segmentation device is used for Chinese word segmentation recognition:

for the fund manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager due to the fact that the word2vector model can be identified, for the fund manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager, for the financial manager, the word vector of the fund manager can be obtained to express the Chinese meaning of the fund manager due to the fact that the word2vector model can not be identified, the word vector of the fund manager can be obtained to carry out searching mode re-segmentation again to obtain three words of financial, manager and financial manager, wherein the word vector of the financial manager can not be identified to directly skip, the word vector of the financial manager and the word vector of the manager can be obtained to jointly express the Chinese meaning of the financial manager, and therefore the effect of word segmentation can be greatly optimized through the flexible application of the optimized fund classifier, the efficiency is high and the accuracy is not lost;

training a text data set of a question and answer library through a CBOW model to obtain a trained word2vecor model, and specifically, expressing input Chinese word segmentation into 256-dimensional word vectors through the CBOW model;

the word2vector model network is composed of two layers of weights, a hidden layer is composed of n neurons, wherein n represents the vector dimension of a word, the vector dimension is unified into 256 dimensions, an input layer and an output layer both comprise M neurons, M is the total number of Chinese participles in a vocabulary table, an output layer activation function is a softmax function commonly used in the classification problem, and on the basis, two training methods of Skip-gram and COBW are provided:

as shown in fig. 2, the Skip-gram model predicts surrounding words according to a central word, takes the unique heat vector of the central word as an input vector and the unique heat vector of the surrounding words as an output to construct a training sample pair, and calculates the probability that the output word is the surrounding words by an output layer softmax function;

as shown in fig. 3, the COBW model predicts a central word according to surrounding words, creates a sum of unique heat vectors of the surrounding words as an input vector, and the unique heat vector of the central word as an output to construct a training sample pair, and derives a node word with the maximum probability as an output by an output layer softmax function;

after the neural network training is finished, the trained network weight can be used for representing semantics, one row in the weight matrix represents one word in a vocabulary table of a corpus through the conversion of a vocabulary entry unique heat vector, and no additional training is carried out after the word2vetcor model training is finished, so that the output layer of the network can be ignored, and only the input weight of a hidden layer is used as word embedding representation;

the training result of the word2vetcor model can be adjusted by adjusting parameters such as the word frequency (min _ count), the maximum distance (window) between the current word and the predicted word in a sentence, the dimension (size) of the feature vector and the like;

in the two models, Skip-gram is suitable for small corpora and some rare terms, while CBOW model is more suitable for common words in common scenes and has faster training speed.

The modified cosine similarity calculation model is:

wherein i and j are vectors of the same dimension,

the unmodified cosine similarity model is:

wherein, x and y are two n-dimensional vectors for calculating the similarity;

whether two texts are matched or not is judged, namely whether the similarity of word vectors expressing the semantics of the two texts is close or not is calculated, under the scene, a cosine similarity model is the most suitable and most widely used method, and the principle is that cosine values of an included angle between two vectors in a vector space are used as measurement for measuring the difference of the two vectors;

as shown in fig. 4, the closer the cosine value is to 1, the closer the included angle is to 0 degrees, i.e. the more similar the two vectors are;

as shown in fig. 5, the included angle between the two vectors a and b is small, so that the a vector and the b vector have high similarity, and in an extreme case, the a vector and the b vector completely coincide, and in this case, the a vector and the b vector are considered to be equal, that is, the texts represented by the a and b vectors are completely similar or equal;

as shown in fig. 6, if the angle between the a and b vectors is large, or in the opposite direction, the angle between the two vectors a and b is large, so that the a vector and the b vector have very low similarity, or the texts represented by the a and b vectors are basically dissimilar;

therefore, if the similarity calculation value of the chinese word segmentation is calculated by the unmodified cosine similarity model, the similarity in the dimension direction between two vectors can be obtained, and a better application can be obtained, which is a common practice at present, but such a basic cosine similarity model has its limitation, that is, only the similarity in the vector dimension direction is considered, but the difference of dimensions of each dimension is not considered, for example, such two vectors x₁(20,30) and y₁(20,31) their similarity value is calculated to be 0.99988695, which satisfies our needs, however, x is expressed₁(20,30) magnification to x₂After (40,60), vector x₂Compare x₁Although the length is doubled, the direction of the vector is not changed, so vector x₂(40,60) and y₁The similarity of (20,31) is still 0.99988695, which is not in accordance with the requirements of some scenes;

in the modified cosine similarity model, the vector length is also taken into consideration, the mean value is subtracted from each dimension, and the difference of dimensions of each dimension is taken into consideration, so that the effect of the cosine similarity model is optimized;

the text matching algorithm comprises the following steps:

Specifically, the calculation method of the given similarity threshold includes the following steps:

s1, inputting a 'fixed question' in the intelligent question answering system;

s5, sequencing the similarity calculation values in a descending order, and selecting the similarity calculation value with the largest similarity calculation value as a given similarity threshold value;

the Chinese word segmentation skill used in the invention can combine the advantages of the accurate mode and the search mode of the jieba word segmentation device to obtain the optimized jieba word segmentation device, and the respective disadvantages of the accurate mode and the search mode are abandoned, so that the problems of longest word priority and information overlapping are solved, the Chinese word segmentation with high efficiency in industry is realized, and the Chinese word segmentation device is a good choice under the condition that the training word segmentation model is insufficient in cost performance or the corpus data is not complete enough;

the word vector expression selects a word2vector model trained by a CBOW model, and the word vector obtained by considering the context information near the word, including the influence of other words and adjacent words around each word on the word meaning of the word and the influence of the relation between the words on the sentence semantics, can express the semantics of the word;

the invention uses the modified cosine similarity model, improves the original cosine similarity model, performs the modification operation of subtracting the mean value from each dimension, and considers the difference of dimensions of each dimension.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A text matching algorithm serving an intelligent question and answer system is characterized by comprising a question and answer library text data set, an optimized jieba word splitter, a trained word2vector model and a modified cosine similarity model;

the modified cosine similarity calculation model is as follows:

wherein i and j are of the same dimensionThe vector of (a) is determined,

the text matching algorithm comprises the following steps:

2. The text matching algorithm for a smart question-answering system according to claim 1, wherein the given similarity threshold calculation method comprises the following steps:

s1, inputting a 'fixed question' in the intelligent question answering system;

3. The text matching algorithm for a service of an intelligent question and answer system according to claim 1, wherein the word segmentation pattern of the jieba word segmenter that is not optimized comprises a precise pattern.

4. The text matching algorithm for a service of an intelligent question answering system according to claim 1, wherein the word segmentation mode of the jieba word segmenter that is not optimized comprises a full mode.

5. The text matching algorithm for a service of an intelligent question answering system according to claim 1, wherein the word segmentation pattern of the jieba word segmenter that is not optimized comprises a search pattern.

6. The text matching algorithm for a service of an intelligent question and answer system according to claim 1, wherein the CBOW model expresses input chinese participles as 256-dimensional word vectors.

7. The text matching algorithm for a smart question-answering system according to claim 1, wherein the cosine similarity model without modification is:

wherein x and y are two n-dimensional vectors for calculating the similarity.