CN109359302B

CN109359302B - Optimization method of domain word vectors and fusion ordering method based on optimization method

Info

Publication number: CN109359302B
Application number: CN201811257850.4A
Authority: CN
Inventors: 刘慧君; 李傲; 曾一; 乔猛; 周明强; 邬小燕
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2023-04-18
Anticipated expiration: 2038-10-26
Also published as: CN109359302A

Abstract

The invention provides an optimization method of a domain word vector and a fusion ordering method based on the optimization method, wherein the optimization method of the domain word vector comprises the following steps: s11, training a domain-free word vector and obtaining a demand word vector; s12, training the field word vectors, obtaining demand word vectors, and calculating the similarity by using an RWMD algorithm; s12 comprises the following specific steps: s121, cleaning data of the domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the domain corpus by using a word segmentation model of LTP; s122, calculating an IDF value of each word, wherein the IDF value is the probability of each word appearing in the domain corpus, and calculating the value of IDF _ weight. The optimization method of the domain word vectors and the fusion ordering method based on the optimization method solve the problem that the newly generated domain word vectors cannot adapt to a certain specific vertical domain question-answering system because the domain-free word vectors and the domain word vectors cannot be fused in the prior art.

Description

Optimization method of domain word vectors and fusion ordering method based on optimization method

Technical Field

The invention relates to the field of information retrieval, in particular to a domain word vector optimization method and a fusion ordering method based on the domain word vector optimization method.

Background

With the rapid development of socioeconomic and internet, various affairs and information are stored as data. How to use these data and to manage them scientifically and effectively is a very popular research direction in the field of information retrieval. The database of the search engine is a multi-field hybrid, for problems in some professional fields, a large-scale search engine can return more useless results, the retrieval difficulty is increased, and searching for relevant answers from a large amount of useless information can increase the burden of a retrieval system and reduce the use experience.

The expert system belongs to an application of information retrieval, and can be defined as a natural language processing category aiming at the main realized content, namely a short text similarity matching problem. The bottom-layer implementation of the expert system is a question-answering system in a fixed professional field, so that the quality of a returned result influences the experience of a questioner to a certain extent.

Ranking learning is currently widely used in information retrieval. The expert system is a typical application of the supervised learning, is different from a single traditional evaluation model, introduces a mechanism of fusing a plurality of traditional models into the sequencing learning, and at present, the sequencing learning mainly comprises three categories, namely a single document method (PointWise Approach), a document pair method (paierwise Approach) and a document list method (ListWise Approach).

The short text matching is to search the required information through a similarity problem pair in an information retrieval mode, and mainly comprises semantic matching and word meaning matching, wherein the semantic matching needs to learn a semantic model through a large amount of labeled data, the engineering quantity is large, the data quantity for a knowledge base is small relative to a language model, an effective model is difficult to learn, the matching on a word meaning level is simple and quick, and the probability representation of a text sequence is solved by constructing a feature vector of each word according to a TF/IDF or a natural language model; establishing a BiGram model and a TriGram model, and calculating the similarity through the Euclidean distance; the Word2Vec model simplifies the training process and reduces the training time.

However, the above methods have the following problems: the generated word vector is only influenced by the non-domain word vector or only influenced by the word vector in the domain, and the non-domain word vector and the domain word vector cannot be fused, so that the newly generated domain word vector cannot be suitable for a certain specific vertical domain question-answering system, and the phenomenon of slow response during searching is caused.

Disclosure of Invention

The invention provides a domain word vector optimization method and a fusion ordering method based on the domain word vector optimization method, and solves the problem that a newly generated domain word vector cannot adapt to a certain specific vertical domain question-and-answer system because a domain-free word vector and a domain word vector cannot be fused in the prior art.

In order to realize the purpose, the invention adopts the following technical scheme:

the invention firstly provides a method for optimizing a domain word vector, which comprises the following steps: s11, training a domain-free word vector; s12, training field word vectors and obtaining demand word vectors;

s11 comprises the following specific steps:

s111, cleaning data of the non-domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation on the non-domain corpus;

s112, training the whole corpus by using a Word2Vec model to obtain an initial Word vector V _old (w)；

S113, setting weight for each word according to the frequency p (w) of each word in the non-domain corpus, and calculating a non-domain word vector in the non-domain corpus according to the following formula:

V _undomain (w)＝exp(p(w))×V _old (w)

in the formula, V _undomain (w) represents a domain-free word vector, p (w) being the frequency of each word in the corpus;

s12, the specific steps are as follows:

s121, cleaning data of a domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the domain corpus by using a word segmentation model of LTP (low-temperature poly-p);

s122, calculating an IDF (w) value of each word, wherein the IDF (w) value is the probability of each word appearing in the field corpus, calculating an IDF _ weight value, setting the intermediate value of the frequency of all the words appearing in the field corpus as IDFmo, and setting the average value of the frequency of all the words appearing in the field corpus as IDFmo

Then->

S123, training word vectors of the field corpus, comparing the word vectors with the cBOW by adopting Skip-gram, optimizing by using negative sampling, setting the quantity according to specific scenes and test results, adopting down sampling during model training, and setting the window size according to the specific scenes to obtain field word vectors V _old (w)'；

S124 performing spatial mapping according to the frequency p (w)' of each word appearing in the domain corpus to obtain

The calculation formula is as follows:

s125, fusing the field word vectors and the non-field word vectors in the field corpus to obtain a demand word vector V _new (w)。

The invention also provides a fusion ordering method of the domain word vectors, which comprises the following steps:

s21, setting a fusion model to comprise a single document model and a document pair model;

s21, performing word segmentation on each piece of data by using LTP, and performing word-out-of-service operation;

s22, calculating TF/IDF cosine values, BM25, word2Vec Euclidean distance, RWMD and the semantic similarity of the known net as the characteristics of the sequencing learning model;

s23, training a first part of the fusion model, selecting according to characteristics, mapping similarity vectors between each text and the original question, and sending the similarity vectors into a single-layer neural network substrate model of a single-document model for training, wherein the number of neurons in the middle layer is 8, and the Batchsize is set to be 128;

and S24, training the second part of the fusion model, randomly extracting other error texts on the basis of selecting the correct answer of the data, mapping similarity vectors between document pairs, and sending the similarity vectors into a double-layer neural network substrate model of the document pair model for training, wherein the number of neurons in each layer is set to be 8, and the Batchsize is set to be 128.

Compared with the prior art, the invention has the following beneficial effects:

1) Obtaining a domain-free word vector by establishing and domesticating a domain-free corpus; meanwhile, the domain-free word vectors and the domain word vectors are fused together through condition setting, and the characteristics of the correlation between the question and between the question and the document are added as supplements, so that the expert robot fuses a plurality of characteristics together when performing answer sorting, and finally the probability of replying the best answer is improved, and the response efficiency of retrieval and the correctness of a response result are improved;

2) The expert system has higher requirements on the final sequencing result, the time complexity is higher for the document list model with better effect, and the high response speed required by the expert system cannot be met.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a process diagram of short text similarity comparison using RWMD of domain word vectors;

FIG. 2 is a schematic diagram of a fusion ordering model.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the functions of the invention clearer and easier to understand, the invention is further explained by combining the drawings and the detailed implementation mode:

the invention provides a method for optimizing a domain word vector, which comprises the following steps: s11, training a domain-free word vector; s12, training field word vectors and obtaining demand word vectors;

s11, the steps are as follows:

s111, cleaning data of a domain-free large-scale corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the corpus;

s112, training the whole corpus by using a Word2Vec model; (because Skip-gram is friendly to rare words with less occurrence, and word missing phenomenon can not occur, so that the cBOW model can be used for comparative analysis, two sets of model training can be respectively carried out, negative sampling optimization is used, the quantity setting is correspondingly adjusted according to a specific use scene, downsampling is adopted during model training, and the window size is correspondingly adjusted according to an actual scene and a test effect)

S113, setting weight for each word according to the frequency of each word in the corpus, and mapping the word to a new space according to a corresponding rule; (the first part is mainly to train word vectors of two sets of models and perform word potential stop processing on the word vectors, namely to calculate the weight corresponding to each word and map the weight with the trained initial word vector to obtain a field-free word vector.)

(the second part is completed based on the first part, after obtaining the non-domain Word vectors without the potential stop words, the Word vector training of a specific domain corpus and the final Word vector fusion are carried out, aiming at the corpus training of the vertical domain, the Word2Vec model training is carried out by using the linguistic data in the professional domain knowledge base, and after obtaining the Word vectors in the domain, the fusion operation is carried out with the result of the first part.) S12, the concrete steps are as follows:

s121, cleaning data of the domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the domain corpus by using a word segmentation model of LTP;

Then->

S124, carrying out spatial mapping according to the frequency p (w)' of each word appearing in the domain corpus to obtain

The calculation formula is as follows:

s125, fusing the field word vectors and the non-field word vectors in the field corpus to obtain a demand word vector V _new (w) is carried out. (there are three cases: the first, when a word appears in both the domain vocabulary and the non-domain vocabulary, the new domain word vector is the product of the IDF value of the word subtracted from the IDF _ weight value obtained in step S122 and the domain word vector of the word, and the expanded domain word vector is fused with the non-domain word vector of the word, i.e. the intermediate value between them; the second, if the word appears only in the domain vocabulary, the domain word vector is used as the final domain word vector of the word, and the third, the second, if the word appears only in the domain vocabulary, the final domain word vector of the word is used, and the final domain word vector of the word is usedIf the word only appears in the domain-free word list, the domain-free word vector will be used as the final domain-based word vector for the word)

After step S12 is calculated, step S13 is performed, and step 13 corrects the demand word vector, where step 13 specifically includes the following steps:

s131, using similar problem pairs in the professional field, performing word segmentation operation on each text, and searching the field word vector V trained in the step S123 for each word _old (w)' and the demand word vector V obtained in step S125 _new (w) pair V using RWMD algorithm _old (w)' at V _new (w) carrying out similarity rho (w), judging whether the similarity rho (w) of each word is qualified or not, and counting the unqualified rate lambda in the similarity rho (w) of all words;

s132, judging whether the failure rate lambda is less than or equal to the threshold value, and if not, performing the step S133; if yes, go to step S134;

s133, adjusting the number of negative samples and the size of a down-sampling window in the S123, repeating the steps S123 to S125, and obtaining the demand word vector V again _new (w), and then returns to step S131;

and S134, finishing the calculation. (the purpose is to check the accuracy of the word vector, if not accurate, adjust the number of negative samples and the size of the down-sampling window in S123 to train the word vector of the domain corpus again, if accurate, show that the word vector can be used for the search similarity calculation in the domain)

To achieve better fusion, step S125 includes the following steps:

s1251, calculating the smoothed field word vector V _domain (w)：

S1252, w represents the current word, C _d Representing a domain corpus, C _ud Representing a domain-free corpus, when w ∈ C _d And w ∈ C _ud Executing a first fusion mode; when w is equal to C _d And is

Executing a second fusion mode; when/is>

And w ∈ C _ud Executing a third fusion mode;

first fusion mode to obtain V _new The formula for (w) is:

second fusion mode to obtain V _new The formula for (w) is: v _new (w)＝V _domain (w)；

Obtaining V by a third fusion mode _new The formula for (w) is: v _new (w)＝V _undomain (w)。

For the domain-free word vector, since the result of RWMD is only the similarity between two sentences, it is difficult to give a threshold value to determine whether two sentences are similar or not. Eliminating potential disablement therefore requires using the magnitude of the difference between similar and irrelevant sentences before and after an optimization as a measure.

For domain word vector fusion, the size percentage of the distance between similar problems before and after optimization is used as a measurement standard. Since the reason for fusing the word vectors is to make the distance between similar problems closer after merging, the difference between the final result of similarity before and after optimization is used as a measure. The result before optimization is larger than the RWMD result after optimization, so that the distance between similar problems is more compact after optimization, and the optimization is effective. The percentage between the different distances can also prove the effectiveness of the optimization from the side. As shown in fig. 1.

The implementation also provides a fusion ordering method of the domain word vectors, which comprises the following steps:

s21, setting a fusion model to comprise a single document model and a document pair model; ( The fusion model is formed by fusing a single document model and a document pair model, so that the two submodels uniformly use a neural network as a substrate model, and for the single document model, a single-layer neural network is used, and the number of neurons in the middle layer is generally 8. For the document pair model, a double-layer neural network is used, and the number of the neuron suggestions in each layer is set to be 8. The Batchsize proposal is uniformly set to 128, and the iteration number depends on the actual scene and the test effect. )

s23, training the first part of the fusion model, selecting according to characteristics, mapping similarity vectors between each text and the original question, sending the similarity vectors into a single-layer neural network substrate model of a single-document model for training, wherein the number of neurons in the middle layer is 8, and the Batchsize is set to be 128;

s24, training the second part of the fusion model, randomly extracting other error texts on the basis of selecting the correct answer of the data, mapping similarity vectors between document pairs, and sending the similarity vectors into a double-layer neural network substrate model of the document pair model for training, wherein the number of neurons in each layer is set to be 8, and the Batchsize is set to be 128.

RWMD algorithm introduction in first and second domain word vector optimization method

The Word vector generation model which is most popular at present is a Word2Vec model, the nature of the model belongs to a probability distribution model, the model can be regarded as a huge high-dimensional sphere, the center of the circle is set as the origin, and each Word can be regarded as one point in the model. The model improves the Word2Vec training process and results, so that the final Word vector result is more suitable for a certain vertical field, and the WMD optimization algorithm RWMD is used for calculating the short text similarity.

The method aims at semantic calculation of word meaning level, key words and stop words are main influence factors, a corpus with a large database such as historical news can be selected for the stop words, the method has the advantages that the number of words is large, deviation of any field does not exist, balanced word vectors can be trained and suitable for any field, and the corpus in a field knowledge base is directly used for training the word vectors for solving the problem that rare special nouns appear too few times.

In a news corpus, since each word thereof cannot be defined using a method similar to TF/IDF or the like due to content mashup, the frequency of occurrence of each word may be used instead. The frequency of occurrence of potential stop words is high, so the model performs spatial mapping using the following formula:

V _undomain (w)＝exp(p(w))×V _old (w)

where p (w) is the frequency of occurrence of the word. The above equation is essentially a vector expansion. The utilization of:

(exp(x))'＝exp(x)

by the characteristic, the amplitude of expansion after mapping is related to the probability of the vector, so that the distance relation of the word vectors with different frequencies in the space of the mapped word vectors can be better distinguished. The expansion amplitude is larger for points with higher occurrence frequency and smaller for points with lower occurrence frequency, the influence of some potential stop words in RWMD can be reduced to a certain extent, and the transition probability matrix T of the potential stop words can be reduced to a certain extent by the spatial mapping mode _i，j On the other hand, cost between the keywords and the potential stop words is increased, more attention is focused on the keywords, and the effect of RWMD is indirectly improved.

Although the potential stop words in large-scale corpora are weakened, the interest level of keywords in some fields in the expert system is far from sufficient, and even in large-scale corpora, the keywords in these fields are classified into UNKNOW, which is a considerable error in the euclidean distance calculation between words in RWMD. For the problems, the model trains two sets of Word vectors, one set is the Word vector trained by a non-domain large corpus, and the other set is the training of a Word2Vec of a knowledge base in a target domain, and the Word vector can be subjected to fusion operation after spatial mapping during actual application, so that certain weight can be given to keywords; since the domain corpus is composed of a knowledge base, words can change the mapping weight from the frequency of occurrence of each word to the idf value of each word. So that the importance of the word itself affects the position in the new vector space to some extent. However, another problem is that the weight of a certain kind of keywords is too high, because the RWMD algorithm compares the words in the matched text with the words with the minimum Cost in the matched text during calculation, and excessive attention to a certain keyword can also negatively affect the calculation of the overall similarity. Therefore, the IDF value of each word can be subtracted from the IDF _ Middle value of the entire corpus, where IDF _ weight is the median between the average IDF value and the median of the corpus IDF, so that the overall IDF value is smoother. The calculation formula is as follows,

after possessing two sets of weights, the field Word2Vec can be used to influence the whole Word2Vec, thus obtaining a richer Word vector space in the field and leading the RWMD result between similar sentences to be better than the non-optimized RWMD result. The formula for the fusion is as follows,

wherein C _d Representing a domain corpus, V _domain (w) weight of a keyword in the domain thesaurus, C _ud Representing a domain-free corpus, V _undomain (w) represents the weight of a keyword in the domain-free corpus, and the cost value of the distance between the related problems calculated by using the RWMD can be reduced and the distance between the related problems can be increased through the mapping.

When short text similarity comparison is carried out, only partial words cannot be focused, and the minimum cost value is concentrated on one word according to the characteristics of RWMD, so that additional other information is lost, and the result is greatly deviated. The IDF is subjected to average difference removing operation, so that the situation that the IDF value is too high can be reduced to a certain extent, some secondary important keywords can also represent the importance of the secondary important keywords in the RWMD algorithm, and the situation that the final result has larger deviation due to the fact that the primary keywords are over-concerned is avoided to a certain extent. After the word vectors in the field are obtained, the word vectors in the non-field are subjected to fusion operation, and the specific process is to add the word vectors and the non-field and remove the intermediate value. The generated word vector is influenced not only by the original word vector but also by the word vector in the field, so that the newly generated domain word vector can be more suitable for a certain specific vertical domain question-answering system.

The method is hot-plugging and can be applied to expert systems with knowledge bases in any field.

First, introduction of domain word vector fusion ordering method

The document list method can solve the problems caused by a single-document method and a document method to a certain extent, but because each iteration of the document list method needs to traverse each query during training, the time complexity is higher during probability model conversion, and therefore the method is not applicable to an expert system which needs to respond quickly and has high accuracy and ask-answer in a specific field. And for the setting of K in Top-K, it may cause the correct answer not to be returned to the user.

Aiming at the problems of the document list model, a single document method and a document-to-method simulation document list method can be fused, corresponding improvement and optimization are carried out aiming at the practical application of the expert system, the time complexity is reduced on the premise of not influencing the overall effect, and the probability of returning expected answers to the user is increased. The overall architecture may be divided into two parts, a document classification part and a document pair ranking part.

The schematic diagram of the fusion ranking model is shown in fig. 2, and the overall architecture is divided into two parts, namely a document classification part and a document pair ranking part.

For the first part, the single documents are classified to obtain a coarser order. In this way, the sequence itself can be no longer considered, but the model with the high possibility of being considered by the single document model is extracted integrally without considering the internal sequence thereof. After processing, the more relevant questions are extracted more coarsely, and the final results are sent to the second part for document pair ranking. The rough extraction of the first partial single document model is actually equivalent to a Top-K like process, with initial selection of entries. Because the similar problems in the domain knowledge base are fewer in number, a great number of irrelevant problems can be screened out to a great extent.

The second part is a document pair ranking part. Aiming at a rough unordered question result which is relatively related to the question and is given by the first part of single document classification, the document pair method carries out detailed sequencing on the result, namely, the candidate similar question and a vector which is obtained by comparing and converting the original question per se are compared pairwise, and a tree-shaped sequencing method can be used for comparing the candidate similar question and the vector, and finally the most similar question is obtained. According to the first part, giving a rough similarity sequence, the document carries out a precise extraction operation on the rough similarity sequence by the method part, and extracts the result with the highest similarity coefficient and returns the result to the questioner.

During training, parallel training can be separately carried out, namely the conversion from the self-arranged sequence to the vector is split, so that the time complexity is added.

For the overall model, the first part can be considered as a pre-processed part of the second part, and the two are more independent in definition. For a single document model, the main function is to calculate and classify the similarity between a single question and similar questions, and the document pair model is to obtain the most similar texts by screening similar questions once.

Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for optimizing a domain word vector is characterized by comprising the following steps: s11, training a domain-free word vector; s12, training field word vectors and obtaining demand word vectors;

s11 comprises the following specific steps:

V _undomain (w)＝exp(p(w))×V _old (w)

s12, the specific steps are as follows:

Then

S123, training word vectors of a field corpus is carried out, skip-gram and cBOW are used for comparison, negative sampling optimization is used, quantity setting is determined according to specific scenes and test results, downsampling is adopted during model training, and window size is determined according to specific scenes, so that field word vectors V are obtained _old (w)'；

S124, carrying out space mapping according to the frequency p (w)' of each word appearing in the domain corpus to obtain Vo _lddomain (w), the calculation formula is as follows:

Vo _lddomain (w)＝exp(p(w)')×Vo _ld (w)

and S125, fusing the field word vectors and the non-field word vectors in the field language database to obtain a demand word vector Vnew (w).

2. The method as claimed in claim 1, wherein step S13 is performed after step S12 is performed, and step 13 is performed to correct the demand word vector, wherein step 13 specifically includes the following steps:

s131, using similar problem pairs in the professional field, performing word segmentation operation on each text, and searching the field word vector V trained in the step S123 for each word _old (w)' and the demand word vector Vnew (w) obtained in step S125, V is aligned by using RWMD algorithm _old (w)' at V _new (w) carrying out similarity rho (w), judging whether the similarity rho (w) of each word is qualified or not, counting the failure rate lambda in all word similarity rho (w);

s133, adjusting the number of negative samples and the size of a down-sampling window in the step S123, repeating the steps S123 to S125, and obtaining the demand word vector V again _new (w), and then returns to step S131;

and S134, finishing the calculation.

3. The method of claim 2, wherein the step S125 comprises the following steps:

s1251, calculating the smoothed field word vector V _domain (w)：

S1252, w denotes the current word, C _d Representing a domain corpus, C _ud Represents a domain-free corpus, when w ∈ C _d And w ∈ C _ud Executing a first fusion mode; when w ∈ C _d And is provided with

Executing a second fusion mode; when/is>

And w ∈ C _ud Executing a third fusion mode;

first fusion mode to obtain V _new The formula for (w) is: