CN109359302B - Optimization method of domain word vectors and fusion ordering method based on optimization method - Google Patents

Optimization method of domain word vectors and fusion ordering method based on optimization method Download PDF

Info

Publication number
CN109359302B
CN109359302B CN201811257850.4A CN201811257850A CN109359302B CN 109359302 B CN109359302 B CN 109359302B CN 201811257850 A CN201811257850 A CN 201811257850A CN 109359302 B CN109359302 B CN 109359302B
Authority
CN
China
Prior art keywords
word
domain
corpus
field
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811257850.4A
Other languages
Chinese (zh)
Other versions
CN109359302A (en
Inventor
刘慧君
李傲
曾一
乔猛
周明强
邬小燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201811257850.4A priority Critical patent/CN109359302B/en
Publication of CN109359302A publication Critical patent/CN109359302A/en
Application granted granted Critical
Publication of CN109359302B publication Critical patent/CN109359302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an optimization method of a domain word vector and a fusion ordering method based on the optimization method, wherein the optimization method of the domain word vector comprises the following steps: s11, training a domain-free word vector and obtaining a demand word vector; s12, training the field word vectors, obtaining demand word vectors, and calculating the similarity by using an RWMD algorithm; s12 comprises the following specific steps: s121, cleaning data of the domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the domain corpus by using a word segmentation model of LTP; s122, calculating an IDF value of each word, wherein the IDF value is the probability of each word appearing in the domain corpus, and calculating the value of IDF _ weight. The optimization method of the domain word vectors and the fusion ordering method based on the optimization method solve the problem that the newly generated domain word vectors cannot adapt to a certain specific vertical domain question-answering system because the domain-free word vectors and the domain word vectors cannot be fused in the prior art.

Description

Optimization method of domain word vectors and fusion ordering method based on optimization method
Technical Field
The invention relates to the field of information retrieval, in particular to a domain word vector optimization method and a fusion ordering method based on the domain word vector optimization method.
Background
With the rapid development of socioeconomic and internet, various affairs and information are stored as data. How to use these data and to manage them scientifically and effectively is a very popular research direction in the field of information retrieval. The database of the search engine is a multi-field hybrid, for problems in some professional fields, a large-scale search engine can return more useless results, the retrieval difficulty is increased, and searching for relevant answers from a large amount of useless information can increase the burden of a retrieval system and reduce the use experience.
The expert system belongs to an application of information retrieval, and can be defined as a natural language processing category aiming at the main realized content, namely a short text similarity matching problem. The bottom-layer implementation of the expert system is a question-answering system in a fixed professional field, so that the quality of a returned result influences the experience of a questioner to a certain extent.
Ranking learning is currently widely used in information retrieval. The expert system is a typical application of the supervised learning, is different from a single traditional evaluation model, introduces a mechanism of fusing a plurality of traditional models into the sequencing learning, and at present, the sequencing learning mainly comprises three categories, namely a single document method (PointWise Approach), a document pair method (paierwise Approach) and a document list method (ListWise Approach).
The short text matching is to search the required information through a similarity problem pair in an information retrieval mode, and mainly comprises semantic matching and word meaning matching, wherein the semantic matching needs to learn a semantic model through a large amount of labeled data, the engineering quantity is large, the data quantity for a knowledge base is small relative to a language model, an effective model is difficult to learn, the matching on a word meaning level is simple and quick, and the probability representation of a text sequence is solved by constructing a feature vector of each word according to a TF/IDF or a natural language model; establishing a BiGram model and a TriGram model, and calculating the similarity through the Euclidean distance; the Word2Vec model simplifies the training process and reduces the training time.
However, the above methods have the following problems: the generated word vector is only influenced by the non-domain word vector or only influenced by the word vector in the domain, and the non-domain word vector and the domain word vector cannot be fused, so that the newly generated domain word vector cannot be suitable for a certain specific vertical domain question-answering system, and the phenomenon of slow response during searching is caused.
Disclosure of Invention
The invention provides a domain word vector optimization method and a fusion ordering method based on the domain word vector optimization method, and solves the problem that a newly generated domain word vector cannot adapt to a certain specific vertical domain question-and-answer system because a domain-free word vector and a domain word vector cannot be fused in the prior art.
In order to realize the purpose, the invention adopts the following technical scheme:
the invention firstly provides a method for optimizing a domain word vector, which comprises the following steps: s11, training a domain-free word vector; s12, training field word vectors and obtaining demand word vectors;
s11 comprises the following specific steps:
s111, cleaning data of the non-domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation on the non-domain corpus;
s112, training the whole corpus by using a Word2Vec model to obtain an initial Word vector V old (w);
S113, setting weight for each word according to the frequency p (w) of each word in the non-domain corpus, and calculating a non-domain word vector in the non-domain corpus according to the following formula:
V undomain (w)=exp(p(w))×V old (w)
in the formula, V undomain (w) represents a domain-free word vector, p (w) being the frequency of each word in the corpus;
s12, the specific steps are as follows:
s121, cleaning data of a domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the domain corpus by using a word segmentation model of LTP (low-temperature poly-p);
s122, calculating an IDF (w) value of each word, wherein the IDF (w) value is the probability of each word appearing in the field corpus, calculating an IDF _ weight value, setting the intermediate value of the frequency of all the words appearing in the field corpus as IDFmo, and setting the average value of the frequency of all the words appearing in the field corpus as IDFmo
Figure BDA0001843173710000021
Then->
Figure BDA0001843173710000022
S123, training word vectors of the field corpus, comparing the word vectors with the cBOW by adopting Skip-gram, optimizing by using negative sampling, setting the quantity according to specific scenes and test results, adopting down sampling during model training, and setting the window size according to the specific scenes to obtain field word vectors V old (w)';
S124 performing spatial mapping according to the frequency p (w)' of each word appearing in the domain corpus to obtain
Figure BDA0001843173710000031
The calculation formula is as follows:
Figure BDA0001843173710000032
s125, fusing the field word vectors and the non-field word vectors in the field corpus to obtain a demand word vector V new (w)。
The invention also provides a fusion ordering method of the domain word vectors, which comprises the following steps:
s21, setting a fusion model to comprise a single document model and a document pair model;
s21, performing word segmentation on each piece of data by using LTP, and performing word-out-of-service operation;
s22, calculating TF/IDF cosine values, BM25, word2Vec Euclidean distance, RWMD and the semantic similarity of the known net as the characteristics of the sequencing learning model;
s23, training a first part of the fusion model, selecting according to characteristics, mapping similarity vectors between each text and the original question, and sending the similarity vectors into a single-layer neural network substrate model of a single-document model for training, wherein the number of neurons in the middle layer is 8, and the Batchsize is set to be 128;
and S24, training the second part of the fusion model, randomly extracting other error texts on the basis of selecting the correct answer of the data, mapping similarity vectors between document pairs, and sending the similarity vectors into a double-layer neural network substrate model of the document pair model for training, wherein the number of neurons in each layer is set to be 8, and the Batchsize is set to be 128.
Compared with the prior art, the invention has the following beneficial effects:
1) Obtaining a domain-free word vector by establishing and domesticating a domain-free corpus; meanwhile, the domain-free word vectors and the domain word vectors are fused together through condition setting, and the characteristics of the correlation between the question and between the question and the document are added as supplements, so that the expert robot fuses a plurality of characteristics together when performing answer sorting, and finally the probability of replying the best answer is improved, and the response efficiency of retrieval and the correctness of a response result are improved;
2) The expert system has higher requirements on the final sequencing result, the time complexity is higher for the document list model with better effect, and the high response speed required by the expert system cannot be met.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
FIG. 1 is a process diagram of short text similarity comparison using RWMD of domain word vectors;
FIG. 2 is a schematic diagram of a fusion ordering model.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the functions of the invention clearer and easier to understand, the invention is further explained by combining the drawings and the detailed implementation mode:
the invention provides a method for optimizing a domain word vector, which comprises the following steps: s11, training a domain-free word vector; s12, training field word vectors and obtaining demand word vectors;
s11, the steps are as follows:
s111, cleaning data of a domain-free large-scale corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the corpus;
s112, training the whole corpus by using a Word2Vec model; (because Skip-gram is friendly to rare words with less occurrence, and word missing phenomenon can not occur, so that the cBOW model can be used for comparative analysis, two sets of model training can be respectively carried out, negative sampling optimization is used, the quantity setting is correspondingly adjusted according to a specific use scene, downsampling is adopted during model training, and the window size is correspondingly adjusted according to an actual scene and a test effect)
S113, setting weight for each word according to the frequency of each word in the corpus, and mapping the word to a new space according to a corresponding rule; (the first part is mainly to train word vectors of two sets of models and perform word potential stop processing on the word vectors, namely to calculate the weight corresponding to each word and map the weight with the trained initial word vector to obtain a field-free word vector.)
(the second part is completed based on the first part, after obtaining the non-domain Word vectors without the potential stop words, the Word vector training of a specific domain corpus and the final Word vector fusion are carried out, aiming at the corpus training of the vertical domain, the Word2Vec model training is carried out by using the linguistic data in the professional domain knowledge base, and after obtaining the Word vectors in the domain, the fusion operation is carried out with the result of the first part.) S12, the concrete steps are as follows:
s121, cleaning data of the domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the domain corpus by using a word segmentation model of LTP;
s122, calculating an IDF (w) value of each word, wherein the IDF (w) value is the probability of each word appearing in the field corpus, calculating an IDF _ weight value, setting the intermediate value of the frequency of all the words appearing in the field corpus as IDFmo, and setting the average value of the frequency of all the words appearing in the field corpus as IDFmo
Figure BDA0001843173710000051
Then->
Figure BDA0001843173710000052
S123, training word vectors of the field corpus, comparing the word vectors with the cBOW by adopting Skip-gram, optimizing by using negative sampling, setting the quantity according to specific scenes and test results, adopting down sampling during model training, and setting the window size according to the specific scenes to obtain field word vectors V old (w)';
S124, carrying out spatial mapping according to the frequency p (w)' of each word appearing in the domain corpus to obtain
Figure BDA0001843173710000053
The calculation formula is as follows:
Figure BDA0001843173710000054
s125, fusing the field word vectors and the non-field word vectors in the field corpus to obtain a demand word vector V new (w) is carried out. (there are three cases: the first, when a word appears in both the domain vocabulary and the non-domain vocabulary, the new domain word vector is the product of the IDF value of the word subtracted from the IDF _ weight value obtained in step S122 and the domain word vector of the word, and the expanded domain word vector is fused with the non-domain word vector of the word, i.e. the intermediate value between them; the second, if the word appears only in the domain vocabulary, the domain word vector is used as the final domain word vector of the word, and the third, the second, if the word appears only in the domain vocabulary, the final domain word vector of the word is used, and the final domain word vector of the word is usedIf the word only appears in the domain-free word list, the domain-free word vector will be used as the final domain-based word vector for the word)
After step S12 is calculated, step S13 is performed, and step 13 corrects the demand word vector, where step 13 specifically includes the following steps:
s131, using similar problem pairs in the professional field, performing word segmentation operation on each text, and searching the field word vector V trained in the step S123 for each word old (w)' and the demand word vector V obtained in step S125 new (w) pair V using RWMD algorithm old (w)' at V new (w) carrying out similarity rho (w), judging whether the similarity rho (w) of each word is qualified or not, and counting the unqualified rate lambda in the similarity rho (w) of all words;
s132, judging whether the failure rate lambda is less than or equal to the threshold value, and if not, performing the step S133; if yes, go to step S134;
s133, adjusting the number of negative samples and the size of a down-sampling window in the S123, repeating the steps S123 to S125, and obtaining the demand word vector V again new (w), and then returns to step S131;
and S134, finishing the calculation. (the purpose is to check the accuracy of the word vector, if not accurate, adjust the number of negative samples and the size of the down-sampling window in S123 to train the word vector of the domain corpus again, if accurate, show that the word vector can be used for the search similarity calculation in the domain)
To achieve better fusion, step S125 includes the following steps:
s1251, calculating the smoothed field word vector V domain (w):
Figure BDA0001843173710000061
S1252, w represents the current word, C d Representing a domain corpus, C ud Representing a domain-free corpus, when w ∈ C d And w ∈ C ud Executing a first fusion mode; when w is equal to C d And is
Figure BDA0001843173710000062
Executing a second fusion mode; when/is>
Figure BDA0001843173710000063
And w ∈ C ud Executing a third fusion mode;
first fusion mode to obtain V new The formula for (w) is:
Figure BDA0001843173710000064
second fusion mode to obtain V new The formula for (w) is: v new (w)=V domain (w);
Obtaining V by a third fusion mode new The formula for (w) is: v new (w)=V undomain (w)。
For the domain-free word vector, since the result of RWMD is only the similarity between two sentences, it is difficult to give a threshold value to determine whether two sentences are similar or not. Eliminating potential disablement therefore requires using the magnitude of the difference between similar and irrelevant sentences before and after an optimization as a measure.
For domain word vector fusion, the size percentage of the distance between similar problems before and after optimization is used as a measurement standard. Since the reason for fusing the word vectors is to make the distance between similar problems closer after merging, the difference between the final result of similarity before and after optimization is used as a measure. The result before optimization is larger than the RWMD result after optimization, so that the distance between similar problems is more compact after optimization, and the optimization is effective. The percentage between the different distances can also prove the effectiveness of the optimization from the side. As shown in fig. 1.
The implementation also provides a fusion ordering method of the domain word vectors, which comprises the following steps:
s21, setting a fusion model to comprise a single document model and a document pair model; ( The fusion model is formed by fusing a single document model and a document pair model, so that the two submodels uniformly use a neural network as a substrate model, and for the single document model, a single-layer neural network is used, and the number of neurons in the middle layer is generally 8. For the document pair model, a double-layer neural network is used, and the number of the neuron suggestions in each layer is set to be 8. The Batchsize proposal is uniformly set to 128, and the iteration number depends on the actual scene and the test effect. )
S21, performing word segmentation on each piece of data by using LTP, and performing word-out-of-service operation;
s22, calculating TF/IDF cosine values, BM25, word2Vec Euclidean distance, RWMD and the semantic similarity of the known net as the characteristics of the sequencing learning model;
s23, training the first part of the fusion model, selecting according to characteristics, mapping similarity vectors between each text and the original question, sending the similarity vectors into a single-layer neural network substrate model of a single-document model for training, wherein the number of neurons in the middle layer is 8, and the Batchsize is set to be 128;
s24, training the second part of the fusion model, randomly extracting other error texts on the basis of selecting the correct answer of the data, mapping similarity vectors between document pairs, and sending the similarity vectors into a double-layer neural network substrate model of the document pair model for training, wherein the number of neurons in each layer is set to be 8, and the Batchsize is set to be 128.
RWMD algorithm introduction in first and second domain word vector optimization method
The Word vector generation model which is most popular at present is a Word2Vec model, the nature of the model belongs to a probability distribution model, the model can be regarded as a huge high-dimensional sphere, the center of the circle is set as the origin, and each Word can be regarded as one point in the model. The model improves the Word2Vec training process and results, so that the final Word vector result is more suitable for a certain vertical field, and the WMD optimization algorithm RWMD is used for calculating the short text similarity.
The method aims at semantic calculation of word meaning level, key words and stop words are main influence factors, a corpus with a large database such as historical news can be selected for the stop words, the method has the advantages that the number of words is large, deviation of any field does not exist, balanced word vectors can be trained and suitable for any field, and the corpus in a field knowledge base is directly used for training the word vectors for solving the problem that rare special nouns appear too few times.
In a news corpus, since each word thereof cannot be defined using a method similar to TF/IDF or the like due to content mashup, the frequency of occurrence of each word may be used instead. The frequency of occurrence of potential stop words is high, so the model performs spatial mapping using the following formula:
V undomain (w)=exp(p(w))×V old (w)
where p (w) is the frequency of occurrence of the word. The above equation is essentially a vector expansion. The utilization of:
(exp(x))'=exp(x)
by the characteristic, the amplitude of expansion after mapping is related to the probability of the vector, so that the distance relation of the word vectors with different frequencies in the space of the mapped word vectors can be better distinguished. The expansion amplitude is larger for points with higher occurrence frequency and smaller for points with lower occurrence frequency, the influence of some potential stop words in RWMD can be reduced to a certain extent, and the transition probability matrix T of the potential stop words can be reduced to a certain extent by the spatial mapping mode i,j On the other hand, cost between the keywords and the potential stop words is increased, more attention is focused on the keywords, and the effect of RWMD is indirectly improved.
Although the potential stop words in large-scale corpora are weakened, the interest level of keywords in some fields in the expert system is far from sufficient, and even in large-scale corpora, the keywords in these fields are classified into UNKNOW, which is a considerable error in the euclidean distance calculation between words in RWMD. For the problems, the model trains two sets of Word vectors, one set is the Word vector trained by a non-domain large corpus, and the other set is the training of a Word2Vec of a knowledge base in a target domain, and the Word vector can be subjected to fusion operation after spatial mapping during actual application, so that certain weight can be given to keywords; since the domain corpus is composed of a knowledge base, words can change the mapping weight from the frequency of occurrence of each word to the idf value of each word. So that the importance of the word itself affects the position in the new vector space to some extent. However, another problem is that the weight of a certain kind of keywords is too high, because the RWMD algorithm compares the words in the matched text with the words with the minimum Cost in the matched text during calculation, and excessive attention to a certain keyword can also negatively affect the calculation of the overall similarity. Therefore, the IDF value of each word can be subtracted from the IDF _ Middle value of the entire corpus, where IDF _ weight is the median between the average IDF value and the median of the corpus IDF, so that the overall IDF value is smoother. The calculation formula is as follows,
Figure BDA0001843173710000081
after possessing two sets of weights, the field Word2Vec can be used to influence the whole Word2Vec, thus obtaining a richer Word vector space in the field and leading the RWMD result between similar sentences to be better than the non-optimized RWMD result. The formula for the fusion is as follows,
Figure BDA0001843173710000082
wherein C d Representing a domain corpus, V domain (w) weight of a keyword in the domain thesaurus, C ud Representing a domain-free corpus, V undomain (w) represents the weight of a keyword in the domain-free corpus, and the cost value of the distance between the related problems calculated by using the RWMD can be reduced and the distance between the related problems can be increased through the mapping.
When short text similarity comparison is carried out, only partial words cannot be focused, and the minimum cost value is concentrated on one word according to the characteristics of RWMD, so that additional other information is lost, and the result is greatly deviated. The IDF is subjected to average difference removing operation, so that the situation that the IDF value is too high can be reduced to a certain extent, some secondary important keywords can also represent the importance of the secondary important keywords in the RWMD algorithm, and the situation that the final result has larger deviation due to the fact that the primary keywords are over-concerned is avoided to a certain extent. After the word vectors in the field are obtained, the word vectors in the non-field are subjected to fusion operation, and the specific process is to add the word vectors and the non-field and remove the intermediate value. The generated word vector is influenced not only by the original word vector but also by the word vector in the field, so that the newly generated domain word vector can be more suitable for a certain specific vertical domain question-answering system.
The method is hot-plugging and can be applied to expert systems with knowledge bases in any field.
First, introduction of domain word vector fusion ordering method
The document list method can solve the problems caused by a single-document method and a document method to a certain extent, but because each iteration of the document list method needs to traverse each query during training, the time complexity is higher during probability model conversion, and therefore the method is not applicable to an expert system which needs to respond quickly and has high accuracy and ask-answer in a specific field. And for the setting of K in Top-K, it may cause the correct answer not to be returned to the user.
Aiming at the problems of the document list model, a single document method and a document-to-method simulation document list method can be fused, corresponding improvement and optimization are carried out aiming at the practical application of the expert system, the time complexity is reduced on the premise of not influencing the overall effect, and the probability of returning expected answers to the user is increased. The overall architecture may be divided into two parts, a document classification part and a document pair ranking part.
The schematic diagram of the fusion ranking model is shown in fig. 2, and the overall architecture is divided into two parts, namely a document classification part and a document pair ranking part.
For the first part, the single documents are classified to obtain a coarser order. In this way, the sequence itself can be no longer considered, but the model with the high possibility of being considered by the single document model is extracted integrally without considering the internal sequence thereof. After processing, the more relevant questions are extracted more coarsely, and the final results are sent to the second part for document pair ranking. The rough extraction of the first partial single document model is actually equivalent to a Top-K like process, with initial selection of entries. Because the similar problems in the domain knowledge base are fewer in number, a great number of irrelevant problems can be screened out to a great extent.
The second part is a document pair ranking part. Aiming at a rough unordered question result which is relatively related to the question and is given by the first part of single document classification, the document pair method carries out detailed sequencing on the result, namely, the candidate similar question and a vector which is obtained by comparing and converting the original question per se are compared pairwise, and a tree-shaped sequencing method can be used for comparing the candidate similar question and the vector, and finally the most similar question is obtained. According to the first part, giving a rough similarity sequence, the document carries out a precise extraction operation on the rough similarity sequence by the method part, and extracts the result with the highest similarity coefficient and returns the result to the questioner.
During training, parallel training can be separately carried out, namely the conversion from the self-arranged sequence to the vector is split, so that the time complexity is added.
For the overall model, the first part can be considered as a pre-processed part of the second part, and the two are more independent in definition. For a single document model, the main function is to calculate and classify the similarity between a single question and similar questions, and the document pair model is to obtain the most similar texts by screening similar questions once.
Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (3)

1. A method for optimizing a domain word vector is characterized by comprising the following steps: s11, training a domain-free word vector; s12, training field word vectors and obtaining demand word vectors;
s11 comprises the following specific steps:
s111, cleaning data of the non-domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation on the non-domain corpus;
s112, training the whole corpus by using a Word2Vec model to obtain an initial Word vector V old (w);
S113, setting weight for each word according to the frequency p (w) of each word in the non-domain corpus, and calculating a non-domain word vector in the non-domain corpus according to the following formula:
V undomain (w)=exp(p(w))×V old (w)
in the formula, V undomain (w) represents a domain-free word vector, p (w) being the frequency of each word in the corpus;
s12, the specific steps are as follows:
s121, cleaning data of the domain corpus, removing emoticons and unrecognizable messy codes, and performing word segmentation processing on the domain corpus by using a word segmentation model of LTP;
s122, calculating an IDF (w) value of each word, wherein the IDF (w) value is the probability of each word appearing in the field corpus, calculating an IDF _ weight value, setting the intermediate value of the frequency of all the words appearing in the field corpus as IDFmo, and setting the average value of the frequency of all the words appearing in the field corpus as IDFmo
Figure FDA0004061775200000011
Then
Figure FDA0004061775200000012
S123, training word vectors of a field corpus is carried out, skip-gram and cBOW are used for comparison, negative sampling optimization is used, quantity setting is determined according to specific scenes and test results, downsampling is adopted during model training, and window size is determined according to specific scenes, so that field word vectors V are obtained old (w)';
S124, carrying out space mapping according to the frequency p (w)' of each word appearing in the domain corpus to obtain Vo lddomain (w), the calculation formula is as follows:
Vo lddomain (w)=exp(p(w)')×Vo ld (w)
and S125, fusing the field word vectors and the non-field word vectors in the field language database to obtain a demand word vector Vnew (w).
2. The method as claimed in claim 1, wherein step S13 is performed after step S12 is performed, and step 13 is performed to correct the demand word vector, wherein step 13 specifically includes the following steps:
s131, using similar problem pairs in the professional field, performing word segmentation operation on each text, and searching the field word vector V trained in the step S123 for each word old (w)' and the demand word vector Vnew (w) obtained in step S125, V is aligned by using RWMD algorithm old (w)' at V new (w) carrying out similarity rho (w), judging whether the similarity rho (w) of each word is qualified or not, counting the failure rate lambda in all word similarity rho (w);
s132, judging whether the failure rate lambda is less than or equal to the threshold value, and if not, performing the step S133; if yes, go to step S134;
s133, adjusting the number of negative samples and the size of a down-sampling window in the step S123, repeating the steps S123 to S125, and obtaining the demand word vector V again new (w), and then returns to step S131;
and S134, finishing the calculation.
3. The method of claim 2, wherein the step S125 comprises the following steps:
s1251, calculating the smoothed field word vector V domain (w):
Figure FDA0004061775200000022
S1252, w denotes the current word, C d Representing a domain corpus, C ud Represents a domain-free corpus, when w ∈ C d And w ∈ C ud Executing a first fusion mode; when w ∈ C d And is provided with
Figure FDA0004061775200000023
Executing a second fusion mode; when/is>
Figure FDA0004061775200000024
And w ∈ C ud Executing a third fusion mode;
first fusion mode to obtain V new The formula for (w) is:
Figure FDA0004061775200000021
second fusion mode to obtain V new The formula for (w) is: v new (w)=V domain (w);
Obtaining V by a third fusion mode new The formula for (w) is: v new (w)=V undomain (w)。
CN201811257850.4A 2018-10-26 2018-10-26 Optimization method of domain word vectors and fusion ordering method based on optimization method Active CN109359302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811257850.4A CN109359302B (en) 2018-10-26 2018-10-26 Optimization method of domain word vectors and fusion ordering method based on optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811257850.4A CN109359302B (en) 2018-10-26 2018-10-26 Optimization method of domain word vectors and fusion ordering method based on optimization method

Publications (2)

Publication Number Publication Date
CN109359302A CN109359302A (en) 2019-02-19
CN109359302B true CN109359302B (en) 2023-04-18

Family

ID=65347004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811257850.4A Active CN109359302B (en) 2018-10-26 2018-10-26 Optimization method of domain word vectors and fusion ordering method based on optimization method

Country Status (1)

Country Link
CN (1) CN109359302B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960724B (en) * 2019-03-13 2021-06-04 北京工业大学 Text summarization method based on TF-IDF
CN110083834B (en) * 2019-04-24 2023-05-09 北京百度网讯科技有限公司 Semantic matching model training method and device, electronic equipment and storage medium
CN111814473B (en) * 2020-09-11 2020-12-22 平安国际智慧城市科技股份有限公司 Word vector increment method and device for specific field and storage medium
CN112185359B (en) * 2020-09-28 2023-08-29 广州秉理科技有限公司 Word coverage rate-based voice training set minimization method
CN112836509A (en) * 2021-02-22 2021-05-25 西安交通大学 Expert system knowledge base construction method and system
CN113221531B (en) * 2021-06-04 2024-08-06 西安邮电大学 Semantic matching method for multi-model dynamic collaboration
CN113449074A (en) * 2021-06-22 2021-09-28 重庆长安汽车股份有限公司 Sentence vector similarity matching optimization method and device containing proper nouns and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPR155800A0 (en) * 2000-11-17 2000-12-14 Agriculture Victoria Services Pty Ltd Method of enhancing virus-resistance in plants and producing virus-immune plants
US9342991B2 (en) * 2013-03-14 2016-05-17 Canon Kabushiki Kaisha Systems and methods for generating a high-level visual vocabulary
CN104899322B (en) * 2015-06-18 2021-09-17 百度在线网络技术(北京)有限公司 Search engine and implementation method thereof
CA3009758A1 (en) * 2015-12-29 2017-07-06 Mz Ip Holdings, Llc Systems and methods for suggesting emoji
CN106649561B (en) * 2016-11-10 2020-05-26 复旦大学 Intelligent question-answering system for tax consultation service
US10922621B2 (en) * 2016-11-11 2021-02-16 International Business Machines Corporation Facilitating mapping of control policies to regulatory documents
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN106777395A (en) * 2017-03-01 2017-05-31 北京航空航天大学 A kind of topic based on community's text data finds system
CN106997345A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 The keyword abstraction method of word-based vector sum word statistical information
CN107016439A (en) * 2017-05-09 2017-08-04 重庆大学 Based on CR2The image text dual coding mechanism implementation model of neutral net
CN207354333U (en) * 2017-07-31 2018-05-11 曾一菲 Instrument share management system based on cloud platform
CN108491480B (en) * 2018-03-12 2021-05-11 义语智能科技(上海)有限公司 Rumor detection method and apparatus
CN108363816A (en) * 2018-03-21 2018-08-03 北京理工大学 Open entity relation extraction method based on sentence justice structural model
CN108536781B (en) * 2018-03-29 2022-04-01 武汉大学 Social network emotion focus mining method and system
CN108595440B (en) * 2018-05-11 2022-03-18 厦门市美亚柏科信息股份有限公司 Short text content classification method and system
CN108664637B (en) * 2018-05-15 2021-10-08 惠龙易通国际物流股份有限公司 Retrieval method and system
CN108710611B (en) * 2018-05-17 2021-08-03 南京大学 Short text topic model generation method based on word network and word vector

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field

Also Published As

Publication number Publication date
CN109359302A (en) 2019-02-19

Similar Documents

Publication Publication Date Title
CN109359302B (en) Optimization method of domain word vectors and fusion ordering method based on optimization method
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
CN109271506A (en) A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning
CN111581519A (en) Item recommendation method and system based on user intention in session
CN101561805A (en) Document classifier generation method and system
CN109829045A (en) A kind of answering method and device
CN112463944B (en) Search type intelligent question-answering method and device based on multi-model fusion
CN111651558A (en) Hyperspherical surface cooperative measurement recommendation device and method based on pre-training semantic model
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN110956044A (en) Attention mechanism-based case input recognition and classification method for judicial scenes
CN114357120A (en) Non-supervision type retrieval method, system and medium based on FAQ
CN115309872B (en) Multi-model entropy weighted retrieval method and system based on Kmeans recall
CN112015760B (en) Automatic question-answering method and device based on candidate answer set reordering and storage medium
CN114332519A (en) Image description generation method based on external triple and abstract relation
CN112632250A (en) Question and answer method and system under multi-document scene
CN114637760A (en) Intelligent question and answer method and system
CN110851584A (en) Accurate recommendation system and method for legal provision
CN115146021A (en) Training method and device for text retrieval matching model, electronic equipment and medium
CN112084307A (en) Data processing method and device, server and computer readable storage medium
CN116992007A (en) Limiting question-answering system based on question intention understanding
CN117807232A (en) Commodity classification method, commodity classification model construction method and device
CN113836269B (en) Chapter-level core event extraction method based on question-answering system
CN117171428B (en) Method for improving accuracy of search and recommendation results
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant