CN111401043A - Method, device and equipment for mining similar meaning words and storage medium - Google Patents
Method, device and equipment for mining similar meaning words and storage medium Download PDFInfo
- Publication number
- CN111401043A CN111401043A CN202010149502.6A CN202010149502A CN111401043A CN 111401043 A CN111401043 A CN 111401043A CN 202010149502 A CN202010149502 A CN 202010149502A CN 111401043 A CN111401043 A CN 111401043A
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- words
- target
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for near synonym mining, which are used for improving the accuracy of near synonym prediction. The method comprises the following steps: acquiring a target word in a question sentence input by a user; performing data preprocessing on the target word by adopting the one-hot code to obtain the one-hot code corresponding to the target word; inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word; carrying out vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors; calculating the similarity between the intermediate word vector and the corresponding word vector of each word in the preset target field word list; and sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation so as to obtain a corresponding synonym dictionary.
Description
Technical Field
The invention relates to the technical field of artificial intelligence semantic analysis, in particular to a method, a device, equipment and a storage medium for near-synonym mining.
Background
The word vectors corresponding to the context words are added together through the existing word2vec algorithm to predict the intermediate word vector, and the action of the sequence of the context words on the prediction of the intermediate word vector is not considered, so that the action of the part of speech of the near-sense words is not considered when the near-sense word mining is carried out through the existing word2vec algorithm.
Disclosure of Invention
The invention mainly aims to solve the technical problem of insufficient prediction accuracy caused by the fact that the existing word2vec algorithm is used for near word mining and does not consider the action of the part of speech of the near word.
In order to achieve the above object, a first aspect of the present invention provides a method for mining a synonym, including:
acquiring a target word in a question sentence input by a user;
performing data preprocessing on the target word by adopting one-hot coding to obtain one-hot codes corresponding to the target word;
inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word;
carrying out vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors;
calculating the similarity between the intermediate word vector and the corresponding word vector of each word in a preset target field vocabulary, wherein the preset target field vocabulary is a dictionary of target field retrieval language languages;
and sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation to obtain a corresponding synonym dictionary.
Optionally, in another implementation manner of the first aspect of the present invention, before the obtaining of the target word in the question sentence input by the user, the method includes:
obtaining problems to be trained in a problem library in a target field, and performing word segmentation processing on each problem to be trained to obtain corresponding context words, wherein the problems to be trained comprise words with different parts of speech;
performing data preprocessing on the context word block by adopting one-hot coding to obtain one-hot codes corresponding to the context words;
multiplying the one-hot code corresponding to each context word by an input weight matrix to obtain a corresponding word vector;
carrying out vector splicing on the word vector corresponding to each context word to obtain a spliced vector corresponding to the context word;
inputting the splicing vector corresponding to the context word into a continuous bag-of-words model, and outputting to obtain a single-hot coded intermediate word;
multiplying the one-hot codes corresponding to the intermediate words by an output weight matrix to obtain corresponding intermediate word vectors;
processing the intermediate word vector by using an activation function to obtain corresponding probability distribution, and obtaining a word corresponding to a value with the maximum probability as a predicted intermediate word;
and calculating the error between the predicted intermediate word and the one-hot coded intermediate word by adopting a preset loss function until a corresponding minimum function value is obtained, and obtaining a corresponding preset continuous bag-of-words model.
Optionally, in another implementation manner of the first aspect of the present invention, after the calculating, by using a preset loss function, an error between the predicted intermediate word and the one-hot coded intermediate word until a corresponding minimum function value is obtained to obtain a corresponding preset continuous bag-of-words model, the method further includes:
and updating the input weight matrix and the output weight matrix by adopting a gradient descent algorithm according to the function value obtained by calculation of the preset loss function to obtain an updated input weight matrix and an updated output weight matrix.
Optionally, in another implementation manner of the first aspect of the present invention, the preset loss function is:
wherein L (θ) represents the loss function value;
s represents the S-th sentence;
Tjrepresenting the number of target words of the jth sentence;
Optionally, in another implementation manner of the first aspect of the present invention, the acquiring a target word in a question sentence input by a user specifically includes:
removing punctuation marks except for set punctuation marks in a question sentence input by a user, and performing word segmentation processing on the question sentence input by the user to obtain a corresponding target word, wherein the target word is a word of a near meaning word to be mined; the set punctuation mark comprises at least one of a punctuation mark used for expressing the tone of the question sentence input by the user and a punctuation mark at the end of the sentence.
Optionally, in another implementation manner of the first aspect of the present invention, the similarity is at least one of a cosine distance, a manhattan distance, a correlation coefficient, and a mahalanobis distance.
A second aspect of the present invention provides a synonym mining device, including:
the target word acquisition module is used for acquiring a target word in a question sentence input by a user;
the unique hot code acquisition module is used for carrying out data preprocessing on the target word by adopting unique hot coding to obtain a unique hot code corresponding to the target word;
the intermediate word acquisition module is used for inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word;
the intermediate word vector acquisition module is used for carrying out vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors;
the similarity calculation module is used for calculating the similarity between the intermediate word vector and a word vector corresponding to each word in a preset target field vocabulary, and the preset target field vocabulary is a dictionary of the target field retrieval language;
and the near-meaning word dictionary obtaining module is used for sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation to obtain a corresponding near-meaning word dictionary.
Optionally, in another implementation manner of the second aspect of the present invention, the synonym mining device further includes:
the context word acquisition module is used for acquiring the problems to be trained in the problem library in the target field, and performing word segmentation processing on each problem to be trained to acquire corresponding context words, wherein the problems to be trained comprise words with different parts of speech;
the context word corresponding one-hot code acquisition module is used for carrying out data preprocessing on the context word block by adopting one-hot coding to obtain one-hot codes corresponding to the context words;
the first word vector acquisition module is used for multiplying the unique hot code corresponding to each context word by an input weight matrix to obtain a corresponding word vector;
the splicing vector acquisition module is used for carrying out vector splicing on the word vector corresponding to each context word to obtain a splicing vector corresponding to the context word;
the unique hot coded intermediate word acquisition module is used for inputting the splicing vector corresponding to the context word into a continuous word bag model and outputting to obtain a unique hot coded intermediate word;
the second word vector acquisition module is used for multiplying the one-hot codes corresponding to the intermediate words by the output weight matrix to obtain corresponding intermediate word vectors;
the predicted intermediate word obtaining module is used for processing the intermediate word vector by using an activation function to obtain corresponding probability distribution, and obtaining a word corresponding to the value with the maximum probability as the predicted intermediate word;
and the preset continuous bag-of-words model acquisition module is used for calculating the error between the predicted intermediate word and the one-hot coded intermediate word by adopting a preset loss function until a corresponding minimum function value is obtained, and obtaining a corresponding preset continuous bag-of-words model.
Optionally, in another implementation manner of the second aspect of the present invention, the model training module further includes:
and the matrix updating unit is used for updating the input weight matrix and the output weight matrix by adopting a gradient descent algorithm according to the function value obtained by calculation of the preset loss function to obtain an updated input weight matrix and an updated output weight matrix.
Optionally, in another implementation manner of the second aspect of the present invention, in the synonym mining device, the preset loss function is:
wherein L (θ) represents the loss function value;
s represents the S-th sentence;
Tjrepresenting the number of target words of the jth sentence;
Optionally, in another implementation manner of the second aspect of the present invention, the target word obtaining module is specifically configured to:
removing punctuation marks except for set punctuation marks in a question sentence input by a user, and performing word segmentation processing on the question sentence input by the user to obtain a corresponding target word, wherein the target word is a word of a near meaning word to be mined; the set punctuation mark comprises at least one of a punctuation mark used for expressing the tone of the question sentence input by the user and a punctuation mark at the end of the sentence.
Optionally, in another implementation manner of the second aspect of the present invention, in the synonym mining apparatus, the similarity is at least one of a cosine distance, a manhattan distance, a correlation coefficient, and a mahalanobis distance.
A third aspect of the present invention provides a synonym mining apparatus, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the hypernym mining device to perform the method of the first aspect described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of the first aspect described above.
According to the technical scheme provided by the invention, the target words in the question sentences input by the user are obtained; performing data preprocessing on the target word by adopting one-hot coding to obtain one-hot codes corresponding to the target word; inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word; carrying out vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors; calculating the similarity between the intermediate word vector and the corresponding word vector of each word in a preset target field vocabulary, wherein the preset target field vocabulary is a dictionary of target field retrieval language languages; and sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation to obtain a corresponding synonym dictionary. The embodiment of the invention is based on the problem that the lexical effect of the near-meaning words is not considered when the existing word2vec algorithm is used for mining the near-meaning words, the probability of intermediate words is predicted by splicing the context word vectors and inputting the spliced vectors into a continuous bag-of-words model so as to train the obtained improved continuous bag-of-words model; and inputting the target words in the user input question sentences into the trained continuous word bag model for vector representation and intermediate word prediction, performing similarity calculation on the obtained intermediate word vectors and word vectors of words in the word list, and reordering the words in the word list according to the similarity calculation result, thereby obtaining a near-meaning word dictionary corresponding to the target words in the user input question sentences.
Drawings
FIG. 1 is a diagram of an embodiment of a method for semantic mining near words according to an embodiment of the present invention;
FIG. 2 is a diagram of another embodiment of a method for semantic word mining according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an embodiment of a synonym mining device according to the embodiment of the invention;
FIG. 4 is a schematic diagram of another embodiment of a synonym mining device according to the embodiment of the invention;
fig. 5 is a schematic diagram of an embodiment of a near-synonym mining device in the embodiment of the invention.
Detailed Description
The embodiment of the invention provides a method, a device, equipment and a storage medium for mining a near meaning word, which are used for improving the precision of near meaning word prediction.
In order to make the technical field of the invention better understand the scheme of the invention, the embodiment of the invention will be described in conjunction with the attached drawings in the embodiment of the invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a detailed flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a method for extracting a synonym according to an embodiment of the present invention includes:
101. and acquiring a target word in the question sentence input by the user.
Specifically, the server obtains a target word in the question sentence input by the user, further, the server may remove the punctuation marks except the set punctuation marks in the question sentence input by the user, and perform word segmentation processing on the question sentence input by the user to obtain a corresponding target word, where the target word is a word of the near-synonym to be mined. The set punctuation mark comprises at least one of a punctuation mark used for expressing the tone of the question sentence input by the user and a punctuation mark at the end of the sentence. In particular, the user input question may be an insurance question, and thus the target area in which the user input question is located is an insurance area. For example, the question sentence entered by the user is "is class fredrin a legend in insurance? Removing the punctuation mark' and carrying out word segmentation on the sentence to obtain a corresponding word as a target word. The step is to preprocess data of question sentences input by users, and remove unnecessary punctuation marks to leave punctuation marks capable of representing a certain tone, so that consideration of part of speech of similar meaning words in the next step of testing data by adopting a preset continuous bag-of-words model can be improved. And the word segmentation processing is carried out on the question sentence input by the user, and the word with the part of speech can be obtained.
102. And carrying out data preprocessing on the target word by adopting the one-hot code to obtain the one-hot code corresponding to the target word.
Specifically, the server also performs data preprocessing on the target word by using the one-hot code to obtain the one-hot code corresponding to the target word. The obtained target words are reprocessed by adopting the one-hot coding, the target words are represented by adopting the one-hot code result, the next step of inputting a preset continuous word bag model for prediction is facilitated, and the data format preprocessing is performed, and the function of expanding the characteristics of the target words is also realized to a certain extent.
103. And inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word.
Specifically, the server inputs the obtained unique hot code into a preset continuous word bag model for vector representation and intermediate word prediction, and corresponding intermediate words are obtained. The invention inputs the spliced one-hot codes into the preset continuous word bag model to predict the intermediate words.
104. And performing vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors.
Specifically, the server also adopts a word embedding model to perform vector representation on the intermediate words to obtain corresponding intermediate word vectors; since the output of the continuous bag-of-words model is still the predicted intermediate words of the one-hot encoding, the invention further needs to convert the predicted intermediate words of the one-hot encoding into the corresponding intermediate word vectors thereof by the word embedding technology.
105. And calculating the similarity between the intermediate word vector and the corresponding word vector of each word in a preset target field word list, wherein the preset target field word list is a dictionary of the target field retrieval language.
Specifically, the server calculates the similarity between the intermediate word vector and the word vector corresponding to each word in the preset target field vocabulary by performing similarity calculation, and the preset target field vocabulary is a dictionary of the target field retrieval language, such as an insurance vocabulary used in the insurance field, in the specific implementation. Further, the similarity is at least one of a cosine distance, a manhattan distance, a correlation coefficient and a mahalanobis distance, for example, the similarity takes the cosine distance as an example, and the cosine distance between the intermediate word vector and the corresponding word vector of each word in the preset target field vocabulary is calculated. After vector representation is performed on the predicted intermediate words by using a word embedding model, the similarity between the intermediate word vector and the word vector corresponding to each word in the preset target field word list needs to be further calculated, and word reordering in the preset target field word list is performed according to the calculated similarity, so that a near-meaning word dictionary corresponding to the input target words is obtained.
106. And sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation to obtain a corresponding synonym dictionary.
Specifically, the server further performs descending sorting on each word in the preset target field word list according to the corresponding similarity obtained through calculation to obtain a corresponding near-meaning word dictionary, for example, the similarity takes cosine distance as an example, after calculating the cosine distance between the intermediate word vector and each word corresponding word vector in the preset target field word list, performs descending sorting on each word in the preset target field word list according to the corresponding cosine distance to reorder the words in the preset target field word list to obtain the near-meaning word dictionary corresponding to the input target word, so as to complete the near-meaning word mining for the target word.
As can be seen from the above, the embodiment of the method performs vector representation and intermediate word prediction by inputting the target word in the user input question sentence into the trained preset continuous word bag model, performs similarity calculation on the obtained intermediate word vector and the word vector of the word in the word list, and performs word reordering in the word list according to the similarity calculation result, thereby obtaining the near meaning word dictionary corresponding to the target word in the user input question sentence. When the preset continuous word bag model is used for training and testing data, the spliced word vectors are changed into the dimension which is the same as the dimension of the initialized input word vectors through a layer of forward neural network, and the word vectors are used for predicting the probability of the occurrence of the intermediate words.
By the method for mining the similar meaning words, the part of speech of the similar meaning words obtained by mining is consistent with the target part of speech of the question sentences input by the user, the semantic matching degree between the question sentences input by the user and the relevant standard questions in the question library is improved, and the answering accuracy of the question sentences input by the user is further improved.
Therefore, the method and the device realize the full-automatic mining of the similar meaning words in the question sentences input by the user, so that the condition that whether the words are similar or not does not need to be judged manually. And if 30 ten thousand words exist in the target field word list used by the invention, and the target word obtained by the word segmentation of the question sentence input by the user is 5 words, namely n is 5 in top n word hot word statistics, the invention can be used for carrying out the near-sense word mining to obtain a near-sense word dictionary with 150 ten thousand scales, so that the scale of the near-sense word dictionary obtained by carrying out the near-sense word mining by the invention can be freely defined, and the invention also realizes the scale of the near-sense word mining.
Further, referring to fig. 2, another embodiment of the method for mining a synonym according to the embodiment of the present invention includes:
201. the method comprises the steps of obtaining to-be-trained problems in a target field problem library, carrying out word segmentation on each to-be-trained problem, and obtaining corresponding context words, wherein the to-be-trained problems comprise words with different parts of speech.
Specifically, by extracting part of standard questions from a question bank or a question-answer bank of a target field in which the user input question sentences are located as materials for model training, the user input questions can be insurance questions during specific implementation, and therefore the target field in which the user input question sentences are located is an insurance field.
202. And carrying out data preprocessing on the context word block by adopting the one-hot coding to obtain the one-hot code corresponding to the context word.
203. And multiplying the one-hot code corresponding to each context word by the input weight matrix to obtain a corresponding word vector. Specifically, the corresponding word vector may be obtained by multiplying the unique code corresponding to the context word by the shared input weight matrix.
204. And carrying out vector splicing on the word vector corresponding to each context word to obtain a spliced vector corresponding to the context word.
205. And inputting the splicing vector corresponding to the context word into the continuous bag-of-word model, and outputting to obtain the one-hot coded intermediate word.
The steps 202, 204, 205 are similar to the steps 101-103, and are not described herein.
206. And multiplying the one-hot codes corresponding to the intermediate words by the output weight matrix to obtain corresponding intermediate word vectors.
Specifically, the server further multiplies the one-hot code corresponding to the intermediate word by the output weight matrix to obtain a corresponding intermediate word vector.
In specific implementation, word segmentation and coding are carried out on the problem to be trained to obtain word vectors corresponding to context words with parts of speech, the obtained word vectors are further spliced, and the spliced vectors are input into a continuous word bag model to carry out interword prediction; since the continuous bag of words model outputs intermediate words that are still one-hot encoded, the server also translates into their corresponding intermediate word vectors by multiplying the one-hot encoded intermediate words by the output weight matrix.
207. And processing the intermediate word vector by using an activation function to obtain corresponding probability distribution, and obtaining a word corresponding to the value with the maximum probability as the predicted intermediate word.
Specifically, the server uses the activation function to process the obtained intermediate word vector to obtain a corresponding probability distribution, and obtains a word corresponding to a value with the maximum probability from the probability distribution as the predicted intermediate word. Therefore, the occurrence probability of the intermediate words is predicted by splicing the context word vectors, the words corresponding to the maximum probability are selected as the predicted intermediate words, and the activation function used in the method is not specifically limited.
208. And calculating the error between the predicted intermediate word and the one-hot coded intermediate word by adopting a preset loss function until the corresponding minimum function value is obtained, and obtaining the corresponding preset continuous bag-of-words model.
Specifically, the server further calculates an error between the predicted intermediate word and the one-hot coded intermediate word by using a preset loss function until a corresponding minimum function value is obtained, so as to obtain a corresponding preset continuous bag-of-words model. In specific implementation, because the intermediate word corresponding to the maximum probability value obtained by prediction has an error with the one-hot coded intermediate word obtained by model output, it is necessary to compare the error with the one-hot code corresponding to the intermediate word, so that the smaller the error, the better. The error between the intermediate word obtained by calculation and prediction and the one-hot coded intermediate word is calculated by adopting the preset loss function, the smaller the error is, the better the error is, the training process of the continuous bag-of-words model is stopped until the corresponding minimum function value is obtained, the obtained continuous bag-of-words model is the optimal training model, namely the preset continuous bag-of-words model required by the invention, and the vector representation and the near meaning word prediction can be carried out on each word in the user input problem through the preset continuous bag-of-words model.
Further, the preset loss function is:
wherein L (θ) represents the loss function value;
s represents the S-th sentence;
Tjrepresenting the number of target words of the jth sentence;
In specific implementation, word vectors corresponding to the context words are spliced when the probability of the intermediate words is predicted, and then the probability of the intermediate words is predicted by the spliced word vectors. Therefore, the greatest feature of the loss function preset by the invention is that the word vector for representing the context for splicing is adopted in the formula. The spliced word vector is changed into the dimension which is the same as the dimension of the initialized word vector through a layer of forward neural network, the word vector is used for predicting the probability of the occurrence of the intermediate word, and the part of speech information contained in the sequence of the occurrence of the context words is considered in the spliced word vector, namely the grammar of the language is considered, so the part of speech information is also considered when the intermediate word is predicted.
Furthermore, the server updates the input weight matrix and the output weight matrix by adopting a gradient descent algorithm according to the function value obtained by calculation of the preset loss function, so as to obtain an updated input weight matrix and an updated output weight matrix. After the model training is finished, the input weight matrix and the output weight matrix required by word embedding in the input layer and the output layer of the preset continuous word bag model need to be updated, namely the weight matrix needs to be corrected in the gradient direction. According to the method, all function values obtained by calculating the preset loss function are subjected to probability derivation through a gradient descent algorithm so as to update the weight matrix, and the updated input weight matrix and the updated output weight matrix are obtained through random gradient descent and serve as new weight matrices in the preset continuous bag-of-words model.
209. And acquiring a target word in the question sentence input by the user.
210. And carrying out data preprocessing on the target word by adopting the one-hot code to obtain the one-hot code corresponding to the target word.
211. And inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word.
212. And performing vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors.
213. And calculating the similarity between the intermediate word vector and the corresponding word vector of each word in a preset target field word list, wherein the preset target field word list is a dictionary of the target field retrieval language.
214. And sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation to obtain a corresponding synonym dictionary.
Specifically, the detailed implementation description of steps 209-214 refers to steps 101-106, which are not described herein again.
In the above description of the method for mining a near-synonym in the embodiment of the present invention, referring to fig. 3, a near-synonym mining device in the embodiment of the present invention is described below, where an embodiment of the near-synonym mining device in the embodiment of the present invention includes:
and a target word obtaining module 301, configured to obtain a target word in a question sentence input by a user.
The unique hot code obtaining module 302 is configured to perform data preprocessing on the target word by using unique hot codes to obtain a unique hot code corresponding to the target word.
The intermediate word obtaining module 303 is configured to input the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction, so as to obtain a corresponding intermediate word.
The intermediate word vector obtaining module 304 is configured to perform vector representation on the intermediate words by using a word embedding model, and obtain corresponding intermediate word vectors.
And a similarity calculation module 305, configured to calculate a similarity between the intermediate word vector and a word vector corresponding to each word in a preset target field vocabulary, where the preset target field vocabulary is a dictionary of the target field retrieval language.
And the near-meaning word dictionary obtaining module 306 is configured to perform descending order on each word in the preset target field word list according to the calculated corresponding similarity, so as to obtain a corresponding near-meaning word dictionary.
Optionally, in another implementation manner of the synonym mining device of the present invention, as shown in fig. 4, the synonym mining device includes:
and a target word obtaining module 301, configured to obtain a target word in a question sentence input by a user.
The unique hot code obtaining module 302 is configured to perform data preprocessing on the target word by using unique hot codes to obtain a unique hot code corresponding to the target word.
The intermediate word obtaining module 303 is configured to input the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction, so as to obtain a corresponding intermediate word.
The intermediate word vector obtaining module 304 is configured to perform vector representation on the intermediate words by using a word embedding model, and obtain corresponding intermediate word vectors.
And a similarity calculation module 305, configured to calculate a similarity between the intermediate word vector and a word vector corresponding to each word in a preset target field vocabulary, where the preset target field vocabulary is a dictionary of the target field retrieval language.
And the near-meaning word dictionary obtaining module 306 is configured to perform descending order on each word in the preset target field word list according to the calculated corresponding similarity, so as to obtain a corresponding near-meaning word dictionary.
The context word obtaining module 307 is configured to obtain a problem to be trained in the target field problem library, perform word segmentation on each problem to be trained, and obtain a corresponding context word, where the problem to be trained includes words with different parts of speech.
The context word corresponding one-hot code obtaining module 308 is configured to perform data preprocessing on the context word block by using one-hot coding to obtain one-hot codes corresponding to the context words.
The first word vector obtaining module 309 is configured to multiply the unique code corresponding to each context word by the input weight matrix to obtain a corresponding word vector.
The concatenation vector obtaining module 310 is configured to perform vector concatenation on the word vector corresponding to each context word to obtain a concatenation vector corresponding to the context word.
And the unique hot coded intermediate word obtaining module 311 is configured to input the concatenation vector corresponding to the context word into the continuous bag-of-words model, and output the concatenation vector to obtain the unique hot coded intermediate word.
The second word vector obtaining module 312 is configured to multiply the one-hot code corresponding to the intermediate word by the output weight matrix to obtain a corresponding intermediate word vector.
And the predicted intermediate word obtaining module 313 is configured to process the intermediate word vector by using an activation function to obtain a corresponding probability distribution, and obtain a word corresponding to the value with the maximum probability as the predicted intermediate word.
And the preset continuous bag-of-words model obtaining module 314 is configured to calculate an error between the predicted intermediate word and the one-hot coded intermediate word by using a preset loss function until a corresponding minimum function value is obtained, so as to obtain a corresponding preset continuous bag-of-words model.
Optionally, in another implementation manner of the synonym mining device of the present invention, the model training module further includes:
and the matrix updating unit is used for updating the input weight matrix and the output weight matrix by adopting a gradient descent algorithm according to the function value obtained by calculation of the preset loss function to obtain an updated input weight matrix and an updated output weight matrix.
Optionally, in another implementation manner of the synonym mining device of the present invention, in the synonym mining device, the preset loss function is:
wherein L (θ) represents the loss function value;
s represents the S-th sentence;
Tjrepresenting the number of target words of the jth sentence;
Optionally, in another implementation manner of the near word mining device of the present invention, the target word obtaining module is specifically configured to:
removing punctuation marks except for set punctuation marks in the question sentences input by the user, and performing word segmentation processing on the question sentences input by the user to obtain corresponding target words, wherein the target words are words of the near-meaning words to be mined. The set punctuation mark comprises at least one of a punctuation mark used for expressing the tone of the question sentence input by the user and a punctuation mark at the end of the sentence.
Optionally, in another implementation manner of the near-sense word mining device in the present invention, in the near-sense word mining device, the similarity of the word vector is at least one of a cosine distance, a manhattan distance, a correlation coefficient, and a mahalanobis distance.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device or system type embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Fig. 3 and 4 describe the synonym mining device in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the synonym mining device in the embodiment of the present invention is described in detail from the perspective of the hardware processing.
Fig. 5 is a schematic structural diagram of a synonym mining device according to an embodiment of the present invention, where the synonym mining device 500 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 501 (e.g., one or more processors) and a memory 509, and one or more storage media 508 (e.g., one or more mass storage devices) storing an application 507 or data 506. Memory 509 and storage medium 508 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 508 may include one or more modules (not shown), each of which may include a series of instruction operations in a boolean variable store computed on a graph. Still further, the processor 501 may be configured to communicate with the storage medium 508 to execute a series of instruction operations in the storage medium 508 on the near word mining device 500.
The synonym mining device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems 505, such as Windows Server, Mac OS X, Unix, L inux, FreeBSD, etc. it will be understood by those skilled in the art that the synonym mining device configuration shown in FIG. 5 does not constitute a limitation of the synonym mining device, may include more or fewer components than shown, may combine certain components, or a different arrangement of components.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for mining a synonym, comprising:
acquiring a target word in a question sentence input by a user;
performing data preprocessing on the target word by adopting one-hot coding to obtain one-hot codes corresponding to the target word;
inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word;
carrying out vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors;
calculating the similarity between the intermediate word vector and the corresponding word vector of each word in a preset target field vocabulary, wherein the preset target field vocabulary is a dictionary of target field retrieval language languages;
and sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation to obtain a corresponding synonym dictionary.
2. The method for mining synonyms according to claim 1, wherein before the obtaining the target word in the user input question sentence, the method comprises:
obtaining problems to be trained in a problem library in a target field, and performing word segmentation processing on each problem to be trained to obtain corresponding context words, wherein the problems to be trained comprise words with different parts of speech;
performing data preprocessing on the context word block by adopting one-hot coding to obtain one-hot codes corresponding to the context words;
multiplying the one-hot code corresponding to each context word by an input weight matrix to obtain a corresponding word vector;
carrying out vector splicing on the word vector corresponding to each context word to obtain a spliced vector corresponding to the context word;
inputting the splicing vector corresponding to the context word into a continuous bag-of-words model, and outputting to obtain a single-hot coded intermediate word;
multiplying the one-hot codes corresponding to the intermediate words by an output weight matrix to obtain corresponding intermediate word vectors;
processing the intermediate word vector by using an activation function to obtain corresponding probability distribution, and obtaining a word corresponding to a value with the maximum probability as a predicted intermediate word;
and calculating the error between the predicted intermediate word and the one-hot coded intermediate word by adopting a preset loss function until a corresponding minimum function value is obtained, and obtaining a corresponding preset continuous bag-of-words model.
3. The method of near word mining of claim 2, wherein after the calculating the error between the predicted intermediate word and the one-hot coded intermediate word using a preset loss function until a corresponding minimum function value is obtained, resulting in a corresponding preset continuous bag of words model, the method further comprises:
and updating the input weight matrix and the output weight matrix by adopting a gradient descent algorithm according to the function value obtained by calculation of the preset loss function to obtain an updated input weight matrix and an updated output weight matrix.
4. The synonym mining method of claim 2, wherein the preset penalty function is:
wherein L (θ) represents the loss function value;
s represents the S-th sentence;
Tjrepresenting the number of target words of the jth sentence;
5. The method of claim 1, wherein the similarity is at least one of a cosine distance, a manhattan distance, a correlation coefficient, and a mahalanobis distance.
6. The method for mining near-meaning words according to any one of claims 1 to 5, wherein the acquiring target words in the question sentence input by the user specifically comprises:
removing punctuation marks except set punctuation marks in the user input question sentence, and performing word segmentation on the user input question sentence to obtain a corresponding target word, wherein the target word is a word of a near meaning word to be mined, and the set punctuation marks comprise at least one of punctuation marks used for expressing the tone of the user input question sentence and punctuation marks used for finishing the sentence.
7. A synonym mining device, comprising:
the target word acquisition module is used for acquiring a target word in a question sentence input by a user;
the unique hot code acquisition module is used for carrying out data preprocessing on the target word by adopting unique hot coding to obtain a unique hot code corresponding to the target word;
the intermediate word acquisition module is used for inputting the unique hot code corresponding to the target word into a preset continuous word bag model for vector representation and intermediate word prediction to obtain a corresponding intermediate word;
the intermediate word vector acquisition module is used for carrying out vector representation on the intermediate words by adopting a word embedding model to obtain corresponding intermediate word vectors;
the similarity calculation module is used for calculating the similarity between the intermediate word vector and a word vector corresponding to each word in a preset target field vocabulary, and the preset target field vocabulary is a dictionary of the target field retrieval language;
and the near-meaning word dictionary obtaining module is used for sequencing each word in the preset target field word list in a descending order according to the corresponding similarity obtained by calculation to obtain a corresponding near-meaning word dictionary.
8. The synonym mining device according to claim 7, further comprising:
the context word acquisition module is used for acquiring the problems to be trained in the problem library in the target field, and performing word segmentation processing on each problem to be trained to acquire corresponding context words, wherein the problems to be trained comprise words with different parts of speech;
the context word corresponding one-hot code acquisition module is used for carrying out data preprocessing on the context word block by adopting one-hot coding to obtain one-hot codes corresponding to the context words;
the first word vector acquisition module is used for multiplying the unique hot code corresponding to each context word by an input weight matrix to obtain a corresponding word vector;
the splicing vector acquisition module is used for carrying out vector splicing on the word vector corresponding to each context word to obtain a splicing vector corresponding to the context word;
the unique hot coded intermediate word acquisition module is used for inputting the splicing vector corresponding to the context word into a continuous word bag model and outputting to obtain a unique hot coded intermediate word;
the second word vector acquisition module is used for multiplying the one-hot codes corresponding to the intermediate words by the output weight matrix to obtain corresponding intermediate word vectors;
the predicted intermediate word obtaining module is used for processing the intermediate word vector by using an activation function to obtain corresponding probability distribution, and obtaining a word corresponding to the value with the maximum probability as the predicted intermediate word;
and the preset continuous bag-of-words model acquisition module is used for calculating the error between the predicted intermediate word and the one-hot coded intermediate word by adopting a preset loss function until a corresponding minimum function value is obtained, and obtaining a corresponding preset continuous bag-of-words model.
9. A synonym mining apparatus, characterized in that the synonym mining apparatus comprises: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the near word mining device to perform the method of any of claims 1-6.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010149502.6A CN111401043A (en) | 2020-03-06 | 2020-03-06 | Method, device and equipment for mining similar meaning words and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010149502.6A CN111401043A (en) | 2020-03-06 | 2020-03-06 | Method, device and equipment for mining similar meaning words and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111401043A true CN111401043A (en) | 2020-07-10 |
Family
ID=71430550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010149502.6A Pending CN111401043A (en) | 2020-03-06 | 2020-03-06 | Method, device and equipment for mining similar meaning words and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111401043A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257419A (en) * | 2020-11-06 | 2021-01-22 | 开普云信息科技股份有限公司 | Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof |
CN113642312A (en) * | 2021-08-19 | 2021-11-12 | 平安医疗健康管理股份有限公司 | Physical examination data processing method, physical examination data processing device, physical examination equipment and storage medium |
CN115563933A (en) * | 2022-09-19 | 2023-01-03 | 中国电信股份有限公司 | Word encoding method and device, storage medium and electronic equipment |
-
2020
- 2020-03-06 CN CN202010149502.6A patent/CN111401043A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257419A (en) * | 2020-11-06 | 2021-01-22 | 开普云信息科技股份有限公司 | Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof |
CN113642312A (en) * | 2021-08-19 | 2021-11-12 | 平安医疗健康管理股份有限公司 | Physical examination data processing method, physical examination data processing device, physical examination equipment and storage medium |
CN115563933A (en) * | 2022-09-19 | 2023-01-03 | 中国电信股份有限公司 | Word encoding method and device, storage medium and electronic equipment |
CN115563933B (en) * | 2022-09-19 | 2023-12-01 | 中国电信股份有限公司 | Word encoding method, device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113792818B (en) | Intention classification method and device, electronic equipment and computer readable storage medium | |
CN111368565B (en) | Text translation method, text translation device, storage medium and computer equipment | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN109635274B (en) | Text input prediction method, device, computer equipment and storage medium | |
CN111401043A (en) | Method, device and equipment for mining similar meaning words and storage medium | |
CN108763535B (en) | Information acquisition method and device | |
CN112395385B (en) | Text generation method and device based on artificial intelligence, computer equipment and medium | |
CN113887215A (en) | Text similarity calculation method and device, electronic equipment and storage medium | |
CN110163181B (en) | Sign language identification method and device | |
CN113642316B (en) | Chinese text error correction method and device, electronic equipment and storage medium | |
CN111144110A (en) | Pinyin marking method, device, server and storage medium | |
CN110704621A (en) | Text processing method and device, storage medium and electronic equipment | |
CN114153971B (en) | Error correction recognition and classification equipment for Chinese text containing errors | |
CN110705253A (en) | Burma language dependency syntax analysis method and device based on transfer learning | |
CN113836938A (en) | Text similarity calculation method and device, storage medium and electronic device | |
CN117332788B (en) | Semantic analysis method based on spoken English text | |
CN112000809A (en) | Incremental learning method and device for text categories and readable storage medium | |
CN108664464B (en) | Method and device for determining semantic relevance | |
CN114254645A (en) | Artificial intelligence auxiliary writing system | |
CN115510232A (en) | Text sentence classification method and classification device, electronic equipment and storage medium | |
CN114936274A (en) | Model training method, dialogue generating device, dialogue training equipment and storage medium | |
CN110929518A (en) | Text sequence labeling algorithm using overlapping splitting rule | |
CN113742451A (en) | Machine reading understanding system based on multi-type question and multi-fragment answer extraction | |
CN111783430A (en) | Sentence pair matching rate determination method and device, computer equipment and storage medium | |
CN111414762A (en) | Machine reading understanding method based on DCU (distributed channel Unit) coding and self-attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |