CN113704416A - Word sense disambiguation method and device, electronic equipment and computer-readable storage medium - Google Patents
Word sense disambiguation method and device, electronic equipment and computer-readable storage medium Download PDFInfo
- Publication number
- CN113704416A CN113704416A CN202111249932.6A CN202111249932A CN113704416A CN 113704416 A CN113704416 A CN 113704416A CN 202111249932 A CN202111249932 A CN 202111249932A CN 113704416 A CN113704416 A CN 113704416A
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- disambiguated
- text
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/274—Converting codes to words; Guess-ahead of partial word inputs
Abstract
The application relates to a word sense disambiguation method, comprising: the method comprises the steps of obtaining a text to be processed, determining a word to be disambiguated in the text to be processed, a text of the word to be disambiguated and a context of the word to be disambiguated, searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, vectorizing the definition interpretations to obtain a definition matrix, constructing a corresponding classifier based on the definition matrix, inputting the text, the context and the word to be disambiguated to the classifier to obtain a predicted candidate word of the word to be disambiguated, and replacing the word to be disambiguated by using the predicted candidate word to obtain a standard text after the disambiguation is eliminated. In addition, the application also relates to a word sense disambiguation method, a device, equipment and a storage medium. The method and the device can solve the problem that the word sense disambiguation accuracy is not high enough.
Description
Technical Field
The present application relates to the field of text processing, and in particular, to a word sense disambiguation method and apparatus, an electronic device, and a computer-readable storage medium.
Background
A certain number of ambiguous words exist in the language vocabulary, and although the occurrence of ambiguous words brings convenience to the application of natural language, the occurrence of ambiguous words also brings certain difficulties to the understanding and translation of natural language. With the rise of artificial intelligence, word sense disambiguation is applied more and more in many high and new fields, and has become an important problem to be solved in natural language processing.
The existing word sense disambiguation method usually combines machine learning to perform word sense disambiguation, and the method needs a large amount of manually labeled training corpora and is expensive. Meanwhile, the training corpus is labeled manually, which cannot completely and accurately label out the unusual words with multiple meanings, so that the accuracy of word sense disambiguation is not high enough.
Disclosure of Invention
The application provides a word sense disambiguation method, a word sense disambiguation device, electronic equipment and a storage medium, which are used for solving the problem that the accuracy of word sense disambiguation is not high enough.
In a first aspect, the present application provides a word sense disambiguation method comprising:
acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated; and replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
In detail, the inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated includes:
vectorizing the above words, the below words and the words to be disambiguated respectively to obtain an above vector, an ambiguous word vector and a below vector;
carrying out vector averaging on the upper vector, the ambiguous word vector and the lower vector to obtain a semantic vector;
and inputting the semantic vector into the classifier to obtain a predicted candidate word of the word to be disambiguated.
In detail, the vectorization processing is performed on the above text, the below text, and the word to be disambiguated, so as to obtain an above text vector, an ambiguous word vector, and a below text vector, including:
respectively carrying out mask processing on the upper text, the lower text and the word to be disambiguated to obtain a mask data set;
converting the mask data set into a vector data set, and performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix;
and inputting the target vector correlation matrix into a final hidden output layer of a pre-training model to obtain an upper vector, an ambiguous word vector and a lower vector.
In detail, the masking the above text, the below text, and the word to be disambiguated to obtain a mask data set includes:
screening a plurality of keywords from the upper text, the lower text and the words to be disambiguated by using pre-acquired mask probability, and performing mask processing on the plurality of keywords to obtain mask words corresponding to the keywords;
and replacing the key words by the mask words to obtain a mask data set.
In detail, the performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix includes:
carrying out position coding processing on the vector data set to obtain a position vector set;
converting the position vector set into a position vector matrix, and generating a classification matrix according to the dimension of the position vector matrix;
calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix and a preset activation function;
and adjusting an iteration weight factor in a pre-purchased feed-forward neural network by using the original vector correlation matrix and the position vector matrix to obtain a target vector correlation matrix.
In detail, the vectorizing the plurality of definition interpretations to obtain a definition matrix includes:
inputting the multiple definition interpretations into a target training model to obtain multiple sentence vectors corresponding to the multiple definition interpretations;
and splicing the sentence vectors to obtain a definition matrix.
In detail, after the text to be processed is obtained, the method further includes:
cleaning and dirty washing are carried out on the text to be processed, and a cleaning data set is obtained;
splitting the cleaning data set into a plurality of sentences to obtain a sentence splitting data set;
and performing word segmentation processing on the sentence segmentation data set by using a reference word segmentation device to obtain a preprocessed text to be processed.
In a second aspect, the present application provides a word sense disambiguation apparatus, the apparatus comprising:
the text processing module is used for acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
the vectorization module is used for searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
the candidate word prediction module is used for constructing a corresponding classifier based on the definition matrix, inputting the text, the text and the word to be disambiguated into the classifier, and obtaining a prediction candidate word of the word to be disambiguated; and the ambiguity elimination module is used for replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after disambiguation.
In a third aspect, a word sense disambiguation apparatus is provided, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor configured to implement the steps of the word sense disambiguation method described in any of the embodiments of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the word sense disambiguation method as defined in any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
the embodiment of the invention searches a plurality of definition interpretations corresponding to the word to be disambiguated from the preset semantic knowledge base, because the semantic knowledge base contains large ranges of words and word senses, the performance of most of the uncommon words or word senses is greatly improved, manual labeling is not needed by searching through the semantic knowledge base, and the manual labeling cost is saved. Vectorizing the multiple definition interpretations can convert the definition interpretations into a vector form convenient for computer processing, constructing a corresponding classifier based on the definition matrix, inputting the determined word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated into the classifier, obtaining a predicted candidate word of the word to be disambiguated, obtaining a predicted candidate word, improving the accuracy of screening the predicted candidate word, and replacing the word to be disambiguated in the initial text by using the predicted candidate word to obtain a standard text after disambiguation. Therefore, the word sense disambiguation method, the word sense disambiguation device, the electronic device and the computer readable storage medium provided by the invention can solve the problem that the accuracy of word sense disambiguation is not high enough.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a flow chart illustrating a word sense disambiguation method according to an embodiment of the present application;
FIG. 2 is a block diagram of an apparatus for word sense disambiguation according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device for word sense disambiguation according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow chart of a word sense disambiguation method according to an embodiment of the present application. In this embodiment, the word sense disambiguation method includes:
s1, acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated.
In the embodiment of the present invention, the text to be processed refers to a complete sentence in any context, where the sentence includes an upper text, a word to be disambiguated, and a lower text, and a composition of the sentence may be the upper text (presence _ up) + the word to be disambiguated (token) + the lower text (presence _ down).
In an optional embodiment of the present invention, the word to be disambiguated may be determined according to the sentence identifier of the input text to be processed, the text before the word to be disambiguated is determined to be the context of the word to be disambiguated, and the text after the word to be disambiguated is determined to be the context of the word to be disambiguated.
In another alternative embodiment, the word to be disambiguated may be determined through a preset ambiguous word recognition model, and the text before the disambiguated word is determined as the upper text of the disambiguated word, and the text after the disambiguated word is determined as the lower text of the disambiguated word.
Specifically, after the text to be processed is obtained, the method further includes:
cleaning and dirty washing are carried out on the text to be processed, and a cleaning data set is obtained;
splitting the cleaning data set into a plurality of sentences to obtain a sentence splitting data set;
and performing word segmentation processing on the sentence segmentation data set by using a reference word segmentation device to obtain a preprocessed text to be processed.
In detail, the text to be processed is cleaned and dirty, so that the accuracy and cleanness of the processed cleaning data set can be guaranteed, the cleaning data set comprises a plurality of complete sentences, the cleaning data set can be split into a plurality of sentences by taking a sentence number as a split node to obtain a sentence division data set, and the sentence division data set is subjected to word division by using a reference word divider to obtain the preprocessed text to be processed. The reference word segmenter includes, but is not limited to, a word segmenter based on a dictionary-based character string matching model or a word segmenter based on a character labeling machine learning model, such as a stanford word segmenter.
S2, searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix.
In the embodiment of the invention, the preset semantic knowledge base can be WordNet, the WordNet is an English dictionary based on cognitive linguistics, words are not arranged in alphabetical order, and a word network is formed according to the meaning of the words.
Specifically, with the semantic knowledge base as a reference, searching the semantic knowledge base for a plurality of defined interpretations corresponding to the word to be disambiguated.
Further, vectorizing the plurality of definition interpretations to obtain a definition matrix includes:
inputting the multiple definition interpretations into a target training model to obtain multiple sentence vectors corresponding to the multiple definition interpretations;
and splicing the sentence vectors to obtain a definition matrix.
In detail, inputting the definition interpretation (token definition) into the target training model, and obtaining S vectors: vector _1[ CLS ]],...,vector_S[CLS]Splicing the S vectors to obtain a definition matrixWherein the dimension of the definition matrix is H × S.
S3, constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a prediction candidate word of the word to be disambiguated.
In the embodiment of the present invention, the constructing of the corresponding classifier based on the definition matrix means that the corresponding classifier is constructed by using the definition matrix as a parameter of the classifier. Wherein the classifier comprises a linear classifier and a non-linear classifier.
For example, a definition matrix with dimension H S is obtainedConstructing corresponding classifiers according to the definition matrixWherein W and b are updatable parameters, W dimension is H × H, and b dimension is H × 1.
Specifically, the inputting the above text, the below text, and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated includes:
vectorizing the above words, the below words and the words to be disambiguated respectively to obtain an above vector, an ambiguous word vector and a below vector;
carrying out vector averaging on the upper vector, the ambiguous word vector and the lower vector to obtain a semantic vector;
and inputting the semantic vector into the classifier to obtain a predicted candidate word of the word to be disambiguated.
In detail, a pre-training model is used for carrying out vectorization processing on the above text, the below text and the word to be disambiguated respectively to obtain an above text vector, an ambiguous word vector and a below text vector. The pre-training model may be a BERT (Bidirectional Encoder representation) model or an ELMO (Bidirectional language model) model.
In an embodiment of the present invention, the vectorizing the above text, the below text, and the word to be disambiguated to obtain an above text vector, an ambiguous word vector, and a below text vector includes:
respectively carrying out mask processing on the upper text, the lower text and the word to be disambiguated to obtain a mask data set;
converting the mask data set into a vector data set, and performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix;
and inputting the target vector correlation matrix into a final hidden output layer of a pre-training model to obtain an upper vector, an ambiguous word vector and a lower vector.
Specifically, the masking the above text, the below text, and the word to be disambiguated to obtain a mask data set includes:
screening a plurality of keywords from the upper text, the lower text and the words to be disambiguated by using pre-acquired mask probability, and performing mask processing on the plurality of keywords to obtain mask words corresponding to the keywords;
and replacing the key words by the mask words to obtain a mask data set.
In detail, the mask probability refers to a ratio of the number of words of the randomly selected keywords in the above text, the below text and the word to be disambiguated to the total number of words in the above text, the below text and the word to be disambiguated, for example, if the mask probability is 30%, when the above text, the below text and the word to be disambiguated contain 100 words, a plurality of keywords are screened out according to the 30% mask probability, that is, 30 keywords in the above text, the below text and the word to be disambiguated are randomly masked.
The MASK processing mode comprises a MASK MASK and a random MASK, wherein the MASK MASK refers to the MASK of the key words by MASK symbols, and the random MASK refers to the MASK of the key words by other words.
In particular, the Word2vec algorithm may be employed to convert the mask data set into a vector data set.
Further, the performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix includes:
carrying out position coding processing on the vector data set to obtain a position vector set;
converting the position vector set into a position vector matrix, and generating a classification matrix according to the dimension of the position vector matrix;
calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix and a preset activation function;
and adjusting an iteration weight factor in a pre-purchased feed-forward neural network by using the original vector correlation matrix and the position vector matrix to obtain a target vector correlation matrix.
In detail, the dimension of the classification matrix is consistent with the dimension of the position vector matrix, and if the dimension of the position vector matrix isThen the dimension of the classification matrix is also。
Specifically, the calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix, and a preset activation function includes:
splitting the classification matrix according to a preset splitting rule to obtain a central matrix, an incidence matrix and a weight matrix;
performing point multiplication on the position vector matrix and the central matrix, the association matrix and the weight matrix respectively to obtain a central vector matrix, an association vector matrix and a weight vector matrix;
and taking the central vector matrix, the associated vector matrix and the weight vector matrix as input parameters of the activation function to obtain the original vector correlation matrix.
In detail, the classification matrix is split into a center matrix, an association matrix and a weight matrix by using a preset splitting rule pair, wherein the classification matrix isThen can be according to、、Performing splitting to obtainA dimensional central matrix,Correlation matrix of dimension anda weight matrix of dimensions.
Preferably, the activation function is a softmax function.
Further, the obtaining the original vector correlation matrix by using the central vector matrix, the associated vector matrix and the weight vector matrix as input parameters of the activation function includes:
wherein the content of the first and second substances,for the matrix of the central vectors,for the matrix of the relevance vectors,is a dimension of the matrix of the relevance vectors,for the matrix of weight vectors is described,is the original vector correlation matrix.
Specifically, the target vector correlation matrix is input into a final hidden output layer of a pre-training model, and an upper vector, an ambiguous word vector and a lower vector are obtained.
In detail, in the embodiment of the present invention, the pretraining model adopts a BERT model, wherein the BERT model has the advantages of being more efficient and capable of capturing more dependencies.
In another embodiment of the present invention, before the vectorizing the above text, the below text, and the word to be disambiguated to obtain the above text vector, the ambiguous word vector, and the below text vector, the method further includes:
acquiring a preset number of appointed texts in a preset corpus, and performing sentence segmentation and word segmentation on the appointed texts to obtain an appointed word segmentation set;
and training a bidirectional long-short term memory model by using the specified word segmentation set to obtain a pre-training model.
In detail, the Bi-directional long-short term memory model, i.e., the Bi-LSTM model, is trained to obtain a pre-trained model, and the pre-trained model can infer a word vector corresponding to each word according to the context, so that the ambiguous word can be understood according to the context.
Specifically, the training of the bidirectional long-short term memory model by using the specified word segmentation set to obtain a pre-training model includes:
calculating the forward probability and the backward probability of the appointed participle set by utilizing the bidirectional long-short term memory network;
constructing a maximized log-likelihood function based on the forward probability and the backward probability, and calculating a function value of the maximized log-likelihood function;
and when the function value is greater than or equal to a preset threshold value, outputting the bidirectional long-short term memory model as a pre-training model.
Further, the calculating the forward probability and the backward probability of the specified participle set by using the bidirectional long-short term memory network comprises:
calculating the forward probability of the appointed participle set by using the following calculation formula:
wherein the content of the first and second substances,the forward probability is that the forward probability,refers to the first in the specified participle setThe number of the word-segmentation is,and the number of the participles is specified in the specified participle set.
Calculating the backward probability of the appointed participle set by using the following calculation formula:
Specifically, the constructing a maximum log-likelihood function based on the forward probability and the backward probability includes:
wherein the content of the first and second substances,is the function value of the function value,is a parameter of a word vector and is,for the parameters of the softmax layer, the layer parameters,is a parameter of the bidirectional long-short term memory network.
Further, the vector averaging the above vector, the ambiguous word vector, and the below vector to obtain a semantic vector includes:
vector averaging the above vector, the ambiguous word vector, and the below vector using the following calculation:
wherein the content of the first and second substances,in the form of a semantic vector, the semantic vector,for the purpose of the above-mentioned vector,for the purpose of the ambiguous word vector, the word vector,is the context vector.
In detail, vector averaging is performed on the above vector, the ambiguous word vector and the below vector, so that the finally obtained semantic vector can link the above information and the below information, and the semantic information contained in the semantic vector is enriched.
Specifically, before the inputting the above text, the below text and the word to be disambiguated into the classifier, the method further comprises:
obtaining a plurality of candidate words, inputting the candidate words into the classifier, and obtaining the probability corresponding to the candidate words;
selecting a label corresponding to the candidate word with the probability greater than or equal to a preset probability threshold value as a prediction label;
calculating a loss value between the predicted tag and a preset real tag by using a preset minimum loss function;
and when the loss value is greater than or equal to a preset loss threshold value, performing parameter adjustment on the classifier, and outputting the classifier as a standard classifier until the loss value is less than the loss threshold value.
Further, the calculating a loss value between the predicted tag and a preset real tag by using a preset minimum loss function includes:
the preset minimization loss function is:
wherein the content of the first and second substances,in order to be said loss value, the loss value,for the number of the words to be disambiguated,for the number of candidate definition wordsIs the true tag of the s-th candidate definition of the m-th word to be disambiguated, [ alpha ]Is the probability score of the s-th candidate definition for the m-th word to be disambiguated.
In detail, the classifier is used for carrying out classification prediction on the semantic vectors to obtain prediction candidate words.
And S4, replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
In the embodiment of the invention, the predicted candidate words are words finally obtained after classification and prediction are carried out by the classifier, and the predicted candidate words are replaced by the words to be disambiguated in the initial text to obtain the standard text after disambiguation.
The embodiment of the invention searches a plurality of definition interpretations corresponding to the word to be disambiguated from the preset semantic knowledge base, because the semantic knowledge base contains large ranges of words and word senses, the performance of most of the uncommon words or word senses is greatly improved, manual labeling is not needed by searching through the semantic knowledge base, and the manual labeling cost is saved. Vectorizing the multiple definition interpretations can convert the definition interpretations into a vector form convenient for computer processing, constructing a corresponding classifier based on the definition matrix, inputting the determined word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated into the classifier, obtaining a predicted candidate word of the word to be disambiguated, obtaining a predicted candidate word, improving the accuracy of screening the predicted candidate word, and replacing the word to be disambiguated in the initial text by using the predicted candidate word to obtain a standard text after disambiguation. Therefore, the word sense disambiguation method provided by the invention can solve the problem that the accuracy of word sense disambiguation is not high enough.
As shown in fig. 2, the present embodiment provides a module schematic diagram of a word sense disambiguation apparatus 10, where the word sense disambiguation apparatus 10 includes: the text processing module 11, the vectorization module 12, the candidate word prediction module 13, and the disambiguation module 14.
The text processing module 11 is configured to obtain a text to be processed, and determine a word to be disambiguated in the text to be processed, a context of the word to be disambiguated, and a context of the word to be disambiguated;
the vectorization module 12 is configured to search a preset semantic knowledge base for a plurality of definition interpretations corresponding to the word to be disambiguated, and perform vectorization on the plurality of definition interpretations to obtain a definition matrix;
the candidate word prediction module 13 is configured to construct a corresponding classifier based on the definition matrix, and input the context, and the word to be disambiguated to the classifier to obtain a prediction candidate word of the word to be disambiguated;
the disambiguation module 14 is configured to replace the word to be disambiguated with the predicted candidate word to obtain a standard text after disambiguation.
In detail, each module in the word sense disambiguation apparatus 10 in the embodiment of the present application adopts the same technical means as the word sense disambiguation method described in fig. 1 above, and can produce the same technical effect, and is not described herein again.
As shown in fig. 3, an electronic device provided in the embodiment of the present application includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete communication with each other through the communication bus 114;
a memory 113 for storing a computer program;
in an embodiment of the present application, the processor 111, when executing the program stored in the memory 113, is configured to implement the word sense disambiguation method provided in any of the foregoing method embodiments, including:
acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated;
and replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the word sense disambiguation method as provided in any of the preceding method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method of word sense disambiguation, the method comprising:
acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated;
and replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
2. The word sense disambiguation method of claim 1, wherein the inputting the context, and the word to be disambiguated into the classifier to obtain a predicted candidate word for the word to be disambiguated comprises:
vectorizing the above words, the below words and the words to be disambiguated respectively to obtain an above vector, an ambiguous word vector and a below vector;
carrying out vector averaging on the upper vector, the ambiguous word vector and the lower vector to obtain a semantic vector;
and inputting the semantic vector into the classifier to obtain a predicted candidate word of the word to be disambiguated.
3. The word sense disambiguation method according to claim 2, wherein the vectorizing the context, and the word to be disambiguated to obtain a context vector, an ambiguous word vector, and a context vector comprises:
respectively carrying out mask processing on the upper text, the lower text and the word to be disambiguated to obtain a mask data set;
converting the mask data set into a vector data set, and performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix;
and inputting the target vector correlation matrix into a final hidden output layer of a pre-training model to obtain an upper vector, an ambiguous word vector and a lower vector.
4. The word sense disambiguation method according to claim 3, wherein said masking said context, and said word to be disambiguated, respectively, to obtain a masked data set, comprises:
screening a plurality of keywords from the upper text, the lower text and the words to be disambiguated by using pre-acquired mask probability, and performing mask processing on the plurality of keywords to obtain mask words corresponding to the keywords;
and replacing the key words by the mask words to obtain a mask data set.
5. The word sense disambiguation method of claim 3, wherein said performing a matrix transformation process on the vector data set to obtain a target vector correlation matrix comprises:
carrying out position coding processing on the vector data set to obtain a position vector set;
converting the position vector set into a position vector matrix, and generating a classification matrix according to the dimension of the position vector matrix;
calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix and a preset activation function;
and adjusting an iteration weight factor in a pre-purchased feed-forward neural network by using the original vector correlation matrix and the position vector matrix to obtain a target vector correlation matrix.
6. The word sense disambiguation method according to one of the claims 1 to 4, wherein vectorizing a plurality of the defined interpretations resulting in a defined matrix comprises:
inputting the multiple definition interpretations into a target training model to obtain multiple sentence vectors corresponding to the multiple definition interpretations;
and splicing the sentence vectors to obtain a definition matrix.
7. The word sense disambiguation method according to one of the claims 1 to 4, further comprising, after said obtaining the text to be processed:
cleaning and dirty washing are carried out on the text to be processed, and a cleaning data set is obtained;
splitting the cleaning data set into a plurality of sentences to obtain a sentence splitting data set;
and performing word segmentation processing on the sentence segmentation data set by using a reference word segmentation device to obtain a preprocessed text to be processed.
8. A word sense disambiguating apparatus, the apparatus comprising:
the text processing module is used for acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
the vectorization module is used for searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
the candidate word prediction module is used for constructing a corresponding classifier based on the definition matrix, inputting the text, the text and the word to be disambiguated into the classifier, and obtaining a prediction candidate word of the word to be disambiguated;
and the ambiguity elimination module is used for replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after disambiguation.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the word sense disambiguation method of any one of claims 1-7 when executing the program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the word sense disambiguation method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111249932.6A CN113704416B (en) | 2021-10-26 | 2021-10-26 | Word sense disambiguation method and device, electronic equipment and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111249932.6A CN113704416B (en) | 2021-10-26 | 2021-10-26 | Word sense disambiguation method and device, electronic equipment and computer-readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113704416A true CN113704416A (en) | 2021-11-26 |
CN113704416B CN113704416B (en) | 2022-03-04 |
Family
ID=78647043
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111249932.6A Active CN113704416B (en) | 2021-10-26 | 2021-10-26 | Word sense disambiguation method and device, electronic equipment and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113704416B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114707489A (en) * | 2022-03-29 | 2022-07-05 | 马上消费金融股份有限公司 | Method and device for acquiring marked data set, electronic equipment and storage medium |
CN114818736A (en) * | 2022-05-31 | 2022-07-29 | 北京百度网讯科技有限公司 | Text processing method, chain finger method and device for short text and storage medium |
WO2023098013A1 (en) * | 2021-11-30 | 2023-06-08 | 青岛海尔科技有限公司 | Semantic recognition method and apparatus and electronic device |
WO2024051516A1 (en) * | 2022-09-07 | 2024-03-14 | 马上消费金融股份有限公司 | Method and apparatus for eliminating dialogue intent ambiguity, and electronic device and non-transitory computer-readable storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
WO2019085640A1 (en) * | 2017-10-31 | 2019-05-09 | 株式会社Ntt都科摩 | Word meaning disambiguation method and device, word meaning expansion method, apparatus and device, and computer-readable storage medium |
CN110555208A (en) * | 2018-06-04 | 2019-12-10 | 北京三快在线科技有限公司 | ambiguity elimination method and device in information query and electronic equipment |
US20200073996A1 (en) * | 2018-08-28 | 2020-03-05 | Stitched.IO Limited | Methods and Systems for Domain-Specific Disambiguation of Acronyms or Homonyms |
US10733383B1 (en) * | 2018-05-24 | 2020-08-04 | Workday, Inc. | Fast entity linking in noisy text environments |
CN112784604A (en) * | 2021-02-08 | 2021-05-11 | 哈尔滨工业大学 | Entity linking method based on entity boundary network |
CN112906397A (en) * | 2021-04-06 | 2021-06-04 | 南通大学 | Short text entity disambiguation method |
CN113158687A (en) * | 2021-04-29 | 2021-07-23 | 新声科技(深圳)有限公司 | Semantic disambiguation method and device, storage medium and electronic device |
-
2021
- 2021-10-26 CN CN202111249932.6A patent/CN113704416B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
WO2019085640A1 (en) * | 2017-10-31 | 2019-05-09 | 株式会社Ntt都科摩 | Word meaning disambiguation method and device, word meaning expansion method, apparatus and device, and computer-readable storage medium |
US10733383B1 (en) * | 2018-05-24 | 2020-08-04 | Workday, Inc. | Fast entity linking in noisy text environments |
CN110555208A (en) * | 2018-06-04 | 2019-12-10 | 北京三快在线科技有限公司 | ambiguity elimination method and device in information query and electronic equipment |
US20200073996A1 (en) * | 2018-08-28 | 2020-03-05 | Stitched.IO Limited | Methods and Systems for Domain-Specific Disambiguation of Acronyms or Homonyms |
CN112784604A (en) * | 2021-02-08 | 2021-05-11 | 哈尔滨工业大学 | Entity linking method based on entity boundary network |
CN112906397A (en) * | 2021-04-06 | 2021-06-04 | 南通大学 | Short text entity disambiguation method |
CN113158687A (en) * | 2021-04-29 | 2021-07-23 | 新声科技(深圳)有限公司 | Semantic disambiguation method and device, storage medium and electronic device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023098013A1 (en) * | 2021-11-30 | 2023-06-08 | 青岛海尔科技有限公司 | Semantic recognition method and apparatus and electronic device |
CN114707489A (en) * | 2022-03-29 | 2022-07-05 | 马上消费金融股份有限公司 | Method and device for acquiring marked data set, electronic equipment and storage medium |
CN114707489B (en) * | 2022-03-29 | 2023-08-18 | 马上消费金融股份有限公司 | Method and device for acquiring annotation data set, electronic equipment and storage medium |
CN114818736A (en) * | 2022-05-31 | 2022-07-29 | 北京百度网讯科技有限公司 | Text processing method, chain finger method and device for short text and storage medium |
CN114818736B (en) * | 2022-05-31 | 2023-06-09 | 北京百度网讯科技有限公司 | Text processing method, chain finger method and device for short text and storage medium |
WO2024051516A1 (en) * | 2022-09-07 | 2024-03-14 | 马上消费金融股份有限公司 | Method and apparatus for eliminating dialogue intent ambiguity, and electronic device and non-transitory computer-readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113704416B (en) | 2022-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN107291693B (en) | Semantic calculation method for improved word vector model | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN106980683B (en) | Blog text abstract generating method based on deep learning | |
CN113704416B (en) | Word sense disambiguation method and device, electronic equipment and computer-readable storage medium | |
CN110245229B (en) | Deep learning theme emotion classification method based on data enhancement | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
CN107025284A (en) | The recognition methods of network comment text emotion tendency and convolutional neural networks model | |
Paulus et al. | Global belief recursive neural networks | |
CN113268995B (en) | Chinese academy keyword extraction method, device and storage medium | |
CN107273913B (en) | Short text similarity calculation method based on multi-feature fusion | |
CN111160031A (en) | Social media named entity identification method based on affix perception | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN110263325A (en) | Chinese automatic word-cut | |
CN112101031B (en) | Entity identification method, terminal equipment and storage medium | |
Mahmoud et al. | A text semantic similarity approach for Arabic paraphrase detection | |
CN111368542A (en) | Text language association extraction method and system based on recurrent neural network | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN111507093A (en) | Text attack method and device based on similar dictionary and storage medium | |
Almutiri et al. | Markov models applications in natural language processing: a survey | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN109815497B (en) | Character attribute extraction method based on syntactic dependency | |
Anandika et al. | A study on machine learning approaches for named entity recognition | |
ELAffendi et al. | A simple Galois Power-of-Two real time embedding scheme for performing Arabic morphology deep learning tasks | |
CN113076718A (en) | Commodity attribute extraction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |