CN113704416A - Word sense disambiguation method and device, electronic equipment and computer-readable storage medium - Google Patents

Word sense disambiguation method and device, electronic equipment and computer-readable storage medium Download PDF

Info

Publication number
CN113704416A
CN113704416A CN202111249932.6A CN202111249932A CN113704416A CN 113704416 A CN113704416 A CN 113704416A CN 202111249932 A CN202111249932 A CN 202111249932A CN 113704416 A CN113704416 A CN 113704416A
Authority
CN
China
Prior art keywords
word
vector
disambiguated
text
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111249932.6A
Other languages
Chinese (zh)
Other versions
CN113704416B (en
Inventor
张剑
杨大明
黄石磊
蒋志燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202111249932.6A priority Critical patent/CN113704416B/en
Publication of CN113704416A publication Critical patent/CN113704416A/en
Application granted granted Critical
Publication of CN113704416B publication Critical patent/CN113704416B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs

Abstract

The application relates to a word sense disambiguation method, comprising: the method comprises the steps of obtaining a text to be processed, determining a word to be disambiguated in the text to be processed, a text of the word to be disambiguated and a context of the word to be disambiguated, searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, vectorizing the definition interpretations to obtain a definition matrix, constructing a corresponding classifier based on the definition matrix, inputting the text, the context and the word to be disambiguated to the classifier to obtain a predicted candidate word of the word to be disambiguated, and replacing the word to be disambiguated by using the predicted candidate word to obtain a standard text after the disambiguation is eliminated. In addition, the application also relates to a word sense disambiguation method, a device, equipment and a storage medium. The method and the device can solve the problem that the word sense disambiguation accuracy is not high enough.

Description

Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
Technical Field
The present application relates to the field of text processing, and in particular, to a word sense disambiguation method and apparatus, an electronic device, and a computer-readable storage medium.
Background
A certain number of ambiguous words exist in the language vocabulary, and although the occurrence of ambiguous words brings convenience to the application of natural language, the occurrence of ambiguous words also brings certain difficulties to the understanding and translation of natural language. With the rise of artificial intelligence, word sense disambiguation is applied more and more in many high and new fields, and has become an important problem to be solved in natural language processing.
The existing word sense disambiguation method usually combines machine learning to perform word sense disambiguation, and the method needs a large amount of manually labeled training corpora and is expensive. Meanwhile, the training corpus is labeled manually, which cannot completely and accurately label out the unusual words with multiple meanings, so that the accuracy of word sense disambiguation is not high enough.
Disclosure of Invention
The application provides a word sense disambiguation method, a word sense disambiguation device, electronic equipment and a storage medium, which are used for solving the problem that the accuracy of word sense disambiguation is not high enough.
In a first aspect, the present application provides a word sense disambiguation method comprising:
acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated; and replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
In detail, the inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated includes:
vectorizing the above words, the below words and the words to be disambiguated respectively to obtain an above vector, an ambiguous word vector and a below vector;
carrying out vector averaging on the upper vector, the ambiguous word vector and the lower vector to obtain a semantic vector;
and inputting the semantic vector into the classifier to obtain a predicted candidate word of the word to be disambiguated.
In detail, the vectorization processing is performed on the above text, the below text, and the word to be disambiguated, so as to obtain an above text vector, an ambiguous word vector, and a below text vector, including:
respectively carrying out mask processing on the upper text, the lower text and the word to be disambiguated to obtain a mask data set;
converting the mask data set into a vector data set, and performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix;
and inputting the target vector correlation matrix into a final hidden output layer of a pre-training model to obtain an upper vector, an ambiguous word vector and a lower vector.
In detail, the masking the above text, the below text, and the word to be disambiguated to obtain a mask data set includes:
screening a plurality of keywords from the upper text, the lower text and the words to be disambiguated by using pre-acquired mask probability, and performing mask processing on the plurality of keywords to obtain mask words corresponding to the keywords;
and replacing the key words by the mask words to obtain a mask data set.
In detail, the performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix includes:
carrying out position coding processing on the vector data set to obtain a position vector set;
converting the position vector set into a position vector matrix, and generating a classification matrix according to the dimension of the position vector matrix;
calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix and a preset activation function;
and adjusting an iteration weight factor in a pre-purchased feed-forward neural network by using the original vector correlation matrix and the position vector matrix to obtain a target vector correlation matrix.
In detail, the vectorizing the plurality of definition interpretations to obtain a definition matrix includes:
inputting the multiple definition interpretations into a target training model to obtain multiple sentence vectors corresponding to the multiple definition interpretations;
and splicing the sentence vectors to obtain a definition matrix.
In detail, after the text to be processed is obtained, the method further includes:
cleaning and dirty washing are carried out on the text to be processed, and a cleaning data set is obtained;
splitting the cleaning data set into a plurality of sentences to obtain a sentence splitting data set;
and performing word segmentation processing on the sentence segmentation data set by using a reference word segmentation device to obtain a preprocessed text to be processed.
In a second aspect, the present application provides a word sense disambiguation apparatus, the apparatus comprising:
the text processing module is used for acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
the vectorization module is used for searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
the candidate word prediction module is used for constructing a corresponding classifier based on the definition matrix, inputting the text, the text and the word to be disambiguated into the classifier, and obtaining a prediction candidate word of the word to be disambiguated; and the ambiguity elimination module is used for replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after disambiguation.
In a third aspect, a word sense disambiguation apparatus is provided, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor configured to implement the steps of the word sense disambiguation method described in any of the embodiments of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the word sense disambiguation method as defined in any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
the embodiment of the invention searches a plurality of definition interpretations corresponding to the word to be disambiguated from the preset semantic knowledge base, because the semantic knowledge base contains large ranges of words and word senses, the performance of most of the uncommon words or word senses is greatly improved, manual labeling is not needed by searching through the semantic knowledge base, and the manual labeling cost is saved. Vectorizing the multiple definition interpretations can convert the definition interpretations into a vector form convenient for computer processing, constructing a corresponding classifier based on the definition matrix, inputting the determined word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated into the classifier, obtaining a predicted candidate word of the word to be disambiguated, obtaining a predicted candidate word, improving the accuracy of screening the predicted candidate word, and replacing the word to be disambiguated in the initial text by using the predicted candidate word to obtain a standard text after disambiguation. Therefore, the word sense disambiguation method, the word sense disambiguation device, the electronic device and the computer readable storage medium provided by the invention can solve the problem that the accuracy of word sense disambiguation is not high enough.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a flow chart illustrating a word sense disambiguation method according to an embodiment of the present application;
FIG. 2 is a block diagram of an apparatus for word sense disambiguation according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device for word sense disambiguation according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow chart of a word sense disambiguation method according to an embodiment of the present application. In this embodiment, the word sense disambiguation method includes:
s1, acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated.
In the embodiment of the present invention, the text to be processed refers to a complete sentence in any context, where the sentence includes an upper text, a word to be disambiguated, and a lower text, and a composition of the sentence may be the upper text (presence _ up) + the word to be disambiguated (token) + the lower text (presence _ down).
In an optional embodiment of the present invention, the word to be disambiguated may be determined according to the sentence identifier of the input text to be processed, the text before the word to be disambiguated is determined to be the context of the word to be disambiguated, and the text after the word to be disambiguated is determined to be the context of the word to be disambiguated.
In another alternative embodiment, the word to be disambiguated may be determined through a preset ambiguous word recognition model, and the text before the disambiguated word is determined as the upper text of the disambiguated word, and the text after the disambiguated word is determined as the lower text of the disambiguated word.
Specifically, after the text to be processed is obtained, the method further includes:
cleaning and dirty washing are carried out on the text to be processed, and a cleaning data set is obtained;
splitting the cleaning data set into a plurality of sentences to obtain a sentence splitting data set;
and performing word segmentation processing on the sentence segmentation data set by using a reference word segmentation device to obtain a preprocessed text to be processed.
In detail, the text to be processed is cleaned and dirty, so that the accuracy and cleanness of the processed cleaning data set can be guaranteed, the cleaning data set comprises a plurality of complete sentences, the cleaning data set can be split into a plurality of sentences by taking a sentence number as a split node to obtain a sentence division data set, and the sentence division data set is subjected to word division by using a reference word divider to obtain the preprocessed text to be processed. The reference word segmenter includes, but is not limited to, a word segmenter based on a dictionary-based character string matching model or a word segmenter based on a character labeling machine learning model, such as a stanford word segmenter.
S2, searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix.
In the embodiment of the invention, the preset semantic knowledge base can be WordNet, the WordNet is an English dictionary based on cognitive linguistics, words are not arranged in alphabetical order, and a word network is formed according to the meaning of the words.
Specifically, with the semantic knowledge base as a reference, searching the semantic knowledge base for a plurality of defined interpretations corresponding to the word to be disambiguated.
Further, vectorizing the plurality of definition interpretations to obtain a definition matrix includes:
inputting the multiple definition interpretations into a target training model to obtain multiple sentence vectors corresponding to the multiple definition interpretations;
and splicing the sentence vectors to obtain a definition matrix.
In detail, inputting the definition interpretation (token definition) into the target training model, and obtaining S vectors: vector _1[ CLS ]],...,vector_S[CLS]Splicing the S vectors to obtain a definition matrix
Figure 979941DEST_PATH_IMAGE001
Wherein the dimension of the definition matrix is H × S.
S3, constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a prediction candidate word of the word to be disambiguated.
In the embodiment of the present invention, the constructing of the corresponding classifier based on the definition matrix means that the corresponding classifier is constructed by using the definition matrix as a parameter of the classifier. Wherein the classifier comprises a linear classifier and a non-linear classifier.
For example, a definition matrix with dimension H S is obtained
Figure 992897DEST_PATH_IMAGE002
Constructing corresponding classifiers according to the definition matrix
Figure 875402DEST_PATH_IMAGE003
Wherein W and b are updatable parameters, W dimension is H × H, and b dimension is H × 1.
Specifically, the inputting the above text, the below text, and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated includes:
vectorizing the above words, the below words and the words to be disambiguated respectively to obtain an above vector, an ambiguous word vector and a below vector;
carrying out vector averaging on the upper vector, the ambiguous word vector and the lower vector to obtain a semantic vector;
and inputting the semantic vector into the classifier to obtain a predicted candidate word of the word to be disambiguated.
In detail, a pre-training model is used for carrying out vectorization processing on the above text, the below text and the word to be disambiguated respectively to obtain an above text vector, an ambiguous word vector and a below text vector. The pre-training model may be a BERT (Bidirectional Encoder representation) model or an ELMO (Bidirectional language model) model.
In an embodiment of the present invention, the vectorizing the above text, the below text, and the word to be disambiguated to obtain an above text vector, an ambiguous word vector, and a below text vector includes:
respectively carrying out mask processing on the upper text, the lower text and the word to be disambiguated to obtain a mask data set;
converting the mask data set into a vector data set, and performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix;
and inputting the target vector correlation matrix into a final hidden output layer of a pre-training model to obtain an upper vector, an ambiguous word vector and a lower vector.
Specifically, the masking the above text, the below text, and the word to be disambiguated to obtain a mask data set includes:
screening a plurality of keywords from the upper text, the lower text and the words to be disambiguated by using pre-acquired mask probability, and performing mask processing on the plurality of keywords to obtain mask words corresponding to the keywords;
and replacing the key words by the mask words to obtain a mask data set.
In detail, the mask probability refers to a ratio of the number of words of the randomly selected keywords in the above text, the below text and the word to be disambiguated to the total number of words in the above text, the below text and the word to be disambiguated, for example, if the mask probability is 30%, when the above text, the below text and the word to be disambiguated contain 100 words, a plurality of keywords are screened out according to the 30% mask probability, that is, 30 keywords in the above text, the below text and the word to be disambiguated are randomly masked.
The MASK processing mode comprises a MASK MASK and a random MASK, wherein the MASK MASK refers to the MASK of the key words by MASK symbols, and the random MASK refers to the MASK of the key words by other words.
In particular, the Word2vec algorithm may be employed to convert the mask data set into a vector data set.
Further, the performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix includes:
carrying out position coding processing on the vector data set to obtain a position vector set;
converting the position vector set into a position vector matrix, and generating a classification matrix according to the dimension of the position vector matrix;
calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix and a preset activation function;
and adjusting an iteration weight factor in a pre-purchased feed-forward neural network by using the original vector correlation matrix and the position vector matrix to obtain a target vector correlation matrix.
In detail, the dimension of the classification matrix is consistent with the dimension of the position vector matrix, and if the dimension of the position vector matrix is
Figure 227886DEST_PATH_IMAGE004
Then the dimension of the classification matrix is also
Figure 170434DEST_PATH_IMAGE005
Specifically, the calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix, and a preset activation function includes:
splitting the classification matrix according to a preset splitting rule to obtain a central matrix, an incidence matrix and a weight matrix;
performing point multiplication on the position vector matrix and the central matrix, the association matrix and the weight matrix respectively to obtain a central vector matrix, an association vector matrix and a weight vector matrix;
and taking the central vector matrix, the associated vector matrix and the weight vector matrix as input parameters of the activation function to obtain the original vector correlation matrix.
In detail, the classification matrix is split into a center matrix, an association matrix and a weight matrix by using a preset splitting rule pair, wherein the classification matrix is
Figure 166291DEST_PATH_IMAGE006
Then can be according to
Figure 586908DEST_PATH_IMAGE007
Figure 59478DEST_PATH_IMAGE008
Figure 172927DEST_PATH_IMAGE009
Performing splitting to obtain
Figure 98158DEST_PATH_IMAGE010
A dimensional central matrix,
Figure 384783DEST_PATH_IMAGE011
Correlation matrix of dimension and
Figure 446280DEST_PATH_IMAGE012
a weight matrix of dimensions.
Preferably, the activation function is a softmax function.
Further, the obtaining the original vector correlation matrix by using the central vector matrix, the associated vector matrix and the weight vector matrix as input parameters of the activation function includes:
Softmax
Figure 730631DEST_PATH_IMAGE013
=
Figure 143157DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure 171156DEST_PATH_IMAGE016
for the matrix of the central vectors,
Figure 415056DEST_PATH_IMAGE017
for the matrix of the relevance vectors,
Figure 604729DEST_PATH_IMAGE018
is a dimension of the matrix of the relevance vectors,
Figure 504551DEST_PATH_IMAGE019
for the matrix of weight vectors is described,
Figure 70662DEST_PATH_IMAGE021
is the original vector correlation matrix.
Specifically, the target vector correlation matrix is input into a final hidden output layer of a pre-training model, and an upper vector, an ambiguous word vector and a lower vector are obtained.
In detail, in the embodiment of the present invention, the pretraining model adopts a BERT model, wherein the BERT model has the advantages of being more efficient and capable of capturing more dependencies.
In another embodiment of the present invention, before the vectorizing the above text, the below text, and the word to be disambiguated to obtain the above text vector, the ambiguous word vector, and the below text vector, the method further includes:
acquiring a preset number of appointed texts in a preset corpus, and performing sentence segmentation and word segmentation on the appointed texts to obtain an appointed word segmentation set;
and training a bidirectional long-short term memory model by using the specified word segmentation set to obtain a pre-training model.
In detail, the Bi-directional long-short term memory model, i.e., the Bi-LSTM model, is trained to obtain a pre-trained model, and the pre-trained model can infer a word vector corresponding to each word according to the context, so that the ambiguous word can be understood according to the context.
Specifically, the training of the bidirectional long-short term memory model by using the specified word segmentation set to obtain a pre-training model includes:
calculating the forward probability and the backward probability of the appointed participle set by utilizing the bidirectional long-short term memory network;
constructing a maximized log-likelihood function based on the forward probability and the backward probability, and calculating a function value of the maximized log-likelihood function;
and when the function value is greater than or equal to a preset threshold value, outputting the bidirectional long-short term memory model as a pre-training model.
Further, the calculating the forward probability and the backward probability of the specified participle set by using the bidirectional long-short term memory network comprises:
calculating the forward probability of the appointed participle set by using the following calculation formula:
Figure 434647DEST_PATH_IMAGE022
wherein the content of the first and second substances,
Figure 795221DEST_PATH_IMAGE023
the forward probability is that the forward probability,
Figure 182340DEST_PATH_IMAGE025
refers to the first in the specified participle set
Figure 552142DEST_PATH_IMAGE026
The number of the word-segmentation is,
Figure 442738DEST_PATH_IMAGE027
and the number of the participles is specified in the specified participle set.
Calculating the backward probability of the appointed participle set by using the following calculation formula:
Figure 803574DEST_PATH_IMAGE028
wherein the content of the first and second substances,
Figure 412410DEST_PATH_IMAGE029
is the backward probability.
Specifically, the constructing a maximum log-likelihood function based on the forward probability and the backward probability includes:
Figure 585902DEST_PATH_IMAGE030
wherein the content of the first and second substances,
Figure 596584DEST_PATH_IMAGE031
is the function value of the function value,
Figure 626856DEST_PATH_IMAGE032
is a parameter of a word vector and is,
Figure 722988DEST_PATH_IMAGE034
for the parameters of the softmax layer, the layer parameters,
Figure 434592DEST_PATH_IMAGE035
is a parameter of the bidirectional long-short term memory network.
Further, the vector averaging the above vector, the ambiguous word vector, and the below vector to obtain a semantic vector includes:
vector averaging the above vector, the ambiguous word vector, and the below vector using the following calculation:
Figure 299780DEST_PATH_IMAGE036
wherein the content of the first and second substances,
Figure 438638DEST_PATH_IMAGE037
in the form of a semantic vector, the semantic vector,
Figure 84383DEST_PATH_IMAGE038
for the purpose of the above-mentioned vector,
Figure 334098DEST_PATH_IMAGE039
for the purpose of the ambiguous word vector, the word vector,
Figure 319372DEST_PATH_IMAGE040
is the context vector.
In detail, vector averaging is performed on the above vector, the ambiguous word vector and the below vector, so that the finally obtained semantic vector can link the above information and the below information, and the semantic information contained in the semantic vector is enriched.
Specifically, before the inputting the above text, the below text and the word to be disambiguated into the classifier, the method further comprises:
obtaining a plurality of candidate words, inputting the candidate words into the classifier, and obtaining the probability corresponding to the candidate words;
selecting a label corresponding to the candidate word with the probability greater than or equal to a preset probability threshold value as a prediction label;
calculating a loss value between the predicted tag and a preset real tag by using a preset minimum loss function;
and when the loss value is greater than or equal to a preset loss threshold value, performing parameter adjustment on the classifier, and outputting the classifier as a standard classifier until the loss value is less than the loss threshold value.
Further, the calculating a loss value between the predicted tag and a preset real tag by using a preset minimum loss function includes:
the preset minimization loss function is:
Figure 629130DEST_PATH_IMAGE041
wherein the content of the first and second substances,
Figure 699855DEST_PATH_IMAGE042
in order to be said loss value, the loss value,
Figure 815578DEST_PATH_IMAGE043
for the number of the words to be disambiguated,
Figure 655358DEST_PATH_IMAGE044
for the number of candidate definition words
Figure DEST_PATH_IMAGE045
Is the true tag of the s-th candidate definition of the m-th word to be disambiguated, [ alpha ]
Figure 136018DEST_PATH_IMAGE046
Is the probability score of the s-th candidate definition for the m-th word to be disambiguated.
In detail, the classifier is used for carrying out classification prediction on the semantic vectors to obtain prediction candidate words.
And S4, replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
In the embodiment of the invention, the predicted candidate words are words finally obtained after classification and prediction are carried out by the classifier, and the predicted candidate words are replaced by the words to be disambiguated in the initial text to obtain the standard text after disambiguation.
The embodiment of the invention searches a plurality of definition interpretations corresponding to the word to be disambiguated from the preset semantic knowledge base, because the semantic knowledge base contains large ranges of words and word senses, the performance of most of the uncommon words or word senses is greatly improved, manual labeling is not needed by searching through the semantic knowledge base, and the manual labeling cost is saved. Vectorizing the multiple definition interpretations can convert the definition interpretations into a vector form convenient for computer processing, constructing a corresponding classifier based on the definition matrix, inputting the determined word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated into the classifier, obtaining a predicted candidate word of the word to be disambiguated, obtaining a predicted candidate word, improving the accuracy of screening the predicted candidate word, and replacing the word to be disambiguated in the initial text by using the predicted candidate word to obtain a standard text after disambiguation. Therefore, the word sense disambiguation method provided by the invention can solve the problem that the accuracy of word sense disambiguation is not high enough.
As shown in fig. 2, the present embodiment provides a module schematic diagram of a word sense disambiguation apparatus 10, where the word sense disambiguation apparatus 10 includes: the text processing module 11, the vectorization module 12, the candidate word prediction module 13, and the disambiguation module 14.
The text processing module 11 is configured to obtain a text to be processed, and determine a word to be disambiguated in the text to be processed, a context of the word to be disambiguated, and a context of the word to be disambiguated;
the vectorization module 12 is configured to search a preset semantic knowledge base for a plurality of definition interpretations corresponding to the word to be disambiguated, and perform vectorization on the plurality of definition interpretations to obtain a definition matrix;
the candidate word prediction module 13 is configured to construct a corresponding classifier based on the definition matrix, and input the context, and the word to be disambiguated to the classifier to obtain a prediction candidate word of the word to be disambiguated;
the disambiguation module 14 is configured to replace the word to be disambiguated with the predicted candidate word to obtain a standard text after disambiguation.
In detail, each module in the word sense disambiguation apparatus 10 in the embodiment of the present application adopts the same technical means as the word sense disambiguation method described in fig. 1 above, and can produce the same technical effect, and is not described herein again.
As shown in fig. 3, an electronic device provided in the embodiment of the present application includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete communication with each other through the communication bus 114;
a memory 113 for storing a computer program;
in an embodiment of the present application, the processor 111, when executing the program stored in the memory 113, is configured to implement the word sense disambiguation method provided in any of the foregoing method embodiments, including:
acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated;
and replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the word sense disambiguation method as provided in any of the preceding method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of word sense disambiguation, the method comprising:
acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
constructing a corresponding classifier based on the definition matrix, and inputting the above text, the below text and the word to be disambiguated into the classifier to obtain a predicted candidate word of the word to be disambiguated;
and replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after the disambiguation.
2. The word sense disambiguation method of claim 1, wherein the inputting the context, and the word to be disambiguated into the classifier to obtain a predicted candidate word for the word to be disambiguated comprises:
vectorizing the above words, the below words and the words to be disambiguated respectively to obtain an above vector, an ambiguous word vector and a below vector;
carrying out vector averaging on the upper vector, the ambiguous word vector and the lower vector to obtain a semantic vector;
and inputting the semantic vector into the classifier to obtain a predicted candidate word of the word to be disambiguated.
3. The word sense disambiguation method according to claim 2, wherein the vectorizing the context, and the word to be disambiguated to obtain a context vector, an ambiguous word vector, and a context vector comprises:
respectively carrying out mask processing on the upper text, the lower text and the word to be disambiguated to obtain a mask data set;
converting the mask data set into a vector data set, and performing matrix conversion processing on the vector data set to obtain a target vector correlation matrix;
and inputting the target vector correlation matrix into a final hidden output layer of a pre-training model to obtain an upper vector, an ambiguous word vector and a lower vector.
4. The word sense disambiguation method according to claim 3, wherein said masking said context, and said word to be disambiguated, respectively, to obtain a masked data set, comprises:
screening a plurality of keywords from the upper text, the lower text and the words to be disambiguated by using pre-acquired mask probability, and performing mask processing on the plurality of keywords to obtain mask words corresponding to the keywords;
and replacing the key words by the mask words to obtain a mask data set.
5. The word sense disambiguation method of claim 3, wherein said performing a matrix transformation process on the vector data set to obtain a target vector correlation matrix comprises:
carrying out position coding processing on the vector data set to obtain a position vector set;
converting the position vector set into a position vector matrix, and generating a classification matrix according to the dimension of the position vector matrix;
calculating to obtain an original vector correlation matrix corresponding to the position vector set according to the position vector matrix, the classification matrix and a preset activation function;
and adjusting an iteration weight factor in a pre-purchased feed-forward neural network by using the original vector correlation matrix and the position vector matrix to obtain a target vector correlation matrix.
6. The word sense disambiguation method according to one of the claims 1 to 4, wherein vectorizing a plurality of the defined interpretations resulting in a defined matrix comprises:
inputting the multiple definition interpretations into a target training model to obtain multiple sentence vectors corresponding to the multiple definition interpretations;
and splicing the sentence vectors to obtain a definition matrix.
7. The word sense disambiguation method according to one of the claims 1 to 4, further comprising, after said obtaining the text to be processed:
cleaning and dirty washing are carried out on the text to be processed, and a cleaning data set is obtained;
splitting the cleaning data set into a plurality of sentences to obtain a sentence splitting data set;
and performing word segmentation processing on the sentence segmentation data set by using a reference word segmentation device to obtain a preprocessed text to be processed.
8. A word sense disambiguating apparatus, the apparatus comprising:
the text processing module is used for acquiring a text to be processed, and determining a word to be disambiguated in the text to be processed, the context of the word to be disambiguated and the context of the word to be disambiguated;
the vectorization module is used for searching a plurality of definition interpretations corresponding to the word to be disambiguated from a preset semantic knowledge base, and vectorizing the definition interpretations to obtain a definition matrix;
the candidate word prediction module is used for constructing a corresponding classifier based on the definition matrix, inputting the text, the text and the word to be disambiguated into the classifier, and obtaining a prediction candidate word of the word to be disambiguated;
and the ambiguity elimination module is used for replacing the word to be disambiguated by using the predicted candidate word to obtain the standard text after disambiguation.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the word sense disambiguation method of any one of claims 1-7 when executing the program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the word sense disambiguation method of any one of claims 1-7.
CN202111249932.6A 2021-10-26 2021-10-26 Word sense disambiguation method and device, electronic equipment and computer-readable storage medium Active CN113704416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111249932.6A CN113704416B (en) 2021-10-26 2021-10-26 Word sense disambiguation method and device, electronic equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111249932.6A CN113704416B (en) 2021-10-26 2021-10-26 Word sense disambiguation method and device, electronic equipment and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN113704416A true CN113704416A (en) 2021-11-26
CN113704416B CN113704416B (en) 2022-03-04

Family

ID=78647043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111249932.6A Active CN113704416B (en) 2021-10-26 2021-10-26 Word sense disambiguation method and device, electronic equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN113704416B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114707489A (en) * 2022-03-29 2022-07-05 马上消费金融股份有限公司 Method and device for acquiring marked data set, electronic equipment and storage medium
CN114818736A (en) * 2022-05-31 2022-07-29 北京百度网讯科技有限公司 Text processing method, chain finger method and device for short text and storage medium
WO2023098013A1 (en) * 2021-11-30 2023-06-08 青岛海尔科技有限公司 Semantic recognition method and apparatus and electronic device
WO2024051516A1 (en) * 2022-09-07 2024-03-14 马上消费金融股份有限公司 Method and apparatus for eliminating dialogue intent ambiguity, and electronic device and non-transitory computer-readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
WO2019085640A1 (en) * 2017-10-31 2019-05-09 株式会社Ntt都科摩 Word meaning disambiguation method and device, word meaning expansion method, apparatus and device, and computer-readable storage medium
CN110555208A (en) * 2018-06-04 2019-12-10 北京三快在线科技有限公司 ambiguity elimination method and device in information query and electronic equipment
US20200073996A1 (en) * 2018-08-28 2020-03-05 Stitched.IO Limited Methods and Systems for Domain-Specific Disambiguation of Acronyms or Homonyms
US10733383B1 (en) * 2018-05-24 2020-08-04 Workday, Inc. Fast entity linking in noisy text environments
CN112784604A (en) * 2021-02-08 2021-05-11 哈尔滨工业大学 Entity linking method based on entity boundary network
CN112906397A (en) * 2021-04-06 2021-06-04 南通大学 Short text entity disambiguation method
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
WO2019085640A1 (en) * 2017-10-31 2019-05-09 株式会社Ntt都科摩 Word meaning disambiguation method and device, word meaning expansion method, apparatus and device, and computer-readable storage medium
US10733383B1 (en) * 2018-05-24 2020-08-04 Workday, Inc. Fast entity linking in noisy text environments
CN110555208A (en) * 2018-06-04 2019-12-10 北京三快在线科技有限公司 ambiguity elimination method and device in information query and electronic equipment
US20200073996A1 (en) * 2018-08-28 2020-03-05 Stitched.IO Limited Methods and Systems for Domain-Specific Disambiguation of Acronyms or Homonyms
CN112784604A (en) * 2021-02-08 2021-05-11 哈尔滨工业大学 Entity linking method based on entity boundary network
CN112906397A (en) * 2021-04-06 2021-06-04 南通大学 Short text entity disambiguation method
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023098013A1 (en) * 2021-11-30 2023-06-08 青岛海尔科技有限公司 Semantic recognition method and apparatus and electronic device
CN114707489A (en) * 2022-03-29 2022-07-05 马上消费金融股份有限公司 Method and device for acquiring marked data set, electronic equipment and storage medium
CN114707489B (en) * 2022-03-29 2023-08-18 马上消费金融股份有限公司 Method and device for acquiring annotation data set, electronic equipment and storage medium
CN114818736A (en) * 2022-05-31 2022-07-29 北京百度网讯科技有限公司 Text processing method, chain finger method and device for short text and storage medium
CN114818736B (en) * 2022-05-31 2023-06-09 北京百度网讯科技有限公司 Text processing method, chain finger method and device for short text and storage medium
WO2024051516A1 (en) * 2022-09-07 2024-03-14 马上消费金融股份有限公司 Method and apparatus for eliminating dialogue intent ambiguity, and electronic device and non-transitory computer-readable storage medium

Also Published As

Publication number Publication date
CN113704416B (en) 2022-03-04

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN107291693B (en) Semantic calculation method for improved word vector model
CN108984526B (en) Document theme vector extraction method based on deep learning
CN106980683B (en) Blog text abstract generating method based on deep learning
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
Paulus et al. Global belief recursive neural networks
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN111160031A (en) Social media named entity identification method based on affix perception
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN110263325A (en) Chinese automatic word-cut
CN112101031B (en) Entity identification method, terminal equipment and storage medium
Mahmoud et al. A text semantic similarity approach for Arabic paraphrase detection
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN111507093A (en) Text attack method and device based on similar dictionary and storage medium
Almutiri et al. Markov models applications in natural language processing: a survey
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN109815497B (en) Character attribute extraction method based on syntactic dependency
Anandika et al. A study on machine learning approaches for named entity recognition
ELAffendi et al. A simple Galois Power-of-Two real time embedding scheme for performing Arabic morphology deep learning tasks
CN113076718A (en) Commodity attribute extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant