CN114925698A - Abbreviation disambiguation method, apparatus, computer device and storage medium - Google Patents

Abbreviation disambiguation method, apparatus, computer device and storage medium Download PDF

Info

Publication number
CN114925698A
CN114925698A CN202210361778.XA CN202210361778A CN114925698A CN 114925698 A CN114925698 A CN 114925698A CN 202210361778 A CN202210361778 A CN 202210361778A CN 114925698 A CN114925698 A CN 114925698A
Authority
CN
China
Prior art keywords
sentence
combination
abbreviation
paraphrase
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210361778.XA
Other languages
Chinese (zh)
Other versions
CN114925698B (en
Inventor
罗雪山
欧丽珍
陈洪辉
蔡飞
王思远
胡涛
陈佩佩
李新梦
陈翀昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210361778.XA priority Critical patent/CN114925698B/en
Publication of CN114925698A publication Critical patent/CN114925698A/en
Application granted granted Critical
Publication of CN114925698B publication Critical patent/CN114925698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to an abbreviation disambiguation method, apparatus, computer device and storage medium. The method comprises the following steps: acquiring an abbreviation and an original sentence containing the abbreviation from the text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation; obtaining a paraphrase combination according to the candidate paraphrases and the original sentences; obtaining a sentence combination according to a new sentence and an original sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase; inputting the paraphrase combination and the sentence combination into the trained twin neural network model respectively to obtain the score of word dimensionality and the score of sentence dimensionality; and obtaining a prediction result of the model according to the scoring of the word dimension and the scoring of the sentence dimension. The method can evaluate the number of candidate paraphrase definitions of the abbreviations with any length, has better robustness, and can improve the disambiguation accuracy of the abbreviations.

Description

Abbreviation disambiguation method, apparatus, computer device and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for disambiguating an abbreviation, a computer device, and a storage medium.
Background
With dramatic advances in the field of natural language processing, neural network models replace many traditional models and are widely deployed in the physical world, including many safety-critical areas. However, deep learning is very sensitive to model input, and slight changes in input may cause errors in model judgment. Although the acronym is simple to construct, the acronym itself is in fact ambiguous and often requires contextual information to determine the correct interpretation. In tasks such as question answering, machine reading understanding, information extraction, sensitive word detection and retrieval, an expansion form of the acronym is often used. It is a challenging task for computers to recognize and understand acronyms.
Abbreviation disambiguation can assist machine understanding, information retrieval, and other natural language processing downstream tasks. In the general flow, two sequences are usually input, one is a candidate paraphrase, the other is a text containing an abbreviation, the two sequences are encoded through a representation layer and then enter an evaluation layer to obtain an evaluation score, the evaluation layer may be a semantic similarity calculation function or a classifier and the like, and the specific evaluation mode is different according to different modeling methods. Finally, the evaluation scores are ranked, usually the highest score is taken as the predicted answer, and the method is very similar to the structure of the twin network. In the AD challenge of AAAI-SDU, the reference model GAD given by Veyseh et al obtains sentence codes through BiLSTM, obtains context codes by means of a syntax structure (e.g., dependency tree) and a gcn (graph connected neural networks) model, and finally, concatenates the abbreviations and sentence codes under two codes as an input of an evaluation layer, and predicts the definitions of the abbreviations through two layers of feedforward classifiers, where the number of neurons of the last layer of classifier is equal to the number of candidate definitions of the abbreviation in the dictionary, but this means that when the number of definitions of the abbreviation in the dictionary increases, the model structure will change significantly.
Another idea is now to combine a single abbreviation definition with the original sentence by the BERT property of processing two sentences simultaneously. Processing the input formats of two sentences by simulating BERT, splicing [ CLS ] identifiers, candidate paraphrases (according to the number in a dictionary) and [ SEP ] identifiers with original sentences as model input, and constructing pseudo labels. Pseudo-labels are essentially semi-supervised machine learning techniques, i.e., by labeling high-confidence network predictions, by mixing these labels with a training set to generate a new data set, and then using this new data set to train a new binary model. This approach is more robust and can handle longer dictionary lengths. But this approach does not take into account the degree of match and correlation between the candidate definitions and the original context, which makes disambiguation of abbreviations less accurate.
Disclosure of Invention
In view of the foregoing, it is necessary to provide an abbreviation disambiguation method, apparatus, computer device and storage medium for the above technical problem.
A method of abbreviation disambiguation, the method comprising:
the method comprises the steps of obtaining an abbreviation and an original sentence containing the abbreviation from text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation.
Obtaining paraphrase combination according to the candidate paraphrases and the original sentences; and obtaining a sentence combination according to a new sentence and the original sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase.
And respectively inputting the paraphrase combination and the sentence combination into a trained twin neural network model, and respectively scoring the codes of sentence pairs in the paraphrase combination and the sentence combination to obtain the score of word dimension and the score of sentence dimension.
And obtaining a prediction result of the model according to the score of the word dimension and the score of the sentence dimension.
In one embodiment, inputting the paraphrase combination and the sentence combination into a trained twin neural network model, respectively, and scoring the codes of sentence pairs in the paraphrase combination and the sentence combination to obtain a score of word dimension and a score of sentence dimension, respectively, includes:
and inputting each sentence pair in the paraphrase combination into a trained twin neural network model, and scoring the codes of the sentence pairs by using a vector similarity evaluation method to obtain the score of the word dimensionality.
And inputting each sentence pair in the sentence combination into a trained twin neural network model, and scoring the codes of the sentence pairs by using a vector similarity evaluation method to obtain the score of sentence dimensionality.
In one embodiment, inputting the paraphrase combination and the sentence combination into a trained twin neural network model, respectively, and scoring the codes of sentence pairs in the paraphrase combination and the sentence combination to obtain a score of word dimension and a score of sentence dimension, respectively, includes:
inputting each sentence pair in the paraphrase combination into a trained twin neural network model, coding the sentence pairs, and processing a coding result through a pooling layer to obtain two sentence vectors with the same dimensionality; and (3) passing sentence vectors with two same dimensions through a softmax classifier to obtain paraphrase combination prediction labels, and using the probability value of the paraphrase combination prediction labels being 1 as the score of corresponding candidate paraphrases to obtain the score of word dimension.
And inputting each sentence pair in the sentence combination into the trained twin neural network model to obtain the score of the sentence dimension.
In one embodiment, the scoring of the word dimension is a row vector with dimensions equal to the number of candidate paraphrases; the sentence dimension is scored as a row vector with dimensions equal to the number of candidate paraphrases.
Obtaining a prediction result of a model according to the scoring of the word dimension and the scoring of the sentence dimension, wherein the method comprises the following steps:
and carrying out weighted summation on the scores of the word dimensions and the scores of the sentence dimensions to obtain a final score vector, sequencing elements in the final score vector, and taking the candidate paraphrases corresponding to the sequenced maximum values as the prediction results of the model.
In one embodiment, the paraphrase combination and the sentence combination are respectively input into a trained twin neural network model, codes of sentence pairs in the paraphrase combination and the sentence combination are respectively scored, scoring of word dimensions and scoring of sentence dimensions are obtained, and in the step, the trained twin neural network model is the trained BERT twin neural network model.
A method of abbreviation disambiguation, the method comprising:
the method comprises the steps of obtaining an abbreviation and an original sentence containing the abbreviation from text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation.
And replacing the abbreviation in the original sentence with a new sentence formed by the candidate paraphrase.
And obtaining a new sentence set according to the sentence pair formed by the new sentence and the original sentence.
And inputting the new sentence set into the trained twin neural network model, and scoring the codes of the sentence pairs in the new sentence set to obtain the scores of all the sentence pairs in the new sentence set.
And sorting according to the scores of all sentence pairs, and taking the candidate paraphrase corresponding to the sentence pair with the highest score as a model prediction result.
In one embodiment, the new sentence set is input into a trained twin neural network model, codes of sentence pairs in the new sentence set are scored, scores of all sentence pairs in the new sentence set are obtained, and the trained twin neural network model is a trained BERT twin neural network.
An abbreviation disambiguation apparatus, the apparatus comprising:
and the candidate paraphrase determining module is used for acquiring the abbreviation and the original sentence containing the abbreviation from the text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation.
The input data processing module is used for obtaining paraphrase combination according to the candidate paraphrases and the original sentences; and obtaining a sentence combination according to a new sentence and the original sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase.
And the sentence pair similarity evaluation module is used for respectively inputting the paraphrase combination and the sentence combination into the trained twin neural network model, and respectively scoring the codes of the sentence pairs in the paraphrase combination and the sentence combination to obtain the score of the word dimension and the score of the sentence dimension.
And the abbreviation disambiguation module is used for obtaining a prediction result of the model according to the score of the word dimension and the score of the sentence dimension.
A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the above-described abbreviation disambiguation methods when said computer program is executed by said processor.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of any of the above-mentioned abbreviation disambiguation methods.
The abbreviation disambiguation method, apparatus, computer device and storage medium described above, the method comprising: acquiring an abbreviation and an original sentence containing the abbreviation from the text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation; obtaining paraphrase combinations according to the candidate paraphrases and the original sentences; obtaining a sentence combination according to a new sentence and an original sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase; inputting the paraphrase combination and the sentence combination into a trained twin neural network model respectively, and scoring codes of sentence pairs in the paraphrase combination and the sentence combination respectively to obtain scoring of word dimensions and scoring of sentence dimensions; and obtaining a prediction result of the model according to the scoring of the word dimension and the scoring of the sentence dimension. The method can evaluate the number of candidate paraphrase definitions of the abbreviations with any length, has better robustness, and can improve the disambiguation accuracy of the abbreviations.
Drawings
FIG. 1 is a schematic flow chart diagram of a method for disambiguation of abbreviations in one embodiment;
FIG. 2 is a twin network model base structure in one embodiment;
FIG. 3 is a BERT twin network structure in one embodiment, where (a) is the network structure of a classification target and (b) is the network structure of a regression target;
FIG. 4 is a schematic flow chart diagram illustrating a method for disambiguating abbreviations in another embodiment;
FIG. 5 is a diagram illustrating a distribution of the number of each abbreviation definition in a dictionary in accordance with one embodiment;
FIG. 6 is a block diagram of an abbreviation disambiguation apparatus according to one embodiment;
FIG. 7 is a diagram of the internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The disambiguation of the abbreviations in the invention means that given texts only contain the abbreviations but not the abbreviations, and the existing natural language processing system can be helped to better understand the abbreviations by researching the global abbreviations based on dictionaries.
In one embodiment, as shown in FIG. 1, there is provided an abbreviation disambiguation method comprising the steps of:
step 100: and acquiring the abbreviation and an original sentence containing the abbreviation from the text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation.
Abbreviations are a common language phenomenon in modern languages, and are short words formed by directly or indirectly extracting main components from relatively complex and stable words, and represent the same meaning as the original words.
For example: the abbreviation CNN;
the original sentence containing the abbreviation is: CNN is a kind of fed forward neural network.
The candidate definitions for abbreviations in the dictionary are:
Convolutional Neural Network;
Condensed Nearest Neighbor;
Complicated Neural Networks;
Citation Nearest Neighbour。
step 102: obtaining a paraphrase combination according to the candidate paraphrases and the original sentences; and obtaining a sentence combination according to a new sentence and the original sentence, wherein the new sentence is formed by replacing the abbreviation in the original sentence with the candidate paraphrase.
Specifically, each candidate paraphrase and the original sentence constitutes a sentence pair, and all such sentence pairs constitute a paraphrase combination, that is, a paraphrase combination is composed of a plurality of sentence pairs constructed from the candidate paraphrase and the original sentence including the abbreviation. The method comprises the steps of replacing an abbreviation in an original sentence with a new sentence formed by a candidate paraphrase, constructing a sentence from the new sentence and the original sentence including the abbreviation, and combining all pairs of sentences, namely, a sentence combination is composed of a plurality of pairs of sentences formed by the new sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase and the original sentence including the abbreviation. The sentence pairs in the paraphrase combination and the sentence pairs in the sentence combination are explained by taking the candidate paraphrase relational Network of the abbreviation CNN as an example.
Figure BDA0003585567720000061
For each candidate paraphrase of an abbreviation in a sentence pair of the paraphrase combination, there will be a corresponding sentence pair in the sentence combination.
Step 104: and respectively inputting the paraphrase combination and the sentence combination into a trained twin neural network model, and respectively scoring the codes of the sentence pairs in the paraphrase combination and the sentence combination to obtain the score of the word dimension and the score of the sentence dimension.
Specifically, a twin neural network (also called a twin network) is different from a traditional neural network model, and the twin neural network model is formed by two networks sharing a weight value, and classification or similarity prediction is realized by converting two inputs into a high-dimensional vector and interacting the characteristics of the high-dimensional vector. The basic structure of the twin neural network is shown in fig. 2. The twin network can measure the direct correlation degree of two inputs, wherein network 1 and network 2 can be two identical network models, such as CNN, LSTM, transformer or attention.
Twin networks generally use a contrast Loss function (contrast Loss), which is calculated by the formula:
Figure BDA0003585567720000071
wherein D is w Representing the euclidean distance between the twin network outputs. The value of Y is 1 or 0, 0 if the two inputs are similar, otherwise 1. m represents a marginal value (margin value), and if the difference between the two inputs is less than the marginal value, the loss is calculated, otherwise the loss is 0.
Unlike the existing way of directly comparing candidate paraphrases with original text, the patent replaces abbreviations in the original text with the candidate paraphrases, thereby constituting new sentences.
Using a twin network approach, the correlation between the abbreviation definitions and the original sentences and between the candidate abbreviation definitions of the abbreviation and the context of the corresponding abbreviation is calculated. The framework is able to evaluate the number of abbreviation candidates of arbitrary length.
Step 106: and obtaining a prediction result of the model according to the scoring of the word dimension and the scoring of the sentence dimension.
And the twin neural network respectively evaluates the correlation between the candidate words of the abbreviation and the original text and the new sentences formed by replacing the abbreviation with the candidate words and the original text, and sorts the definitions of the candidate words according to the two correlations, wherein the candidate definition with the highest correlation is regarded as an accurate model.
In the abbreviation disambiguation method described above, the method comprises: acquiring an abbreviation and an original sentence containing the abbreviation from the text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation; obtaining paraphrase combinations according to the candidate paraphrases and the original sentences; obtaining a sentence combination according to a new sentence and an original sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase; inputting the paraphrase combination and the sentence combination into a trained twin neural network model respectively, and scoring codes of sentence pairs in the paraphrase combination and the sentence combination respectively to obtain a score of word dimension and a score of sentence dimension; and obtaining a prediction result of the model according to the scoring of the word dimension and the scoring of the sentence dimension. The method can evaluate the number of candidate paraphrase definitions of the abbreviations with any length, has better robustness, and can improve the disambiguation accuracy of the abbreviations.
In one embodiment, step 104 includes: inputting each sentence pair in the paraphrase combination into a trained twin neural network model, and scoring the codes of the sentence pairs by using a vector similarity evaluation method to obtain a score of word dimensionality; and inputting each sentence pair in the sentence combination into the trained twin neural network model, and scoring the codes of the sentence pairs by using a vector similarity evaluation method to obtain the score of sentence dimensionality. Preferably, the vector similarity evaluation method may be a cosine similarity evaluation method, a covariance similarity evaluation method, a pearson correlation coefficient similarity evaluation method, or the like.
Specifically, the structure of the BERT twin network is shown in fig. 3, in which (a) is the network structure of the classification target and (b) is the network structure of the regression target. The BERT of fig. 3 includes BERT and its variants. By the method, the relations between the abbreviations and the definitions and between the contexts of the abbreviations and the definitions of the abbreviations can be considered at the same time, robustness is considered, the model structure is irrelevant to the length of the dictionary of the abbreviations, and the candidate dictionary with any length can be processed.
The twin network model selects the network structure shown in fig. 3 (b). In the experiment of the text, no special character for improving the attention of the model is added, and only two inputs are required to be directly input into the model. In addition, manual annotations will be set for these data, i.e., the input label containing the exact definition of the target abbreviation is 1, and the remaining inputs are 0.
The new sentence and the original sentence are simultaneously input into a BERT twin network, so that the two inputs are directly coded and encoded, and are trained by comparing loss (comparative loss), if the label is 1 (similar), the codes of the two are as close as possible, otherwise, the distance between the two is as far as possible, and the model is trained. When verification is carried out, the codes of all input pairs of the test set are obtained through the trained model, the semantic similarity of the input pairs and the codes is evaluated through the cosine similarity, and the candidate paraphrase which is most similar to the original text information is considered as the correct paraphrase.
In one embodiment, step 104 includes: inputting each sentence pair in the paraphrase combination into a trained twin neural network model, coding the sentence pairs, and processing a coding result through a pooling layer to obtain two sentence vectors with the same dimensionality; sentence vectors with two same dimensions pass through a softmax classifier to obtain paraphrase combined prediction labels, and the probability value of the paraphrase combined prediction labels being 1 is used as the score of corresponding candidate paraphrases to obtain the score of word dimension; and inputting each sentence pair in the sentence combination into the trained twin neural network model to obtain the score of the sentence dimension.
Specifically, the twin network model selects the network structure shown in fig. 3 (a), two sentences of a sentence pair are input into the BERT model to be encoded, sentence encoding dimensions are unified through a Pooling layer (Pooling), sentence vectors u and v with the same dimensions are obtained, and finally tags are obtained through a softmax classifier. The classifier evaluates two different sentence pairs, and takes the probability value of the predictive label being 1 as the score of the corresponding paraphrase in the combination.
In one embodiment, the scoring of the word dimension is a row vector with dimensions equal to the number of candidate paraphrases; the dimension of the sentence is divided into row vectors with the dimension equal to the number of the candidate paraphrases; step 106 comprises: and carrying out weighted summation on the score of the word dimension and the score of the sentence dimension to obtain a final score vector, sequencing elements in the final score vector, and taking the candidate paraphrase corresponding to the sequenced maximum value as a prediction result of the model.
Specifically, s (s is the number of candidate paraphrases of the target abbreviation in the dictionary) sentence combinations and paraphrase combinations are respectively input into the twin network, the codes of all sentence pairs are scored by using a vector similarity evaluation method, and the paraphrase combinations, namely the word dimensionality, are scored into
Figure BDA0003585567720000091
And sentence combination, i.e., the scoring of sentence dimension, is
Figure BDA0003585567720000092
Wherein
Figure BDA0003585567720000093
The sentence pairs in the paraphrase combination and the sentences in the sentence combination corresponding to the ith candidate paraphrases respectivelyScore of pair, i ═ 1,2, …, s; the weighted sum of the scores of the word dimension and the sentence dimension is the final score
Figure BDA0003585567720000094
Wherein alpha represents a weight, note p m And (argmax (p)), the mth candidate definition corresponding to pm is the prediction result of the model.
In one embodiment, the twin neural network model trained in step 104 is a trained BERT twin neural network model.
Preferably, the twin sub-network in the trained twin neural network model may also be CNN, LSTM, transformer, attention, or the like.
In one embodiment, as shown in FIG. 3, there is provided an abbreviation disambiguation method comprising the steps of:
step 400: the method comprises the steps of obtaining an abbreviation and an original sentence containing the abbreviation from text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation.
Specifically, unlike the conventional method of directly using candidate definitions and original texts for similarity evaluation, the text is simultaneously integrated with the matching degree of the candidate abbreviations and the contexts of the abbreviations, that is, the candidate words of the abbreviations replace the abbreviations in the original texts, so as to form a new sentence set, and then the new sentence set and the original texts are jointly input into a twin network to evaluate the similarity of the two sentences.
Step 402: and replacing the abbreviation in the original sentence by a new sentence formed by the candidate paraphrase.
Step 404: and obtaining a new sentence set according to the sentence pair formed by the new sentence and the original sentence.
Step 406: and inputting the new sentence set into the trained twin neural network model, and scoring the codes of the sentence pairs in the new sentence set to obtain the scores of all the sentence pairs in the new sentence set.
Step 408: and sorting according to the scores of all sentence pairs, and taking the candidate paraphrase corresponding to the sentence pair with the highest score as a model prediction result.
In one embodiment, the twin neural network model trained in step 406 is a trained BERT twin neural network.
It should be understood that although the steps in the flowcharts of fig. 1 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In a validation embodiment, the AD data set in the SDU challenge in AAAI2021 was chosen for experimental demonstration. Raw data included 50034 training samples, 6189 validation samples and 6218 test samples. However, since the label of the test sample is not disclosed and the challenge task is closed, 10% of the data is randomly extracted from 50033 training samples as validation set and the original validation set as test set, so the number of training samples we tested is 45031, the number of validation set samples is 5003 and the number of test samples is 6189. The number of the abbreviation definitions of the matched dictionary is concentrated within 5, the number of the abbreviation definitions of 4 abbreviations exceeds 15, but the number of the samples containing the abbreviations in the test set only accounts for 2.21 percent of the total number of the samples.
The benchmark model framework of the embodiment includes the models carried by the data sets and the single paraphrase score, and is compared with sentence matching and double scoring. From the perspective of the model, in this embodiment, BERT-base and BERT-large are respectively selected for experiments, which provide references for subsequent studies.
(1) Results of the experiment
In the experiment, a pyroch frame is used for code writing, and a pre-training model mainly adopts a model trained in hugging face (https:// hugging face. co/models). At the same time, no preprocessing was performed on the initial data to test the robustness of the model to the unprocessed data. The results of the experiment are shown in table 1.
TABLE 1 abbreviation disambiguation test results
Figure BDA0003585567720000111
As can be seen from Table 1, even the most basic BERT model, the sentence matching effect is significantly higher than that of matching directly with the original document using paraphrases. Wherein the double-scored sentence matching weight is 0.9 and the single paraphrase weight is 0.1. Experiments prove that the sentence similarity evaluation based on BERT-base is 7.7% higher than the F1 value of the direct paraphrase evaluation, 27.738% higher than the data set benchmark model, and the combined similarity is 7.9% higher than the direct paraphrase evaluation.
(2) Robustness testing
Through statistical analysis of the prepared abbreviation dictionary, the dictionary contains 732 abbreviations in total, the average number of paraphrases of each abbreviation is about 3, the maximum number of paraphrases is 20, and the minimum number of paraphrases is 2. Wherein the number of abbreviations with number of definitions less than 5 is 660, accounting for 90.16% of the total number; the number of abbreviations with the number of definitions between 5 and 10 is 55, accounting for 7.51% of the total number; the number of definitions between 10 and 15 is 13, which represents only 1.78% of the total, while those greater than 20 represent only 4, which represents only 0.55% of the total. The number of samples containing these four abbreviations, "CA", "CS", "CC" and "SC", was found to be 44, 40, 35, 18, respectively, accounting for only 2.21% of the total number of samples by analysis of the test set. Most samples contain within 5 abbreviations. The specific distribution is shown in fig. 5. In fact, machines have the advantage of being able to process more information and data than humans, and therefore, the present embodiment proposes a new model robustness testing method by extending the existing dictionary according to the acronymFinder website and experimentally verifying the robustness of the model.
In order to ensure the effectiveness of the experiment, the training hyper-parameters and the evaluation indexes of the model are kept consistent, and only the length of each paraphrase in the abbreviation dictionary is expanded. The expansion mode is to obtain the 30 definitions with the highest use frequency in the abbreviation query website such as AcronymFinder and the like as an expansion definition set, and add the definitions to the original dictionary. In order to keep effective contrast with the original dictionary, the number of the definitions of each abbreviation is 20 at most, and the definitions are consistent with the original dictionary. While in this experiment only the corresponding definitions of the acronyms are retained, e.g. the Position will not be considered as an acronym definition for POS, etc. Compared with the original dictionary, the number of paraphrases in the expanded dictionary is increased, and the difficulty of selecting accurate paraphrases by human is higher. However, for the BERT twin network based approach, the increase in the number of abbreviated lexicon definitions essentially expands the data set, increasing the number of negative examples. The small amplitude increase of the negative sample format is essentially beneficial to improving the performance of deep learning.
In one embodiment, as shown in fig. 6, there is provided an abbreviation disambiguation apparatus comprising: a candidate paraphrasing module, an input data processing module, a sentence pair similarity evaluation module, and an abbreviation disambiguation module, wherein:
and the candidate paraphrase determining module is used for acquiring the abbreviation and the original sentence containing the abbreviation from the text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in the dictionary according to the abbreviation.
The input data processing module is used for obtaining paraphrase combinations according to the candidate paraphrases and the original sentences; and obtaining a sentence combination according to a new sentence and the original sentence, wherein the new sentence is formed by replacing the abbreviation in the original sentence with the candidate paraphrase.
And the sentence pair similarity evaluation module is used for respectively inputting the paraphrase combination and the sentence combination into the trained twin neural network model, and respectively scoring the codes of the sentence pairs in the paraphrase combination and the sentence combination to obtain the score of the word dimension and the score of the sentence dimension.
And the abbreviation disambiguation module is used for obtaining a prediction result of the model according to the scoring of the word dimension and the scoring of the sentence dimension.
In one embodiment, the sentence pair similarity evaluation module is further configured to input each sentence pair in the paraphrase combination into the trained twin neural network model, and score the codes of the sentence pairs by using a vector similarity evaluation method to obtain a score of a word dimension; and inputting each sentence pair in the sentence combination into the trained twin neural network model, and scoring the codes of the sentence pairs by using a vector similarity evaluation method to obtain the score of the sentence dimensionality.
In one embodiment, the similarity evaluation module of the sentence pairs is further configured to input each sentence pair in the paraphrase combination into a trained twin neural network model, encode the sentence pair, and process the encoding result through a pooling layer to obtain two sentence vectors with the same dimension; sentence vectors with two same dimensions pass through a softmax classifier to obtain paraphrase combined prediction labels, and the probability value of the paraphrase combined prediction labels being 1 is used as the score of corresponding candidate paraphrases to obtain the score of word dimension; and inputting each sentence pair in the sentence combination into the trained twin neural network model to obtain the score of the sentence dimension.
In one embodiment, the scoring of the word dimension is a row vector with dimensions equal to the number of candidate paraphrases; the dimension of the sentence is divided into row vectors with the dimension equal to the number of the candidate paraphrases; and the abbreviation disambiguation module is also used for weighting and summing the scores of the word dimensionality and the scores of the sentence dimensionality to obtain a final score vector, sequencing elements in the final score vector, and taking the candidate paraphrases corresponding to the sequenced maximum values as the prediction results of the model.
In one embodiment, the trained twin neural network model in the similarity evaluation module for sentence pairs is a trained BERT twin neural network model.
For specific limitations of the abbreviation disambiguation apparatus, reference may be made to the above limitations of the abbreviation disambiguation method, which will not be described herein again. The modules in the abbreviation disambiguation apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a method of disambiguation of abbreviations. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (10)

1. A method of disambiguating abbreviations, the method comprising:
acquiring an abbreviation and an original sentence containing the abbreviation from text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation;
obtaining paraphrase combinations according to the candidate paraphrases and the original sentences; obtaining a sentence combination according to a new sentence and the original sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase;
inputting the paraphrase combination and the sentence combination into a trained twin neural network model respectively, and scoring codes of sentence pairs in the paraphrase combination and the sentence combination respectively to obtain a score of word dimension and a score of sentence dimension;
and obtaining a prediction result of the model according to the score of the word dimension and the score of the sentence dimension.
2. The method of claim 1, wherein inputting the paraphrase combination and the sentence combination into a trained twin neural network model, respectively, and scoring the encoding of sentence pairs in the paraphrase combination and the sentence combination, respectively, to obtain a score for a word dimension and a score for a sentence dimension, comprises:
inputting each sentence pair in the paraphrase combination into a trained twin neural network model, and scoring the codes of the sentence pairs by using a vector similarity evaluation method to obtain a score of word dimensionality;
and inputting each sentence pair in the sentence combination into a trained twin neural network model, and scoring the codes of the sentence pairs by using a vector similarity evaluation method to obtain the score of sentence dimensionality.
3. The method of claim 1, wherein inputting the paraphrase combination and the sentence combination into a trained twin neural network model, respectively, and scoring the encoding of sentence pairs in the paraphrase combination and the sentence combination, respectively, to obtain a score for a word dimension and a score for a sentence dimension, comprises:
inputting each sentence pair in the paraphrase combination into a trained twin neural network model, coding the sentence pairs, and processing a coding result through a pooling layer to obtain two sentence vectors with the same dimensionality; sentence vectors with two same dimensions pass through a softmax classifier to obtain paraphrase combination prediction labels, and the probability value of the paraphrase combination prediction labels being 1 is used as the score of corresponding candidate paraphrases to obtain the score of word dimension;
and inputting each sentence pair in the sentence combination into the trained twin neural network model to obtain the score of the sentence dimension.
4. The method of claim 1, wherein the score for the word dimension is a row vector having dimensions equal to a number of candidate paraphrases; scoring the sentence dimension into row vectors having dimensions equal to the number of candidate paraphrases;
obtaining a prediction result of a model according to the score of the word dimension and the score of the sentence dimension, wherein the prediction result comprises the following steps:
and carrying out weighted summation on the scores of the word dimensions and the scores of the sentence dimensions to obtain a final score vector, sequencing elements in the final score vector, and taking the candidate paraphrases corresponding to the sequenced maximum values as the prediction results of the model.
5. The method of claim 1, wherein the paraphrase combination and the sentence combination are input into a trained twin neural network model, respectively, and the encoding of sentence pairs in the paraphrase combination and the sentence combination is scored, respectively, to obtain a score for word dimensions and a score for sentence dimensions, wherein the trained twin neural network model is a trained BERT twin neural network model.
6. A method of disambiguating abbreviations, the method comprising:
acquiring an abbreviation and an original sentence containing the abbreviation from text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation;
replacing abbreviations in the original sentence with new sentences formed by the candidate paraphrases;
obtaining a new sentence set according to a sentence pair formed by the new sentence and the original sentence;
inputting the new sentence set into a trained twin neural network model, and scoring the codes of the sentence pairs in the new sentence set to obtain the scores of all the sentence pairs in the new sentence set;
and sorting according to the scores of all sentence pairs, and taking the candidate paraphrase corresponding to the sentence pair with the highest score as a model prediction result.
7. The method of claim 6, wherein the new set of sentences is input into a trained twin neural network model, and wherein the coding of sentence pairs in the new set of sentences is scored to obtain scores for all sentence pairs in the new set of sentences, and wherein the trained twin neural network model is a trained BERT twin neural network.
8. An abbreviation disambiguating apparatus, the apparatus comprising:
the definition candidate paraphrase module is used for acquiring the abbreviation and the original sentence containing the abbreviation from the text information to be disambiguated, and determining a plurality of candidate paraphrases of the abbreviation in a dictionary according to the abbreviation;
the input data processing module is used for obtaining paraphrase combination according to the candidate paraphrases and the original sentences; obtaining a sentence combination according to a new sentence and the original sentence formed by replacing the abbreviation in the original sentence with the candidate paraphrase;
the sentence pair similarity evaluation module is used for respectively inputting the paraphrase combination and the sentence combination into a trained twin neural network model, and respectively scoring the codes of the sentence pairs in the paraphrase combination and the sentence combination to obtain a score of a word dimension and a score of a sentence dimension;
and the abbreviation disambiguation module is used for obtaining a prediction result of the model according to the scoring of the word dimension and the scoring of the sentence dimension.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method according to any of claims 1 to 5.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202210361778.XA 2022-04-07 2022-04-07 Abbreviation disambiguation method, apparatus, computer device, and storage medium Active CN114925698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210361778.XA CN114925698B (en) 2022-04-07 2022-04-07 Abbreviation disambiguation method, apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210361778.XA CN114925698B (en) 2022-04-07 2022-04-07 Abbreviation disambiguation method, apparatus, computer device, and storage medium

Publications (2)

Publication Number Publication Date
CN114925698A true CN114925698A (en) 2022-08-19
CN114925698B CN114925698B (en) 2024-08-20

Family

ID=82805551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210361778.XA Active CN114925698B (en) 2022-04-07 2022-04-07 Abbreviation disambiguation method, apparatus, computer device, and storage medium

Country Status (1)

Country Link
CN (1) CN114925698B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117555995A (en) * 2024-01-11 2024-02-13 北京领初医药科技有限公司 Hierarchical abbreviation sentence matching confirmation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523312A (en) * 2020-04-22 2020-08-11 南京贝湾信息科技有限公司 Paraphrase disambiguation-based query display method and device and computing equipment
WO2021057884A1 (en) * 2019-09-27 2021-04-01 华为技术有限公司 Sentence paraphrasing method, and method and apparatus for training sentence paraphrasing model
US20220067294A1 (en) * 2020-09-02 2022-03-03 Jpmorgan Chase Bank, N.A. Systems and methods for generalized structured data discovery utilizing contextual metadata disambiguation via machine learning techniques

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021057884A1 (en) * 2019-09-27 2021-04-01 华为技术有限公司 Sentence paraphrasing method, and method and apparatus for training sentence paraphrasing model
CN111523312A (en) * 2020-04-22 2020-08-11 南京贝湾信息科技有限公司 Paraphrase disambiguation-based query display method and device and computing equipment
US20220067294A1 (en) * 2020-09-02 2022-03-03 Jpmorgan Chase Bank, N.A. Systems and methods for generalized structured data discovery utilizing contextual metadata disambiguation via machine learning techniques

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李国佳;赵莹地;郭鸿奇;: "一种基于多义词向量表示的词义消歧方法", 智能计算机与应用, no. 04, 28 August 2018 (2018-08-28) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117555995A (en) * 2024-01-11 2024-02-13 北京领初医药科技有限公司 Hierarchical abbreviation sentence matching confirmation method and system
CN117555995B (en) * 2024-01-11 2024-04-12 北京领初医药科技有限公司 Hierarchical abbreviation sentence matching confirmation method and system

Also Published As

Publication number Publication date
CN114925698B (en) 2024-08-20

Similar Documents

Publication Publication Date Title
KR102490752B1 (en) Deep context-based grammatical error correction using artificial neural networks
CN110196982B (en) Method and device for extracting upper-lower relation and computer equipment
Isa et al. Indobert for indonesian fake news detection
CN111460820A (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN113282713B (en) Event trigger detection method based on difference neural representation model
CN112966068A (en) Resume identification method and device based on webpage information
Ciosici et al. Unsupervised Abbreviation Disambiguation Contextual disambiguation using word embeddings
Choi et al. Analyzing zero-shot cross-lingual transfer in supervised NLP tasks
Xu et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding
CN111523312B (en) Word searching display method and device based on paraphrasing disambiguation and computing equipment
CN113806493A (en) Entity relationship joint extraction method and device for Internet text data
Nehar et al. Rational kernels for Arabic root extraction and text classification
CN113221553A (en) Text processing method, device and equipment and readable storage medium
CN114925698B (en) Abbreviation disambiguation method, apparatus, computer device, and storage medium
Hangaragi et al. Accuracy Comparison of Neural Models for Spelling Correction in Handwriting OCR Data
Khan et al. A Roman Urdu Corpus for sentiment analysis
Choi et al. How to generate data for acronym detection and expansion
CN116089605A (en) Text emotion analysis method based on transfer learning and improved word bag model
Rao et al. Language Detection Using Natural Language Processing
Mi et al. Recurrent neural network based loanwords identification in Uyghur
CN112948536A (en) Information extraction method and device for web resume page
Yao ADHN: Sentiment Analysis of Reviews for MOOCs of Dilated Convolution Neural Network and Hierarchical Attention Network Based on ALBERT
Helmy A multilingual encoding method for text classification and dialect identification using convolutional neural network
Sampath et al. Hybrid Tamil spell checker with combined character splitting
Biondi et al. Defining classification ambiguity to discover a potential bias applied to emotion recognition data sets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant