CN111738001A - Training method of synonym recognition model, synonym determination method and equipment - Google Patents

Training method of synonym recognition model, synonym determination method and equipment Download PDF

Info

Publication number
CN111738001A
CN111738001A CN202010781406.3A CN202010781406A CN111738001A CN 111738001 A CN111738001 A CN 111738001A CN 202010781406 A CN202010781406 A CN 202010781406A CN 111738001 A CN111738001 A CN 111738001A
Authority
CN
China
Prior art keywords
word
synonym
words
prediction result
characteristic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010781406.3A
Other languages
Chinese (zh)
Other versions
CN111738001B (en
Inventor
高文龙
张子恒
陈曦
文瑞
管冲
向玥佳
刘博�
孙继超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010781406.3A priority Critical patent/CN111738001B/en
Publication of CN111738001A publication Critical patent/CN111738001A/en
Application granted granted Critical
Publication of CN111738001B publication Critical patent/CN111738001B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a training method of a synonym recognition model, a synonym determination method and equipment, and relates to the technical field of machine learning and computers. The method comprises the following steps: acquiring a plurality of words; acquiring multi-source characteristic information of a word, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information; determining a plurality of training samples based on the plurality of words; determining a synonym prediction result and a correlation prediction result of the training sample based on multi-source characteristic information of two words in the training sample through the synonym recognition model, wherein the correlation prediction result refers to a prediction result of the correlation between the two words in the training sample; calculating a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample; and training the synonym recognition model according to the loss function value. According to the technical scheme provided by the embodiment of the application, the accuracy of synonym recognition can be improved.

Description

Training method of synonym recognition model, synonym determination method and equipment
Technical Field
The embodiment of the application relates to the technical field of machine learning and computers, in particular to a training method of a synonym recognition model, a synonym determination method and equipment.
Background
With the development of computer technology, machine learning techniques oriented to artificial intelligence are increasingly applied in natural language analysis scenarios, such as recognizing synonyms.
In the related art, the edit distance refers to the number of times a character string needs to be processed into another character string, the edit distance can represent the difference degree of the two character strings, whether two words are synonyms or not is determined according to the edit distance between the two words, and when the edit distance between the two words is smaller than or equal to a preset value, the two words are determined to be synonyms; and when the edit distance between the two words is larger than a preset value, determining the two words as non-synonyms.
In the above related art, since there is a non-synonym pair having a small edit distance, the recognition accuracy of synonyms is low.
Disclosure of Invention
The embodiment of the application provides a training method of a synonym recognition model, a synonym determination method and equipment, and the synonym recognition accuracy can be improved. The technical scheme is as follows.
According to an aspect of an embodiment of the present application, there is provided a method for training a synonym recognition model, the method including:
acquiring a plurality of words;
acquiring multi-source characteristic information of the words, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
determining a plurality of training examples based on the plurality of words, the plurality of training examples including at least one positive example and at least one negative example, the positive examples being synonym pairs and the negative examples being non-synonym pairs;
determining a synonym prediction result and a correlation prediction result of the training sample based on multi-source characteristic information of two words in the training sample through a synonym recognition model, wherein the synonym prediction result is a prediction result of whether the two words in the training sample are synonyms, and the correlation prediction result is a prediction result of correlation between the two words in the training sample;
calculating a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample;
and training the synonym recognition model according to the loss function value.
According to an aspect of an embodiment of the present application, there is provided a synonym determination method, including:
acquiring a target word pair, wherein the target word pair comprises a first word and a second word;
acquiring multi-source characteristic information of the first word and multi-source characteristic information of the second word, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
determining a synonym prediction result of the target word pair based on multi-source characteristic information of the target word pair through a synonym recognition model, wherein the synonym prediction result is a prediction result of whether the first word and the second word are synonyms or not.
According to an aspect of an embodiment of the present application, there is provided a training apparatus for a synonym recognition model, the apparatus including:
the word acquisition module is used for acquiring a plurality of words;
the information acquisition module is used for acquiring multi-source characteristic information of the words, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
a sample determination module, configured to determine a plurality of training samples based on the plurality of words, where the plurality of training samples include at least one positive sample and at least one negative sample, the positive sample is a synonym pair, and the negative sample is a non-synonym pair;
the result prediction module is used for determining a synonym prediction result and a correlation prediction result of the training sample based on multi-source characteristic information of two words in the training sample through a synonym recognition model, wherein the synonym prediction result is a prediction result of whether the two words in the training sample are synonyms, and the correlation prediction result is a prediction result of correlation between the two words in the training sample;
the loss calculation module is used for calculating a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample;
and the model training module is used for training the synonym recognition model according to the loss function value.
According to an aspect of an embodiment of the present application, there is provided a synonym determination apparatus, including:
the word pair obtaining module is used for obtaining a target word pair, and the target word pair comprises a first word and a second word;
the information acquisition module is used for acquiring multi-source characteristic information of the first word and multi-source characteristic information of the second word, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
and the result determining module is used for determining a synonym prediction result of the target word pair based on the multi-source characteristic information of the target word pair through a synonym recognition model, wherein the synonym prediction result refers to a prediction result of whether the first word and the second word are synonyms.
According to an aspect of the embodiments of the present application, there is provided a computer device, including a processor and a memory, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the training method of the synonym recognition model or to implement the synonym determination method.
According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the method for training a synonym recognition model described above, or to implement the method for determining synonyms described above.
According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the synonym recognition model or executes the synonym determination method.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
the synonym recognition model is trained by adopting the multi-source characteristic information of the words, the loss function value of the synonym recognition model is calculated through the prediction results of the synonym prediction result and the correlation prediction result obtained through multi-task learning, and the synonym recognition model is trained according to the loss function value.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for training a synonym recognition model provided in one embodiment of the present application;
FIG. 2 is a flow chart of a method for training a synonym recognition model according to another embodiment of the present application;
FIG. 3 is a schematic diagram of a body part query tree as provided by one embodiment of the present application;
FIG. 4 is an architecture diagram of a synonym recognition model provided in one embodiment of the present application;
FIG. 5 is a flow chart of a synonym determination method provided in one embodiment of the present application;
FIG. 6 is a schematic diagram of a method for training a synonym recognition model according to an embodiment of the present application;
FIG. 7 is a block diagram of a training apparatus for a synonym recognition model according to an embodiment of the present application;
FIG. 8 is a block diagram of a training apparatus for a synonym recognition model according to another embodiment of the present application;
FIG. 9 is a block diagram of a synonym determination apparatus provided in one embodiment of the present application;
FIG. 10 is a block diagram of a synonym determination apparatus provided in another embodiment of the present application;
FIG. 11 is a block diagram of a computer device provided by another embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of methods consistent with aspects of the present application, as detailed in the appended claims.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.
The scheme provided by the embodiment of the application relates to artificial intelligence natural language processing and machine learning technologies, for example, synonyms are determined by using the natural language processing technology, and synonym recognition models are trained by using the machine learning technology.
According to the method provided by the embodiment of the application, the execution main body of each step can be a computer device, and the computer device refers to an electronic device with data calculation, processing and storage capabilities. The computer device may be a terminal such as a PC (personal computer), a tablet, a smartphone, a wearable device, a smart robot, or the like; or may be a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services.
Referring to fig. 1, a flowchart of a method for training a synonym recognition model according to an embodiment of the present application is shown. The method can comprise the following steps (101-106).
Step 101, obtaining a plurality of words.
In some embodiments, the plurality of terms is obtained from a plurality of channels, such as a knowledge graph, business data, and network retrieval data. The knowledge graph is an integral knowledge framework which is formed by combining theories and methods of applying subjects such as mathematics, graphics, information visualization technology, information science and the like with methods such as metrology citation analysis, co-occurrence analysis and the like and utilizing a visual graph to vividly display the subjects. The service data refers to data generated in the actual operation process of the computer program product, and words provided by the user can be acquired from the service data. Alternatively, the plurality of words are words related to the same field, such as a medical field, a botanical field, a chemical field, and the like. When a plurality of words are related words in the same field, the accuracy of synonym determination can be improved. Optionally, the plurality of words are words after cleaning operations such as correction of wrongly written characters, removal of duplicate words, and the like.
And 102, acquiring multi-source characteristic information of the words.
The multi-source characteristic information comprises semantic characteristic information and character characteristic information, wherein the semantic characteristic information is used for representing semantic characteristics of words, and the character characteristic information is used for representing word characteristics of the words. In some embodiments, the semantic feature information comprises a word vector. word2vec is a neural network language model, and words can be output as word vectors in a distributed representation so as to facilitate the use of downstream natural language processing tasks. The word vector contains semantic information that is word rich and is high in dimensionality, e.g., the word vector may be a vector of hundreds of dimensions. In some embodiments, the textual feature information is represented in a vector form. Optionally, the text characteristic information includes an edit distance (e.g., a text edit distance, a pinyin edit distance, etc.), a subsequence, a number of components, and the like. Edit Distance (Edit Distance), also known as the levenstanh Distance, is a quantitative measure of the degree of difference between two strings by how many processes are required to change one string into another. In the embodiment of the present application, the edit distance is a quantitative measure of the degree of differentiation between two words. The term multi-source characteristic information may also include other information, which is not limited in the embodiments of the present application.
Step 103, determining a plurality of training samples based on the plurality of words.
The training samples comprise at least one positive sample and at least one negative sample, the positive sample is a synonym pair, and the negative sample is a non-synonym pair. Each word pair comprises two words, and when the two words in the word pair are synonyms, the word pair is a positive sample; when two words in a word pair are non-synonyms, the word pair is a negative example.
In some embodiments, this step 103 includes the following sub-steps:
1. according to semantic similarity among semantic feature information of a plurality of words, dividing the plurality of words into a plurality of word sets, wherein the semantic similarity among the semantic feature information of the words in the same word set is larger than a threshold value, and the semantic similarity among the semantic feature information of the words in different word sets is smaller than the threshold value;
2. selecting two words from the same word set to construct a positive sample;
3. two words are selected from different word sets to construct a negative sample.
In some embodiments, the semantic feature information includes word vectors, and cosine similarity between word vectors corresponding to words is taken as semantic similarity. Calculating semantic similarity between every two word vectors corresponding to the plurality of words, dividing the words with the corresponding semantic similarity larger than a threshold value into the same word set, and dividing the words with the corresponding semantic similarity smaller than the threshold value into different word sets. Semantic similarity between words in the same set of words may also be equal to a threshold value. In some examples, semantic similarity between a word vector of a word and word vectors of other words is smaller than a threshold value, and the word is separately divided into a word set, that is, the word set only contains one word. Therefore, the words in the same word set can be considered as synonym pairs, and two words selected from the same word set can be constructed as a positive sample; the words in different word sets are non-synonym pairs, and two words respectively selected from different word sets can be constructed into negative samples. Optionally, the specific value of the threshold is set by a related technician according to an actual situation, and this is not limited in the embodiment of the present application.
And step 104, determining a synonym prediction result and a correlation prediction result of the training sample based on the multi-source characteristic information of the two words in the training sample through the synonym recognition model.
The synonym prediction result refers to a prediction result of whether two words in the training sample are synonyms, and the correlation prediction result refers to a prediction result of the correlation between the two words in the training sample. In some embodiments, the multi-source word feature information in the training sample is processed and fused through the synonym recognition model to obtain fused feature information corresponding to each word, and then the synonym prediction result and the correlation prediction result of the training sample are obtained through analyzing the relationship between the fused feature information corresponding to two words in the training sample.
And 105, calculating a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample.
In some embodiments, after obtaining the synonym prediction result and the correlation prediction result of the training sample, the loss function value of the synonym recognition model can be obtained based on the synonym prediction result, the correlation prediction result, the label information of the training sample, and the loss function calculation formula of the training sample. In some examples, a plurality of training samples in a round of training cycles corresponds to one loss function value; in other examples, there is one loss function value for each training sample in each training round.
And 106, training the synonym recognition model according to the loss function value.
Optionally, the synonym recognition model can adjust the model parameters according to the loss function values and perform the next round of training, so as to reduce the loss function values as much as possible and improve the recognition accuracy of the synonym recognition model. And when the loss function value meets the condition, training the synonym recognition model. In some embodiments, the conditions include, but are not limited to, at least one of: the loss function value is less than or equal to a third threshold value, the loss function value is less than or equal to a fourth threshold value for n consecutive times, and the loss function value does not drop for m consecutive times. Specific numerical values of the third threshold and the fourth threshold are set by a related technician according to actual situations, and the embodiment of the present application does not limit the specific numerical values. m and n are positive integers, m can take values of 5, 10, 20, 28 and the like, n can take values of 5, 10, 20, 28 and the like, and specific numerical values of m and n are set by related technicians according to actual conditions, and the embodiment of the application is not limited. In other embodiments, the condition for the synonym recognition model to stop training further comprises: the accuracy of the synonym identification model is greater than or equal to an accuracy threshold, the recall of the synonym identification model is greater than or equal to a recall threshold, the F1 score of the synonym determination model is greater than or equal to an F1 score threshold, and so forth. Optionally, the accuracy threshold, the precision threshold, the recall threshold, and the F1 score threshold are set by a relevant technician according to an actual situation, and the embodiment of the present application is not limited thereto.
In summary, according to the technical scheme provided by the embodiment of the application, the synonym recognition model is trained by using the multi-source characteristic information of the words, the synonym prediction result and the correlation prediction result obtained by the multi-task learning are used, the loss function value of the synonym recognition model is calculated, and the synonym recognition model is trained according to the loss function value.
In addition, the embodiment of the application can improve the iteration efficiency of the parameters in the synonym recognition model through multi-task learning, so that the synonym recognition model with high performance can be obtained in a short time, and the training efficiency of the synonym recognition model is improved.
In addition, according to the embodiment of the application, a plurality of training samples are determined through semantic similarity among semantic feature information of a plurality of words, so that initial positive samples and initial negative samples are automatically obtained, and the training efficiency of the synonym recognition model is improved.
Please refer to fig. 2, which shows a flowchart of a training method of a synonym recognition model according to another embodiment of the present application. The method can comprise the following steps (201-208).
Step 201, obtaining a plurality of words.
Step 202, obtaining multi-source characteristic information of the words.
Step 203, determining a plurality of training samples based on the plurality of words.
The specific contents of the steps 201 to 203 can refer to the steps 101 to 103 in the embodiment of fig. 1, which are not described herein again.
And 204, respectively determining the matched words of the first word and the second word in the training sample to obtain the determination results of the matched words of the first word and the second word.
In some embodiments, the matching word determination result is a matching word corresponding to a word or no matching word corresponding to a word exists. Respectively calculating the matching degrees of the first word and the candidate matching words to obtain a plurality of matching degrees corresponding to the first word, and when the matching degrees corresponding to the first word are all smaller than a threshold value of the matching degrees, indicating that no matching word exists in the first word, and determining the matching word of the first word as the result that no matching word exists in the first word; and when the matching degree which is greater than or equal to the threshold value of the matching degree exists in the plurality of matching degrees corresponding to the first word, determining the matching word with the highest matching degree as the matching word of the first word. Optionally, the specific value of the matching degree threshold is set by a related technician according to an actual situation, and this is not limited in the embodiment of the present application. The determination of the matching words of the second word is the same as that of the first word, and is not repeated here.
In some embodiments, the plurality of words are medically related words and the matching words are part descriptors for representing corresponding body parts. Correspondingly, the matching word determination result of the first word comprises a first part descriptor corresponding to the first word, and the matching word determination result of the second word comprises a second part descriptor corresponding to the second word. This step 204 includes the following substeps:
1. acquiring a first part descriptor corresponding to a first term from the body part query tree based on the matching degree between the first term and the part descriptor in the body part query tree;
2. and acquiring a second part descriptor corresponding to the second term from the body part query tree based on the matching degree between the second term and the part descriptor in the body part query tree.
The body part query tree records the part descriptors (i.e., the candidate matching words) of a plurality of body parts and the relationship between the plurality of body parts. In some embodiments, a degree of match between the first term and each term in the body part query tree is determined by calculating an edit distance between the first term and each part descriptor in the body part query tree, the smaller the edit distance, the greater the degree of match; the larger the edit distance, the smaller the degree of matching. And determining the part descriptor with the maximum matching degree with the first word in the body part query tree as the first part descriptor corresponding to the first word.
Step 205, determining second label information of the training sample based on the matching word determination result of the first word and the matching word determination result of the second word.
And the second label information refers to the label information of the correlation between the two words in the training sample. Optionally, the correlation between two words is used to indicate the correlation between the body parts to which the two words correspond. The relevance label information comprises three types of labels that the body parts are identical or related, the body parts are not related and the body parts are uncertain. Based on the part descriptors of the first words and the part descriptors of the second words, it can be determined whether the second label information of the training sample is that the body parts are the same or related, or that the body parts are unrelated. And when at least one of the first word and the second word does not have the part descriptor, determining that the second label information of the training sample is uncertain.
In some embodiments, the body part is divided into thirteen systems of the skin system, the muscle system, the nervous system, the skeletal system, the respiratory system, the digestive system, the urinary system, the reproductive system, the cardiovascular system, the lymphatic system, the endocrine system, and the body trunk according to medical knowledge acquired from various ways, and the body part is subdivided into layers based on the thirteen systems to obtain the body part query tree. Referring to fig. 3, a schematic diagram of a body part query tree according to an embodiment of the present application is shown. As shown in fig. 3, the body part query tree 30 includes a plurality of nodes, each node including at least one part descriptor for representing a type of body part. Optionally, each node uses as many part descriptors as possible to represent the body part corresponding to the node, and the part descriptors may include more standard words, or may include spoken words or aliases of the body part, which is not limited in this embodiment of the present application. Illustratively, for the node 31, the upper node 32 connected thereto is a parent node of the node 31, and the lower nodes 33, 34, and 35 connected thereto are child nodes of the node 31. In one example, when the first part descriptor and the second part descriptor are located at the same node, the second label information of the training sample is the same or related to the body part; when the first part descriptor and the second part descriptor are located at different nodes, the second label information of the training sample is irrelevant for the body part. In another example, when the first part descriptor and the second part descriptor are located in the same node, or the node where the first part descriptor is located and the node where the second part descriptor is located are a parent node and a child node, the second label information of the training sample is the same or related to the body part; when the first part descriptor and the second part descriptor are located at different nodes, and the node where the first part descriptor is located and the node where the second part descriptor is located do not have a father node and son node relationship, the second label information of the training sample is irrelevant to the body part. The rule for determining the second tag information may be set in other manners, specifically, by a related technician according to an actual situation, which is not limited in the embodiment of the present application.
And step 206, determining a synonym prediction result and a correlation prediction result of the training sample based on the multi-source characteristic information of the two words in the training sample through the synonym recognition model.
Part of the content of step 206 may refer to step 104 in the embodiment of fig. 1, and is not described herein again.
As shown in FIG. 4, the synonym recognition model 40 includes twin first and second networks 41, 42, a primary task output layer 43, and a secondary task output layer 44. In some embodiments, this step 206 may include the following substeps (2061-2064).
Step 2061, after performing dimension reduction processing on the semantic feature information of the first word in the training sample through the first network 41, performing fusion processing on the semantic feature information of the first word and the character feature information of the first word to obtain fusion feature information of the first word.
Optionally, the first network 41 includes a multilayer fully-connected layer 45 and a fusion layer 46, and performs multiple dimensionality reduction processing on the semantic feature information of the first term through the multilayer fully-connected layer 45 to obtain the semantic feature information of the first term after dimensionality reduction, and performs fusion processing on the semantic feature information of the first term after dimensionality reduction and the character feature information at the network layer 46 to obtain the fusion feature information of the first term. In some embodiments, the semantic feature information and the text feature information of the first word are in a vector form, that is, a word vector and a text feature vector of the first word, and after connecting elements in the text feature vector of the first word to elements of the word vector of the first word, a fused feature vector of the first word (that is, fused feature information of the first word) can be obtained. In one example, the first word reduced-dimension word vector is [0, 1, 1, 0, 2, 1, 0]TThe first word has a character feature vector of [1, 1, 0, 1, 2 ]]TThen the fused feature vector of the first word is [0, 1, 1, 0, 2, 1, 0, 0, 1, 1, 0, 1, 2]T
Step 2062, after performing dimension reduction processing on the semantic feature information of the second word in the training sample through the second network 42, performing fusion processing on the semantic feature information of the second word and the character feature information of the second word to obtain fusion feature information of the second word.
Step 2062 may refer to step 2061, which is not described herein again.
Step 2063, determining the synonym prediction result of the training sample based on the fusion characteristic information of the first term and the fusion characteristic information of the second term through the main task output layer 43.
In some embodiments, the main task output layer 43 obtains the fusion feature information of the first term from the first network 41 and the fusion feature information of the second term from the second network 42, and performs similarity calculation on the fusion feature information of the first term and the fusion feature information of the second term to obtain the synonym prediction result of the training sample. In one example, the fusion feature information is a fusion feature vector, the euclidean distance between the fusion feature vector of the first word and the fusion feature vector of the second word is calculated to obtain the euclidean distance corresponding to the training sample, and the smaller the euclidean distance corresponding to the training sample is, the larger the similarity between the first word and the second word is represented; the larger the Euclidean distance corresponding to the training sample is, the smaller the similarity between the first word and the second word is. When the Euclidean distance corresponding to the training sample is larger than or equal to the Euclidean distance threshold value, the synonym prediction result is that the training sample is a synonym; and when the Euclidean distance corresponding to the training sample is smaller than the Euclidean distance threshold value, determining that the training sample is non-synonym by the synonym prediction result. The euclidean distance threshold may be 0.4, 0.5, 0.55, 0.7, etc., and specific values of the euclidean distance threshold are set by the related technical personnel according to actual situations, which is not limited in the embodiment of the present application.
At step 2064, the secondary task output layer 44 determines the correlation prediction result of the training sample based on the fusion characteristic information of the first word and the fusion characteristic information of the second word.
In some embodiments, the secondary task output layer 44 obtains the fusion feature information of the first word from the first network 41 and the fusion feature information of the second word from the second network 42, and compares the fusion feature information of the first word with the fusion feature information of the second word to obtain the correlation prediction result of the training sample.
And step 207, calculating a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample.
For part of the content of step 207, reference may be made to step 105 in the embodiment of fig. 1, which is not described herein again.
In some embodiments, step 207 includes the following sub-steps:
1. calculating the loss of the main task according to the synonym prediction result of the training sample and the first label information; the first label information is label information indicating whether two words in the training sample are synonyms or not;
2. according to the correlation prediction result of the training sample and the second label information, calculating the subtask loss; the second label information refers to label information of correlation between two words in the training sample;
3. and calculating a loss function value of the synonym recognition model according to the main task loss, the weight corresponding to the main task loss, the auxiliary task loss and the weight corresponding to the auxiliary task loss.
Based on the first label information and the synonym prediction result of the training sample, a main task loss function is adopted, and the main task loss can be calculated; and calculating the subtask loss by adopting a subtask loss function based on the second label information and the correlation prediction result of the training sample. And giving corresponding weights to the main task loss and the auxiliary task loss, and calculating by combining the main task loss and the auxiliary task loss to obtain a loss function value of the synonym recognition model. The calculation formula of the loss function value of the synonym recognition model refers to the following formula one:
the formula I is as follows:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE002
a loss function value for the synonym recognition model, N is the number of training samples, i represents the number of training samples, λ1For the loss of a corresponding weight, λ, of the main task2For the loss of the corresponding weight for the subtask, yiIs the value corresponding to the first label information of the ith training sample, margin is the Euclidean distance threshold value, diIs the Euclidean distance, k, corresponding to the ith training sampleiIs the value corresponding to the second label information of the ith training sample,
Figure DEST_PATH_IMAGE003
and predicting a value corresponding to the correlation result of the ith training sample.
Optionally, when the first tag information is a synonym, yiIs 1; when the first tag information is non-synonym, yiIs 0. Optionally, k is when the second tag information is the same or related to the body partiIs 0; when the second label information is not related to the body part, kiIs 1; when the second tag information is uncertain, kiIs 2. Alternatively, when the correlation prediction results of the ith training sample are the same or related to the body parts,
Figure 339657DEST_PATH_IMAGE003
is 0; when the correlation prediction result of the ith training sample is that the body part is not correlated,
Figure 584694DEST_PATH_IMAGE003
is 1; when the second tag information is indeterminate,
Figure 833272DEST_PATH_IMAGE003
is 2. In addition, y isi、kiAnd
Figure 494061DEST_PATH_IMAGE003
is defined by the correlation techniqueThe operator can set the setting according to the actual situation, which is not limited in the embodiments of the present application.
And step 208, training the synonym recognition model according to the loss function value.
Step 208 may refer to step 106 of the embodiment of fig. 1, which is not described herein again.
In some embodiments, active learning is used to train the synonym recognition model, including the following steps:
1. selecting a target training sample with a synonym prediction result meeting conditions, wherein the conditions comprise that the similarity of the synonym prediction result is greater than a first threshold and smaller than a second threshold;
2. and obtaining label information obtained by manually marking the target training sample.
And the target training sample is used for carrying out the next round of training on the synonym recognition model. In some embodiments, some training samples with euclidean distances near the euclidean distance threshold are selected as target training samples, correct label information of the target training samples is manually labeled, and the target training samples are used for performing the next round of training on the synonym recognition model. In some embodiments, the tag information is first tag information. In one example, if the euclidean distance threshold is 0.5, then the training samples having euclidean distances between 0.45, 0.55 are determined to be the target training samples. Optionally, specific values of the first threshold and the second threshold are continuously set by a related technician according to an actual situation, which is not limited in the embodiment of the present application.
In summary, according to the technical scheme provided by the embodiment of the application, the training samples with the euclidean distances near the euclidean distance threshold are used as the target training samples, the target training samples are used for carrying out the next round of training on the synonym recognition model, namely, the training samples with high distinguishing difficulty are used for training the synonym recognition model, the synonym recognition model with high performance can be obtained after less training samples are trained, the number of required training samples and the training time of the synonym recognition model are reduced, and the training efficiency is improved.
Please refer to fig. 5, which illustrates a flowchart of a synonym determination method according to an embodiment of the present application. Some contents of the steps in this embodiment may refer to the above embodiments, and are not described in detail below. The method can comprise the following steps (501-505).
Step 501, obtaining a target word pair.
Wherein the target word pair comprises a first word and a second word.
Step 502, determining the matching words of the first word and the second word respectively to obtain a matching word determination result of the first word and a matching word determination result of the second word.
Optionally, the matching word determination result of the first word includes a first part descriptor corresponding to the first word, and the matching word determination result of the second word includes a second part descriptor corresponding to the second word.
In some embodiments, the matching word is a part descriptor, and step 502 further includes the following sub-steps:
1. acquiring a first part descriptor corresponding to a first term from the body part query tree based on the matching degree between the first term and the part descriptor in the body part query tree;
2. and acquiring a second part descriptor corresponding to the second term from the body part query tree based on the matching degree between the second term and the part descriptor in the body part query tree.
Wherein, the body part query tree records the part descriptors of a plurality of body parts and the relationship among the plurality of body parts.
Step 503, determining the correlation prediction result of the target word pair based on the matching word determination result of the first word and the matching word determination result of the second word.
In some embodiments, if the correlation prediction result meets the condition, the following step 504 is performed; and if the correlation prediction result does not meet the condition, determining that the target word pair is a non-synonym. The relevance prediction result meets the conditions of: the correlation prediction result is that the body parts corresponding to the first word and the second word are the same or related, or whether the body parts corresponding to the first word and the second word are related or not is uncertain. The condition that the correlation prediction result does not meet the condition comprises the following steps: the relevance prediction result is that the body parts corresponding to the first words and the second words are not relevant. When the body parts corresponding to the first words and the second words are not related, the first words and the second words are definitely non-synonyms, and synonym determination is not needed, so that the efficiency of synonym determination is improved, and the operation cost of computer equipment is saved.
Step 504, multi-source characteristic information of the first term and multi-source characteristic information of the second term are obtained.
And 505, determining a synonym prediction result of the target word pair based on the multi-source characteristic information of the target word pair through the synonym recognition model.
In some embodiments, the synonym recognition model includes a twin first network and a second network, step 505 including the sub-steps of:
1. after the semantic feature information of the first word in the target word pair is subjected to dimensionality reduction processing through a first network, the semantic feature information of the first word and the character feature information of the first word are subjected to fusion processing to obtain fusion feature information of the first word;
2. after the semantic feature information of a second word in the target word pair is subjected to dimensionality reduction processing through a second network, the semantic feature information of the second word is subjected to fusion processing with the character feature information of the second word to obtain fusion feature information of the second word;
3. determining the similarity between the fusion characteristic information of the first word and the fusion characteristic information of the second word;
4. and determining a synonym prediction result of the target word pair according to the similarity.
In summary, in the technical scheme provided in the embodiment of the present application, when the body parts corresponding to the first word and the second word are not related, it is directly determined that the first word and the second word are determined to be non-synonyms, and no further synonym determination is performed on the first word and the second word, so that the efficiency of synonym determination is improved, and the operation cost of the computer device is saved.
Next, the method provided in this embodiment will be described with reference to fig. 6, which illustrates a synonym determination scheme for a symptom in the medical field. As shown in FIG. 6, the method includes the following steps (steps 61 to 67).
Step 61, a plurality of symptom words are obtained.
Here, the symptom word is a word indicating the state of the body part. Such as "headache", "soreness of the waist", "herpetic childhood pain", etc.
Step 62, word vectors of a plurality of symptom words are obtained.
And step 63, acquiring and determining a training sample based on the word vectors of the plurality of symptom words.
In some embodiments, semantic similarity between every two word vectors of the plurality of symptom words is calculated, the plurality of symptom words are divided into a plurality of word sets, the semantic similarity between the word vectors of the symptom words in the same word set is greater than or equal to a threshold value, and the semantic similarity between the word vectors of the symptom words in different word sets is smaller than the threshold value. The training samples comprise positive samples and negative samples, two symptom words selected from the same word set are constructed into a positive sample, and two symptom words selected from different word sets are constructed into a negative sample.
And step 64, acquiring character feature vectors of a plurality of symptom words.
In some embodiments, the literal feature vector is generated based on at least one of: editing distance, pinyin editing distance, subsequence, number of character components and dictionary tool.
And 65, training the synonym recognition model based on the training samples.
In some embodiments, the synonym recognition model includes a twin first network and a twin second network, each of the first network and the second network includes a plurality of fully connected layers (dense), and the word vectors of the two symptom words of the training sample can be subjected to dimension reduction processing through the plurality of fully connected layers to obtain the word vectors after dimension reduction of the two symptom words. And obtaining an output result of the synonym recognition model based on the word vector and the character feature vector after the two symptom words are subjected to dimensionality reduction. For other training processes of the synonym recognition model, reference may be made to the above embodiments, which are not described herein again.
And step 66, obtaining the synonym symptom output by the synonym recognition model.
And step 67, acquiring a checking result after the synonym output by the synonym recognition model is manually checked, and storing the training sample with the manual checking result as the synonym into the synonym database.
The technical scheme provided by the embodiment of the application can be applied to various scenes.
For example, when the technical scheme provided by the embodiment of the present application is applied to a medical inquiry scene, a symptom word (e.g., a dialogized symptom word) which is not sufficiently standard and provided by a user can be obtained according to the technical scheme provided by the embodiment of the present application, and a synonym symptom word whose corresponding expression mode is relatively standard is determined, so that a symptom which the user wants to describe can be conveniently mapped to a medical map, and then, an automatic inquiry can be realized according to the medical map by adopting a man-machine conversation mode.
For another example, by adopting the technical scheme provided by the embodiment of the application, the synonym symptom words of the symptom words contained in the existing medical map can be mined, and the mined synonym symptom words are supplemented into the existing medical map, so that the content of the existing medical map is enriched, and the expression capacity of the medical map is enhanced.
In addition, the technical solution of "standardizing words by synonym recognition" provided in the present application may also be applied to other scenarios, such as collection and arrangement of information of diseases and drugs (e.g., Chinese herbal medicines), medical examination, and the like, and the embodiment of the present application is not particularly limited to this.
In order to transversely compare the technical scheme provided by the embodiment of the application with other technical schemes, a comparison experiment based on five-fold cross validation (five-fold cross validation) is also designed. Selecting a high-confidence-degree synonym set as a data set, dividing the data set into five equal parts of data, selecting one part of data as a test set without repetition, taking the other four parts of data as a training set to carry out training and verification on a synonym recognition model, recording the experimental result of each time, and taking the average value of the five results as the evaluation result of the model.
Four models were designed as follows:
baseline (control) 1-word vector based model: carrying out training test by taking an SVM (Support vector machine) as a classifier based on the word vectors;
baseline 2-text feature vector based model: training and testing are carried out on the basis of the character feature vectors and by taking the SVM as a classifier;
twin network-single task learning model: adopting a twin network model without introducing a subtask output layer;
twin network-multitask learning model: a twin network model is adopted and a subtask output layer is introduced.
The experimental results of the above four models are shown in the following table one:
watch 1
Figure DEST_PATH_IMAGE004
Wherein, the accuracy, precision, call and F1-score are evaluation indexes.
From the above experimental results, it can be seen that the evaluation indexes of both baseline models are much lower than those of the twin network model, and the recall rate of both models is only about 10% higher than random gain (random guess). The twin network model can better predict synonyms due to the introduction of the loss functions of the multi-source characteristics and the self model, and the F1 value of the twin network model is more than 20% higher than that of the two baseline models. In addition, due to the addition of multi-task learning, the evaluation indexes of the twin network model on accuracy, precision and F1-score all exceed those of the single-task learning model, wherein precision is improved by 5%. It can be seen that: the performance of the synonym recognition model can be effectively improved by introducing multi-task learning, particularly multi-task learning with the body part being predicted as a side task.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Referring to fig. 7, a block diagram of a training apparatus for a synonym recognition model according to an embodiment of the present application is shown. The device has the function of realizing the training method of the synonym recognition model, and the function can be realized by hardware or by hardware executing corresponding software. The device can be computer equipment, and can also be arranged in the computer equipment. The apparatus 700 may include: a word acquisition module 710, an information acquisition module 720, a sample determination module 730, a result prediction module 740, a loss calculation module 750, and a model training module 760.
The word obtaining module 710 is configured to obtain a plurality of words.
The information obtaining module 720 is configured to obtain multi-source feature information of the word, where the multi-source feature information includes semantic feature information and text feature information, the semantic feature information is used to represent semantic features of the word, and the text feature information is used to represent word features of the word.
The sample determining module 730 is configured to determine a plurality of training samples based on the plurality of words, where the plurality of training samples include at least one positive sample and at least one negative sample, the positive sample is a synonym pair, and the negative sample is a non-synonym pair.
The result prediction module 740 is configured to determine, by using a synonym recognition model, a synonym prediction result and a correlation prediction result of the training sample based on multi-source feature information of two words in the training sample, where the synonym prediction result is a prediction result of whether two words in the training sample are synonyms, and the correlation prediction result is a prediction result of correlation between two words in the training sample.
The loss calculating module 750 is configured to calculate a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample.
The model training module 760 is configured to train the synonym recognition model according to the loss function value.
In summary, according to the technical scheme provided by the embodiment of the application, the synonym recognition model is trained by using the multi-source characteristic information of the words, the synonym prediction result and the correlation prediction result obtained by the multi-task learning are used, the loss function value of the synonym recognition model is calculated, and the synonym recognition model is trained according to the loss function value.
In some embodiments, the synonym recognition model includes twin first and second networks, a primary task output layer, and a secondary task output layer; the result prediction module 740 is configured to:
after the semantic feature information of a first word in the training sample is subjected to dimensionality reduction processing through the first network, the semantic feature information of the first word is subjected to fusion processing with the character feature information of the first word to obtain fusion feature information of the first word;
after the semantic feature information of a second word in the training sample is subjected to dimensionality reduction processing through the second network, the semantic feature information of the second word is subjected to fusion processing with the character feature information of the second word to obtain fusion feature information of the second word;
determining, by the main task output layer, a synonym prediction result of the training sample based on the fusion feature information of the first term and the fusion feature information of the second term;
determining, by the subtask output layer, a relevance prediction result for the training sample based on the fusion feature information of the first term and the fusion feature information of the second term.
In some embodiments, as shown in fig. 8, the apparatus 700 further comprises: a matching term determination module 770 and a tag determination module 780.
The matching term determining module 770 is configured to determine matching terms of a first term and a second term in the training sample, respectively, to obtain a matching term determining result of the first term and a matching term determining result of the second term.
The label determining module 780 is configured to determine second label information of the training sample based on the matching word determination result of the first word and the matching word determination result of the second word, where the second label information refers to label information of a correlation between two words in the training sample.
In some embodiments, the match determination for the first word comprises a first location descriptor corresponding to the first word, and the match determination for the second word comprises a second location descriptor corresponding to the second word; the tag determination module 780 is configured to:
acquiring a first part descriptor corresponding to the first term from a body part query tree based on the matching degree between the first term and the part descriptor in the body part query tree;
acquiring second part descriptors corresponding to the second terms from the body part query tree based on the matching degree between the second terms and the part descriptors in the body part query tree;
wherein the body part query tree records part descriptors of a plurality of body parts and relations between the plurality of body parts. In some embodiments, the sample determination module 730 is configured to:
dividing the plurality of words into a plurality of word sets according to semantic similarity among the semantic feature information of the plurality of words, wherein the semantic similarity among the semantic feature information of the words in the same word set is larger than a threshold value, and the semantic similarity among the semantic feature information of the words in different word sets is smaller than the threshold value;
selecting two words from the same word set to construct the positive sample;
and selecting two words from different word sets to construct the negative sample.
In some embodiments, the loss calculation module 750 is configured to:
calculating the loss of the main task according to the synonym prediction result of the training sample and the first label information; the first label information refers to label information of whether two words in the training sample are synonyms or not;
according to the correlation prediction result of the training sample and the second label information, calculating the subtask loss; the second label information refers to label information of correlation between two words in the training sample;
and calculating a loss function value of the synonym recognition model according to the main task loss, the weight corresponding to the main task loss, the auxiliary task loss and the weight corresponding to the auxiliary task loss.
In some embodiments, as shown in fig. 8, the apparatus 700 further comprises: a sample selection module 790 and a label acquisition module 795.
The sample selecting module 790 is configured to select a target training sample for which the synonym prediction result meets a condition, where the condition includes that the similarity of the synonym prediction result is greater than a first threshold and smaller than a second threshold.
The label obtaining module 795 is configured to obtain label information obtained by manually labeling the target training sample.
And the target training sample is used for carrying out the next round of training on the synonym recognition model.
Referring to fig. 9, a block diagram of a synonym determination apparatus according to an embodiment of the present application is shown. The device has the function of realizing the synonym determination method, and the function can be realized by hardware or by hardware executing corresponding software. The device can be computer equipment, and can also be arranged in the computer equipment. The apparatus 900 may include: a word pair obtaining module 910, an information obtaining module 920, and a result determining module 930.
The word pair obtaining module 910 is configured to obtain a target word pair, where the target word pair includes a first word and a second word.
The information obtaining module 920 is configured to obtain multi-source feature information of the first word and multi-source feature information of the second word, where the multi-source feature information includes semantic feature information and text feature information, the semantic feature information is used to represent semantic features of the words, and the text feature information is used to represent word features of the words.
The result determining module 930 is configured to determine, by using a synonym recognition model, a synonym prediction result of the target word pair based on the multi-source feature information of the target word pair, where the synonym prediction result refers to a prediction result of whether the first word and the second word are synonyms.
In summary, in the technical scheme provided in the embodiment of the present application, when the body parts corresponding to the first word and the second word are not related, it is directly determined that the first word and the second word are determined to be non-synonyms, and no further synonym determination is performed on the first word and the second word, so that the efficiency of synonym determination is improved, and the operation cost of the computer device is saved.
In some embodiments, the synonym recognition model includes a twin first network and a second network; the result determination module 930 is configured to:
after the semantic feature information of the first word in the target word pair is subjected to dimensionality reduction processing through the first network, the semantic feature information and the character feature information of the first word are subjected to fusion processing to obtain fusion feature information of the first word;
after the semantic feature information of a second word in the target word pair is subjected to dimensionality reduction processing through the second network, the semantic feature information and the character feature information of the second word are subjected to fusion processing to obtain fusion feature information of the second word;
determining similarity between the fusion characteristic information of the first word and the fusion characteristic information of the second word;
and determining a synonym prediction result of the target word pair according to the similarity.
In some embodiments, as shown in fig. 10, the apparatus 900 further comprises: a matchmaking determination module 940, a step loop module 950, and a non-synonym determination module 960.
The matching word determining module 940 is configured to determine matching words of the first word and the second word respectively, and obtain a matching word determining result of the first word and a matching word determining result of the second word.
The result determining module 930 is further configured to determine a relevance prediction result of the target word pair based on the matching word determination result of the first word and the matching word determination result of the second word, where the relevance prediction result refers to a prediction result of relevance between the first word and the second word.
The step circulation module 950 is configured to start execution from the step of obtaining the multi-source feature information of the first term and the multi-source feature information of the second term if the correlation prediction result meets a condition.
The non-synonym determining module 960 is configured to determine that the target word pair is a non-synonym if the correlation prediction result does not meet the condition.
In some embodiments, the match determination for the first word comprises a first location descriptor corresponding to the first word, and the match determination for the second word comprises a second location descriptor corresponding to the second word; the matching word determination module 940 is configured to:
acquiring a first part descriptor corresponding to the first term from a body part query tree based on the matching degree between the first term and the part descriptor in the body part query tree;
acquiring second part descriptors corresponding to the second terms from the body part query tree based on the matching degree between the second terms and the part descriptors in the body part query tree;
wherein the body part query tree records part descriptors of a plurality of body parts and relations between the plurality of body parts.
Referring to fig. 11, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be used to implement the training method of the synonym recognition model or to implement the function of the synonym determination method. Specifically, the method comprises the following steps:
the computer apparatus 1100 includes a Central Processing Unit (CPU) 1101, a system Memory 1104 including a Random Access Memory (RAM) 1102 and a Read Only Memory (ROM) 1103, and a system bus 1105 connecting the system Memory 1104 and the Central Processing Unit 1101. The computer device 1100 also includes a basic Input/Output (I/O) system 1106, which facilitates transfer of information between devices within the computer, and a mass storage device 1107 for storing an operating system 1113, application programs 1114, and other program modules 1115.
The basic input/output system 1106 includes a display 1108 for displaying information and an input device 1109 such as a mouse, keyboard, etc. for user input of information. Wherein the display 1108 and the input device 1109 are connected to the central processing unit 1101 through an input output controller 1110 connected to the system bus 1105. The basic input/output system 1106 may also include an input/output controller 1110 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1110 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1107 is connected to the central processing unit 1101 through a mass storage controller (not shown) that is connected to the system bus 1105. The mass storage device 1107 and its associated computer-readable media provide non-volatile storage for the computer device 1100. That is, the mass storage device 1107 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.
Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1104 and mass storage device 1107 described above may be collectively referred to as memory.
According to various embodiments of the present application, the computer device 1100 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1100 may connect to the network 1112 through the network interface unit 1111 that is connected to the system bus 1105, or may connect to other types of networks or remote computer systems (not shown) using the network interface unit 1111.
The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the method of training the synonym recognition model described above, or to implement the method of synonym determination described above.
In some embodiments, there is further provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which when executed by a processor, implement the above-described method of training a synonym recognition model.
In some embodiments, there is also provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions which, when executed by a processor, implement the synonym determination method described above.
Optionally, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State drive), or optical disk. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).
In some embodiments, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer readable storage medium by a processor of a computer device, and the processor executes the computer instructions to enable the computer device to execute the training method of the synonym recognition model.
In some embodiments, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the synonym determination method.
It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.
The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (15)

1. A training method of a synonym recognition model, characterized in that the method comprises:
acquiring a plurality of words;
acquiring multi-source characteristic information of the words, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
determining a plurality of training examples based on the plurality of words, the plurality of training examples including at least one positive example and at least one negative example, the positive examples being synonym pairs and the negative examples being non-synonym pairs;
determining a synonym prediction result and a correlation prediction result of the training sample based on multi-source characteristic information of two words in the training sample through a synonym recognition model, wherein the synonym prediction result is a prediction result of whether the two words in the training sample are synonyms, and the correlation prediction result is a prediction result of correlation between the two words in the training sample;
calculating a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample;
and training the synonym recognition model according to the loss function value.
2. The method of claim 1, wherein the synonym recognition model includes twin first and second networks, and primary and secondary task output layers;
the determining, by the synonym recognition model, a synonym prediction result and a correlation prediction result of the training sample based on multi-source feature information of two words in the training sample includes:
after the semantic feature information of a first word in the training sample is subjected to dimensionality reduction processing through the first network, the semantic feature information of the first word is subjected to fusion processing with the character feature information of the first word to obtain fusion feature information of the first word;
after the semantic feature information of a second word in the training sample is subjected to dimensionality reduction processing through the second network, the semantic feature information of the second word is subjected to fusion processing with the character feature information of the second word to obtain fusion feature information of the second word;
determining, by the main task output layer, a synonym prediction result of the training sample based on the fusion feature information of the first term and the fusion feature information of the second term;
determining, by the subtask output layer, a relevance prediction result for the training sample based on the fusion feature information of the first term and the fusion feature information of the second term.
3. The method of claim 1, wherein after determining a plurality of training samples based on the plurality of words, further comprising:
respectively determining the matching words of a first word and a second word in the training sample to obtain a matching word determination result of the first word and a matching word determination result of the second word;
and determining second label information of the training sample based on the matching word determination result of the first word and the matching word determination result of the second word, wherein the second label information refers to label information of correlation between two words in the training sample.
4. The method of claim 3, wherein the matching-word determination result for the first word comprises a first part descriptor corresponding to the first word, and wherein the matching-word determination result for the second word comprises a second part descriptor corresponding to the second word;
the determining the matching words of the first word and the second word in the training sample respectively to obtain the determining result of the matching words of the first word and the determining result of the matching words of the second word includes:
acquiring a first part descriptor corresponding to the first term from a body part query tree based on the matching degree between the first term and the part descriptor in the body part query tree;
acquiring second part descriptors corresponding to the second terms from the body part query tree based on the matching degree between the second terms and the part descriptors in the body part query tree;
wherein the body part query tree records part descriptors of a plurality of body parts and relations between the plurality of body parts.
5. The method of claim 1, wherein determining a plurality of training samples based on the plurality of words comprises:
dividing the plurality of words into a plurality of word sets according to semantic similarity among the semantic feature information of the plurality of words, wherein the semantic similarity among the semantic feature information of the words in the same word set is larger than a threshold value, and the semantic similarity among the semantic feature information of the words in different word sets is smaller than the threshold value;
selecting two words from the same word set to construct the positive sample;
and selecting two words from different word sets to construct the negative sample.
6. The method of claim 1, wherein calculating the loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample comprises:
calculating the loss of the main task according to the synonym prediction result of the training sample and the first label information; the first label information refers to label information of whether two words in the training sample are synonyms or not;
according to the correlation prediction result of the training sample and the second label information, calculating the subtask loss; the second label information refers to label information of correlation between two words in the training sample;
and calculating a loss function value of the synonym recognition model according to the main task loss, the weight corresponding to the main task loss, the auxiliary task loss and the weight corresponding to the auxiliary task loss.
7. The method according to any one of claims 1 to 6, further comprising:
selecting a target training sample with the synonym prediction result meeting conditions, wherein the conditions comprise that the similarity of the synonym prediction result is greater than a first threshold and smaller than a second threshold;
acquiring label information obtained by manually marking the target training sample;
and the target training sample is used for carrying out the next round of training on the synonym recognition model.
8. A method for synonym determination, the method comprising:
acquiring a target word pair, wherein the target word pair comprises a first word and a second word;
acquiring multi-source characteristic information of the first word and multi-source characteristic information of the second word, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
determining a synonym prediction result of the target word pair based on multi-source characteristic information of the target word pair through a synonym recognition model, wherein the synonym prediction result is a prediction result of whether the first word and the second word are synonyms or not.
9. The method of claim 8, wherein the synonym recognition model includes a twin first network and a second network;
determining a synonym prediction result of the target word pair based on the multi-source characteristic information of the target word pair through the synonym recognition model, wherein the determining comprises the following steps:
after the semantic feature information of the first word in the target word pair is subjected to dimensionality reduction processing through the first network, the semantic feature information and the character feature information of the first word are subjected to fusion processing to obtain fusion feature information of the first word;
after the semantic feature information of a second word in the target word pair is subjected to dimensionality reduction processing through the second network, the semantic feature information and the character feature information of the second word are subjected to fusion processing to obtain fusion feature information of the second word;
determining similarity between the fusion characteristic information of the first word and the fusion characteristic information of the second word;
and determining a synonym prediction result of the target word pair according to the similarity.
10. The method of claim 8, wherein after obtaining the target word pair, further comprising:
respectively determining the matching words of the first word and the second word to obtain a matching word determination result of the first word and a matching word determination result of the second word;
determining a correlation prediction result of the target word pair based on a matching word determination result of the first word and a matching word determination result of the second word, wherein the correlation prediction result is a prediction result of the correlation between the first word and the second word;
if the correlation prediction result meets the condition, starting to execute the step of acquiring the multi-source characteristic information of the first term and the multi-source characteristic information of the second term;
and if the correlation prediction result does not meet the condition, determining that the target word pair is a non-synonym.
11. The method of claim 10, wherein the matching-word determination result for the first word comprises a first part descriptor corresponding to the first word, and wherein the matching-word determination result for the second word comprises a second part descriptor corresponding to the second word;
the determining the matching words of the first word and the second word respectively to obtain a matching word determination result of the first word and a matching word determination result of the second word includes:
acquiring a first part descriptor corresponding to the first term from a body part query tree based on the matching degree between the first term and the part descriptor in the body part query tree;
acquiring second part descriptors corresponding to the second terms from the body part query tree based on the matching degree between the second terms and the part descriptors in the body part query tree;
wherein the body part query tree records part descriptors of a plurality of body parts and relations between the plurality of body parts.
12. An apparatus for training a synonym recognition model, the apparatus comprising:
the word acquisition module is used for acquiring a plurality of words;
the information acquisition module is used for acquiring multi-source characteristic information of the words, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
a sample determination module, configured to determine a plurality of training samples based on the plurality of words, where the plurality of training samples include at least one positive sample and at least one negative sample, the positive sample is a synonym pair, and the negative sample is a non-synonym pair;
the result prediction module is used for determining a synonym prediction result and a correlation prediction result of the training sample based on multi-source characteristic information of two words in the training sample through a synonym recognition model, wherein the synonym prediction result is a prediction result of whether the two words in the training sample are synonyms, and the correlation prediction result is a prediction result of correlation between the two words in the training sample;
the loss calculation module is used for calculating a loss function value of the synonym recognition model based on the synonym prediction result and the correlation prediction result of the training sample;
and the model training module is used for training the synonym recognition model according to the loss function value.
13. A synonym determination apparatus, characterized in that the apparatus comprises:
the word pair obtaining module is used for obtaining a target word pair, and the target word pair comprises a first word and a second word;
the information acquisition module is used for acquiring multi-source characteristic information of the first word and multi-source characteristic information of the second word, wherein the multi-source characteristic information comprises semantic characteristic information and character characteristic information, the semantic characteristic information is used for representing semantic characteristics of the words, and the character characteristic information is used for representing word characteristics of the words;
and the result determining module is used for determining a synonym prediction result of the target word pair based on the multi-source characteristic information of the target word pair through a synonym recognition model, wherein the synonym prediction result refers to a prediction result of whether the first word and the second word are synonyms.
14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement a method of training a synonym recognition model according to any one of the preceding claims 1 to 7, or to implement a method of synonym determination according to any one of the preceding claims 8 to 11.
15. A computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a method of training a synonym recognition model according to any one of the preceding claims 1-7, or to implement a method of synonym determination according to any one of the preceding claims 8-11.
CN202010781406.3A 2020-08-06 2020-08-06 Training method of synonym recognition model, synonym determination method and equipment Active CN111738001B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010781406.3A CN111738001B (en) 2020-08-06 2020-08-06 Training method of synonym recognition model, synonym determination method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010781406.3A CN111738001B (en) 2020-08-06 2020-08-06 Training method of synonym recognition model, synonym determination method and equipment

Publications (2)

Publication Number Publication Date
CN111738001A true CN111738001A (en) 2020-10-02
CN111738001B CN111738001B (en) 2020-12-01

Family

ID=72658145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010781406.3A Active CN111738001B (en) 2020-08-06 2020-08-06 Training method of synonym recognition model, synonym determination method and equipment

Country Status (1)

Country Link
CN (1) CN111738001B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269858A (en) * 2020-10-22 2021-01-26 中国平安人寿保险股份有限公司 Optimization method, device and equipment of synonym dictionary and storage medium
CN112417147A (en) * 2020-11-05 2021-02-26 腾讯科技(深圳)有限公司 Method and device for selecting training samples
CN112989837A (en) * 2021-05-11 2021-06-18 北京明略软件系统有限公司 Entity alias discovery method and device based on co-occurrence graph
CN113377921A (en) * 2021-06-25 2021-09-10 北京百度网讯科技有限公司 Method, apparatus, electronic device, and medium for matching information
CN113392651A (en) * 2020-11-09 2021-09-14 腾讯科技(深圳)有限公司 Training word weight model, and method, device, equipment and medium for extracting core words
CN113836901A (en) * 2021-09-14 2021-12-24 灵犀量子(北京)医疗科技有限公司 Chinese and English medicine synonym data cleaning method and system
CN117009532A (en) * 2023-09-21 2023-11-07 腾讯科技(深圳)有限公司 Semantic type recognition method and device, computer readable medium and electronic equipment
CN118052221A (en) * 2024-04-16 2024-05-17 腾讯科技(深圳)有限公司 Text processing method, device, equipment, storage medium and product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978356A (en) * 2014-04-10 2015-10-14 阿里巴巴集团控股有限公司 Synonym identification method and device
CN105095204A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Method and device for obtaining synonym
CN106844571A (en) * 2017-01-03 2017-06-13 北京齐尔布莱特科技有限公司 Recognize method, device and the computing device of synonym
CN110287337A (en) * 2019-06-19 2019-09-27 上海交通大学 The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN110852082A (en) * 2019-10-23 2020-02-28 北京明略软件系统有限公司 Synonym determination method and device
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978356A (en) * 2014-04-10 2015-10-14 阿里巴巴集团控股有限公司 Synonym identification method and device
CN105095204A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Method and device for obtaining synonym
CN106844571A (en) * 2017-01-03 2017-06-13 北京齐尔布莱特科技有限公司 Recognize method, device and the computing device of synonym
CN110287337A (en) * 2019-06-19 2019-09-27 上海交通大学 The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN110852082A (en) * 2019-10-23 2020-02-28 北京明略软件系统有限公司 Synonym determination method and device
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李军锋 等: "专利领域同义词识别", 《小型微型计算机系统》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269858B (en) * 2020-10-22 2024-04-19 中国平安人寿保险股份有限公司 Optimization method, device, equipment and storage medium of synonymous dictionary
CN112269858A (en) * 2020-10-22 2021-01-26 中国平安人寿保险股份有限公司 Optimization method, device and equipment of synonym dictionary and storage medium
CN112417147A (en) * 2020-11-05 2021-02-26 腾讯科技(深圳)有限公司 Method and device for selecting training samples
CN113392651A (en) * 2020-11-09 2021-09-14 腾讯科技(深圳)有限公司 Training word weight model, and method, device, equipment and medium for extracting core words
CN113392651B (en) * 2020-11-09 2024-05-14 腾讯科技(深圳)有限公司 Method, device, equipment and medium for training word weight model and extracting core words
CN112989837A (en) * 2021-05-11 2021-06-18 北京明略软件系统有限公司 Entity alias discovery method and device based on co-occurrence graph
CN112989837B (en) * 2021-05-11 2021-09-10 北京明略软件系统有限公司 Entity alias discovery method and device based on co-occurrence graph
CN113377921A (en) * 2021-06-25 2021-09-10 北京百度网讯科技有限公司 Method, apparatus, electronic device, and medium for matching information
CN113377921B (en) * 2021-06-25 2023-07-21 北京百度网讯科技有限公司 Method, device, electronic equipment and medium for matching information
CN113836901A (en) * 2021-09-14 2021-12-24 灵犀量子(北京)医疗科技有限公司 Chinese and English medicine synonym data cleaning method and system
CN113836901B (en) * 2021-09-14 2023-11-14 灵犀量子(北京)医疗科技有限公司 Method and system for cleaning Chinese and English medical synonym data
CN117009532B (en) * 2023-09-21 2023-12-19 腾讯科技(深圳)有限公司 Semantic type recognition method and device, computer readable medium and electronic equipment
CN117009532A (en) * 2023-09-21 2023-11-07 腾讯科技(深圳)有限公司 Semantic type recognition method and device, computer readable medium and electronic equipment
CN118052221A (en) * 2024-04-16 2024-05-17 腾讯科技(深圳)有限公司 Text processing method, device, equipment, storage medium and product
CN118052221B (en) * 2024-04-16 2024-06-21 腾讯科技(深圳)有限公司 Text processing method, device, equipment, storage medium and product

Also Published As

Publication number Publication date
CN111738001B (en) 2020-12-01

Similar Documents

Publication Publication Date Title
CN111738001B (en) Training method of synonym recognition model, synonym determination method and equipment
CN111444709B (en) Text classification method, device, storage medium and equipment
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN112633419B (en) Small sample learning method and device, electronic equipment and storage medium
CN112015859A (en) Text knowledge hierarchy extraction method and device, computer equipment and readable medium
CN113704460B (en) Text classification method and device, electronic equipment and storage medium
CN111782826A (en) Knowledge graph information processing method, device, equipment and storage medium
CN113722474A (en) Text classification method, device, equipment and storage medium
Li et al. Intention understanding in human–robot interaction based on visual-NLP semantics
Li et al. Dynamic key-value memory enhanced multi-step graph reasoning for knowledge-based visual question answering
Lin et al. Automatic sorting system for industrial robot with 3D visual perception and natural language interaction
CN113705191A (en) Method, device and equipment for generating sample statement and storage medium
CN111611796A (en) Hypernym determination method and device for hyponym, electronic device and storage medium
CN114519397A (en) Entity link model training method, device and equipment based on comparative learning
CN113536784A (en) Text processing method and device, computer equipment and storage medium
CN117009516A (en) Converter station fault strategy model training method, pushing method and device
CN110362734A (en) Text recognition method, device, equipment and computer readable storage medium
CN114662496A (en) Information identification method, device, equipment, storage medium and product
Wang et al. Backtracing: Retrieving the Cause of the Query
Huang et al. Learning emotion recognition and response generation for a service robot
CN114330297A (en) Language model pre-training method, language text processing method and device
CN114155957A (en) Text determination method and device, storage medium and electronic equipment
CN113822439A (en) Task prediction method, device, equipment and storage medium
CN111444338A (en) Text processing device, storage medium and equipment
CN115146716B (en) Labeling method, labeling device, labeling apparatus, labeling storage medium and labeling program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40030744

Country of ref document: HK