CN112579794B - Method and system for predicting semantic tree for Chinese and English word pairs - Google Patents

Method and system for predicting semantic tree for Chinese and English word pairs Download PDF

Info

Publication number
CN112579794B
CN112579794B CN202011565924.8A CN202011565924A CN112579794B CN 112579794 B CN112579794 B CN 112579794B CN 202011565924 A CN202011565924 A CN 202011565924A CN 112579794 B CN112579794 B CN 112579794B
Authority
CN
China
Prior art keywords
semantic
word
tree
node
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011565924.8A
Other languages
Chinese (zh)
Other versions
CN112579794A (en
Inventor
李涓子
刘宝巨
侯磊
张鹏
唐杰
许斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202011565924.8A priority Critical patent/CN112579794B/en
Publication of CN112579794A publication Critical patent/CN112579794A/en
Application granted granted Critical
Publication of CN112579794B publication Critical patent/CN112579794B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and a system for predicting a semantic tree for Chinese and English word pairs, which comprises the following steps: acquiring a word pair to be predicted and a category semantic corresponding to the word to be predicted; and based on the known preset semantic source set and semantic relation set and the category semantic source corresponding to the word to be predicted, generating a semantic source tree for predicting the word pair to be predicted by adopting a preset semantic source tree generation algorithm. The embodiment of the invention gives the category semantic information of the single word pair through the known semantic knowledge base, predicts the semantic tree for the given word pair, realizes the automatic semantic tree prediction, takes a large amount of time and cost compared with the manual semantic tree labeling, and has the characteristics of higher efficiency and higher accuracy.

Description

Method and system for predicting semantic tree for Chinese and English word pairs
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for predicting a semantic tree for Chinese and English word pairs.
Background
Sentences are composed of words, and different words have commonality and difference. HowNet is a widely used manual labeling database used to describe the semantics of different words, and labels words as a structure composed of a series of semaphores, which are inseparable semantic sets finer than words and represent more basic meanings than words. HowNet and its labeled semantic information can be used in natural language processing tasks such as vocabulary disambiguation, emotion analysis, cross-language vocabulary similarity, and word vector generation.
Although the sememe plays an important role in natural language analysis and processing, manual labeling of the sememe is a time-consuming and labor-consuming task, and deviation such as inconsistency is not avoided. With the development of information technology, new words are layered endlessly, especially new Chinese and English words are rapidly increased, the meanings of the existing words are continuously changed, and the label of the Chinese and English word pair to the original meaning does not have a good processing method at present.
Disclosure of Invention
The embodiment of the invention provides a method and a system for predicting a semantic tree for Chinese and English word pairs, which are used for solving the defect that the class semantic tree can only be labeled manually in the prior art.
In a first aspect, an embodiment of the present invention provides a method for predicting a semantic tree for chinese and english word pairs, including:
acquiring a word pair to be predicted and a category sememe corresponding to the word to be predicted;
and generating a semantic tree for the word pair to be predicted by adopting a preset semantic tree generation algorithm based on the known preset semantic set and semantic relation set and the class semantic corresponding to the word to be predicted.
Further, the preset semantic tree generation algorithm comprises a path generation algorithm or a label propagation algorithm.
Further, the path generation algorithm specifically includes:
constructing an edge generator of the semantic tree, and acquiring an edge taking each node as a head node in the semantic tree;
constructing a semantic tree node generator, and acquiring tail nodes starting from a given head node in a semantic tree;
and constructing a tree generator and acquiring the whole semantic tree.
Further, the constructing an edge generator of the semantic tree to obtain an edge with each node as a head node in the semantic tree specifically includes:
taking the path from the root node to the current node, the word pair sense and the category sense as the input of the sense tree edge generator;
RNN is adopted to model path information, a semantic and semantic relation is modeled into a unique heat vector, and a pre-trained English word vector and a Chinese word vector are in cascade connection to represent word senses;
acquiring a first preset classifier, inputting the last state, any node and word meaning of the RNN in the first preset classifier, and outputting all semantic relation scores;
normalizing the scores of all semantic relations through an L1 norm to obtain normalized scores of all semantic relations;
and arranging all the semantic relation normalization scores in a descending order from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a first preset number exceeds a first preset threshold value, and taking the semantic relation normalization scores of the first preset number as a first output result.
Further, the constructing a semantic tree node generator to obtain a tail node starting from a given head node in a semantic tree specifically includes:
taking the path from the root node to any edge, the word pair word senses and the category senses as the input of the sense tree node generator;
performing node information modeling by using RNN (radio network node), modeling a semantic and semantic relation into a unique heat vector, and cascading a pre-trained English word vector and a Chinese word vector to represent a word meaning;
acquiring a second preset classifier, inputting the last state, any node and word sense of the RNN in the second preset classifier, and outputting the original score of the sense;
normalizing the sense original score through an L1 norm to obtain a sense original normalized score;
and arranging the sense original normalization scores in a descending order from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a second preset number exceeds a second preset threshold value, and taking the semantic relation normalization scores of the second preset number as a second output result.
Further, the constructing the tree generator to obtain the whole sense tree specifically includes:
and inputting the word senses, the root node word senses and the paths from the root nodes to any node in the tree generator by adopting a recursive algorithm to generate the whole semantic tree.
Further, the tag propagation algorithm specifically includes:
constructing a word meaning graph, and connecting the word pair to be predicted with the word pair of the known semantic tree information to obtain a plurality of connecting edges;
parsing the semantic tree of the known word pair into a known triple set, and expressing each triple in the known triple set as a single multi-hot tag vector;
respectively calculating English semantic similarity between English words in the known word pair and English words in the word pair to be predicted and Chinese semantic similarity between Chinese words in the known word pair and Chinese words in the word pair to be predicted, and calculating the correlation between the English semantic similarity and the Chinese semantic similarity through a preset similarity function to obtain the weight coefficients of the connecting edges;
acquiring a preset activation function, and obtaining a multi-hot label vector of the predicted word pair based on the single multi-hot label vector and the weight coefficient;
normalizing the multiple hot label vectors of the predicted word pairs through an L1 norm to obtain multiple hot label vector normalization scores;
arranging the multiple hot label vector normalization scores in a descending order from high to low, stopping traversing if judging that the cumulative score obtained by adding 1 to a third preset number exceeds a third preset threshold value, and taking the semantic relation normalization scores of the third preset number as a third output result;
and converting and outputting the third output result and the single class of the semantic source into the whole semantic source tree.
In a second aspect, an embodiment of the present invention further provides a system for predicting a semantic tree for chinese and english word pairs, including:
the device comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a word pair to be predicted and a category semantic corresponding to the word to be predicted;
and the processing module is used for generating a semantic tree for the word pair to be predicted by adopting a preset semantic tree generation algorithm based on the known preset semantic source set and semantic relation set and the category semantic source corresponding to the word to be predicted.
In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the steps of any of the above methods for predicting a semantic tree for pairs of chinese and english words.
In a fourth aspect, embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for predicting a semantic tree for pairs of chinese and english words as described in any one of the above.
The method and the system for predicting the semantic tree for the Chinese and English word pairs, provided by the embodiment of the invention, give the category semantic information of the word pairs through the known semantic knowledge base, and predict the semantic tree for the given word pairs, so that the automatic prediction of the semantic tree is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a diagram of an escape tree according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for predicting a semantic tree for Chinese and English word pairs according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an edge generator provided by an embodiment of the invention;
FIG. 4 is a schematic diagram of a node generator provided by an embodiment of the present invention;
FIG. 5 is a flow chart of a tree generator algorithm provided by an embodiment of the present invention;
FIG. 6 is a flowchart of an algorithm for generating an semantic tree from a triple set according to an embodiment of the present invention;
FIG. 7 is a block diagram of a system for predicting a semantic tree for Chinese and English word pairs according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic diagram of a semantic tree provided by an embodiment of the present invention, as shown in fig. 1, including the following concepts:
word sense (sense): the meaning or meaning of a word is the general understanding of things, phenomena and relationships that a word refers to, and sen represents a meaning of a word;
one word (word, chinese word w) z English word w e ) Has one or more word senses;
one word pair (word pair, denoted as w) e |w z }): statistically, about 98.09% of word pairs are unambiguous, and according to the previous document, embodiments of the present invention assume that a Chinese-English word pair is an unambiguous word meaning, sen = { w = { e |w z };
Sememe (sememe): the primitive is the smallest indivisible unambiguous semantic unit in human language, using S = { S = { S = } 1 ,s 2 ,s 3 ……s n Represents a set of sememes, such as { buy | buy }, { few | little }, in the known HowNet library, n =2214;
class sememe (category sememe): class sememe
Figure BDA0002861735870000061
Is a basic class to which a word belongs, is a basic and important component for constructing word meaning knowledge of the word, for example, the category of the "outlet | retail store" is originally "institutetePlace | place ";
semantic relationship: relationships to describe semantic relationships between word senses and semaphores, R = { R = { (R) 1 ,r 2 ,……r m }, such as domain, agent; in order to make the directions of the edges in the semaphores tree all out from the root node, in FIG. 1, a (InstitePlace | location, -agent, sell) is used instead of (sell, agent, institePlace | location); therefore, inverse relationships, such as "-agent" need to be introduced; in addition, a null relation is introduced to serve as a mark for ending generation of the semantic tree, and the size of a final relation set is 233;
an artificial tree: each word sense has a semantic tree (sememe tree) to describe semantic information of the word sense, and the root node of each semantic tree is a category sememe of the word sense
Figure BDA0002861735870000062
The semantic tree can be analyzed as a triple set T = { (h, r, T) | h, and T belongs to S; r belongs to R }, wherein h is a head node, R is a semantic relation, t is a tail node, S is a semantic set, and R is a semantic relation set;
and (3) predicting the semantic tree: use of
Figure BDA0002861735870000063
Respectively representing a set of word pairs in the Hownet and a set of word pairs to be predicted. The word pairs in the two sets have class semaphores, the word pairs in the Hownet have sense original tree information, and the word pair set to be predicted has no sense original tree information.
Just because the semantic knowledge can be applied to various natural language processing tasks, but manually labeling the semantic tree is time-consuming and economically expensive, the embodiment of the present invention provides a method for predicting the semantic tree for chinese and english word pairs with category semantic sources, which is time-saving and highly accurate, and fig. 2 is a flowchart of the method for predicting the semantic tree for chinese and english word pairs, which is provided by the embodiment of the present invention, as shown in fig. 2, including:
s1, obtaining a word pair to be predicted and a category sememe corresponding to the word to be predicted;
and S2, based on the known preset semantic source set and semantic relation set and the category semantic source corresponding to the word to be predicted, generating a semantic source tree for the word pair to be predicted by adopting a preset semantic source tree generation algorithm.
In particular, it can be seen from the concepts defined in the preceding paragraphs that for the word pairs sen ∈ D to be tested s The preset semantic tree generation algorithm provided by the embodiment of the invention is adopted to predict a semantic tree for the to-be-detected word pair sen by using the known semantic knowledge set and the category semantic information of the to-be-detected word pair sen.
The embodiment of the invention gives the category semantic information of the single word pair through the known semantic knowledge base, predicts the semantic tree for the given word pair, realizes the automatic prediction of the semantic tree, costs a lot of time and cost compared with the manual labeling of the semantic tree, and has the characteristics of higher efficiency and higher accuracy.
Based on the above embodiment, the preset semantic tree generation algorithm includes a path generation algorithm or a label propagation algorithm.
Specifically, the embodiment of the invention provides two semantic tree generation algorithms, one is a path generation algorithm and the other is a label propagation algorithm.
It can be understood that, in the path generation algorithm, the semantic tree is composed of nodes and edges, a node generator and an edge generator are constructed, and the whole semantic tree is gradually generated in a depth-first manner from the root node category semantic of the semantic tree;
in the label propagation algorithm, since words with similar word vectors may have similar semantics, words with similar semantics may have similar primitive trees. Therefore, similar idea of 'collaborative filtering' is adopted for the sense sen without the sense tree information i Finding out the semantic and sen from the existing semantic knowledge base i Similar words and passing semantic tree information for those words to sen i
Based on any of the above embodiments, the path generation algorithm specifically includes:
constructing an edge generator of the semantic tree, and acquiring an edge taking each node as a head node in the semantic tree;
constructing a semantic tree node generator, and acquiring tail nodes starting from a given head node in a semantic tree;
and constructing a tree generator and acquiring the whole sense tree.
The constructing of the edge generator of the semantic tree obtains an edge of the semantic tree with each node as a head node, and specifically includes:
taking the path from the root node to the current node, the word pair word senses and the category semaphores as the input of the semaphore tree edge generator;
adopting RNN to model path information, modeling a semantic and semantic relation into a unique heat vector, and cascading an English word vector and a Chinese word vector which are pre-trained to represent a word meaning;
acquiring a first preset classifier, inputting the last state, any node and word meaning of the RNN in the first preset classifier, and outputting all semantic relation scores;
normalizing the scores of all semantic relations through an L1 norm to obtain normalized scores of all semantic relations;
and performing descending order on all the semantic relation normalization scores from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a first preset number exceeds a first preset threshold value, and taking the semantic relation normalization scores of the first preset number as a first output result.
The constructing of the semantic tree node generator obtains a tail node starting from a given head node in the semantic tree, and specifically includes:
taking the path from the root node to any edge, the word pair word senses and the category senses as the input of the sense tree node generator;
performing node information modeling by using RNN, modeling the semantic and semantic relations into a unique heat vector, and cascading an English word vector and a Chinese word vector which are pre-trained to represent word senses;
acquiring a second preset classifier, inputting the last state, any node and word meaning of the RNN in the second preset classifier, and outputting the original score of the sense;
normalizing the original score of the sense through an L1 norm to obtain an original normalized score of the sense;
and arranging the sense original normalization scores in a descending order from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a second preset number exceeds a second preset threshold value, and taking the semantic relation normalization scores of the second preset number as a second output result.
The building tree generator acquires the whole semantic tree, and specifically includes:
and inputting the word senses, the root node word senses and the paths from the root nodes to any node in the tree generator by adopting a recursive algorithm to generate the whole semantic tree.
Specifically, an edge generator is first constructed, and for each node in the semantic tree, the edge generator needs to generate all edges with the node as the head node for the node, and for each node s i The edge generator uses the path P from the root node to the current node si =[s1,r1,s2,r2,...,sj,rj,...,si-1,ri-1,si]The word pairs sen and the class semaphores are used as input information, and the edge generator is shown in FIG. 3.
Then using RNN modeling path information to model both the sememe s and the relation R as a one-hot vector, sj belongs to R n ,rj∈R m For the word sense sen, we will pre-train the English word vector w e And Chinese word vector w z The concatenation represents the word sense sen, sen ∈ R o
state e i-1=RNN e ([s1,r1,s2,r2,...,sj,rj,...,si-1,ri-1]);
state e i-1 is RNN e The last state obtained is then input into the last state stateei-1 and the current node s by using a classifier i And the word sense sen, and outputting scores of all semantic relations.
y i e =softmax(W e *[state e i-1;si;sen]+b e );
Wherein, the symbol; represents a concatenated symbol;
state e i-1∈R m ,y i e ∈R m ,b e ∈R m ,sen∈R o ,W e ∈R (m+n+o)*m
after the scores of all the relations are obtained, normalization is carried out through an L1 norm, all the relations are sorted in descending order from high to low according to the scores, when the accumulated scores of the current k +1 relations exceed a first preset threshold me, traversal is stopped, and the first k relations are used as output results.
Here, the loss function of the model is:
L e =||y i e -y i e || 2
since the question is a multi-category question and the correct answer is a set of relationships, the correct target y i e Is a multi-heat vector, the dimension corresponding to the correct relationship is 1, and the dimension value corresponding to the incorrect relationship is 0.
Further, a node generator is constructed, given a head node, and an edge from the head node, there may be multiple correct tail nodes, such as (human | people, hostOf, wisdom), (human | people, hostOf, name | names), for the edge r i The node generator uses the path P from the root node to the current edge ri =[s1,r1,s2,r2,...,sj,rj,...,si,ri]The word pairs sen and the category sememes are used as input information and the node generator is shown in fig. 4.
state n i =RNN n ([s1,r1,s2,r2,...,sj,rj,...,si,ri];
state n i Is the last state obtained by RNNn, and then uses a classifier to input the last state n i And the meaning sen, and outputs the score of the original meaning.
y i n =softmax(W n *[state n i ;sen]+b n )
Wherein, the symbol; represents a concatenation symbol;
state n i ∈R n ,y i n ∈R n ,b n ∈R n ,sen∈R o ,W n ∈R (m+n+o)*n
after the scores of the sense originals are obtained, normalization is carried out through an L1 norm, the sense originals are sorted in a descending order from high to low according to the scores, when the score accumulation of the current k +1 originals exceeds a second threshold mn, traversal is stopped, and the first k originals are taken as output results.
Similarly, the loss function of the model here is:
L n =||y i n -y i n || 2
since the question is a multi-classification question and the correct answer is an ambiguity set, the correct target yin is a multi-heat vector, the dimension corresponding to the correct ambiguity is 1, and the dimension corresponding to the incorrect ambiguity is 0.
And finally, constructing a tree generator, wherein a recursive algorithm is designed in the embodiment of the invention, the input of the algorithm is the word sense sen, and the current node s i Path P si The algorithm generates a part of the semantic tree below the node si, so if the word sense sen, the root node, is input
Figure BDA0002861735870000101
And
Figure BDA0002861735870000102
the algorithm will generate the entire semantic tree.
The algorithm flow is shown in fig. 5, and the path operation adopts: p is si +ri=[s1,r1,...,si-1,ri-1,si,ri]And P si +ri+si+1=[s1,r1,...,si-1,ri-1,si,ri,si+1]。
Based on any of the above embodiments, the building tree generator obtains the whole semantic tree, which specifically includes:
and inputting the word senses, the root node word senses and the paths from the root nodes to any node in the tree generator by adopting a recursive algorithm to generate the whole original sense tree.
Further, the tag propagation algorithm specifically includes:
constructing a word meaning graph, and connecting the word pair to be predicted with the word pair of the known semantic tree information to obtain a plurality of connecting edges;
parsing the semantic tree of the known word pair into a known triple set, and expressing each triple in the known triple set as a single multi-hot tag vector;
respectively calculating English semantic similarity between English words in the known word pair and English words in the word pair to be predicted and Chinese semantic similarity between Chinese words in the known word pair and Chinese words in the word pair to be predicted, and calculating the correlation between the English semantic similarity and the Chinese semantic similarity through a preset similarity function to obtain the weight coefficients of the connecting edges;
acquiring a preset activation function, and obtaining a multi-hot label vector of the predicted word pair based on the single multi-hot label vector and the weight coefficient;
normalizing the multiple hot label vectors of the predicted word pairs through an L1 norm to obtain multiple hot label vector normalization scores;
the multiple hot label vector normalization scores are arranged in a descending order from high to low, if the accumulated score obtained by adding 1 to a third preset number exceeds a third preset threshold value, the traversal is stopped, and the semantic relation normalization scores of the third preset number are taken as a third output result;
and converting and outputting the third output result and the single class of the semantic source into the whole semantic source tree.
Specifically, first, in order to express the similarity relationship between word senses and prepare for the subsequent label propagation, a word sense graph is constructed in which semantically similar word senses are connected to each other and each edge in the graph has a weight representing the similarity of two word senses, and then, based on the word sense graph, a propagation method is designed to generate an ambisense tree for a word sense without an ambisense tree.
The definition of the word sense graph is as follows: for a word sense sen in the figure i Using SEN i De represents in H s Neutralizing sen i Have the sameWord senses of class senses. The word meaning is composed of Chinese and English words, and cos (w) is used i e ,w j e )+cos(w i z ,w j z ) To represent sen i And sen j Semantic similarity between them, for each sense sen i In SEN i Middle selection top N k The sense with the maximum similarity is respectively equal to sen i And connecting, and using the semantic similarity between the two as the weight value of the edge.
Further adopting a label propagation algorithm, each word sense sen j The semantic tree can be resolved into a triple set T senj Each triplet is associated with a dimension, and thus the sense sen j The triplet set may be expressed as a vector I j For a predicted sense sen i Using N i To represent its set of neighbor nodes.
Figure BDA0002861735870000121
Figure BDA0002861735870000122
Wherein the f function is used to calculate sen j For sen i G is an activation function, the embodiment of the present invention uses a tanh function, the variable of the model is a, and the loss function is:
Figure BDA0002861735870000123
Figure BDA0002861735870000124
is a predicted triplet vector,/ i The triple vectors are real triple vectors, after the triple vectors predicted by the model are obtained, the triple vectors are normalized by using an L1 norm, and all the triples are sorted from high to low according to scoresAnd when the score accumulation of the current k +1 triples exceeds a third preset threshold value mt, selecting the former k triples as output results.
Here, the multi-hot tag vector used is a vector having only 1 and 0 as component values, and the number of 1 may be plural.
And finally, after the triple set is obtained, an algorithm is designed to convert the triple set into a tree, the algorithm inputs a triple set and a class sememe and outputs a sememe tree, and the flow of the algorithm is shown in fig. 6.
Here, the parameters of the model are updated using a gradient descent method.
In order to verify the effect of the embodiment of the invention, a word pair in the Hownet is used for testing, and the word pair is randomly divided into a training set, a verification set and a test set according to the proportion of 8; and (3) evaluating by using the F1 evaluation index of the triple, wherein the F1 value of the path generation method is 0.558, and the F1 value of the label propagation method is 0.840.
The following describes a system for predicting a semantic tree for chinese and english words according to an embodiment of the present invention, and the following describes the system for predicting a semantic tree for chinese and english words and the above-described system for predicting a semantic tree for chinese and english words, which may be referred to in correspondence with each other.
Fig. 7 is a schematic structural diagram of a system for predicting a semantic tree for chinese and english word pairs according to an embodiment of the present invention, as shown in fig. 7, including: an acquisition module 71 and a processing module 72; wherein:
the obtaining module 71 is configured to obtain a word pair to be predicted and a category semantic corresponding to the word to be predicted; the processing module 72 is configured to generate an ambiguial tree for the word pair to be predicted by using a preset ambiguial tree generation algorithm based on the known preset ambiguial set and semantic relationship set, and the class ambiguities corresponding to the word to be predicted.
The embodiment of the invention gives the category semantic information of the single word pair through the known semantic knowledge base, predicts the semantic tree for the given word pair, realizes the automatic semantic tree prediction, takes a large amount of time and cost compared with the manual semantic tree labeling, and has the characteristics of higher efficiency and higher accuracy.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor) 810, a communication interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a method of predicting a semantic tree for chinese and english word pairs, the method comprising: acquiring a word pair to be predicted and a category semantic corresponding to the word to be predicted; and generating a semantic tree for the word pair to be predicted by adopting a preset semantic tree generation algorithm based on the known preset semantic set and semantic relation set and the class semantic corresponding to the word to be predicted.
In addition, the logic instructions in the memory 830 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.
In another aspect, embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is capable of executing the method for predicting a semantic tree for chinese and english word pairs, which is provided by the above method embodiments, where the method includes: acquiring a word pair to be predicted and a category semantic corresponding to the word to be predicted; and based on the known preset semantic source set and semantic relation set and the category semantic source corresponding to the word to be predicted, generating a semantic source tree for predicting the word pair to be predicted by adopting a preset semantic source tree generation algorithm.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to, when executed by a processor, perform the method for predicting a semantic tree for chinese and english word pairs provided in the foregoing embodiments, where the method includes: acquiring a word pair to be predicted and a category semantic corresponding to the word to be predicted; and generating a semantic tree for the word pair to be predicted by adopting a preset semantic tree generation algorithm based on the known preset semantic set and semantic relation set and the class semantic corresponding to the word to be predicted.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (4)

1. A method for predicting a semantic tree for chinese and english word pairs, comprising:
acquiring a word pair to be predicted and a category semantic corresponding to the word pair to be predicted;
generating a semantic tree for predicting the word pair to be predicted by adopting a preset semantic tree generation algorithm based on a known preset semantic source set, a known semantic relation set and a category semantic source corresponding to the word pair to be predicted;
the preset semantic tree generation algorithm comprises a path generation algorithm or a label propagation algorithm;
the path generation algorithm specifically includes:
constructing an edge generator of the semantic tree, and acquiring an edge taking each node as a head node in the semantic tree;
constructing a semantic tree node generator, and acquiring tail nodes starting from a given head node in a semantic tree;
constructing a tree generator to obtain a whole semantic tree;
the constructing of the edge generator of the semantic tree to obtain the edge of the semantic tree with each node as a head node specifically comprises:
taking the path from the root node to the current node, the word pair word senses and the category semaphores as the input of the semaphore tree edge generator;
RNN is adopted to model path information, a semantic and semantic relation is modeled into a unique heat vector, and a pre-trained English word vector and a Chinese word vector are in cascade connection to represent word senses;
acquiring a first preset classifier, inputting the last state, any node and word meaning of the RNN in the first preset classifier, and outputting all semantic relation scores;
normalizing the scores of all semantic relations through an L1 norm to obtain normalized scores of all semantic relations;
arranging all the semantic relation normalization scores in descending order from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a first preset number exceeds a first preset threshold value, and taking the semantic relation normalization scores of the first preset number as a first output result;
the constructing of the semantic tree node generator for obtaining the tail node starting from the given head node in the semantic tree specifically comprises:
taking as input to said semantic tree node generator the path of the root node to either edge, word pair senses and said category senses;
performing node information modeling by using RNN, modeling the semantic and semantic relations into a unique heat vector, and cascading an English word vector and a Chinese word vector which are pre-trained to represent word senses;
acquiring a second preset classifier, inputting the last state, any node and word sense of the RNN in the second preset classifier, and outputting the original score of the sense;
normalizing the sense original score through an L1 norm to obtain a sense original normalized score;
sequencing the sense original normalization scores in a descending order from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a second preset number exceeds a second preset threshold value, and taking the semantic relation normalization scores of the second preset number as a second output result;
the building tree generator is used for obtaining the whole semantic tree, and specifically comprises the following steps:
inputting a lexical meaning and a path from a root node to any node in the tree generator by adopting a recursive algorithm to generate the whole semantic tree;
the label propagation algorithm specifically comprises:
constructing a word sense graph, and connecting the word pair to be predicted with the word pair of the known sense tree information to obtain a plurality of connecting edges;
parsing an semantic tree of a known word pair into a known triple set, and expressing each triple in the known triple set into a single multi-hot label vector;
respectively calculating English semantic similarity between English words in the known word pair and English words in the word pair to be predicted and Chinese semantic similarity between Chinese words in the known word pair and Chinese words in the word pair to be predicted, and calculating the correlation between the English semantic similarity and the Chinese semantic similarity through a preset similarity function to obtain the weight coefficients of the connecting edges;
acquiring a preset activation function, and obtaining a multi-hot label vector of the predicted word pair based on the single multi-hot label vector and the weight coefficient;
normalizing the multi-hot label vectors of the predicted word pairs through the L1 norm to obtain multi-hot label vector normalization scores;
the multiple hot label vector normalization scores are arranged in a descending order from high to low, if the accumulated score obtained by adding 1 to a third preset number exceeds a third preset threshold value, the traversal is stopped, and the semantic relation normalization scores of the third preset number are taken as a third output result;
and converting and outputting the third output result and the single class of the semantic source into the whole semantic source tree.
2. A system for predicting a semantic tree for chinese and english word pairs, comprising:
the device comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a word pair to be predicted and a category semantic corresponding to the word to be predicted;
the processing module is used for generating an semantic tree for predicting the word pair to be predicted by adopting a preset semantic tree generation algorithm based on a known preset semantic source set and semantic relation set and the category semantic source corresponding to the word to be predicted;
the preset semantic tree generation algorithm in the processing module comprises a path generation algorithm or a label propagation algorithm;
the path generation algorithm specifically includes:
constructing an edge generator of the semantic tree, and acquiring an edge taking each node as a head node in the semantic tree;
constructing a semantic tree node generator, and acquiring tail nodes starting from a given head node in a semantic tree;
constructing a tree generator to obtain a whole semantic tree;
the constructing of the edge generator of the semantic tree to obtain the edge of the semantic tree with each node as a head node specifically comprises:
taking the path from the root node to the current node, the word pair word senses and the category semaphores as the input of the semaphore tree edge generator;
RNN is adopted to model path information, a semantic and semantic relation is modeled into a unique heat vector, and a pre-trained English word vector and a Chinese word vector are in cascade connection to represent word senses;
acquiring a first preset classifier, inputting the last state, any node and word meaning of the RNN in the first preset classifier, and outputting all semantic relation scores;
normalizing the scores of all semantic relations through an L1 norm to obtain normalized scores of all semantic relations;
arranging all the semantic relation normalization scores in descending order from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a first preset number exceeds a first preset threshold value, and taking the semantic relation normalization scores of the first preset number as a first output result;
the constructing of the semantic tree node generator for obtaining the tail node starting from the given head node in the semantic tree specifically comprises:
taking the path from the root node to any edge, the word pair word senses and the category senses as the input of the sense tree node generator;
performing node information modeling by using RNN (radio network node), modeling a semantic and semantic relation into a unique heat vector, and cascading a pre-trained English word vector and a Chinese word vector to represent a word meaning;
acquiring a second preset classifier, inputting the last state, any node and word sense of the RNN in the second preset classifier, and outputting the original score of the sense;
normalizing the sense original score through an L1 norm to obtain a sense original normalized score;
arranging the sense original normalization scores in descending order from high to low, stopping traversing if judging that the accumulated score obtained by adding 1 to a second preset number exceeds a second preset threshold value, and taking the semantic relation normalization scores of the second preset number as a second output result;
the building tree generator is used for obtaining the whole semantic tree, and specifically comprises the following steps:
inputting a word sense and a path from a root node to any node in the tree generator by adopting a recursive algorithm to generate the whole sense tree;
the label propagation algorithm specifically comprises:
constructing a word meaning graph, and connecting the word pair to be predicted with the word pair of the known semantic tree information to obtain a plurality of connecting edges;
parsing a semantic tree of a known word pair into a known triple set, and expressing each triple in the known triple set into a single multi-hot label vector;
respectively calculating English semantic similarity between English words in the known word pair and English words in the word pair to be predicted and Chinese semantic similarity between Chinese words in the known word pair and Chinese words in the word pair to be predicted, and calculating the correlation between the English semantic similarity and the Chinese semantic similarity through a preset similarity function to obtain the weight coefficients of the connecting edges;
acquiring a preset activation function, and obtaining a multi-hot label vector of the predicted word pair based on the single multi-hot label vector and the weight coefficient;
normalizing the multi-hot label vectors of the predicted word pairs through the L1 norm to obtain multi-hot label vector normalization scores;
arranging the multiple hot label vector normalization scores in a descending order from high to low, stopping traversing if judging that the cumulative score obtained by adding 1 to a third preset number exceeds a third preset threshold value, and taking the semantic relation normalization scores of the third preset number as a third output result;
and converting and outputting the third output result and the single class of the semantic source into the whole semantic source tree.
3. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method of predicting a semantic tree for chinese and english word pairs as recited in claim 1.
4. A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of claim 1 for predicting a semantic tree for chinese and english word pairs.
CN202011565924.8A 2020-12-25 2020-12-25 Method and system for predicting semantic tree for Chinese and English word pairs Active CN112579794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011565924.8A CN112579794B (en) 2020-12-25 2020-12-25 Method and system for predicting semantic tree for Chinese and English word pairs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011565924.8A CN112579794B (en) 2020-12-25 2020-12-25 Method and system for predicting semantic tree for Chinese and English word pairs

Publications (2)

Publication Number Publication Date
CN112579794A CN112579794A (en) 2021-03-30
CN112579794B true CN112579794B (en) 2022-11-11

Family

ID=75139760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011565924.8A Active CN112579794B (en) 2020-12-25 2020-12-25 Method and system for predicting semantic tree for Chinese and English word pairs

Country Status (1)

Country Link
CN (1) CN112579794B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095086B (en) * 2021-04-08 2024-03-01 思必驰科技股份有限公司 Method and system for predicting source meaning
CN113468884B (en) * 2021-06-10 2023-06-16 北京信息科技大学 Chinese event trigger word extraction method and device
CN114841123B (en) * 2022-03-29 2024-07-16 清华大学 Prediction method and system for original hierarchical structure of vocabulary meaning and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6427466B2 (en) * 2015-05-26 2018-11-21 日本電信電話株式会社 Synonym pair acquisition apparatus, method and program
CN107239443A (en) * 2017-05-09 2017-10-10 清华大学 The training method and server of a kind of term vector learning model
CN110263331A (en) * 2019-05-24 2019-09-20 南京航空航天大学 A kind of English-Chinese semanteme of word similarity automatic testing method of Knowledge driving

Also Published As

Publication number Publication date
CN112579794A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
He et al. Multi-perspective sentence similarity modeling with convolutional neural networks
CN104462066B (en) Semantic character labeling method and device
CN112579794B (en) Method and system for predicting semantic tree for Chinese and English word pairs
CN112464662B (en) Medical phrase matching method, device, equipment and storage medium
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN109783618A (en) Pharmaceutical entities Relation extraction method and system based on attention mechanism neural network
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
CN113641819B (en) Argumentation mining system and method based on multitasking sparse sharing learning
CN113761218A (en) Entity linking method, device, equipment and storage medium
US11669740B2 (en) Graph-based labeling rule augmentation for weakly supervised training of machine-learning-based named entity recognition
CN112883193A (en) Training method, device and equipment of text classification model and readable medium
JP6291443B2 (en) Connection relationship estimation apparatus, method, and program
KR20180094664A (en) Method for information extraction from text data and apparatus therefor
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN112686025A (en) Chinese choice question interference item generation method based on free text
JP2018025874A (en) Text analyzer and program
JP6709812B2 (en) Relationship estimation model learning device, method, and program
CN110969023B (en) Text similarity determination method and device
CN114722833B (en) Semantic classification method and device
CN112559691B (en) Semantic similarity determining method and device and electronic equipment
CN112287119B (en) Knowledge graph generation method for extracting relevant information of online resources
CN113901783A (en) Domain-oriented document duplicate checking method and system
CN111815426A (en) Data processing method and terminal related to financial investment and research
Han et al. Lexicalized neural unsupervised dependency parsing
CN111090999A (en) Information extraction method and system for power grid dispatching plan

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant