Background
Graph Embedding (also called Network Embedding) is a process of mapping Graph data (usually a high-dimensional dense matrix) into a low micro dense vector. Graph exists widely in various scenes of the real world, i.e., a collection of nodes and edges. Such as person-to-person connections in social networks, protein interactions in living beings, and communication between IP addresses in communication networks, among others. Besides, a picture and a sentence which are most common can be abstractly regarded as a structure of a graph model, and the graph structure can be said to be ubiquitous. Through the analysis of the above, we can understand the social structure, language and different communication modes, so the graph is always the hot point of the research in the academic world.
In the field of natural language processing, in a new word discovery task, the existing method generally utilizes a statistical learning method to construct a new word, and the basic idea is an information entropy method, but the simple method only uses shallow semantic information in a corpus and often introduces many low-quality new words. Therefore, if deeper embedding information such as graph embedding can be introduced, new words with higher quality can be extracted.
Disclosure of Invention
In view of the above, the present invention provides a method, system, device and medium for discovering new words based on graph embedding, which at least partially solve the problems in the prior art. The method comprises the steps of firstly obtaining a new word candidate set according to the linguistic data to be calculated, then constructing a graph network based on the linguistic data to be calculated, calculating the graph network by using an attention network to obtain graph embedding, and finally screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in an original general dictionary to obtain the general new words or the field new words with higher quality and more reliability.
The invention specifically comprises the following steps:
a graph embedding-based new word discovery method comprises the following steps:
cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
finding image embedding containing words in a new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding containing the words in the new word candidate set based on the image embedding containing the words in a general dictionary, and embedding corresponding words in the screened image as candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises the following steps: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module is used for cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting the character string with the score meeting the requirement to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module is used for carrying out word segmentation on the linguistic data to be calculated, constructing a graph network based on word segmentation results, and calculating the graph network based on a graph attention network to obtain the graph embedding of the words of the linguistic data to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
the new word screening module is used for finding out the image embedding of the words in the new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding of the words in the new word candidate set based on the image embedding of the words in the general dictionary, and taking the words corresponding to the image embedding obtained by screening as the candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set building module is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
An electronic device, comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the above-described graph-embedding-based new word discovery method.
A computer readable storage medium having one or more programs executable by one or more processors to implement the graph-based embedding new word discovery method described above.
The invention has the beneficial effects that:
compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
As shown in fig. 1, an embodiment of a method for discovering new words based on graph embedding according to the present invention includes:
s11: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, and scoring each character string according to the statistic;
s12: selecting character strings with scores meeting requirements and writing the character strings into a new word candidate set;
the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
s13: performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results;
the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
s14: calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated;
the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
s15: finding out the graph embedding of the words in the new word candidate set containing the words in the graph embedding of the words of the linguistic data to be calculated, screening the graph embedding of the words in the new word candidate set containing the words based on the graph embedding of the words in the general dictionary, and taking the corresponding words of the graph embedding obtained by screening as the candidate new words.
The candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
based on the dictionary obtained after the new word candidate set is added into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes;
dictionary maximum probability word segmentation, namely performing word segmentation on the corpus according to the probability of words contained in the corpus in a dictionary, for example:
the language material is as follows: teacher-saving student giving flower to teacher
Wherein, the teacher section can be divided into two words of teacher and teacher section, then calculate and compare the probability of teacher and teacher section in the dictionary, if the probability of teacher is greater than that of teacher section, then the word cutting result is:
teacher-saving student giving flower to teacher
If the probability of the teacher is less than that of the teacher section, the word segmentation result is as follows:
teacher-saving student giving flower to teacher
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, selecting graph embedding with the similarity at the top according to the sorting, wherein the specified number is 3-6, and screening the graph embedding with the similarity larger than a specified threshold, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9;
the similarity may be calculated based on the following algorithm: MS1, cosine similarity, WMD, etc.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module 21 cuts out the N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculates the statistic of each character string, scores each character string according to the statistic, and selects the character string with the score meeting the requirement to write into the new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module 22 is configured to perform word segmentation on the corpus to be calculated, construct a graph network based on word segmentation results, and calculate the graph network based on a graph attention network to obtain graph embedding of words of the corpus to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
and the new word screening module 23 is configured to find a graph embedding including a word in the new word candidate set in the graph embedding of the word of the corpus to be calculated, screen the graph embedding including the word in the new word candidate set based on the graph embedding including the word in the general dictionary, and use the word corresponding to the graph embedding obtained through screening as a candidate new word.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set constructing module 21 is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
The partial process of the system embodiment of the invention is similar to that of the method embodiment, the description of the system embodiment is simpler, and the method embodiment is referred to for the corresponding part.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, which can implement the process of the embodiment shown in fig. 1 of the present invention, and as shown in fig. 3, the electronic device may include: the device comprises a shell 31, a processor 32, a memory 33, a circuit board 34 and a power circuit 35, wherein the circuit board 34 is arranged inside a space enclosed by the shell 31, and the processor 32 and the memory 33 are arranged on the circuit board 34; a power supply circuit 35 for supplying power to each circuit or device of the electronic apparatus; the memory 33 is used for storing executable program codes; the processor 32 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 33, for executing the method described in any of the foregoing embodiments.
The specific execution process of the above steps by the processor 32 and the steps further executed by the processor 32 by running the executable program code may refer to the description of the embodiment shown in fig. 1 of the present invention, and are not described herein again.
The electronic device exists in a variety of forms, including but not limited to:
(1) a server: the device for providing the computing service, the server comprises a processor, a hard disk, a memory, a system bus and the like, the server is similar to a general computer architecture, but the server needs to provide highly reliable service, so the requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like are high;
(2) and other electronic equipment with data interaction function.
Embodiments of the present invention also provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the aforementioned method for preventing malicious traversal of website data.
Compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.