CN112232077A

CN112232077A - A new word discovery method, system, device and medium based on graph embedding

Info

Publication number: CN112232077A
Application number: CN202011060498.2A
Authority: CN
Inventors: 莫永卓; 赵顺峰; 练睿; 肖杰
Original assignee: Workway Shenzhen Information Technology Co ltd
Current assignee: Huzhou Tongmu Intellectual Property Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-15
Anticipated expiration: 2040-09-30
Also published as: CN112232077B

Abstract

The invention relates to a method, a system, equipment and a medium for discovering new words based on graph embedding, which comprises the following steps: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; and screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in the general dictionary, and embedding the corresponding words in the graph obtained by screening as candidate new words. The invention is based on the graph embedding technology, and can effectively filter low-quality candidate new words in the new word discovery process, thereby obtaining higher-quality and more reliable general new words or field new words.

Description

New word discovery method, system, equipment and medium based on graph embedding

Technical Field

The invention relates to the field of natural language processing, in particular to a new word discovery method, a system, equipment and a medium based on graph embedding.

Background

Graph Embedding (also called Network Embedding) is a process of mapping Graph data (usually a high-dimensional dense matrix) into a low micro dense vector. Graph exists widely in various scenes of the real world, i.e., a collection of nodes and edges. Such as person-to-person connections in social networks, protein interactions in living beings, and communication between IP addresses in communication networks, among others. Besides, a picture and a sentence which are most common can be abstractly regarded as a structure of a graph model, and the graph structure can be said to be ubiquitous. Through the analysis of the above, we can understand the social structure, language and different communication modes, so the graph is always the hot point of the research in the academic world.

In the field of natural language processing, in a new word discovery task, the existing method generally utilizes a statistical learning method to construct a new word, and the basic idea is an information entropy method, but the simple method only uses shallow semantic information in a corpus and often introduces many low-quality new words. Therefore, if deeper embedding information such as graph embedding can be introduced, new words with higher quality can be extracted.

Disclosure of Invention

In view of the above, the present invention provides a method, system, device and medium for discovering new words based on graph embedding, which at least partially solve the problems in the prior art. The method comprises the steps of firstly obtaining a new word candidate set according to the linguistic data to be calculated, then constructing a graph network based on the linguistic data to be calculated, calculating the graph network by using an attention network to obtain graph embedding, and finally screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in an original general dictionary to obtain the general new words or the field new words with higher quality and more reliability.

The invention specifically comprises the following steps:

a graph embedding-based new word discovery method comprises the following steps:

cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;

performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;

calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;

finding image embedding containing words in a new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding containing the words in the new word candidate set based on the image embedding containing the words in a general dictionary, and embedding corresponding words in the screened image as candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.

Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:

calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;

scoring each string based on a scoring formula, the scoring formula being:

TF*AMI*(2*(EI+Er)/(El*Er))；

wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;

selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;

the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.

Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises the following steps: adding the new word candidate set to the general dictionary;

the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:

and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.

Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:

traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.

A graph-embedding based new word discovery system comprising:

the new word candidate set building module is used for cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting the character string with the score meeting the requirement to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;

the graph embedding training module is used for carrying out word segmentation on the linguistic data to be calculated, constructing a graph network based on word segmentation results, and calculating the graph network based on a graph attention network to obtain the graph embedding of the words of the linguistic data to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;

the new word screening module is used for finding out the image embedding of the words in the new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding of the words in the new word candidate set based on the image embedding of the words in the general dictionary, and taking the words corresponding to the image embedding obtained by screening as the candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.

scoring each string based on a scoring formula, the scoring formula being:

TF*AMI*(2*(EI+Er)/(El*Er))；

Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set building module is further configured to: adding the new word candidate set to the general dictionary;

An electronic device, comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the above-described graph-embedding-based new word discovery method.

A computer readable storage medium having one or more programs executable by one or more processors to implement the graph-based embedding new word discovery method described above.

The invention has the beneficial effects that:

compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a method for discovering new words based on graph embedding according to an embodiment of the present invention;

FIG. 2 is a diagram of a graph-based embedding new word discovery system according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

As shown in fig. 1, an embodiment of a method for discovering new words based on graph embedding according to the present invention includes:

s11: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, and scoring each character string according to the statistic;

s12: selecting character strings with scores meeting requirements and writing the character strings into a new word candidate set;

the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;

s13: performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results;

the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;

s14: calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated;

the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;

s15: finding out the graph embedding of the words in the new word candidate set containing the words in the graph embedding of the words of the linguistic data to be calculated, screening the graph embedding of the words in the new word candidate set containing the words based on the graph embedding of the words in the general dictionary, and taking the corresponding words of the graph embedding obtained by screening as the candidate new words.

The candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.

Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:

scoring each string based on a scoring formula, the scoring formula being:

TF*AMI*(2*(EI+Er)/(El*Er))；

Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;

based on the dictionary obtained after the new word candidate set is added into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes;

dictionary maximum probability word segmentation, namely performing word segmentation on the corpus according to the probability of words contained in the corpus in a dictionary, for example:

the language material is as follows: teacher-saving student giving flower to teacher

Wherein, the teacher section can be divided into two words of teacher and teacher section, then calculate and compare the probability of teacher and teacher section in the dictionary, if the probability of teacher is greater than that of teacher section, then the word cutting result is:

teacher-saving student giving flower to teacher

If the probability of the teacher is less than that of the teacher section, the word segmentation result is as follows:

teacher-saving student giving flower to teacher

Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:

traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, selecting graph embedding with the similarity at the top according to the sorting, wherein the specified number is 3-6, and screening the graph embedding with the similarity larger than a specified threshold, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9;

the similarity may be calculated based on the following algorithm: MS1, cosine similarity, WMD, etc.

A graph-embedding based new word discovery system comprising:

the new word candidate set building module 21 cuts out the N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculates the statistic of each character string, scores each character string according to the statistic, and selects the character string with the score meeting the requirement to write into the new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;

the graph embedding training module 22 is configured to perform word segmentation on the corpus to be calculated, construct a graph network based on word segmentation results, and calculate the graph network based on a graph attention network to obtain graph embedding of words of the corpus to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;

and the new word screening module 23 is configured to find a graph embedding including a word in the new word candidate set in the graph embedding of the word of the corpus to be calculated, screen the graph embedding including the word in the new word candidate set based on the graph embedding including the word in the general dictionary, and use the word corresponding to the graph embedding obtained through screening as a candidate new word.

scoring each string based on a scoring formula, the scoring formula being:

TF*AMI*(2*(EI+Er)/(El*Er))；

Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set constructing module 21 is further configured to: adding the new word candidate set to the general dictionary;

The partial process of the system embodiment of the invention is similar to that of the method embodiment, the description of the system embodiment is simpler, and the method embodiment is referred to for the corresponding part.

An embodiment of the present invention further provides an electronic device, as shown in fig. 3, which can implement the process of the embodiment shown in fig. 1 of the present invention, and as shown in fig. 3, the electronic device may include: the device comprises a shell 31, a processor 32, a memory 33, a circuit board 34 and a power circuit 35, wherein the circuit board 34 is arranged inside a space enclosed by the shell 31, and the processor 32 and the memory 33 are arranged on the circuit board 34; a power supply circuit 35 for supplying power to each circuit or device of the electronic apparatus; the memory 33 is used for storing executable program codes; the processor 32 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 33, for executing the method described in any of the foregoing embodiments.

The specific execution process of the above steps by the processor 32 and the steps further executed by the processor 32 by running the executable program code may refer to the description of the embodiment shown in fig. 1 of the present invention, and are not described herein again.

The electronic device exists in a variety of forms, including but not limited to:

(1) a server: the device for providing the computing service, the server comprises a processor, a hard disk, a memory, a system bus and the like, the server is similar to a general computer architecture, but the server needs to provide highly reliable service, so the requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like are high;

(2) and other electronic equipment with data interaction function.

Embodiments of the present invention also provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the aforementioned method for preventing malicious traversal of website data.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a new word discovery method based on graph embedding, is characterized in that, comprises:

Use a sliding window to cut out the N-GRAM strings of the corpus to be calculated, calculate the statistics of each string, score each string according to the statistics, and select a string whose score meets the requirements to write into the new word candidate set;

Perform word segmentation on the corpus to be calculated, and construct a graph network based on the word segmentation result;

Calculate the graph network based on the graph attention network to obtain the graph embedding of the words of the corpus to be calculated;

Find the graph embeddings containing words in the new word candidate set from the graph embeddings of the words in the corpus to be calculated, and screen the graph embeddings containing the words in the new word candidate set based on the graph embeddings that contain the words in the general dictionary. The resulting graph embeddings correspond to words as candidate new words.

2. method according to claim 1, is characterized in that, described calculating the statistic of each character string, scoring each character string according to described statistic, selecting the character string whose score satisfies the requirement to write the new word candidate set ,Specifically:

Calculate the statistics of each character string, the statistics include: word frequency, average mutual information, left entropy, right entropy;

Each character string is scored based on a scoring formula, which is:

TF*AMI*(2*(EI+Er)/(El*Er));

Wherein, TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;

According to the score of each character string, the character string whose score is greater than the specified threshold is selected and written into the new word candidate set.

3. The method according to claim 2, characterized in that, after selecting a character string whose score meets the requirements and writing it into a new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;

The described word segmentation is performed on the corpus to be calculated, and a graph network is constructed based on the segmentation result, specifically:

Based on the dictionary obtained by adding the new word candidate set to the general dictionary, the corpus to be calculated adopts the dictionary maximum probability segmentation word, and the adjacent words after the word segmentation are used as nodes to construct a graph network.

4. The method according to claim 3, wherein the graph embeddings that contain words in the new word candidate set are screened based on the graph embeddings that contain words in the general dictionary, specifically:

Traverse the graph embeddings containing words in the new word candidate set, sort the graph embeddings containing words in the new word candidate set according to the similarity with the graph embeddings containing words in the general dictionary, and select a specified number of all graph embeddings according to the sorting. Describe the graph embeddings containing words in the new word candidate set, and filter out the graph embeddings whose similarity with the graph embeddings containing the words in the general dictionary satisfies a prescribed threshold from the selected prescribed number of graph embeddings.

5. A new word discovery system based on graph embedding is characterized in that, comprising:

The new word candidate set building module uses a sliding window to cut out the N-GRAM strings of the corpus to be calculated, calculates the statistics of each string, scores each string according to the statistics, and selects the strings whose scores meet the requirements to write New word candidate set;

The graph embedding training module is used to segment the corpus to be calculated, construct a graph network based on the word segmentation result, and calculate the graph network based on the graph attention network to obtain the graph embedding of the words in the corpus to be calculated. ;

The new word screening module is used to find the graph embeddings containing the words in the new word candidate set in the graph embeddings of the words of the corpus to be calculated, and based on the graph embeddings containing the words in the general dictionary, the graph embeddings of the words contained in the new word candidate set are checked. The graph embedding is screened, and the words corresponding to the screened graph embeddings are used as candidate new words.

6. The system according to claim 5, characterized in that, calculating the statistic of each character string, scoring each character string according to the statistic, selecting the character string whose score satisfies the requirement and writing the new word candidate set ,Specifically:

Each character string is scored based on a scoring formula, which is:

TF*AMI*(2*(EI+Er)/(El*Er));

7. The system according to claim 6, characterized in that, after selecting the character string whose score meets the requirements and writing it into the new word candidate set, the new word candidate set building module is further used for: adding the new word candidate set join said general dictionary;

8. The system according to claim 7, wherein the graph embeddings containing words in the new word candidate set are screened based on the graph embeddings that contain words in the general dictionary, specifically:

9. An electronic device, characterized in that the electronic device comprises: a casing, a processor, a memory, a circuit board and a power supply circuit, wherein the circuit board is arranged inside the space enclosed by the casing, and the processor and the memory are arranged On the circuit board; the power supply circuit is used to supply power to each circuit or device of the above-mentioned electronic equipment; the memory is used to store the executable program code; the processor runs and executes the executable program code by reading the executable program code stored in the memory. A corresponding program is used to execute the method according to any one of claims 1-4.

10. A computer-readable storage medium, characterized in that, the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the claim The method of any of claims 1-4.