CN112232077A - New word discovery method, system, equipment and medium based on graph embedding - Google Patents
New word discovery method, system, equipment and medium based on graph embedding Download PDFInfo
- Publication number
- CN112232077A CN112232077A CN202011060498.2A CN202011060498A CN112232077A CN 112232077 A CN112232077 A CN 112232077A CN 202011060498 A CN202011060498 A CN 202011060498A CN 112232077 A CN112232077 A CN 112232077A
- Authority
- CN
- China
- Prior art keywords
- words
- graph
- candidate set
- new word
- embedding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method, a system, equipment and a medium for discovering new words based on graph embedding, which comprises the following steps: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; and screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in the general dictionary, and embedding the corresponding words in the graph obtained by screening as candidate new words. The invention is based on the graph embedding technology, and can effectively filter low-quality candidate new words in the new word discovery process, thereby obtaining higher-quality and more reliable general new words or field new words.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a new word discovery method, a system, equipment and a medium based on graph embedding.
Background
Graph Embedding (also called Network Embedding) is a process of mapping Graph data (usually a high-dimensional dense matrix) into a low micro dense vector. Graph exists widely in various scenes of the real world, i.e., a collection of nodes and edges. Such as person-to-person connections in social networks, protein interactions in living beings, and communication between IP addresses in communication networks, among others. Besides, a picture and a sentence which are most common can be abstractly regarded as a structure of a graph model, and the graph structure can be said to be ubiquitous. Through the analysis of the above, we can understand the social structure, language and different communication modes, so the graph is always the hot point of the research in the academic world.
In the field of natural language processing, in a new word discovery task, the existing method generally utilizes a statistical learning method to construct a new word, and the basic idea is an information entropy method, but the simple method only uses shallow semantic information in a corpus and often introduces many low-quality new words. Therefore, if deeper embedding information such as graph embedding can be introduced, new words with higher quality can be extracted.
Disclosure of Invention
In view of the above, the present invention provides a method, system, device and medium for discovering new words based on graph embedding, which at least partially solve the problems in the prior art. The method comprises the steps of firstly obtaining a new word candidate set according to the linguistic data to be calculated, then constructing a graph network based on the linguistic data to be calculated, calculating the graph network by using an attention network to obtain graph embedding, and finally screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in an original general dictionary to obtain the general new words or the field new words with higher quality and more reliability.
The invention specifically comprises the following steps:
a graph embedding-based new word discovery method comprises the following steps:
cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
finding image embedding containing words in a new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding containing the words in the new word candidate set based on the image embedding containing the words in a general dictionary, and embedding corresponding words in the screened image as candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises the following steps: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module is used for cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting the character string with the score meeting the requirement to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module is used for carrying out word segmentation on the linguistic data to be calculated, constructing a graph network based on word segmentation results, and calculating the graph network based on a graph attention network to obtain the graph embedding of the words of the linguistic data to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
the new word screening module is used for finding out the image embedding of the words in the new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding of the words in the new word candidate set based on the image embedding of the words in the general dictionary, and taking the words corresponding to the image embedding obtained by screening as the candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set building module is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
An electronic device, comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the above-described graph-embedding-based new word discovery method.
A computer readable storage medium having one or more programs executable by one or more processors to implement the graph-based embedding new word discovery method described above.
The invention has the beneficial effects that:
compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a method for discovering new words based on graph embedding according to an embodiment of the present invention;
FIG. 2 is a diagram of a graph-based embedding new word discovery system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
As shown in fig. 1, an embodiment of a method for discovering new words based on graph embedding according to the present invention includes:
s11: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, and scoring each character string according to the statistic;
s12: selecting character strings with scores meeting requirements and writing the character strings into a new word candidate set;
the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
s13: performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results;
the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
s14: calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated;
the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
s15: finding out the graph embedding of the words in the new word candidate set containing the words in the graph embedding of the words of the linguistic data to be calculated, screening the graph embedding of the words in the new word candidate set containing the words based on the graph embedding of the words in the general dictionary, and taking the corresponding words of the graph embedding obtained by screening as the candidate new words.
The candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
based on the dictionary obtained after the new word candidate set is added into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes;
dictionary maximum probability word segmentation, namely performing word segmentation on the corpus according to the probability of words contained in the corpus in a dictionary, for example:
the language material is as follows: teacher-saving student giving flower to teacher
Wherein, the teacher section can be divided into two words of teacher and teacher section, then calculate and compare the probability of teacher and teacher section in the dictionary, if the probability of teacher is greater than that of teacher section, then the word cutting result is:
teacher-saving student giving flower to teacher
If the probability of the teacher is less than that of the teacher section, the word segmentation result is as follows:
teacher-saving student giving flower to teacher
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, selecting graph embedding with the similarity at the top according to the sorting, wherein the specified number is 3-6, and screening the graph embedding with the similarity larger than a specified threshold, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9;
the similarity may be calculated based on the following algorithm: MS1, cosine similarity, WMD, etc.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module 21 cuts out the N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculates the statistic of each character string, scores each character string according to the statistic, and selects the character string with the score meeting the requirement to write into the new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module 22 is configured to perform word segmentation on the corpus to be calculated, construct a graph network based on word segmentation results, and calculate the graph network based on a graph attention network to obtain graph embedding of words of the corpus to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
and the new word screening module 23 is configured to find a graph embedding including a word in the new word candidate set in the graph embedding of the word of the corpus to be calculated, screen the graph embedding including the word in the new word candidate set based on the graph embedding including the word in the general dictionary, and use the word corresponding to the graph embedding obtained through screening as a candidate new word.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set constructing module 21 is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
The partial process of the system embodiment of the invention is similar to that of the method embodiment, the description of the system embodiment is simpler, and the method embodiment is referred to for the corresponding part.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, which can implement the process of the embodiment shown in fig. 1 of the present invention, and as shown in fig. 3, the electronic device may include: the device comprises a shell 31, a processor 32, a memory 33, a circuit board 34 and a power circuit 35, wherein the circuit board 34 is arranged inside a space enclosed by the shell 31, and the processor 32 and the memory 33 are arranged on the circuit board 34; a power supply circuit 35 for supplying power to each circuit or device of the electronic apparatus; the memory 33 is used for storing executable program codes; the processor 32 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 33, for executing the method described in any of the foregoing embodiments.
The specific execution process of the above steps by the processor 32 and the steps further executed by the processor 32 by running the executable program code may refer to the description of the embodiment shown in fig. 1 of the present invention, and are not described herein again.
The electronic device exists in a variety of forms, including but not limited to:
(1) a server: the device for providing the computing service, the server comprises a processor, a hard disk, a memory, a system bus and the like, the server is similar to a general computer architecture, but the server needs to provide highly reliable service, so the requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like are high;
(2) and other electronic equipment with data interaction function.
Embodiments of the present invention also provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the aforementioned method for preventing malicious traversal of website data.
Compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A new word discovery method based on graph embedding is characterized by comprising the following steps:
cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set;
performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results;
calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated;
finding out the graph embedding of the words in the new word candidate set containing the words in the graph embedding of the words of the linguistic data to be calculated, screening the graph embedding of the words in the new word candidate set containing the words based on the graph embedding of the words in the general dictionary, and taking the corresponding words of the graph embedding obtained by screening as the candidate new words.
2. The method according to claim 1, wherein the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting requirements to write into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
and selecting the character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set.
3. The method of claim 2, wherein after selecting the character strings with the scores meeting the requirement to write into the new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
4. The method according to claim 3, wherein the screening of the graph insertions containing words in the new word candidate set based on the graph insertions containing words in the general dictionary comprises:
and traversing the graph embedding of the words in the new word candidate set, sequencing the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set in a specified number according to the sequencing, and screening the graph embedding of the specified number, of which the similarity of the graph embedding of the words in the general dictionary meets a specified threshold value, from the selected graph embedding of the specified number.
5. A graph embedding-based new word discovery system, comprising:
the new word candidate set building module is used for cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting the character string with the score meeting the requirement to write into a new word candidate set;
the graph embedding training module is used for carrying out word segmentation on the linguistic data to be calculated, constructing a graph network based on word segmentation results, and calculating the graph network based on a graph attention network to obtain the graph embedding of the words of the linguistic data to be calculated;
and the new word screening module is used for finding the image embedding containing the words in the new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding containing the words in the new word candidate set based on the image embedding containing the words in the general dictionary, and taking the words corresponding to the image embedding obtained by screening as the candidate new words.
6. The system according to claim 5, wherein the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting requirements to write into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
and selecting the character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set.
7. The system of claim 6, wherein after selecting a character string with a score meeting a requirement to write into the new word candidate set, the new word candidate set construction module is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
8. The system according to claim 7, wherein the screening of the graph insertions containing words in the new word candidate set based on the graph insertions containing words in the general dictionary comprises:
and traversing the graph embedding of the words in the new word candidate set, sequencing the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set in a specified number according to the sequencing, and screening the graph embedding of the specified number, of which the similarity of the graph embedding of the words in the general dictionary meets a specified threshold value, from the selected graph embedding of the specified number.
9. An electronic device, characterized in that the electronic device comprises: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the method according to any one of claims 1 to 4.
10. A computer-readable storage medium, having one or more programs stored thereon, the one or more programs being executable by one or more processors to perform the method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011060498.2A CN112232077B (en) | 2020-09-30 | 2020-09-30 | New word discovery method, system, equipment and medium based on graph embedding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011060498.2A CN112232077B (en) | 2020-09-30 | 2020-09-30 | New word discovery method, system, equipment and medium based on graph embedding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112232077A true CN112232077A (en) | 2021-01-15 |
CN112232077B CN112232077B (en) | 2021-10-29 |
Family
ID=74119881
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011060498.2A Active CN112232077B (en) | 2020-09-30 | 2020-09-30 | New word discovery method, system, equipment and medium based on graph embedding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112232077B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138510A (en) * | 2015-08-10 | 2015-12-09 | 昆明理工大学 | Microblog-based neologism emotional tendency judgment method |
CN106502984A (en) * | 2016-10-19 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and device of field new word discovery |
CN106649849A (en) * | 2016-12-30 | 2017-05-10 | 上海智臻智能网络科技股份有限公司 | Text information base building method and device and searching method, device and system |
CN107622051A (en) * | 2017-09-14 | 2018-01-23 | 马上消费金融股份有限公司 | New word screening method and device |
US20190155918A1 (en) * | 2017-11-20 | 2019-05-23 | Colossio, Inc. | Real-time classification of evolving dictionaries |
-
2020
- 2020-09-30 CN CN202011060498.2A patent/CN112232077B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138510A (en) * | 2015-08-10 | 2015-12-09 | 昆明理工大学 | Microblog-based neologism emotional tendency judgment method |
CN106502984A (en) * | 2016-10-19 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and device of field new word discovery |
CN106649849A (en) * | 2016-12-30 | 2017-05-10 | 上海智臻智能网络科技股份有限公司 | Text information base building method and device and searching method, device and system |
CN107622051A (en) * | 2017-09-14 | 2018-01-23 | 马上消费金融股份有限公司 | New word screening method and device |
US20190155918A1 (en) * | 2017-11-20 | 2019-05-23 | Colossio, Inc. | Real-time classification of evolving dictionaries |
Non-Patent Citations (1)
Title |
---|
杜洋: "基于词向量表征的新词发现及命名实体识别研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Also Published As
Publication number | Publication date |
---|---|
CN112232077B (en) | 2021-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10956464B2 (en) | Natural language question answering method and apparatus | |
US10725836B2 (en) | Intent-based organisation of APIs | |
CN107463666B (en) | sensitive word filtering method based on text content | |
CN111680089B (en) | Text structuring method, device and system and non-volatile storage medium | |
CN113055386B (en) | Method and device for identifying and analyzing attack organization | |
CN109271641A (en) | A kind of Text similarity computing method, apparatus and electronic equipment | |
Razenshteyn | High-dimensional similarity search and sketching: algorithms and hardness | |
CN109783805B (en) | Network community user identification method and device and readable storage medium | |
WO2022134779A1 (en) | Method, apparatus and device for extracting character action related data, and storage medium | |
US20220156468A1 (en) | Method and apparatus for generating knowledge graph | |
CN105786971B (en) | A kind of grammer point recognition methods towards international Chinese teaching | |
JP2018124617A (en) | Teacher data collection apparatus, teacher data collection method and program | |
CN115757743A (en) | Document search term matching method and electronic equipment | |
CN112232077B (en) | New word discovery method, system, equipment and medium based on graph embedding | |
CN117236467A (en) | Method, apparatus, electronic device and medium for generating language model | |
CN113868508B (en) | Writing material query method and device, electronic equipment and storage medium | |
JP5184195B2 (en) | Language processing apparatus and program | |
CN111507098B (en) | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium | |
Fernandez et al. | Extracting social network from literature to predict antagonist and protagonist | |
Si et al. | An online Dirichlet model based on sentence embedding and DBSCAN for noisy short text stream clustering | |
CN112182235A (en) | Method and device for constructing knowledge graph, computer equipment and storage medium | |
CN109684630A (en) | The comparative analysis method of patent similitude | |
CN105488183B (en) | The method and apparatus for excavating rock cave mural painting spatial and temporal association in rock cave mural painting group | |
CN117725555B (en) | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium | |
CN118093838B (en) | Large language model prompt word generation method, system, terminal equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |