CN112232077A - New word discovery method, system, equipment and medium based on graph embedding - Google Patents

New word discovery method, system, equipment and medium based on graph embedding Download PDF

Info

Publication number
CN112232077A
CN112232077A CN202011060498.2A CN202011060498A CN112232077A CN 112232077 A CN112232077 A CN 112232077A CN 202011060498 A CN202011060498 A CN 202011060498A CN 112232077 A CN112232077 A CN 112232077A
Authority
CN
China
Prior art keywords
words
graph
candidate set
new word
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011060498.2A
Other languages
Chinese (zh)
Other versions
CN112232077B (en
Inventor
莫永卓
赵顺峰
练睿
肖杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Workway Shenzhen Information Technology Co ltd
Original Assignee
Workway Shenzhen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Workway Shenzhen Information Technology Co ltd filed Critical Workway Shenzhen Information Technology Co ltd
Priority to CN202011060498.2A priority Critical patent/CN112232077B/en
Publication of CN112232077A publication Critical patent/CN112232077A/en
Application granted granted Critical
Publication of CN112232077B publication Critical patent/CN112232077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method, a system, equipment and a medium for discovering new words based on graph embedding, which comprises the following steps: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; and screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in the general dictionary, and embedding the corresponding words in the graph obtained by screening as candidate new words. The invention is based on the graph embedding technology, and can effectively filter low-quality candidate new words in the new word discovery process, thereby obtaining higher-quality and more reliable general new words or field new words.

Description

New word discovery method, system, equipment and medium based on graph embedding
Technical Field
The invention relates to the field of natural language processing, in particular to a new word discovery method, a system, equipment and a medium based on graph embedding.
Background
Graph Embedding (also called Network Embedding) is a process of mapping Graph data (usually a high-dimensional dense matrix) into a low micro dense vector. Graph exists widely in various scenes of the real world, i.e., a collection of nodes and edges. Such as person-to-person connections in social networks, protein interactions in living beings, and communication between IP addresses in communication networks, among others. Besides, a picture and a sentence which are most common can be abstractly regarded as a structure of a graph model, and the graph structure can be said to be ubiquitous. Through the analysis of the above, we can understand the social structure, language and different communication modes, so the graph is always the hot point of the research in the academic world.
In the field of natural language processing, in a new word discovery task, the existing method generally utilizes a statistical learning method to construct a new word, and the basic idea is an information entropy method, but the simple method only uses shallow semantic information in a corpus and often introduces many low-quality new words. Therefore, if deeper embedding information such as graph embedding can be introduced, new words with higher quality can be extracted.
Disclosure of Invention
In view of the above, the present invention provides a method, system, device and medium for discovering new words based on graph embedding, which at least partially solve the problems in the prior art. The method comprises the steps of firstly obtaining a new word candidate set according to the linguistic data to be calculated, then constructing a graph network based on the linguistic data to be calculated, calculating the graph network by using an attention network to obtain graph embedding, and finally screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in an original general dictionary to obtain the general new words or the field new words with higher quality and more reliability.
The invention specifically comprises the following steps:
a graph embedding-based new word discovery method comprises the following steps:
cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
finding image embedding containing words in a new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding containing the words in the new word candidate set based on the image embedding containing the words in a general dictionary, and embedding corresponding words in the screened image as candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises the following steps: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module is used for cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting the character string with the score meeting the requirement to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module is used for carrying out word segmentation on the linguistic data to be calculated, constructing a graph network based on word segmentation results, and calculating the graph network based on a graph attention network to obtain the graph embedding of the words of the linguistic data to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
the new word screening module is used for finding out the image embedding of the words in the new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding of the words in the new word candidate set based on the image embedding of the words in the general dictionary, and taking the words corresponding to the image embedding obtained by screening as the candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set building module is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
An electronic device, comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the above-described graph-embedding-based new word discovery method.
A computer readable storage medium having one or more programs executable by one or more processors to implement the graph-based embedding new word discovery method described above.
The invention has the beneficial effects that:
compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a method for discovering new words based on graph embedding according to an embodiment of the present invention;
FIG. 2 is a diagram of a graph-based embedding new word discovery system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
As shown in fig. 1, an embodiment of a method for discovering new words based on graph embedding according to the present invention includes:
s11: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, and scoring each character string according to the statistic;
s12: selecting character strings with scores meeting requirements and writing the character strings into a new word candidate set;
the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
s13: performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results;
the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
s14: calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated;
the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
s15: finding out the graph embedding of the words in the new word candidate set containing the words in the graph embedding of the words of the linguistic data to be calculated, screening the graph embedding of the words in the new word candidate set containing the words based on the graph embedding of the words in the general dictionary, and taking the corresponding words of the graph embedding obtained by screening as the candidate new words.
The candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
based on the dictionary obtained after the new word candidate set is added into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes;
dictionary maximum probability word segmentation, namely performing word segmentation on the corpus according to the probability of words contained in the corpus in a dictionary, for example:
the language material is as follows: teacher-saving student giving flower to teacher
Wherein, the teacher section can be divided into two words of teacher and teacher section, then calculate and compare the probability of teacher and teacher section in the dictionary, if the probability of teacher is greater than that of teacher section, then the word cutting result is:
teacher-saving student giving flower to teacher
If the probability of the teacher is less than that of the teacher section, the word segmentation result is as follows:
teacher-saving student giving flower to teacher
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, selecting graph embedding with the similarity at the top according to the sorting, wherein the specified number is 3-6, and screening the graph embedding with the similarity larger than a specified threshold, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9;
the similarity may be calculated based on the following algorithm: MS1, cosine similarity, WMD, etc.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module 21 cuts out the N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculates the statistic of each character string, scores each character string according to the statistic, and selects the character string with the score meeting the requirement to write into the new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module 22 is configured to perform word segmentation on the corpus to be calculated, construct a graph network based on word segmentation results, and calculate the graph network based on a graph attention network to obtain graph embedding of words of the corpus to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
and the new word screening module 23 is configured to find a graph embedding including a word in the new word candidate set in the graph embedding of the word of the corpus to be calculated, screen the graph embedding including the word in the new word candidate set based on the graph embedding including the word in the general dictionary, and use the word corresponding to the graph embedding obtained through screening as a candidate new word.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set constructing module 21 is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
The partial process of the system embodiment of the invention is similar to that of the method embodiment, the description of the system embodiment is simpler, and the method embodiment is referred to for the corresponding part.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, which can implement the process of the embodiment shown in fig. 1 of the present invention, and as shown in fig. 3, the electronic device may include: the device comprises a shell 31, a processor 32, a memory 33, a circuit board 34 and a power circuit 35, wherein the circuit board 34 is arranged inside a space enclosed by the shell 31, and the processor 32 and the memory 33 are arranged on the circuit board 34; a power supply circuit 35 for supplying power to each circuit or device of the electronic apparatus; the memory 33 is used for storing executable program codes; the processor 32 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 33, for executing the method described in any of the foregoing embodiments.
The specific execution process of the above steps by the processor 32 and the steps further executed by the processor 32 by running the executable program code may refer to the description of the embodiment shown in fig. 1 of the present invention, and are not described herein again.
The electronic device exists in a variety of forms, including but not limited to:
(1) a server: the device for providing the computing service, the server comprises a processor, a hard disk, a memory, a system bus and the like, the server is similar to a general computer architecture, but the server needs to provide highly reliable service, so the requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like are high;
(2) and other electronic equipment with data interaction function.
Embodiments of the present invention also provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the aforementioned method for preventing malicious traversal of website data.
Compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A new word discovery method based on graph embedding is characterized by comprising the following steps:
cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set;
performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results;
calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated;
finding out the graph embedding of the words in the new word candidate set containing the words in the graph embedding of the words of the linguistic data to be calculated, screening the graph embedding of the words in the new word candidate set containing the words based on the graph embedding of the words in the general dictionary, and taking the corresponding words of the graph embedding obtained by screening as the candidate new words.
2. The method according to claim 1, wherein the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting requirements to write into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
and selecting the character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set.
3. The method of claim 2, wherein after selecting the character strings with the scores meeting the requirement to write into the new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
4. The method according to claim 3, wherein the screening of the graph insertions containing words in the new word candidate set based on the graph insertions containing words in the general dictionary comprises:
and traversing the graph embedding of the words in the new word candidate set, sequencing the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set in a specified number according to the sequencing, and screening the graph embedding of the specified number, of which the similarity of the graph embedding of the words in the general dictionary meets a specified threshold value, from the selected graph embedding of the specified number.
5. A graph embedding-based new word discovery system, comprising:
the new word candidate set building module is used for cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting the character string with the score meeting the requirement to write into a new word candidate set;
the graph embedding training module is used for carrying out word segmentation on the linguistic data to be calculated, constructing a graph network based on word segmentation results, and calculating the graph network based on a graph attention network to obtain the graph embedding of the words of the linguistic data to be calculated;
and the new word screening module is used for finding the image embedding containing the words in the new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding containing the words in the new word candidate set based on the image embedding containing the words in the general dictionary, and taking the words corresponding to the image embedding obtained by screening as the candidate new words.
6. The system according to claim 5, wherein the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting requirements to write into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
and selecting the character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set.
7. The system of claim 6, wherein after selecting a character string with a score meeting a requirement to write into the new word candidate set, the new word candidate set construction module is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
8. The system according to claim 7, wherein the screening of the graph insertions containing words in the new word candidate set based on the graph insertions containing words in the general dictionary comprises:
and traversing the graph embedding of the words in the new word candidate set, sequencing the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set in a specified number according to the sequencing, and screening the graph embedding of the specified number, of which the similarity of the graph embedding of the words in the general dictionary meets a specified threshold value, from the selected graph embedding of the specified number.
9. An electronic device, characterized in that the electronic device comprises: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the method according to any one of claims 1 to 4.
10. A computer-readable storage medium, having one or more programs stored thereon, the one or more programs being executable by one or more processors to perform the method of any of claims 1-4.
CN202011060498.2A 2020-09-30 2020-09-30 New word discovery method, system, equipment and medium based on graph embedding Active CN112232077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011060498.2A CN112232077B (en) 2020-09-30 2020-09-30 New word discovery method, system, equipment and medium based on graph embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011060498.2A CN112232077B (en) 2020-09-30 2020-09-30 New word discovery method, system, equipment and medium based on graph embedding

Publications (2)

Publication Number Publication Date
CN112232077A true CN112232077A (en) 2021-01-15
CN112232077B CN112232077B (en) 2021-10-29

Family

ID=74119881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011060498.2A Active CN112232077B (en) 2020-09-30 2020-09-30 New word discovery method, system, equipment and medium based on graph embedding

Country Status (1)

Country Link
CN (1) CN112232077B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138510A (en) * 2015-08-10 2015-12-09 昆明理工大学 Microblog-based neologism emotional tendency judgment method
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN106649849A (en) * 2016-12-30 2017-05-10 上海智臻智能网络科技股份有限公司 Text information base building method and device and searching method, device and system
CN107622051A (en) * 2017-09-14 2018-01-23 马上消费金融股份有限公司 A kind of neologisms screening technique and device
US20190155918A1 (en) * 2017-11-20 2019-05-23 Colossio, Inc. Real-time classification of evolving dictionaries

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138510A (en) * 2015-08-10 2015-12-09 昆明理工大学 Microblog-based neologism emotional tendency judgment method
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN106649849A (en) * 2016-12-30 2017-05-10 上海智臻智能网络科技股份有限公司 Text information base building method and device and searching method, device and system
CN107622051A (en) * 2017-09-14 2018-01-23 马上消费金融股份有限公司 A kind of neologisms screening technique and device
US20190155918A1 (en) * 2017-11-20 2019-05-23 Colossio, Inc. Real-time classification of evolving dictionaries

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜洋: "基于词向量表征的新词发现及命名实体识别研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Also Published As

Publication number Publication date
CN112232077B (en) 2021-10-29

Similar Documents

Publication Publication Date Title
US10725836B2 (en) Intent-based organisation of APIs
CN107463666B (en) sensitive word filtering method based on text content
US10210245B2 (en) Natural language question answering method and apparatus
CN111680089B (en) Text structuring method, device and system and non-volatile storage medium
CN113055386B (en) Method and device for identifying and analyzing attack organization
CN109271641A (en) A kind of Text similarity computing method, apparatus and electronic equipment
WO2022134779A1 (en) Method, apparatus and device for extracting character action related data, and storage medium
CN110929510A (en) Chinese unknown word recognition method based on dictionary tree
CN115757743A (en) Document search term matching method and electronic equipment
JP6936014B2 (en) Teacher data collection device, teacher data collection method, and program
CN109783805B (en) Network community user identification method and device and readable storage medium
CN112232077B (en) New word discovery method, system, equipment and medium based on graph embedding
Phuvipadawat et al. Detecting a multi-level content similarity from microblogs based on community structures and named entities
CN113868508B (en) Writing material query method and device, electronic equipment and storage medium
JP5184195B2 (en) Language processing apparatus and program
US20220156468A1 (en) Method and apparatus for generating knowledge graph
CN111507098B (en) Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
Fernandez et al. Extracting social network from literature to predict antagonist and protagonist
CN112182235A (en) Method and device for constructing knowledge graph, computer equipment and storage medium
Sanabila et al. Automatic Wayang Ontology Construction using Relation Extraction from Free Text
CN112364128B (en) Information processing method, device, computer equipment and storage medium
JP6201702B2 (en) Semantic information classification program and information processing apparatus
Si et al. An online Dirichlet model based on sentence embedding and DBSCAN for noisy short text stream clustering
KR102635118B1 (en) Learning data creation apparatus, method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant