CN112232077A - A new word discovery method, system, device and medium based on graph embedding - Google Patents

A new word discovery method, system, device and medium based on graph embedding Download PDF

Info

Publication number
CN112232077A
CN112232077A CN202011060498.2A CN202011060498A CN112232077A CN 112232077 A CN112232077 A CN 112232077A CN 202011060498 A CN202011060498 A CN 202011060498A CN 112232077 A CN112232077 A CN 112232077A
Authority
CN
China
Prior art keywords
graph
words
candidate set
new word
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011060498.2A
Other languages
Chinese (zh)
Other versions
CN112232077B (en
Inventor
莫永卓
赵顺峰
练睿
肖杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huzhou Tongmu Intellectual Property Co ltd
Original Assignee
Workway Shenzhen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Workway Shenzhen Information Technology Co ltd filed Critical Workway Shenzhen Information Technology Co ltd
Priority to CN202011060498.2A priority Critical patent/CN112232077B/en
Publication of CN112232077A publication Critical patent/CN112232077A/en
Application granted granted Critical
Publication of CN112232077B publication Critical patent/CN112232077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method, a system, equipment and a medium for discovering new words based on graph embedding, which comprises the following steps: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; and screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in the general dictionary, and embedding the corresponding words in the graph obtained by screening as candidate new words. The invention is based on the graph embedding technology, and can effectively filter low-quality candidate new words in the new word discovery process, thereby obtaining higher-quality and more reliable general new words or field new words.

Description

New word discovery method, system, equipment and medium based on graph embedding
Technical Field
The invention relates to the field of natural language processing, in particular to a new word discovery method, a system, equipment and a medium based on graph embedding.
Background
Graph Embedding (also called Network Embedding) is a process of mapping Graph data (usually a high-dimensional dense matrix) into a low micro dense vector. Graph exists widely in various scenes of the real world, i.e., a collection of nodes and edges. Such as person-to-person connections in social networks, protein interactions in living beings, and communication between IP addresses in communication networks, among others. Besides, a picture and a sentence which are most common can be abstractly regarded as a structure of a graph model, and the graph structure can be said to be ubiquitous. Through the analysis of the above, we can understand the social structure, language and different communication modes, so the graph is always the hot point of the research in the academic world.
In the field of natural language processing, in a new word discovery task, the existing method generally utilizes a statistical learning method to construct a new word, and the basic idea is an information entropy method, but the simple method only uses shallow semantic information in a corpus and often introduces many low-quality new words. Therefore, if deeper embedding information such as graph embedding can be introduced, new words with higher quality can be extracted.
Disclosure of Invention
In view of the above, the present invention provides a method, system, device and medium for discovering new words based on graph embedding, which at least partially solve the problems in the prior art. The method comprises the steps of firstly obtaining a new word candidate set according to the linguistic data to be calculated, then constructing a graph network based on the linguistic data to be calculated, calculating the graph network by using an attention network to obtain graph embedding, and finally screening the graph embedding containing the words in the new word candidate set based on the graph embedding containing the words in an original general dictionary to obtain the general new words or the field new words with higher quality and more reliability.
The invention specifically comprises the following steps:
a graph embedding-based new word discovery method comprises the following steps:
cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting character strings with scores meeting the requirements to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results; the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated; the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
finding image embedding containing words in a new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding containing the words in the new word candidate set based on the image embedding containing the words in a general dictionary, and embedding corresponding words in the screened image as candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises the following steps: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module is used for cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, scoring each character string according to the statistic, and selecting the character string with the score meeting the requirement to write into a new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module is used for carrying out word segmentation on the linguistic data to be calculated, constructing a graph network based on word segmentation results, and calculating the graph network based on a graph attention network to obtain the graph embedding of the words of the linguistic data to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
the new word screening module is used for finding out the image embedding of the words in the new word candidate set in the image embedding of the words of the linguistic data to be calculated, screening the image embedding of the words in the new word candidate set based on the image embedding of the words in the general dictionary, and taking the words corresponding to the image embedding obtained by screening as the candidate new words; the candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Further, the calculating statistics of each character string, scoring each character string according to the statistics, selecting a character string with a score meeting the requirement, and writing the character string into a new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Further, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set building module is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Further, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
An electronic device, comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the above-described graph-embedding-based new word discovery method.
A computer readable storage medium having one or more programs executable by one or more processors to implement the graph-based embedding new word discovery method described above.
The invention has the beneficial effects that:
compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a method for discovering new words based on graph embedding according to an embodiment of the present invention;
FIG. 2 is a diagram of a graph-based embedding new word discovery system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
As shown in fig. 1, an embodiment of a method for discovering new words based on graph embedding according to the present invention includes:
s11: cutting N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculating the statistic of each character string, and scoring each character string according to the statistic;
s12: selecting character strings with scores meeting requirements and writing the character strings into a new word candidate set;
the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
s13: performing word segmentation on the linguistic data to be calculated, and constructing a graph network based on word segmentation results;
the process is a process of constructing a triple, wherein the triple comprises an entity, a relation and an entity, the entity is a word, the relation is a connection line between words, namely, the words are connected according to the word relation to form a graph network;
s14: calculating the graph network based on the graph attention network to obtain graph embedding of words of the linguistic data to be calculated;
the process is a training graph embedding process, and after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
s15: finding out the graph embedding of the words in the new word candidate set containing the words in the graph embedding of the words of the linguistic data to be calculated, screening the graph embedding of the words in the new word candidate set containing the words based on the graph embedding of the words in the general dictionary, and taking the corresponding words of the graph embedding obtained by screening as the candidate new words.
The candidate new words obtained in the process are the obtained general new words or field new words with higher quality and more reliability; the general dictionary is different from a common Chinese dictionary and is a dictionary formed by general new words, field new words and the like contained in tools such as Jieba, HanLP, Jiagu, Ansi and the like; the new words are discovered based on the common new words and the field new words contained in the front-edge tool, and the accuracy of discovering the new words can be effectively ensured.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
based on the dictionary obtained after the new word candidate set is added into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes;
dictionary maximum probability word segmentation, namely performing word segmentation on the corpus according to the probability of words contained in the corpus in a dictionary, for example:
the language material is as follows: teacher-saving student giving flower to teacher
Wherein, the teacher section can be divided into two words of teacher and teacher section, then calculate and compare the probability of teacher and teacher section in the dictionary, if the probability of teacher is greater than that of teacher section, then the word cutting result is:
teacher-saving student giving flower to teacher
If the probability of the teacher is less than that of the teacher section, the word segmentation result is as follows:
teacher-saving student giving flower to teacher
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, selecting graph embedding with the similarity at the top according to the sorting, wherein the specified number is 3-6, and screening the graph embedding with the similarity larger than a specified threshold, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9;
the similarity may be calculated based on the following algorithm: MS1, cosine similarity, WMD, etc.
A graph-embedding based new word discovery system comprising:
the new word candidate set building module 21 cuts out the N-GRAM character strings of the linguistic data to be calculated by using a sliding window, calculates the statistic of each character string, scores each character string according to the statistic, and selects the character string with the score meeting the requirement to write into the new word candidate set; the window size of the sliding window is generally set to be 2-7, namely, a character string containing 2-7 characters is cut out every time the sliding window slides; the linguistic data to be calculated comprises forum articles, network crawling contents, documents edited by individuals and the like;
the graph embedding training module 22 is configured to perform word segmentation on the corpus to be calculated, construct a graph network based on word segmentation results, and calculate the graph network based on a graph attention network to obtain graph embedding of words of the corpus to be calculated; after the graph attention network calculates the graph network, each word contained in the graph network is converted into matrix representation, and each matrix is graph embedding of each word;
and the new word screening module 23 is configured to find a graph embedding including a word in the new word candidate set in the graph embedding of the word of the corpus to be calculated, screen the graph embedding including the word in the new word candidate set based on the graph embedding including the word in the general dictionary, and use the word corresponding to the graph embedding obtained through screening as a candidate new word.
Preferably, the calculating statistics of each character string, scoring each character string according to the statistics, and selecting a character string with a score meeting the requirement to write into the new word candidate set specifically comprises:
calculating statistics for each string, the statistics comprising: word frequency, average mutual information, left entropy and right entropy;
scoring each string based on a scoring formula, the scoring formula being:
TF*AMI*(2*(EI+Er)/(El*Er));
wherein TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy;
selecting character strings with the scores larger than a specified threshold value according to the scores of the character strings, and writing the character strings into a new word candidate set;
the AMI is different from the Mutual Information (MI) used in the conventional method, and is an average value of the Mutual Information (MI) divided by the length of the character string, that is, AMI is MI/length, so that a more stable value can be obtained; 2 (EI + Er)/(El) is a weighted average of left and right entropies, and is different from a common method of taking left and right entropy minimum values, so that the result is more objective and stable, and the method is also suitable for small corpus sets.
Preferably, after selecting a character string with a score meeting the requirement and writing the character string into the new word candidate set, the new word candidate set constructing module 21 is further configured to: adding the new word candidate set to the general dictionary;
the word segmentation is carried out on the linguistic data to be calculated, and a graph network is constructed based on word segmentation results, and the method specifically comprises the following steps:
and on the basis of the dictionary obtained by adding the new word candidate set into the general dictionary, adopting maximum probability word segmentation of the dictionary for the corpus to be calculated, and constructing a graph network by taking adjacent words after word segmentation as nodes.
Preferably, the screening of the graph embedding including words in the new word candidate set based on the graph embedding including words in the general dictionary specifically includes:
traversing the graph embedding of the words in the new word candidate set, sorting the graph embedding of the words in the new word candidate set according to the similarity of the graph embedding of the words in the general dictionary, selecting the graph embedding of the words in the new word candidate set with a specified quantity according to the sorting, and screening the graph embedding of the specified quantity which has the similarity meeting a specified threshold value with the graph embedding of the words in the general dictionary; generally, graph embedding with the similarity at the top is selected according to the sorting, the specified number is 3-6, and graph embedding with the similarity larger than a specified threshold is screened, wherein the higher the threshold value is, the better the screening result is, for example, the value is 0.9.
The partial process of the system embodiment of the invention is similar to that of the method embodiment, the description of the system embodiment is simpler, and the method embodiment is referred to for the corresponding part.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, which can implement the process of the embodiment shown in fig. 1 of the present invention, and as shown in fig. 3, the electronic device may include: the device comprises a shell 31, a processor 32, a memory 33, a circuit board 34 and a power circuit 35, wherein the circuit board 34 is arranged inside a space enclosed by the shell 31, and the processor 32 and the memory 33 are arranged on the circuit board 34; a power supply circuit 35 for supplying power to each circuit or device of the electronic apparatus; the memory 33 is used for storing executable program codes; the processor 32 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 33, for executing the method described in any of the foregoing embodiments.
The specific execution process of the above steps by the processor 32 and the steps further executed by the processor 32 by running the executable program code may refer to the description of the embodiment shown in fig. 1 of the present invention, and are not described herein again.
The electronic device exists in a variety of forms, including but not limited to:
(1) a server: the device for providing the computing service, the server comprises a processor, a hard disk, a memory, a system bus and the like, the server is similar to a general computer architecture, but the server needs to provide highly reliable service, so the requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like are high;
(2) and other electronic equipment with data interaction function.
Embodiments of the present invention also provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the aforementioned method for preventing malicious traversal of website data.
Compared with the prior art that a statistical learning method is used for constructing new words based on information entropy, the method can effectively filter low-quality candidate new words in the new word discovery process based on the graph embedding technology, so that higher-quality and more reliable general new words or field new words can be obtained. The invention discovers the new words based on the common new words and the field new words contained in the front-edge tool, and can effectively ensure the accuracy of discovering the new words. When the statistics and the scoring of the corpus character strings are calculated, Average Mutual Information (AMI) is used, compared with the method using Mutual Information (MI) in the traditional method, a more stable calculation result can be obtained, and meanwhile, the weighted average of left and right entropies is utilized, which is different from the method of taking the minimum value of the left and right entropies in the common method, so that the calculation result is more objective and stable, the method is also suitable for a small corpus, the accuracy of the new word candidate set is ensured, and the accuracy of a new word discovery result is further ensured.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1.一种基于图嵌入的新词发现方法,其特征在于,包括:1. a new word discovery method based on graph embedding, is characterized in that, comprises: 使用滑动窗口切取待计算语料的N-GRAM字符串,计算各字符串的统计量,根据所述统计量为各字符串打分,选取得分满足要求的字符串写入新词候选集;Use a sliding window to cut out the N-GRAM strings of the corpus to be calculated, calculate the statistics of each string, score each string according to the statistics, and select a string whose score meets the requirements to write into the new word candidate set; 对所述待计算语料进行切词,基于切词结果构建图网络;Perform word segmentation on the corpus to be calculated, and construct a graph network based on the word segmentation result; 基于图注意力网络对所述图网络进行计算,得到所述待计算语料的词语的图嵌入;Calculate the graph network based on the graph attention network to obtain the graph embedding of the words of the corpus to be calculated; 在所述待计算语料的词语的图嵌入中找到新词候选集中包含词语的图嵌入,并基于通用词典中包含词语的图嵌入对所述新词候选集中包含词语的图嵌入进行筛选,将筛选得到的图嵌入对应的词语作为候选新词。Find the graph embeddings containing words in the new word candidate set from the graph embeddings of the words in the corpus to be calculated, and screen the graph embeddings containing the words in the new word candidate set based on the graph embeddings that contain the words in the general dictionary. The resulting graph embeddings correspond to words as candidate new words. 2.根据权利要求1所述的方法,其特征在于,所述计算各字符串的统计量,根据所述统计量为各字符串打分,选取得分满足要求的字符串写入新词候选集,具体为:2. method according to claim 1, is characterized in that, described calculating the statistic of each character string, scoring each character string according to described statistic, selecting the character string whose score satisfies the requirement to write the new word candidate set ,Specifically: 计算各字符串的统计量,所述统计量包括:词频、平均互信息、左熵、右熵;Calculate the statistics of each character string, the statistics include: word frequency, average mutual information, left entropy, right entropy; 基于得分公式对各字符串进行打分,所述得分公式为:Each character string is scored based on a scoring formula, which is: TF*AMI*(2*(EI+Er)/(El*Er));TF*AMI*(2*(EI+Er)/(El*Er)); 其中,TF为所述词频、AMI为所述平均互信息、El为所述左熵、Er为所述右熵;Wherein, TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy; 根据各字符串的得分,选取得分大于规定阈值的字符串写入新词候选集。According to the score of each character string, the character string whose score is greater than the specified threshold is selected and written into the new word candidate set. 3.根据权利要求2所述的方法,其特征在于,选取得分满足要求的字符串写入新词候选集后,所述方法还包括:将所述新词候选集加入所述通用词典;3. The method according to claim 2, characterized in that, after selecting a character string whose score meets the requirements and writing it into a new word candidate set, the method further comprises: adding the new word candidate set to the general dictionary; 所述对所述待计算语料进行切词,基于切词结果构建图网络,具体为:The described word segmentation is performed on the corpus to be calculated, and a graph network is constructed based on the segmentation result, specifically: 基于将所述新词候选集加入所述通用词典后的词典,对所述待计算语料采用词典最大概率切词,以切词后相邻的词语为节点构建图网络。Based on the dictionary obtained by adding the new word candidate set to the general dictionary, the corpus to be calculated adopts the dictionary maximum probability segmentation word, and the adjacent words after the word segmentation are used as nodes to construct a graph network. 4.根据权利要求3所述的方法,其特征在于,所述基于通用词典中包含词语的图嵌入对所述新词候选集中包含词语的图嵌入进行筛选,具体为:4. The method according to claim 3, wherein the graph embeddings that contain words in the new word candidate set are screened based on the graph embeddings that contain words in the general dictionary, specifically: 遍历所述新词候选集中包含词语的图嵌入,根据与所述通用词典中包含词语的图嵌入的相似度对所述新词候选集中包含词语的图嵌入进行排序,根据排序选取规定数量的所述新词候选集中包含词语的图嵌入,在选取出的规定数量的图嵌入中筛选出与所述通用词典中包含词语的图嵌入的相似度满足规定阈值的图嵌入。Traverse the graph embeddings containing words in the new word candidate set, sort the graph embeddings containing words in the new word candidate set according to the similarity with the graph embeddings containing words in the general dictionary, and select a specified number of all graph embeddings according to the sorting. Describe the graph embeddings containing words in the new word candidate set, and filter out the graph embeddings whose similarity with the graph embeddings containing the words in the general dictionary satisfies a prescribed threshold from the selected prescribed number of graph embeddings. 5.一种基于图嵌入的新词发现系统,其特征在于,包括:5. A new word discovery system based on graph embedding is characterized in that, comprising: 新词候选集构建模块,使用滑动窗口切取待计算语料的N-GRAM字符串,计算各字符串的统计量,根据所述统计量为各字符串打分,选取得分满足要求的字符串写入新词候选集;The new word candidate set building module uses a sliding window to cut out the N-GRAM strings of the corpus to be calculated, calculates the statistics of each string, scores each string according to the statistics, and selects the strings whose scores meet the requirements to write New word candidate set; 图嵌入训练模块,用于对所述待计算语料进行切词,基于切词结果构建图网络,并基于图注意力网络对所述图网络进行计算,得到所述待计算语料的词语的图嵌入;The graph embedding training module is used to segment the corpus to be calculated, construct a graph network based on the word segmentation result, and calculate the graph network based on the graph attention network to obtain the graph embedding of the words in the corpus to be calculated. ; 新词筛选模块,用于在所述待计算语料的词语的图嵌入中找到新词候选集中包含词语的图嵌入,并基于通用词典中包含词语的图嵌入对所述新词候选集中包含词语的图嵌入进行筛选,将筛选得到的图嵌入对应的词语作为候选新词。The new word screening module is used to find the graph embeddings containing the words in the new word candidate set in the graph embeddings of the words of the corpus to be calculated, and based on the graph embeddings containing the words in the general dictionary, the graph embeddings of the words contained in the new word candidate set are checked. The graph embedding is screened, and the words corresponding to the screened graph embeddings are used as candidate new words. 6.根据权利要求5所述的系统,其特征在于,所述计算各字符串的统计量,根据所述统计量为各字符串打分,选取得分满足要求的字符串写入新词候选集,具体为:6. The system according to claim 5, characterized in that, calculating the statistic of each character string, scoring each character string according to the statistic, selecting the character string whose score satisfies the requirement and writing the new word candidate set ,Specifically: 计算各字符串的统计量,所述统计量包括:词频、平均互信息、左熵、右熵;Calculate the statistics of each character string, the statistics include: word frequency, average mutual information, left entropy, right entropy; 基于得分公式对各字符串进行打分,所述得分公式为:Each character string is scored based on a scoring formula, which is: TF*AMI*(2*(EI+Er)/(El*Er));TF*AMI*(2*(EI+Er)/(El*Er)); 其中,TF为所述词频、AMI为所述平均互信息、El为所述左熵、Er为所述右熵;Wherein, TF is the word frequency, AMI is the average mutual information, El is the left entropy, and Er is the right entropy; 根据各字符串的得分,选取得分大于规定阈值的字符串写入新词候选集。According to the score of each character string, the character string whose score is greater than the specified threshold is selected and written into the new word candidate set. 7.根据权利要求6所述的系统,其特征在于,选取得分满足要求的字符串写入新词候选集后,所述新词候选集构建模块还用于:将所述新词候选集加入所述通用词典;7. The system according to claim 6, characterized in that, after selecting the character string whose score meets the requirements and writing it into the new word candidate set, the new word candidate set building module is further used for: adding the new word candidate set join said general dictionary; 所述对所述待计算语料进行切词,基于切词结果构建图网络,具体为:The described word segmentation is performed on the corpus to be calculated, and a graph network is constructed based on the segmentation result, specifically: 基于将所述新词候选集加入所述通用词典后的词典,对所述待计算语料采用词典最大概率切词,以切词后相邻的词语为节点构建图网络。Based on the dictionary obtained by adding the new word candidate set to the general dictionary, the corpus to be calculated adopts the dictionary maximum probability segmentation word, and the adjacent words after the word segmentation are used as nodes to construct a graph network. 8.根据权利要求7所述的系统,其特征在于,所述基于通用词典中包含词语的图嵌入对所述新词候选集中包含词语的图嵌入进行筛选,具体为:8. The system according to claim 7, wherein the graph embeddings containing words in the new word candidate set are screened based on the graph embeddings that contain words in the general dictionary, specifically: 遍历所述新词候选集中包含词语的图嵌入,根据与所述通用词典中包含词语的图嵌入的相似度对所述新词候选集中包含词语的图嵌入进行排序,根据排序选取规定数量的所述新词候选集中包含词语的图嵌入,在选取出的规定数量的图嵌入中筛选出与所述通用词典中包含词语的图嵌入的相似度满足规定阈值的图嵌入。Traverse the graph embeddings containing words in the new word candidate set, sort the graph embeddings containing words in the new word candidate set according to the similarity with the graph embeddings containing words in the general dictionary, and select a specified number of all graph embeddings according to the sorting. Describe the graph embeddings containing words in the new word candidate set, and filter out the graph embeddings whose similarity with the graph embeddings containing the words in the general dictionary satisfies a prescribed threshold from the selected prescribed number of graph embeddings. 9.一种电子设备,其特征在于,所述电子设备包括:壳体、处理器、存储器、电路板和电源电路,其中,电路板安置在壳体围成的空间内部,处理器和存储器设置在电路板上;电源电路,用于为上述电子设备的各个电路或器件供电;存储器用于存储可执行程序代码;处理器通过读取存储器中存储的可执行程序代码来运行与可执行程序代码对应的程序,用于执行如权利要求1-4任一所述的方法。9. An electronic device, characterized in that the electronic device comprises: a casing, a processor, a memory, a circuit board and a power supply circuit, wherein the circuit board is arranged inside the space enclosed by the casing, and the processor and the memory are arranged On the circuit board; the power supply circuit is used to supply power to each circuit or device of the above-mentioned electronic equipment; the memory is used to store the executable program code; the processor runs and executes the executable program code by reading the executable program code stored in the memory. A corresponding program is used to execute the method according to any one of claims 1-4. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如权利要求1-4任一所述的方法。10. A computer-readable storage medium, characterized in that, the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the claim The method of any of claims 1-4.
CN202011060498.2A 2020-09-30 2020-09-30 A new word discovery method, system, device and medium based on graph embedding Active CN112232077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011060498.2A CN112232077B (en) 2020-09-30 2020-09-30 A new word discovery method, system, device and medium based on graph embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011060498.2A CN112232077B (en) 2020-09-30 2020-09-30 A new word discovery method, system, device and medium based on graph embedding

Publications (2)

Publication Number Publication Date
CN112232077A true CN112232077A (en) 2021-01-15
CN112232077B CN112232077B (en) 2021-10-29

Family

ID=74119881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011060498.2A Active CN112232077B (en) 2020-09-30 2020-09-30 A new word discovery method, system, device and medium based on graph embedding

Country Status (1)

Country Link
CN (1) CN112232077B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138510A (en) * 2015-08-10 2015-12-09 昆明理工大学 Microblog-based neologism emotional tendency judgment method
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN106649849A (en) * 2016-12-30 2017-05-10 上海智臻智能网络科技股份有限公司 Text information base building method and device and searching method, device and system
CN107622051A (en) * 2017-09-14 2018-01-23 马上消费金融股份有限公司 New word screening method and device
US20190155918A1 (en) * 2017-11-20 2019-05-23 Colossio, Inc. Real-time classification of evolving dictionaries

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138510A (en) * 2015-08-10 2015-12-09 昆明理工大学 Microblog-based neologism emotional tendency judgment method
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN106649849A (en) * 2016-12-30 2017-05-10 上海智臻智能网络科技股份有限公司 Text information base building method and device and searching method, device and system
CN107622051A (en) * 2017-09-14 2018-01-23 马上消费金融股份有限公司 New word screening method and device
US20190155918A1 (en) * 2017-11-20 2019-05-23 Colossio, Inc. Real-time classification of evolving dictionaries

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜洋: "基于词向量表征的新词发现及命名实体识别研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Also Published As

Publication number Publication date
CN112232077B (en) 2021-10-29

Similar Documents

Publication Publication Date Title
US10725836B2 (en) Intent-based organisation of APIs
US10956464B2 (en) Natural language question answering method and apparatus
CN105378732B (en) Method and system for thematic analysis of tabular data
CN108701161B (en) Providing images for search queries
CN111460083A (en) Document title tree construction method and device, electronic equipment and storage medium
CN110162630A (en) A kind of method, device and equipment of text duplicate removal
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
CN108681557A (en) Based on the short text motif discovery method and system indicated from expansion with similar two-way constraint
CN108132927A (en) A kind of fusion graph structure and the associated keyword extracting method of node
WO2022134779A1 (en) Method, apparatus and device for extracting character action related data, and storage medium
CN109614626A (en) Automatic keyword extraction method based on gravitational model
CN112580331A (en) Method and system for establishing knowledge graph of policy text
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN107679031A (en) Based on the advertisement blog article recognition methods for stacking the self-editing ink recorder of noise reduction
CN116822491A (en) Log analysis method and device, equipment and storage medium
CN113554172B (en) Method and system for extracting knowledge of adjudication rules based on case text
CN115062600A (en) Code plagiarism detection method based on weighted abstract syntax tree
CN114692594A (en) Text similarity recognition method, device, electronic device and readable storage medium
CN112232077B (en) A new word discovery method, system, device and medium based on graph embedding
WO2015125209A1 (en) Information structuring system and information structuring method
CN119025758A (en) Retrieval method, device and electronic device based on large language model and multi-way recall
CN118797050A (en) Abstract generation method, device, electronic device and storage medium
CN113868508B (en) Writing material query method and device, electronic equipment and storage medium
JP5184195B2 (en) Language processing apparatus and program
CN115905885A (en) Data identification method, device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20250613

Address after: Room 401, Building 4, No. 470 Yunxiu North Road, Fuxi Street, Deqing County, Huzhou City, Zhejiang Province, China 313200

Patentee after: Huzhou Tongmu Intellectual Property Co.,Ltd.

Country or region after: China

Address before: 518040 Guangdong city of Shenzhen province Futian District Shatou Street Tairan Industrial Park Cangsong building room 1301

Patentee before: WORKWAY (SHENZHEN) INFORMATION TECHNOLOGY CO.,LTD.

Country or region before: China