CN111460825A - Data enhancement method, device, equipment and storage medium - Google Patents

Data enhancement method, device, equipment and storage medium Download PDF

Info

Publication number
CN111460825A
CN111460825A CN202010245498.3A CN202010245498A CN111460825A CN 111460825 A CN111460825 A CN 111460825A CN 202010245498 A CN202010245498 A CN 202010245498A CN 111460825 A CN111460825 A CN 111460825A
Authority
CN
China
Prior art keywords
entity
target
corpus
entities
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010245498.3A
Other languages
Chinese (zh)
Inventor
何莹
李直旭
陈志刚
邓宸博
李茂龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iflytek Suzhou Technology Co Ltd
Original Assignee
Iflytek Suzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iflytek Suzhou Technology Co Ltd filed Critical Iflytek Suzhou Technology Co Ltd
Priority to CN202010245498.3A priority Critical patent/CN111460825A/en
Publication of CN111460825A publication Critical patent/CN111460825A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data enhancement method, a device, equipment and a storage medium, wherein the method comprises the following steps: determining a target entity from the knowledge graph, wherein the target entity is an entity corresponding to the entity designation in the target corpus; determining each related entity of the target entity from the knowledge graph; and replacing the entity reference in the target corpus by using the related entity of the target entity, wherein the obtained replaced corpora form the extended corpus of the target corpus. By executing the scheme, a large amount of expanded corpora corresponding to the target corpora can be obtained at one time, and the purpose of data enhancement is quickly achieved.

Description

Data enhancement method, device, equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data enhancement method, apparatus, device, and storage medium.
Background
Data enhancement is used as a data preprocessing method, and is widely applied in the field of computer vision, for example, image samples are rotated, cut, turned, translated and the like, and the generalization capability of a model can be effectively improved by using the data enhancement, so that the demand of labeling data is reduced. However, in the field of natural language processing, there are few methods for data enhancement, and different data enhancement methods need to be designed for different tasks.
Entity recognition, one of the underlying tasks of natural language processing, is the basis for various upstream applications. At present, entity recognition models are mostly based on deep learning models, but if the models are required to achieve ideal accuracy and recall rate, a great amount of labeled corpora are required. Therefore, data enhancement is performed on the entity identification labeling corpus with limited quantity, namely, the expansion corpus of the entity identification labeling corpus is obtained, so that the quantity of the corpus is expanded, and the method is a practical requirement in the field of entity identification.
Disclosure of Invention
Based on the above requirements, the present application provides a data enhancement method, apparatus, device and storage medium, which can obtain the expanded corpus of the corpus, thereby achieving the purpose of enhancing corpus data.
A method of data enhancement, comprising:
determining a target entity from the knowledge graph, wherein the target entity is an entity corresponding to the entity designation in the target corpus;
determining each related entity of the target entity from the knowledge graph;
and replacing the entity reference in the target corpus by using the related entity of the target entity, wherein the obtained replaced corpora form the extended corpus of the target corpus.
A data enhancement apparatus, comprising:
the target entity determining unit is used for determining a target entity from the knowledge graph, wherein the target entity is an entity corresponding to the entity designation in the target corpus;
a related entity determining unit, configured to determine, from the knowledge graph, each related entity of the target entity;
and the entity replacing unit is used for replacing the entity references in the target corpus by using the related entities of the target entity to obtain the replaced corpora to form the extended corpora of the target corpus.
A data enhancement device, comprising:
a memory and a processor;
wherein the memory is connected with the processor and used for storing programs;
the processor is used for implementing the data enhancement method by running the program in the memory.
A storage medium having stored thereon a computer program which, when executed by a processor, implements the data enhancement method described above.
When the data enhancement method is applied to data enhancement of the entity recognition corpus, firstly, a target entity corresponding to an entity reference in the target corpus is determined from a knowledge graph, and then, a related entity of the target entity is determined from the knowledge graph; and replacing the entity reference in the target corpus by using the related entity of the target entity, and forming the extended corpus of the target corpus by the obtained replaced corpuses. In the processing process, relevant entities of the target entities corresponding to the entity indexes in the target corpus are searched from the knowledge graph and serve as replacement contents of the entity indexes in the target corpus, and the replacement contents are used for replacing the entity indexes in the target corpus to obtain new corpuses, so that corpus expansion can be achieved. Meanwhile, based on the mass data volume of the knowledge graph, a large number of related entities can be obtained through the processing, so that a large number of expanded corpora corresponding to the target corpora can be obtained at one time through executing the scheme, and the purpose of data enhancement is quickly achieved.
Furthermore, because the relation among the entities and the information such as the attribute and the category of each entity are perfectly recorded in the knowledge graph, the related entities of the target entities corresponding to the entity indexes are selected from the knowledge graph as the replacement contents of the entity indexes, so that the correlation between the replacement information and the entity indexes can be ensured to be more comprehensive, and the new corpus obtained by replacing the entity indexes in the original target corpus by using the replacement contents can more fully retain the semantic information of the original target corpus.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart of a data enhancement method provided in an embodiment of the present application;
fig. 2 is a schematic diagram of an entity graph network structure provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart diagram of another data enhancement method provided in the embodiments of the present application;
FIG. 4 is a schematic diagram of entity information in a knowledge-graph provided by an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data enhancement apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a data enhancement device according to an embodiment of the present application.
Detailed Description
The technical scheme of the embodiment of the application is suitable for an application scene for performing data enhancement processing on the corpus data, and particularly suitable for an application scene for performing data enhancement processing on the entity identification corpus. By adopting the technical scheme of the embodiment of the application, the existing corpus can be expanded to obtain the expanded corpus, so that the aim of data enhancement is fulfilled.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An embodiment of the present application provides a data enhancement method, and as shown in fig. 1, the method includes:
s101, determining a target entity from the knowledge graph, wherein the target entity is an entity corresponding to the entity in the target corpus.
The aforementioned knowledge graph describes various concepts (e.g., people, places, etc.) and attributes of entities existing in the real world, and relationships between them. Through the knowledge graph, the information such as attributes, categories, labels and the like of any entity can be inquired and obtained, and the relation between the entities can be determined based on the network structure of the knowledge graph. The structure, content and the like of the knowledge graph can refer to a knowledge graph commonly known in the prior art, and the embodiment of the application is not described in detail.
As an alternative implementation manner, the knowledge graph may be a dedicated knowledge graph constructed for a certain field, for example, a corresponding knowledge graph may be constructed for a natural field, a social field, a sports field, and the like. The knowledge graph adopted in the embodiment of the application can be a knowledge graph applicable to the whole field, and can also be a knowledge graph belonging to the same field as the target corpus.
The target corpus is an entity identification corpus to which entity names have been labeled. The entity identification corpus refers to a corpus used for entity identification, and the entity identification refers to identification of entity names from the corpus.
For example, assume that the target corpus is "Suzhou Susan population 950,000. ", after entity identification and labeling, the result of the corpus identification and labeling is that the population of the Gusu district of GPE has < MISC >950,000</MISC >. "where" Suzhou Gusu district "and" 950,000 "are two entity designations, the type being administrative location and miscellaneous (time, number), respectively.
It is understood that after the target corpus is obtained, the entity names in the target corpus may be determined based on the known target corpus entity recognition result.
The entity names in the above-mentioned knowledge graph are canonical entity names. However, the entity name in the target corpus may be a canonical entity name or a spoken or irregular entity name. For example, an entity "book" in the knowledge graph may be written as "book" in the entity recognition corpus, and although the two correspond to the same thing, the "book" is a canonical name, and the "book" is a name of a comparison spoken language.
Based on the above reality, in the embodiment of the present application, a word or a phrase corresponding to a certain entity in the target corpus is referred to as an entity reference, and an entity in the knowledge graph is regarded as a canonical entity.
Further, after the entity name in the target expectation is determined, the embodiment of the application further determines an entity corresponding to the entity name from the knowledge graph as the target entity. The entity corresponding to the entity designation refers to an entity corresponding to the same thing as the entity designation in the knowledge graph. For example, if an entity called "book" is in the target corpus, and an entity corresponding to "book" in the knowledge graph is determined to be "book", the entity called "book" in the knowledge graph is used as the target entity corresponding to the entity called "book".
The target entity corresponding to the entity reference may be an entity with the same name as the entity reference or an entity with a different name from the entity reference.
As an exemplary implementation manner, the above-mentioned determining the target entity from the knowledge graph can be implemented by comparing the entity reference with the entity in the knowledge graph by means of semantic recognition, similarity comparison, and the like.
S102, determining each related entity of the target entity from the knowledge graph.
Specifically, after the target entity corresponding to the entity name in the target corpus is determined, the embodiment of the present application further searches for the related entity of the target entity from the knowledge graph.
The related entities of the target entity refer to entities related to the target entity, including but not limited to entities related or similar to attributes, categories, etc. of the target entity, or phrases or sentences capable of forming association relations with recognized or normative when being collocated with the target entity, or network structures in the same or similar in a knowledge graph, etc.
Based on the difference of the determination standards of the related entities, searching the entities related to the target entity from the knowledge graph by a related entity searching method corresponding to the determination standards of the related entities, and obtaining the related entities of the target entity.
For example, there is 950,000 for the target corpus "Suzhou Susan population. The entity in "refers to" Suzhou Gusu district ", and assuming that the corresponding target entity is also" Suzhou Gusu district ", the related entities of the target entity" Suzhou Gusu district "are determined from the knowledge graph.
Assuming that the related entities refer to entities of the same category as the target entities, since "su zhou guo district" is a name of an entity representing a city area, when determining the related entities of "su zhou guo district", the entities representing the city area are searched from the knowledge graph, for example, entities such as "shanghai pu dong district", "beijing dong district" may be found, and the found entities such as "shanghai pu dong district", "beijing dong district" are the related entities of the target entity "su zhou guo district".
Based on the mass data volume of the knowledge graph, in a general situation, a plurality of related entities can be determined from the knowledge graph corresponding to one target entity.
S103, replacing the entity references in the target corpus by using the related entities of the target entity, and forming the extended corpus of the target corpus by the obtained replaced corpora.
Specifically, corresponding to the entity reference in the target corpus, each related entity of the target entity corresponding to the entity reference is used to replace the entity reference, so as to obtain each new corpus after replacement, and each new corpus constitutes the extended corpus of the target corpus.
When the target corpus has a plurality of entity names, the extended corpus of the target corpus is obtained by the above-mentioned related entity replacement processing corresponding to each entity name.
For example, there is 950,000 for the target corpus "Suzhou Susan population. The "middle entity" is named as "suzhou gusu district", and its corresponding target entities, i.e., "shanghai pump east district", "beijing east district", etc., are used to replace the target corpus "suzhou gusu district population is 950,000. The "middle entity" is named as "Suzhou Gusu district", and the replaced corpus "the Pudong district in Shanghai has 950,000 population. "and" the Beijing Tokyo population is 950,000. "etc., these alternative corpora are the target corpora, which is 950,000 in the Suzhou district population. "expanded corpus.
Similarly, there are 950,000 for the target corpus "Suzhou Susan population. The entity in "is named" 950,000 ", and the related entity can also be searched from the knowledge graph by referring to the above scheme for replacement, so as to obtain the corpus after replacement.
And respectively replacing the entity references in the target corpus by using the related corpora according to the scheme to obtain the expanded corpus of the target corpus, thereby realizing the data enhancement of the target corpus.
As can be seen from the above description, when the data enhancement method provided by the present application is applied to data enhancement of entity identification corpora, a target entity corresponding to an entity reference in a target corpus is determined from a knowledge graph, and then a related entity of the target entity is determined from the knowledge graph; and replacing the entity reference in the target corpus by using the related entity of the target entity, and forming the extended corpus of the target corpus by the obtained replaced corpuses. In the processing process, relevant entities of the target entities corresponding to the entity indexes in the target corpus are searched from the knowledge graph and serve as replacement contents of the entity indexes in the target corpus, and the replacement contents are used for replacing the entity indexes in the target corpus to obtain new corpuses, so that corpus expansion can be achieved. Meanwhile, based on the mass data volume of the knowledge graph, a large number of related entities can be obtained through the processing, so that a large number of expanded corpora corresponding to the target corpora can be obtained at one time through executing the scheme, and the purpose of data enhancement is quickly achieved.
Furthermore, because the relation among the entities and the information such as the attribute and the category of each entity are perfectly recorded in the knowledge graph, the related entities of the target entities corresponding to the entity indexes are selected from the knowledge graph as the replacement contents of the entity indexes, so that the correlation between the replacement information and the entity indexes can be ensured to be more comprehensive, and the new corpus obtained by replacing the entity indexes in the original target corpus by using the replacement contents can more fully retain the semantic information of the original target corpus.
As an optional implementation manner, when determining each related entity of the target entity from the knowledge graph, the embodiment of the present application specifically determines each related entity of the target entity from the knowledge graph according to the category label and the target corpus of the target entity.
The category label of the target entity is a label that is included in the target entity and indicates a category of the target entity. In a knowledge graph, the class label of each entity is explicitly documented, and an entity typically has multiple class labels. Therefore, by querying the relevant information of the target entity from the knowledge graph, the various class labels possessed by the target entity can be determined.
According to the method and the device, the entity with the same or similar category label as the target entity is selected from the knowledge graph spectrum as the related entity of the target entity based on the category label of the target entity.
Since there are usually a plurality of category labels of the target entity, in the target corpus, only semantic information related to a certain category label of the target entity may be embodied. Therefore, when an entity is filtered from the knowledge graph according to the category label of the target corpus, the category label related to the semantics of the target corpus should be determined first, and then the entity having the category label related to the semantics of the target corpus should be filtered from the knowledge graph as the related entity of the target entity. Therefore, when the screened related entities are used for replacing the entity indexes in the target corpus at the later stage, the problem that the replaced entities are not matched with the semantics of the primitive corpus can not occur.
Therefore, when selecting the related entities from the knowledge graph based on the category label of the target entity, the semantic information of the target corpus needs to be referred to, that is, each related entity of the target entity is determined from the knowledge graph according to the category label of the target entity and the semantic information of the target corpus.
For example, the determining, according to the category label and the target corpus of the target entity, each related entity of the target entity from the knowledge graph specifically includes:
firstly, according to the category label of the target entity and the target corpus, screening out the category label with the maximum similarity with the target corpus from the category labels of the target entity as a target category label.
In order to facilitate comparison of similarity between the category label of the target entity and the target corpus, in the embodiment of the present application, the similarity between the category label and the target corpus is measured by using the feature vector of the category label of the target entity and the feature vector of the target corpus.
The feature vector of the target corpus is represented by an average value of vectors of all words contained in the target corpus, and the feature vector can also be used for representing semantic information of the target corpus.
Let s denote the feature vector of the target corpus, Cate (e)0) Representing a target entity e0Respectively calculating the similarity between the feature vector of each category label of the target entity and the feature vector of the target corpus according to the following formula, and screening the category label with the maximum similarity between the feature vector of the target entity and the feature vector of the target corpus from the category labels of the target entity according to the similarity measurement result to serve as a target category label csim
csim=argmaxcCosSim(ci,s),(ci∈Cate(e0))
Wherein, CosSim (·,. cndot.) represents calculating cosine similarity between two feature vectors, the cosine similarity is only used as an example for calculating the similarity between two feature vectors, and after calculating the similarity between the feature vector of each category label of the target entity and the feature vector of the target corpus by using CosSim (·,. cndot.) function, the similarity is calculated by argmaxcThe function selects a class label c that maximizes the function value from the class labels of the target entity as a target class label csim
In practical implementation of the technical solution of the embodiment of the present application, other similarity measurement methods may also be used to measure the similarity between the feature vector of the category tag of the target entity and the feature vector of the target corpus, or other similarity measurement methods may also be used to measure the similarity between the category tag of the target entity and the target corpus.
Then, screening out the entity with the target category label from the knowledge graph as the related entity of the target entity.
Figure BDA0002433875150000081
Wherein, Ent (c)sim) Representation acquisition has csimEntity set of category labels, then
Figure BDA0002433875150000083
Representation containing entity e0A set of all related entities.
Illustratively, using the target category label as a retrieval condition, the entity having the target category label is retrieved from the knowledge graph as a related entity to the target entity.
It will be appreciated that the result of carrying out the above-described scheme
Figure BDA0002433875150000082
All related entities including the target entity, due to the aboveThe process of obtaining the related entities of the target entity fully considers the semantic association between the labels of the target entity and the target corpus, so that the related entities of the target entity determined according to the scheme are all entities related to the semantics of the target corpus, and the entity references in the target corpus are replaced by the related entities, so that the semantics of the replaced corpus and the target corpus can be ensured to be related.
As another optional implementation manner, when the related entity of the target entity is determined from the knowledge graph, the related entity of the target entity may also be determined from the knowledge graph according to the network structure of the knowledge graph.
In the knowledge graph, entities are nodes in a graph network structure, and edges between the nodes are used for representing information such as attributes of the entities or relationships between the entities. The inventor of the application finds in practice that in the network structure of the knowledge graph, if the network structures around two entities are more similar, the difference between the two entities is smaller; the greater the difference in network structure around two entities, the greater the difference between the two entities. For example, in an extreme case, the network structures around two identical entities are the same, and the difference between the network structures around the two entities is the smallest and the smaller the difference between the two entities.
Therefore, according to the network structure of the knowledge-graph, by screening the entities with the same or similar network structures from the knowledge-graph, the method can be used for searching similar entities from the knowledge-graph, namely, related entities for determining target entities.
Based on the above thought, the embodiment of the application determines the graph network structure of each entity in the knowledge graph according to the network structure of the knowledge graph, and then determines the related entities of the target entity from the knowledge graph according to the graph network structure of the entity.
Specifically, firstly, based on the network structure of the knowledge graph, the graph network feature vector of each entity in the knowledge graph is respectively determined.
The map network feature vector of the entity refers to a feature vector including features of a map network structure of the entity, and the map network feature vector can be obtained by extracting features of the map network structure of the entity. The above-mentioned graph network structure of the entity refers to a network structure formed by the entity and its neighboring entities in the above-mentioned knowledge graph.
For example, a feature vector of information such as attributes or relationships on edges between entity nodes included in the graph network structure of the entity may be calculated as the graph network feature vector of the entity.
After the map network feature vector of each entity in the knowledge map is determined, the map network feature vector of the target entity is also determined. At this time, the map network feature vector of the target entity is compared with the map network feature vectors of other entities in the knowledge map in similarity, so that the related entities of the target entity are determined from the knowledge map.
For example, as shown in fig. 2, solid nodes in the graph represent entity 1, entity 2, and entity 3, respectively, hollow nodes around each solid node represent adjacent entities of the entities, and edges between the entity nodes represent relationships between the entities. When the network structures around two entities are very similar, that is, the graph network structures of the two entities are very similar, the similarity of the graph network feature vectors between the two entities is very high. As shown in fig. 2, the similarity of the network structures around the node 1 and the node 3 is higher, and the similarity of the node 1 and the node 3 is higher. Therefore, by comparing the similarity of the map network feature vectors of the entities, the similarity comparison of the map network structures of the entities can be realized, that is, the similarity comparison of the entities is realized.
For example, the similarity comparison between the graph network feature vector of the target entity and the graph network feature vectors of other entities in the knowledge graph can be implemented by calculating the manhattan distance between the graph network feature vector of the target entity and the graph network feature vectors of other entities in the knowledge graph, and the smaller the manhattan distance between the graph network feature vector of the target entity and the graph network feature vectors of other entities in the knowledge graph is, the greater the similarity between the graph network feature vector representing the target entity and the graph network feature vectors of other entities in the knowledge graph is, and the smaller the similarity is otherwise.
The method and the device for detecting the similarity of the target entity and the network of the target entity are used for comparing the similarity of the network of the target entity and the network of the other entities in the knowledge graph, and finding out the entity of which the similarity of the network of the target entity and the network of the target entity is larger than a set similarity threshold value from the knowledge graph to serve as the related entity of the target entity.
As an exemplary implementation manner, an embodiment of the present application further discloses that, the determining, by the network structure based on a knowledge graph, a graph network feature vector of each entity in the knowledge graph respectively specifically includes:
firstly, initializing the feature vector of each entity in the knowledge graph to obtain an entity feature matrix of the knowledge graph.
For example, in the embodiment of the present application, a feature vector of each entity node in a knowledge graph is initialized by using a vector of a random normal distribution, so as to obtain an entity feature matrix H of the knowledge graph.
The entity feature matrix H is a N × d matrix representing node features, where N represents the number of entities in the entire knowledge-graph network, and d represents the feature dimension of the entities (in this application, the feature dimension is specifically 16 dimensions).
And then, processing the entity characteristic matrix of the knowledge graph by using a pre-trained graph convolution neural network, and converting the characteristic vector of the entity in the knowledge graph into a graph network characteristic vector so as to determine the graph network characteristic vector of each entity in the knowledge graph.
Specifically, the embodiment of the application trains the graph convolution neural network in advance, so that the graph convolution neural network can process the entity feature matrix of the knowledge graph to obtain new feature representation of the entity in the knowledge graph, and the graph network feature vector of the entity is obtained.
Illustratively, the convolutional neural network described above is trained by processing feature vectors of entities of the knowledge-graph into embedding vectors as graph network feature vectors of the entities.
Based on the above functions of the graph convolution neural network, the embodiment of the present application trains the graph convolution neural network on the network structure of the knowledge graph, so that the graph convolution neural network processes the feature vectors of the entity nodes in the network to obtain the embedding vectors of each entity node in the network, which are used as the graph network feature vectors of the entities in the knowledge graph.
The characteristic processing process based on the graph convolution neural network is as follows:
the graph convolution neural network processes the entity characteristic matrix of the knowledge graph according to the following formula:
Hl+1=(AHlWl)
wherein HlAnd Hl+1A representation of entity feature matrices at layer l and layer l +1 of the knowledge-graph, respectively, A representing an adjacency matrix in the knowledge-graph network, WlNetwork weight, H, representing the l-th layer0The method is an initialized knowledge graph entity feature matrix, and represents an operation function of a graph convolution neural network for calculating a first-layer entity feature matrix according to the first-layer entity feature matrix.
The feature vectors of the entities of the knowledge-graph gradually tend to the embedding vector by continuously iterating according to the formula.
In the operation process of the graph convolutional neural network, the embodiment of the application uses a negative sampling technology to pair the node e in the knowledge graph network and the node e' except for the node e, and uses a Margin-based function as a loss function to execute the feature processing process of the graph convolutional neural network.
The formula of the above-mentioned Margin-based function is as follows:
L=max(d(e,e')-d(e,e)+γ,0)
wherein e represents a feature vector of a single entity, and the feature vector can be obtained from H; d (·, ·) represents the computation of the manhattan distance, and γ represents the operational parameter.
The goal of the training process described above is to make the smaller the losses the better, the more similar the network structure around the entities, the smaller the distance between the entities and the smaller the losses. Therefore, after the training, the obtained feature vector of the entity is related to the structure of the atlas network where the entity is located, that is, the feature vector of the atlas network of the entity is obtained.
The above training process of the graph convolution neural network and the above loss function calculation are the training process of the conventional graph convolution neural network and the processing content of the loss function, and the embodiment of the present application is only a brief summary, and the processing process is not strictly limited, and when the technical scheme of the embodiment of the present application is specifically implemented, adaptive adjustment may be performed. In addition, the conversion of the physical characteristics may be achieved in other ways.
According to the scheme, the target entity e is determined from the knowledge graph according to the network structure of the knowledge graph0When related entities are obtained, the related entities can be grouped into a set to obtain the set
Figure BDA0002433875150000111
It should be noted that, the above embodiments of the present application introduce two schemes for obtaining related entities of a target entity, where the two schemes start from different angles, and respectively obtain a set of related entities of the target entity.
The above two schemes for determining the related entities of the target entity are not contradictory, and the two schemes can be used alternatively or in combination.
As a preferred implementation manner, when determining the related entity of the target entity from the knowledge graph, the embodiments of the present application combine and apply the two methods, which are described in the above embodiments, of determining the related entity of the target entity from the knowledge graph according to the category tag and the target corpus of the target entity, and determining the related entity of the target entity from the knowledge graph according to the network structure of the knowledge graph, that is, determining the related entity of the target entity from the knowledge graph by using the two methods, respectively, and then combining the determined related entities of the target entity to jointly serve as the finally determined related entity of the target entity.
E.g. for the target entity e mentioned above0When the related entities of the target entity are determined from the knowledge graph according to the category label and the target corpus of the target entity, a related entity set is obtained
Figure BDA0002433875150000121
And determining related entities of the target entity from the knowledge graph according to the network structure of the knowledge graph to obtain a related entity set
Figure BDA0002433875150000122
Then, the target entity e is finally determined0Is a set of related entities
Figure BDA0002433875150000123
According to the embodiment of the present application, the related entities of the target entity that are referred to as one entity in the target corpus can be determined from the knowledge graph, and based on the mass data amount of the knowledge graph, the number of the related entities obtained according to the above scheme is large.
When the obtained related entities are used to replace the entity references in the target corpus, in order to ensure semantic similarity between the substituted corpus and the original target corpus, the embodiment of the application screens the obtained related entities, and selects more preferable related entities to replace the entity references in the target corpus.
Referring to fig. 3, an embodiment of the present application provides that the replacing, by using the related entity of the target entity, the entity name in the target corpus to obtain each replaced corpus constitutes an extended corpus of the target corpus, specifically including:
s303, selecting a related entity from the related entities of the target entity as a replacement entity.
For example, the above alternative entities may be single related entities respectively as alternative entities, or may be related entity groups formed by combining two or more related entities.
The selection of the related entities from the related entities of the target entity may be performed according to the category, attribute, relationship with the entity reference in the target corpus, and the like, and the selection is performed at least to ensure semantic similarity between the substituted corpus obtained by replacing the entity reference in the target corpus with the selected related entities and the target corpus.
S304, replacing the entity name in the target corpus by using the replacement entity, and forming the extended corpus of the target corpus by the obtained replaced corpuses.
When the above-mentioned replacement entity is used to replace the entity names in the target corpus, a corresponding replacement mode needs to be adopted according to the number of the entity names in the target corpus.
For example, if there is only one entity in the target corpus, after determining the related entities of the target entity corresponding to the entity reference and selecting the alternative entities from the related entities, each alternative entity is used to replace the entity reference in the target corpus, and each of the obtained alternative corpuses is used as the expanded corpus of the target corpus.
If the target corpus has a plurality of entity names and the determined replacement entity is also composed of a plurality of related entities (the number of related entities contained in the replacement entity is the same as the number of entity names in the target corpus), replacing each entity name in the target corpus with each related entity contained in the replacement entity, and obtaining the replaced corpus. According to the above manner, the substituted corpora obtained by substituting the reference of each entity in the target corpus with each related entity contained in each substituted entity are all used as the extended corpora of the target corpus.
Steps S301 and S302 in the embodiment shown in fig. 3 correspond to steps S101 and S102 in the method embodiment shown in fig. 1, respectively, and for details, reference is made to the contents of the method embodiment shown in fig. 1, which is not described herein again.
As an exemplary implementation manner, when selecting a related entity from the related entities of the target entity as a replacement entity, the embodiment of the present application adopts different selection schemes according to different numbers of entity designations in the target corpus.
On one hand, when the number of the entity indexes in the target corpus is 1, the related entities are directly selected from the related entities of the target entities to be used as the replacing entities.
For example, the number of the entity references in the target corpus is 1, that is, the target corpus is a single entity corpus, and at this time, when selecting the alternative entity, the similarity or association between the selected alternative entity and the entity reference is mainly considered.
Therefore, when the embodiment of the present application selects the related entities from the related entities of the target entity corresponding to the entity designation in the target corpus, the set number of related entities with the highest similarity to the target entity is selected as the replacement entity.
For example, in obtaining a set of related entities for a target entity corresponding to an entity reference in a target corpus
Figure BDA0002433875150000141
Then, from the set of related entities
Figure BDA0002433875150000142
According to the similarity with the target entity, screening a set number of related entities to form a replacement entity set
Figure BDA0002433875150000143
Then at the alternate entity set
Figure BDA0002433875150000144
Each replacement entity in (1) can be directly used to replace the entity reference in the target corpus.
Assume that the above-mentioned set number is nmaxThe above-mentioned set of alternative entities
Figure BDA0002433875150000145
Can be obtained according to the following formula:
Figure BDA0002433875150000146
wherein the content of the first and second substances,
Figure BDA0002433875150000147
representing the number of related entities in the related entity set;
Figure BDA0002433875150000148
representation from a collection of related entities
Figure BDA0002433875150000149
Is acquired with entity e0Most similar top-nmaxAn entity; if the related entity sets
Figure BDA00024338751500001410
Less than n entitiesmaxThen, the related entities are collected
Figure BDA00024338751500001411
All related entities in (a) serve as replacement entities; when the similarity between the entities in the related entity set is measured, the cosine similarity of the feature vectors of the entities can still be measured.
In general, the feature vector of an entity is obtained by vector splicing of description texts of the first 3 sentences in description about the entity in a knowledge graph, and the feature vector of less than 3 sentences is obtained by using 0 vector to complement and is truncated by more than 3 sentences. In addition, the set number can be flexibly adjusted according to the actual situation.
Further, in order to make the relevance between the selected replacement entity and the entity reference of the target corpus stronger, and thus ensure that the semantic similarity between the corpus after the entity reference is replaced by the replacement entity and the target corpus is higher, in the embodiment of the present application, after the replacement entity is selected according to the above description, the replacement entity is further screened.
For example, in the embodiment of the present application, a replacement entity with a target attribute is selected from the above replacement entities, and is used to replace the entity name in the target corpus.
The target attribute refers to an attribute included in the target corpus among attributes of the target entity corresponding to the entity name in the target corpus.
It will be understood that any entity may have many different attributes, but in the target corpus, the target corpus will usually not contain all the attribute information of the entity, but only some or some attribute information of the entity. The same is true for the target entity corresponding to the entity designation in the target corpus, and the target corpus usually contains only part of the attribute information of the target entity, so when selecting the substitute entity from the related entities of the target entity, the related entity having the attribute matching with the attribute information contained in the target corpus should be selected, so that the attribute of the substitute entity can be ensured to match with the attribute information contained in the target corpus, and the semantic similarity between the substituted corpus and the target corpus is ensured.
Based on the above idea, in the embodiment of the present application, when selecting a replacement entity with a target attribute from various replacement entities, an attribute tag of the target entity is first searched from a target corpus, and an attribute of the target entity included in the target corpus is determined as a target attribute.
Illustratively, the attribute information in the target corpus can be searched through methods such as text retrieval or semantic recognition.
Then, from the above alternative entities, the alternative entity with the above target attribute is selected as the alternative entity to be finally used for replacing the entity designation in the target corpus.
Specifically, the attribute tags of the entities are stored in the knowledge graph, so that the alternative entities with the target attributes can be selected from the alternative entities by acquiring the attribute tags of the target entities from the knowledge graph, acquiring the attribute tags of the alternative entities and comparing the attribute tags.
To more clearly describe the process of screening alternative entities described above in the present application, the following description is provided by way of example:
assume that the entity in the target corpus is named "Bageda" (the target entity in its corresponding knowledge-graph is also "Bageda") and the information in the knowledge-graph is shown in FIG. 4.
As can be seen in fig. 4, the attribute labels of the entity named "badda" include "country", "province", "area", "population", and so on. Then, each attribute label of the entity named "bada" is respectively put into the target corpus for character string matching, and if the attribute label of "population" is finally matched, it can be determined that the attribute of "population" of the entity named "bada" is contained in the target corpus, and at this time, the attribute of "population" is regarded as the target attribute.
Then, from the set of alternative entities corresponding to the entity designation "bargad" (each alternative entity in the set of alternative entities is filtered from the related entities of the target entity corresponding to the entity designation "bargad"), an entity having the attribute of "population" is selected as the finally determined alternative entity for the alternative entity designation "bargad". For example, if the entity "colorado" in the replacement entity set corresponding to the entity designation "bagda" has the attribute of "population", the "colorado" may be used as the finally determined entity that may be used to replace the entity designation "bagda" in the target corpus.
It should be noted that, in the above-mentioned processes of selecting a related entity with the greatest similarity to the target entity from the related entities of the target entity and selecting a replacement entity with the target attribute from the replacement entities, the related entities of the target entity are screened from different angles.
In order to improve the relevance between the selected replacement entity and the target entity, as a preferred implementation manner, the embodiment of the present application combines and applies the entity screening schemes from the two angles, so that the selected replacement entity is both an entity sufficiently similar to the target entity and an entity matched with the attribute information included in the target corpus, and thus, after the entity in the target corpus is replaced by the replacement entity, the semantic similarity between the obtained replaced corpus and the target corpus is higher.
In practical implementation of the technical solution of the embodiment of the present application, any one of the entity screening schemes based on different angles may be used to determine a replacement entity, for example, only the related entity with the greatest similarity to the target entity is selected from the related entities of the target entity to serve as the replacement entity, or only the related entity with the target attribute is selected from the related entities of the target entity to serve as the replacement entity, so that the semantic similarity between the replaced corpus and the target corpus, which is obtained by replacing the entity in the target corpus with the selected replacement entity, can be ensured to a certain extent, and the implementation of the technical solution of the embodiment of the present application is not affected.
On the other hand, when the number of the entity names in the target corpus is greater than 1, determining each related entity group corresponding to the entity name group consisting of the entity names in the target corpus as a replacement entity.
Wherein the related entity group corresponding to the entity designation group includes one related entity of the target entity corresponding to each entity designation in the entity designation group; the relationship between the entity designations included in the entity designation group is the same as the relationship between the entities included in the related entity groups corresponding to the entity designation group.
Specifically, when the target corpus includes a plurality of entity names, the entity names usually have a specific relationship, and the relationship between the entity names together form the content of the target corpus.
After determining the related entities of the target entity corresponding to each entity designation in the target corpus, the embodiment of the present application selects one related entity from the related entities of the target entity corresponding to each entity designation, and combines the selected related entities to obtain a related entity group, so that a plurality of related entity groups can be obtained. Then, from each related entity group, a related entity group is selected, as a related entity group corresponding to the entity designation group, where the relationship between related entities included in the related entity group is the same as the relationship between entity designations included in the target corpus, and each related entity group is used as a replacement entity.
When entity designation replacement is carried out, all related entities contained in the related entity group are utilized to replace all entity designations in the target corpus respectively at the same time, and the corpus after replacement is obtained. The above replacement processing is performed respectively corresponding to each related entity group as a jacquard entity, and each replaced corpus can be obtained respectively.
Generally, when the number of entity references in the entity identification corpus is multiple, specifically two entity references are included, and the two entity references and the relationship therebetween may form a triple. The above-mentioned triple can be generally represented in the form of (subject, default, object), specifically, the relationship triple is (entry, relationship, entry), and the attribute triple is (entry, attribute, value). For an entity identification corpus having two entity references, the two entity references and their relationship may constitute a relationship triplet.
When the target corpus includes two entity names, determining each related entity group corresponding to the entity name group formed by the entity names in the target corpus as a replacement entity, including:
firstly, the related entities of the target entity corresponding to the first entity designation included in the entity designation group and the related entities of the target entity corresponding to the second entity designation included in the entity designation group are combined in pairs to obtain each related entity group.
Exemplarily, all relations of the first entity name h and the second entity name T in the target corpus are found out to obtain the triple set T0In the triple set T0Each triplet tiTriplets consisting of the first entity designation h and the second entity designation t, and a relationship therebetween, respectively.
Meanwhile, related entities of the target entity corresponding to the first entity designation h are respectively determined, and a related entity set is obtained
Figure BDA0002433875150000171
And determining related entities of the target entity corresponding to the second entity designation t to obtain a related entity set
Figure BDA0002433875150000172
Aggregating related entities
Figure BDA0002433875150000173
And related entity collections
Figure BDA0002433875150000174
The entities in (1) are combined pairwise to obtain each related entity group. It will be appreciated that since each set of related entities includes two entities, there must be some relationship between the two entities, and thus each set of related entities may also constitute a triple.
The above process can also be expressed as the following formula:
Figure BDA0002433875150000181
wherein, T0Representing a triple set, wherein each element t in the set is a triple formed by a related entity group and a relation between related entities contained in the related entity group; tau represents all triples in the knowledge-graph; r isiThe value of (a) may be any triplet relationship in the knowledge-graph.
It can be understood that the entities included in each triple in the unit group set calculated by the above formula may form an entity group, that is, the related entity group in the embodiment of the present application.
Then, from the related entity groups, selecting the related entity groups with the same relationship with the entity names contained in the entity name group as the replacement entities.
It can be understood that, in the above process of obtaining the related entity group, two entities are referred to as the corresponding related entities and are combined in pairs to obtain the related entity group, and the relationship between the related entities included in each related entity group may be any relationship. In fact, only when the relationship between the related entities included in the related entity group is the same as the relationship between the entity names included in the entity name group, the semantic similarity with the target corpus can be ensured by replacing the entity name group with the related entity group to obtain the replaced corpus.
Therefore, in the embodiment of the present application, after the respective related entity groups are respectively obtained, the relationship between the related entities included in the respective related entity groups is further selected from the respective related entity groups, and the respective related entity groups having the same relationship with the entity names included in the entity name group are used as the replacement entities.
For example, for each relationship triplet ti∈T0,tiThe triplet t can be calculated as follows (h, r, t)iCorresponding triple set
Figure BDA0002433875150000182
Figure BDA0002433875150000183
Wherein the triple set
Figure BDA0002433875150000184
The relationship p between the two entities s and o contained in each triplet t, and the relationship triplet tiThe entities in (1) refer to the same relationship r between h and t.
Synthesizing the triple sets corresponding to the relation triples to obtain a replacement triple set corresponding to the target corpus, and extracting the related entity groups contained in each triple in the replacement triple set respectively to obtain a related entity group set
Figure BDA0002433875150000185
It will be appreciated that the set of related entity groupsThe relationship between the related entities contained in each related entity group is the same as the relationship between the entity names contained in the entity name group in the target corpus, so that each related entity group in the related entity group set can be used as a replacement entity to replace the entity name group in the target corpus.
Corresponding to the above data enhancement method, an embodiment of the present application further provides a data enhancement apparatus, as shown in fig. 5, the apparatus includes:
a target entity determining unit 100, configured to determine a target entity from the knowledge graph, where the target entity is an entity corresponding to an entity designation in the target corpus;
a related entity determining unit 110, configured to determine, from the knowledge graph, each related entity of the target entity;
and an entity replacing unit 120, configured to replace the entity name in the target corpus with the related entity of the target entity, where the obtained replaced corpus forms an extended corpus of the target corpus.
When the data enhancement device provided by the embodiment of the application is applied to data enhancement of entity identification corpora, a target entity corresponding to an entity reference in a target corpus is determined from a knowledge graph, and then a related entity of the target entity is determined from the knowledge graph; and replacing the entity reference in the target corpus by using the related entity of the target entity, and forming the extended corpus of the target corpus by the obtained replaced corpuses. In the processing process, relevant entities of the target entities corresponding to the entity indexes in the target corpus are searched from the knowledge graph and serve as replacement contents of the entity indexes in the target corpus, and the replacement contents are used for replacing the entity indexes in the target corpus to obtain new corpuses, so that corpus expansion can be achieved. Meanwhile, based on the mass data volume of the knowledge graph, a large number of related entities can be obtained through the processing, so that a large number of expanded corpora corresponding to the target corpora can be obtained at one time through executing the scheme, and the purpose of data enhancement is quickly achieved.
Furthermore, because the relation among the entities and the information such as the attribute and the category of each entity are perfectly recorded in the knowledge graph, the related entities of the target entities corresponding to the entity indexes are selected from the knowledge graph as the replacement contents of the entity indexes, so that the correlation between the replacement information and the entity indexes can be ensured to be more comprehensive, and the new corpus obtained by replacing the entity indexes in the original target corpus by using the replacement contents can more fully retain the semantic information of the original target corpus.
As an optional implementation manner, when determining, from the knowledge graph, each related entity of the target entities, the related entity determining unit 110 is specifically configured to:
and determining each related entity of the target entity from the knowledge graph according to the category label of the target entity and the target corpus.
As an optional implementation manner, the determining, according to the category label of the target entity and the target corpus, each related entity of the target entity from the knowledge graph includes:
according to the category label of the target entity and the target corpus, screening out a category label with the maximum similarity with the target corpus from the category labels of the target entity as a target category label;
and screening out the entity with the target category label from the knowledge graph as a related entity of the target entity.
As an optional implementation manner, the screening, according to the category label of the target entity and the target corpus, a category label with the maximum similarity to the target corpus from the category labels of the target entity as a target category label includes:
respectively calculating the similarity between the feature vector of each category label of the target entity and the feature vector of the target corpus;
and screening out the category label with the maximum similarity between the feature vector of the target entity and the feature vector of the target corpus from the category labels of the target entity, and taking the category label as the target category label.
As an optional implementation manner, when determining the relevant entity of the target entity from the knowledge graph, the relevant entity determining unit 110 is specifically configured to:
and determining related entities of the target entity from the knowledge graph according to the network structure of the knowledge graph.
As an optional implementation manner, the determining, according to the network structure of the knowledge graph, a relevant entity of the target entity from the knowledge graph includes:
respectively determining a graph network feature vector of each entity in the knowledge graph based on the network structure of the knowledge graph; the map network feature vector of the entity refers to a feature vector containing the features of the map network structure of the entity, and the map network structure of the entity refers to a network structure formed by the entity and adjacent entities in the knowledge map;
and determining related entities of the target entity from the knowledge graph by comparing the similarity of the graph network characteristic vector of the target entity with the graph network characteristic vectors of other entities in the knowledge graph.
As an optional implementation manner, the separately determining a graph network feature vector of each entity in the knowledge graph based on the network structure of the knowledge graph includes:
initializing the characteristic vector of each entity in the knowledge graph to obtain an entity characteristic matrix of the knowledge graph;
processing the entity feature matrix of the knowledge graph by using a pre-trained graph convolution neural network, and determining a graph network feature vector of each entity in the knowledge graph;
the graph convolution neural network is obtained by optimizing the feature vector of the entity sample into a graph network feature vector for training.
As an optional implementation manner, the entity replacing unit 120 replaces the entity name in the target corpus with the related entity of the target entity, and when the obtained replaced corpus forms the extended corpus of the target corpus, the entity replacing unit is specifically configured to:
selecting a related entity from related entities of the target entity as a replacement entity;
and replacing the entity reference in the target corpus by using the replacement entity, and forming the extended corpus of the target corpus by the obtained replaced corpuses.
As an optional implementation manner, the selecting a related entity from related entities of the target entity, as an alternative entity, includes:
when the number of the entity designations in the target corpus is 1, selecting a related entity from related entities of the target entity as a replacement entity;
when the number of the entity names in the target corpus is more than 1, determining each related entity group corresponding to the entity name group consisting of the entity names in the target corpus as a replacement entity;
wherein the related entity group corresponding to the entity designation group includes one related entity of the target entity corresponding to each entity designation in the entity designation group; the relationship between the entity designations included in the entity designation group is the same as the relationship between the entities included in the related entity groups corresponding to the entity designation group.
As an optional implementation manner, the selecting a related entity from related entities of the target entity, as an alternative entity, includes:
and selecting a set number of related entities with highest similarity with the target entity from the related entities of the target entity as alternative entities.
As an optional implementation manner, the selecting a related entity from related entities of the target entity, as an alternative entity, further includes:
selecting a replacement entity with a target attribute from the replacement entities;
wherein the target attribute is an attribute which is possessed by the target entity and is contained in the target corpus.
As an optional implementation manner, the selecting, from the various alternative entities, an alternative entity having a target attribute includes:
determining the attribute of the target entity contained in the target corpus as a target attribute by searching the attribute tag of the target entity from the target corpus;
selecting a replacement entity having the target attribute from the respective replacement entities.
As an optional implementation manner, when the target corpus includes two entity names, the determining each related entity group corresponding to an entity name group composed of the entity names in the target corpus includes, as a replacement entity:
respectively combining the related entities of the target entity corresponding to the first entity designation contained in the entity designation group and the related entities of the target entity corresponding to the second entity designation contained in the entity designation group in pairs to obtain each related entity group;
and selecting each related entity group with the same relation with the entity names contained in the entity name group as a replacing entity from the related entity groups.
The content of the specific operation of each unit of the data enhancement device is referred to the content of the above method embodiment, and is not repeated here.
Another embodiment of the present application further discloses a data enhancement device, as shown in fig. 6, the device includes:
a memory 200 and a processor 210;
wherein, the memory 200 is connected to the processor 210 for storing programs;
the processor 210 is configured to implement the data enhancement method disclosed in any of the above embodiments by running the program stored in the memory 200.
Specifically, the evaluation device for the target detection result may further include: a bus, a communication interface 220, an input device 230, and an output device 240.
The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:
a bus may include a path that transfers information between components of a computer system.
The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.
The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.
The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.
Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.
Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as ethernet, Radio Access Network (RAN), wireless local area network (W L AN), etc.
The processor 2102 executes programs stored in the memory 200 and invokes other devices that may be used to implement the steps of the data enhancement methods provided by embodiments of the present application.
Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the data enhancement method provided in any of the above embodiments.
The detailed working contents of the above-mentioned parts of the data enhancement device and the detailed processing contents of the computer program in the storage medium when executed by the processor can be referred to the contents of the above-mentioned method embodiments of the present application, and are not repeated here.
While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, similar parts between the embodiments may be referred to, and technical features described in the embodiments may be replaced or combined with each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The steps in the method of the embodiments of the present application may be sequentially adjusted, combined, and deleted according to actual needs.
The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.
In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of data enhancement, comprising:
determining a target entity from the knowledge graph, wherein the target entity is an entity corresponding to the entity designation in the target corpus;
determining each related entity of the target entity from the knowledge graph;
and replacing the entity reference in the target corpus by using the related entity of the target entity, wherein the obtained replaced corpora form the extended corpus of the target corpus.
2. The method of claim 1, wherein the determining, from the knowledge-graph, each related entity of the target entity comprises:
respectively calculating the similarity between the feature vector of each category label of the target entity and the feature vector of the target corpus;
screening out a category label with the maximum similarity between the feature vector of the target entity and the feature vector of the target corpus from the category labels of the target entity, and taking the category label as a target category label;
and screening out the entity with the target category label from the knowledge graph as a related entity of the target entity.
3. The method of claim 1, wherein the determining the related entities of the target entity from the knowledge-graph comprises:
respectively determining a graph network feature vector of each entity in the knowledge graph based on the network structure of the knowledge graph; the map network feature vector of the entity refers to a feature vector containing the features of the map network structure of the entity, and the map network structure of the entity refers to a network structure formed by the entity and adjacent entities in the knowledge map;
and determining related entities of the target entity from the knowledge graph by comparing the similarity of the graph network characteristic vector of the target entity with the graph network characteristic vectors of other entities in the knowledge graph.
4. The method according to claim 1, wherein said replacing entity references in said target corpus with related entities of said target entity to obtain replaced corpora constituting said augmented corpus of said target corpus, comprises:
selecting a related entity from related entities of the target entity as a replacement entity;
and replacing the entity reference in the target corpus by using the replacement entity, and forming the extended corpus of the target corpus by the obtained replaced corpuses.
5. The method of claim 4, wherein the selecting a related entity from the related entities of the target entity as an alternative entity comprises:
when the number of the entity designations in the target corpus is 1, selecting a related entity from related entities of the target entity as a replacement entity;
when the number of the entity names in the target corpus is more than 1, determining each related entity group corresponding to the entity name group consisting of the entity names in the target corpus as a replacement entity;
wherein the related entity group corresponding to the entity designation group includes one related entity of the target entity corresponding to each entity designation in the entity designation group; the relationship between the entity designations included in the entity designation group is the same as the relationship between the entities included in the related entity groups corresponding to the entity designation group.
6. The method of claim 5, wherein the selecting a related entity from the related entities of the target entity as an alternative entity comprises:
determining the attribute of the target entity contained in the target corpus as a target attribute by searching the attribute tag of the target entity from the target corpus;
selecting a related entity with the target attribute from related entities of the target entity as a replacement entity;
wherein the target attribute is an attribute which is possessed by the target entity and is contained in the target corpus.
7. The method according to claim 5, wherein when the target corpus includes two entity references, said determining each related entity group corresponding to an entity reference group consisting of the entity references in the target corpus as a replacement entity comprises:
respectively combining the related entities of the target entity corresponding to the first entity designation contained in the entity designation group and the related entities of the target entity corresponding to the second entity designation contained in the entity designation group in pairs to obtain each related entity group;
and selecting each related entity group with the same relation with the entity names contained in the entity name group as a replacing entity from the related entity groups.
8. A data enhancement apparatus, comprising:
the target entity determining unit is used for determining a target entity from the knowledge graph, wherein the target entity is an entity corresponding to the entity designation in the target corpus;
a related entity determining unit, configured to determine, from the knowledge graph, each related entity of the target entity;
and the entity replacing unit is used for replacing the entity references in the target corpus by using the related entities of the target entity to obtain the replaced corpora to form the extended corpora of the target corpus.
9. A data enhancement device, comprising:
a memory and a processor;
wherein the memory is connected with the processor and used for storing programs;
the processor is configured to implement the data enhancement method according to any one of claims 1 to 7 by executing the program in the memory.
10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements the data enhancement method of any one of claims 1 to 7.
CN202010245498.3A 2020-03-31 2020-03-31 Data enhancement method, device, equipment and storage medium Pending CN111460825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010245498.3A CN111460825A (en) 2020-03-31 2020-03-31 Data enhancement method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010245498.3A CN111460825A (en) 2020-03-31 2020-03-31 Data enhancement method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111460825A true CN111460825A (en) 2020-07-28

Family

ID=71685238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010245498.3A Pending CN111460825A (en) 2020-03-31 2020-03-31 Data enhancement method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111460825A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398943A (en) * 2021-12-09 2022-04-26 北京百度网讯科技有限公司 Sample enhancement method and device thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537065A (en) * 2014-12-29 2015-04-22 北京奇虎科技有限公司 Search result pushing method and system
CN109062939A (en) * 2018-06-20 2018-12-21 广东外语外贸大学 A kind of intelligence towards Chinese international education leads method
CN109726269A (en) * 2017-10-30 2019-05-07 松下知识产权经营株式会社 Information processing method, information processing unit and program
CN110263338A (en) * 2019-06-18 2019-09-20 北京明略软件系统有限公司 Replace entity name method, apparatus, storage medium and electronic device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537065A (en) * 2014-12-29 2015-04-22 北京奇虎科技有限公司 Search result pushing method and system
CN109726269A (en) * 2017-10-30 2019-05-07 松下知识产权经营株式会社 Information processing method, information processing unit and program
CN109062939A (en) * 2018-06-20 2018-12-21 广东外语外贸大学 A kind of intelligence towards Chinese international education leads method
CN110263338A (en) * 2019-06-18 2019-09-20 北京明略软件系统有限公司 Replace entity name method, apparatus, storage medium and electronic device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398943A (en) * 2021-12-09 2022-04-26 北京百度网讯科技有限公司 Sample enhancement method and device thereof

Similar Documents

Publication Publication Date Title
She et al. Hierarchical layout-aware graph convolutional network for unified aesthetics assessment
CN111222305B (en) Information structuring method and device
CN110019647B (en) Keyword searching method and device and search engine
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
WO2016139870A1 (en) Object recognition device, object recognition method, and program
CN111626362B (en) Image processing method, device, computer equipment and storage medium
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN112215837B (en) Multi-attribute image semantic analysis method and device
CN113535974B (en) Diagnostic recommendation method and related device, electronic equipment and storage medium
WO2023138188A1 (en) Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device
CN110321437B (en) Corpus data processing method and device, electronic equipment and medium
CN111475622A (en) Text classification method, device, terminal and storage medium
CN111985520A (en) Multi-mode classification method based on graph convolution neural network
CN113220832A (en) Text processing method and device
CN109145083A (en) A kind of candidate answers choosing method based on deep learning
CN109885180A (en) Error correction method and device, computer-readable medium
WO2023029397A1 (en) Training data acquisition method, abnormal behavior recognition network training method and apparatus, computer device, storage medium, computer program and computer program product
CN114997135A (en) Method, device and equipment for screening differential texts and storage medium
CN113343012B (en) News matching method, device, equipment and storage medium
CN112990387B (en) Model optimization method, related device and storage medium
CN114817633A (en) Video classification method, device, equipment and storage medium
CN114238746A (en) Cross-modal retrieval method, device, equipment and storage medium
CN111460825A (en) Data enhancement method, device, equipment and storage medium
CN111178349A (en) Image identification method, device, equipment and storage medium
CN114580354B (en) Information coding method, device, equipment and storage medium based on synonym

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200728

RJ01 Rejection of invention patent application after publication