CN111708891B - Food material entity linking method and device between multi-source food material data - Google Patents

Food material entity linking method and device between multi-source food material data Download PDF

Info

Publication number
CN111708891B
CN111708891B CN201910156613.7A CN201910156613A CN111708891B CN 111708891 B CN111708891 B CN 111708891B CN 201910156613 A CN201910156613 A CN 201910156613A CN 111708891 B CN111708891 B CN 111708891B
Authority
CN
China
Prior art keywords
food material
entity
food
data
names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910156613.7A
Other languages
Chinese (zh)
Other versions
CN111708891A (en
Inventor
朱泽春
钟敬德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Joyoung Co Ltd
Original Assignee
Joyoung Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Joyoung Co Ltd filed Critical Joyoung Co Ltd
Priority to CN201910156613.7A priority Critical patent/CN111708891B/en
Publication of CN111708891A publication Critical patent/CN111708891A/en
Application granted granted Critical
Publication of CN111708891B publication Critical patent/CN111708891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a food material entity linking method and device between multi-source food material data, wherein the method comprises the following steps: acquiring a candidate food entity set for food entity link; and matching the first text data to be physically linked acquired from any food material data source with the candidate food material entity set, and establishing a link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set successfully matched. By the embodiment, the text information describing the food in different food data sources is effectively and physically linked with the food entity in the created knowledge graph, and the accuracy of food entity linking is improved.

Description

Food material entity linking method and device between multi-source food material data
Technical Field
The embodiment of the invention relates to a technology for constructing a food material knowledge graph, in particular to a method and a device for linking food material entities among multi-source food material data.
Background
Along with the development of knowledge graph technology, the application aspects of establishing semantic search intelligent questions and answers based on the knowledge graph are more convenient and efficient, when the related knowledge graph work of nutrition and health is established, and the establishment of the food material type entity is involved, people have different naming and description modes for the same food material entity, and the situation that the same food material has various different names generally exists is presented in that people maintain different food material classifications, different food material names and the like for the same food material entity for the food materials from different nutrition food websites, so that the work of establishing the nutrition and health knowledge graph is challenged.
Disclosure of Invention
The embodiment of the invention provides a food material entity linking method and device between multi-source food material data, which can efficiently link text information describing food materials from different food material data sources with food material entities in a created knowledge graph (or knowledge base) and improve the accuracy of the food material entity linking.
In order to achieve the purpose of the embodiments of the present invention, the embodiments of the present invention provide a method for linking food material entities between multi-source food material data, where the method may include:
acquiring a candidate food entity set for food entity link;
and matching the first text data to be physically linked acquired from any food material data source with the candidate food material entity set, and establishing a link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set successfully matched.
In an exemplary embodiment of the present invention, the acquiring the candidate food entity set for food entity linking includes:
establishing a knowledge base of food material entities according to the food material data in the first food material data source;
establishing a synonym library of food material entities according to food material data in the second food material data source;
Acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data;
acquiring the candidate food entity set according to the food entity text data set, the knowledge base and the synonym base;
wherein, the first food material data source includes: centella asiatica;
the second food material data source comprises: chinese food ingredients list and/or wikipedia.
In an exemplary embodiment of the present invention, the obtaining the candidate food entity set according to the food entity text data set, the knowledge base, and the synonym bank includes:
word segmentation is carried out on each text data segment in the food entity text data set;
respectively calculating a first word frequency-inverse text frequency index TF-IDF value of words in the word segmentation result in the knowledge base and a second TF-IDF value in the synonym base;
and respectively comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value, and taking a set of food entity compositions corresponding to words meeting that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food entity set.
In an exemplary embodiment of the present invention, the establishing a knowledge base of food material entities from food material data in the first food material data source includes:
extracting food material information from the first food material data source; the food material information includes: class labels of all food materials and various attributes of pages corresponding to each food material entity in each class label;
cleaning the food material information to obtain descriptions meeting preset requirements about each food material entity;
expert auditing is carried out on the cleaned food material information so as to modify error data and nonstandard data in the description about the food material entity;
and storing the descriptions about the food material entities, which are subjected to expert verification, into a preset food material knowledge graph to serve as the knowledge base.
In an exemplary embodiment of the present invention, when the second food material data source is "chinese food composition table", the establishing a synonym dictionary of food material entities according to food material data in the second food material data source includes:
extracting food names in the Chinese food composition table document according to a preset rule by using a text read-write tool to form a food name set; extracting one or more food material names corresponding to the food material names by using a regular expression to form a food material name set;
Traversing the food material name set and the food material unique name set, and aligning the food material names and the food material unique names in the knowledge base by using a character string matching rule;
and when the alignment is successful, recording the mapping relation between the identity ID of the corresponding food material entity in the knowledge base and the food material name of the same food material entity in the Chinese food composition table to form the synonymous word base.
In an exemplary embodiment of the present invention, when the second food material data source is the wikipedia, the establishing the synonym dictionary of food material entities according to the food material data in the second food material data source includes:
traversing all food material names in the knowledge base, searching the food material names in the wikipedia by utilizing a crawler technology, and reserving webpage data of a search result;
extracting the other names of the food materials corresponding to the current food material names when the food material names of the search results are the same as the food material names in the searched knowledge base; the extracted synonyms are called as synonyms of food materials currently searched in the knowledge base; wherein, the position of the generic term of wikipedia includes: one or more bolded fonts of the first segment or the second segment;
When the food material names of the search results are different from the food material names in the searched knowledge base, extracting one or more food material names in the search results, and taking the food material names as synonyms of the food materials currently searched in the knowledge base;
and forming the extracted synonym data into the synonym library after expert auditing or merging the synonym library into the synonym library constructed before, and removing repeated data.
In an exemplary embodiment of the present invention, the matching the first text data to be linked with the candidate food material entity set, which is acquired from any food material data source, and the building a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set, which is successfully matched, includes:
the first text data is segmented, stop words are removed, a first context of the first text data is obtained, and class labels, food names and description words of each candidate food entity in the candidate food entity set are segmented to serve as a second context of the candidate food entity;
calculating semantic similarity of the first context and the second context according to a preset similarity algorithm;
Selecting the food material entity in the second context with the maximum similarity with the first context, and calculating whether the value of the maximum similarity is larger than a preset similarity threshold value;
and when the maximum similarity value is larger than the preset similarity threshold value, connecting the first food material entity in the first context to the second food material entity with the maximum similarity with the first context.
In an exemplary embodiment of the present invention, the preset similarity algorithm includes:
wherein x represents the food material name to be physically linked in the first context, e represents the candidate food material entity, n represents the number of words in the first context, k represents the number of words in the second context, xc i Representing the ith term in the first context, ecj representing the jth term in the second context, v (x) representing the word vector of x, v (xci) representing the word vector of xci, v (ecj) representing the word vector of ecj, the word vector being generated using a Skip-gram model, sim (v (x), v (xci)) representing the semantic similarity of x and xci calculated by calculating the cosine similarity of the word vectors of x and xci, sim (v (xci), v (ecj)) representing the cosine similarity of the word vectors of xci and ecj as weights of xci terms, to calculate xci and ec j Semantic similarity of (c) to each other.
In an exemplary embodiment of the invention, the method further comprises:
outputting relevant entity information of the second food material entity in the knowledge base after the first food material entity is connected to the second food material entity; and/or the number of the groups of groups,
and outputting a NULL symbol NULL when no second food material entity in the candidate food material entity set is matched with the first food material entity in the first text data, and supplementing the first food material entity in the first text data into the knowledge base.
The embodiment of the invention also provides a food material entity linking device between the multi-source food material data, which comprises a processor and a computer readable storage medium, wherein the computer readable storage medium stores instructions.
The beneficial effects of the embodiment of the invention can include:
1. the food material entity linking method between the multi-source food material data of the embodiment of the invention can comprise the following steps: acquiring a candidate food entity set for food entity link; and matching the first text data to be physically linked acquired from any food material data source with the candidate food material entity set, and establishing a link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set successfully matched. According to the embodiment, the method and the device for realizing the physical link between the text information describing the food in different food data sources and the food entity in the created knowledge graph (or knowledge base) are realized, and the accuracy of the food entity link is improved.
2. The obtaining the candidate food material entity set for food material entity link according to the embodiment of the invention may include: establishing a knowledge base of food material entities according to the food material data in the first food material data source; establishing a synonym library of food material entities according to food material data in the second food material data source; acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data; and acquiring the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base. According to the embodiment, the method and the device for constructing the set of the current existing food material entities (namely the candidate food material entity set) in a large data mode through multiple food material data sources are achieved, so that the data of the candidate food material entity set are more accurate, and the accuracy of the link is increased.
3. The first food material data source of the embodiment of the invention comprises: centella asiatica; the second food material data source comprises: chinese food ingredients list and/or wikipedia. In the scheme of the embodiment, the food materials of the food, such as the food material encyclopedia, the Chinese food composition table, the wikipedia and the like are mature food material information websites, and the food material data sources are rich in content and perfect in information, so that the comprehensiveness and accuracy of a built knowledge base and a synonym library can be ensured.
4. The obtaining the candidate food entity set according to the food entity text data set, the knowledge base and the synonym bank according to the embodiment of the invention comprises the following steps: word segmentation is carried out on each text data segment in the food entity text data set; respectively calculating a first word frequency-inverse text frequency index TF-IDF value of words in the word segmentation result in the knowledge base and a second TF-IDF value in the synonym base; and respectively comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value, and taking a set of food entity compositions corresponding to words meeting that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food entity set. According to the embodiment, the TF-IDF value is used as a screening standard of food material entities, so that the accuracy of the knowledge base is further improved.
5. In the embodiment of the present invention, when the second food material data source is "chinese food composition table", the establishing a synonym library of food material entities according to food material data in the second food material data source includes: extracting food names in the Chinese food composition table document according to a preset rule by using a text read-write tool to form a food name set; extracting one or more food material names corresponding to the food material names by using a regular expression to form a food material name set; traversing the food material name set and the food material unique name set, and aligning the food material names and the food material unique names in the knowledge base by using a character string matching rule; and when the alignment is successful, recording the mapping relation between the identity ID of the corresponding food material entity in the knowledge base and the food material name of the same food material entity in the Chinese food composition table to form the synonymous word base. When the second food material data source is the wikipedia, the establishing a synonym library of food material entities according to the food material data in the second food material data source includes: traversing all food material names in the knowledge base, searching the food material names in the wikipedia by utilizing a crawler technology, and reserving webpage data of a search result; extracting the other names of the food materials corresponding to the current food material names when the food material names of the search results are the same as the food material names in the searched knowledge base; the extracted synonyms are called as synonyms of food materials currently searched in the knowledge base; wherein, the position of the generic term of wikipedia includes: one or more bolded fonts of the first segment or the second segment; when the food material names of the search results are different from the food material names in the searched knowledge base, extracting one or more food material names in the search results, and taking the food material names as synonyms of the food materials currently searched in the knowledge base; and forming the extracted synonym data into the synonym library after expert auditing or merging the synonym library into the synonym library constructed before, and removing repeated data. Through the embodiment, the synonym library can be established through different food material data sources respectively, a plurality of technical schemes are provided for technicians, and the synonym libraries established through different food material data sources can be mutually supplemented, so that the comprehensiveness of the synonym library is ensured.
6. In the embodiment of the present invention, matching the first text data to be linked with the candidate food material entity set, which is acquired from any food material data source, and establishing a link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set, where the matching is successful, may include: the first text data is segmented, stop words are removed, a first context of the first text data is obtained, and class labels, food names and description words of each candidate food entity in the candidate food entity set are segmented to serve as a second context of the candidate food entity; calculating semantic similarity of the first context and the second context according to a preset similarity algorithm; selecting the food material entity in the second context with the maximum similarity with the first context, and calculating whether the value of the maximum similarity is larger than a preset similarity threshold value; and when the maximum similarity value is larger than the preset similarity threshold value, connecting the first food material entity in the first context to the second food material entity with the maximum similarity with the first context. By the scheme, the type name of the food material entity and the context of the candidate food material entity are represented by using the word vector, the semantic similarity between the type name of the food material entity and the context of the candidate food material entity is calculated by the word vector, and the matching efficiency and accuracy of the first text data and the candidate food material entity set are improved.
Additional features and advantages of embodiments of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the application. The objectives and other advantages of embodiments of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the technical aspects of embodiments of the application, and are incorporated in and constitute a part of this specification, illustrate and explain the technical aspects of embodiments of the application, and not to limit the technical aspects of embodiments of the application.
FIG. 1 is a flow chart of a method for linking food entities between multi-source food data according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for obtaining a candidate food entity set for food entity linking according to an embodiment of the application;
FIG. 3 is a flowchart of a method for creating a knowledge base of food material entities based on food material data in a first food material data source, according to an embodiment of the application;
FIG. 4 is a flowchart of a method for establishing a synonym library of food material entities according to the food material data in the second food material data source when the second food material data source is the Chinese food composition table according to the embodiment of the present application;
FIG. 5 is a flowchart of a method for establishing a synonym library of food material entities based on food material data in a second food material data source, according to an embodiment of the present disclosure, when the second food material data source is the Wikipedia;
FIG. 6 is a flowchart of a method for obtaining the candidate food entity set according to the food entity text data set, the knowledge base and the synonym base according to the embodiment of the present disclosure;
fig. 7 is a flowchart of a method for matching first text data to be linked with entities obtained from an arbitrary food material data source with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set that is successfully matched;
FIG. 8 is a block diagram of a food entity linking device between multi-source food data according to an embodiment of the present application;
fig. 9 is a schematic diagram of a food material entity linking method between multi-source food material data according to an embodiment of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be arbitrarily combined with each other.
The steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.
In order to achieve the purpose of the embodiments of the present invention, the embodiments of the present invention provide a method for linking food material entities between multi-source food material data, as shown in fig. 1 and fig. 9, the method may include S101-S102:
s101, acquiring a candidate food entity set for food entity link.
In an exemplary embodiment of the present invention, as shown in fig. 2, the acquiring the candidate food entity set for the food entity link may include S201-S204:
s201, establishing a knowledge base of food material entities according to food material data in the first food material data source.
In an exemplary embodiment of the present invention, the knowledge base of food material entities may be a nutritional health knowledge graph denoted as g= (E, R, P), where E represents a set of nodes of the food material entities in the graph, R represents a set of relationships between the food material entities in the graph, and P represents a set of attributes of the food material entities in the graph. An object of an embodiment of the present invention is to link food material text information from multiple food material data sources (e.g., websites) into a food material entity E in a knowledge-graph.
In an exemplary embodiment of the present invention, the first food material data source may include, but is not limited to: food material encyclopedia is used as food.
In an exemplary embodiment of the present invention, as shown in fig. 3, the establishing a knowledge base of food material entities according to food material data in the first food material data source may include S301-S304:
s301, extracting food material information from the first food material data source; the food material information includes: class labels of all food materials and various attributes of pages corresponding to each food material entity in each class label;
s302, cleaning the food material information to obtain descriptions meeting preset requirements about each food material entity;
s303, performing expert auditing on the cleaned food material information to modify error data and nonstandard data in the description about the food material entity;
s304, storing the descriptions about the food material entities, which are subjected to expert verification, into a preset food material knowledge graph to serve as the knowledge base.
In an exemplary embodiment of the present invention, a knowledge base may be created based on the food material encyclopedia data, and the data may be used as the data of the food material entity set E of the initial knowledge graph.
In an exemplary embodiment of the present invention, as shown in fig. 4, related information such as class labels of all food materials in the encyclopedia of the food product and various attributes of pages corresponding to each food material entity may be extracted first. And then, the extracted description of the fine food jersey where the food entity is located can be cleaned, and the accurate description of the corresponding food is obtained. Finally, the food entity data subjected to pretreatment (such as the cleaning) can be audited by a nutrition and health expert, the error data and the nonstandard data are marked and modified, and the relevant information of the food entity after the expert audit is stored in a knowledge graph system to be used as a knowledge base of the embodiment of the invention.
S202, establishing a synonym library of food material entities according to food material data in the second food material data source.
In the exemplary embodiment of the present invention, this step is not sequential to step S201, and may be performed simultaneously.
In an exemplary embodiment of the present invention, the second food material data source may include, but is not limited to: chinese food ingredients list and/or wikipedia.
In the exemplary embodiment of the invention, a synonym library can be created by one second food material data source, or a plurality of synonym libraries can be created by a plurality of second food material data sources, and the synonym libraries are subjected to de-duplication processing to obtain a more perfect synonym library.
In an exemplary embodiment of the present invention, as shown in fig. 4, when the second food material data source is "chinese food composition table", the establishing a synonym word library of food material entities according to the food material data in the second food material data source may include S401-S403:
s401, extracting food material names in the Chinese food composition table document according to preset rules by using a text read-write tool to form a food material name set; and extracting one or more food material names corresponding to the food material names by using the regular expression to form a food material name set.
In an exemplary embodiment of the present invention, the text read-write tool may be a POI tool having a read-write function for office text. The Apache POI is an open source function library of the Apache software foundation, and provides an API (application programming interface) for Java programs, and has the functions of reading and writing files in Microsoft Office format.
S402, traversing the food material name set and the food material unique name set, and aligning the food material names and the food material unique names in the knowledge base by using a character string matching rule.
In an exemplary embodiment of the present invention, the set of food material names extracted from the "chinese food composition table" and the corresponding set of food material names may be traversed, a character string matching method (e.g., character string perfect matching) is used to match the names and names of the food materials in the knowledge base, if the current food material name matches the unique food material name or name in the knowledge base, or if the current food material name matches the unique food material name or name in the knowledge base, the alignment is called, i.e., there is no ambiguity, step S403 may be continuously performed; otherwise, the current food material name and the current food material name are not matched with any food material name or name in the knowledge base, and the current food material name are determined to be the unaligned food material name and name, and the unaligned food material name and name can be added into the knowledge base or the synonymous word base after being subjected to the marking and auditing of a nutrition expert, or the S402 is executed again to perform the realignment operation, so that erroneous judgment is prevented.
And S403, when the alignment is successful, recording the mapping relation between the identity ID of the corresponding food material entity in the knowledge base and the food material names of the same food material entity in the Chinese food ingredient list to form the synonymous word library.
In an exemplary embodiment of the present invention, as shown in fig. 5, when the second food material data source is the wikipedia, the establishing a synonym dictionary of food material entities according to food material data in the second food material data source may include S501-S504:
s501, traversing all food material names in the knowledge base, searching the food material names in the Wikipedia by utilizing a crawler technology, and reserving webpage data of a search result.
S502, when the food material names of the search results are the same as the food material names in the searched knowledge base, extracting the other names of the food materials corresponding to the current food material names; the extracted synonyms are called as synonyms of food materials currently searched in the knowledge base; wherein, the position of the generic term of wikipedia includes: one or more bolded fonts of the first segment or the second segment.
And S503, when the food material names of the search results are different from the food material names in the searched knowledge base, extracting one or more food material names in the search results, and taking the extracted food material names as synonyms of the current searched food materials in the knowledge base.
S504, the extracted synonym data is subjected to expert auditing to form the synonym library or is combined into the synonym library which is constructed before, and repeated data are removed.
In the exemplary embodiment of the invention, all food material names in the knowledge base can be traversed, the crawler technology is utilized to search in the simplified Chinese version of the wikipedia, the search result is a specific term page, or the search result is a redirection page, and the webpage data can be reserved. When the search result is a specific term page and the food material names of the search result are the same as the food material names in the searched knowledge base, document object model DOM analysis (the DOM analyzer can convert extensible markup language XML into an object which can be accessed by Java script JavaScript) can be used for obtaining the unique names of the current food materials, the unique names of the Wikipedia terms are generally positioned in one or more thickened fonts of the first section or the second section, and the extracted unique names can be used as synonyms of the food materials in the current detected knowledge base; when the food material names of the search result are different from those of the searched knowledge base, redirecting the webpage data, extracting the food material names and the names of the search result (filtering out the food material names of the searched knowledge base) and taking the food material names and the names as synonyms of the food materials in the current searched knowledge base. The extracted synonyms can be subjected to examination by nutrition specialists to correct error data. And combining the synonym data after the auditing into a synonym library constructed before, and removing the repeated data.
S203, acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data.
In an exemplary embodiment of the present invention, a crawler technology may be used to crawl a large amount of web page data of food material entities in any food material data source, and DOM parsing may be used to obtain text data in the web page data. The data cleaning of food material text (i.e. the text data) in various formats can include, but is not limited to: removing various invisible special characters (such as line-feed characters, tab characters, carriage returns and the like), deleting spaces among English-removing characters, deleting web page tags which are not completely resolved in the text and the like. After the data cleaning process, the cleaned multi-source food text and the corresponding data such as website information, page URL (uniform resource locator) and the like can be stored in a temporary storage medium (such as MySQL, mongoDB and other common storage media).
S204, acquiring the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base.
In an exemplary embodiment of the present invention, as shown in fig. 6, the obtaining the candidate food entity set according to the food entity text data set, the knowledge base, and the synonym base may include S601-S603:
S601, word segmentation is carried out on each piece of text data in the food entity text data set.
In an exemplary embodiment of the present invention, text data of a food entity may be segmented using ANSJ, which is a chinese open source segmentation tool based on n-gram+crf+hmm.
S602, respectively calculating a first word frequency-inverse text frequency index TF-IDF value of the words in the word segmentation result in the knowledge base and a second TF-IDF value in the synonym base.
S603, comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value respectively, and taking a set composed of food entities corresponding to words meeting the condition that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food entity set.
In an exemplary embodiment of the present invention, the TF-IDF value calculated in S602 may be compared with a preset TF-IDF threshold S, and if the TF-IDF value is greater than the threshold S, the food material entity may be reserved as a candidate food material entity, and after the weight is removed, a candidate food material entity set is added and stored in a temporary storage medium mongo db. If the TF-IDF value is less than the TF-IDF threshold S, the food material entity may be discarded.
S102, matching the first text data to be physically linked acquired from any food material data source with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set successfully matched.
In an exemplary embodiment of the present invention, after the candidate food material entity set is created through the above steps, a food material entity link of any text data (e.g., the first text data) to be linked by an entity obtained from any food material data source may be established based on the candidate food material entity set.
In an exemplary embodiment of the present invention, as shown in fig. 7, the matching the first text data to be linked with the candidate food material entity set, which is acquired from any food material data source, and the building of the link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set, which is successfully matched, may include S701-S704:
s701, segmenting the first text data, removing stop words, obtaining a first context of the first text data, and segmenting class labels, food names and description words of each candidate food entity in the candidate food entity set to serve as a second context of the candidate food entity.
In an exemplary embodiment of the present invention, the context of the first text data to be entity-linked and the context of the candidate food material entity may be separately segmented using ANSJ, and word sets of the component words may be separately represented.
S702, calculating semantic similarity of the first context and the second context according to a preset similarity algorithm.
In an exemplary embodiment of the present invention, all candidate food entities may be traversed, and the food name x to be linked to the entity in the first context and the semantic similarity sim (x, e) of each candidate food entity e may be calculated according to a preset similarity algorithm.
In an exemplary embodiment of the present invention, the preset similarity algorithm may include:
wherein x represents the food material name to be physically linked in the first context, e represents the candidate food material entity, n represents the number of words in the first context, k represents the number of words in the second context, xc i Representing the ith term in the first context, ecj representing the jth term in the second context, v (x) representing the word vector of x, v (xci) representing the word vector of xci, v (ecj) representing the word vector of ecj, the word vector being generated using a Skip-gram model, sim (v (x), v (xci)) representing the semantic similarity of x and xci calculated by calculating the cosine similarity of the word vectors of x and xci, sim (v (xci), v (ecj)) representing the semantic similarity of xci and ecj calculated by calculating the cosine similarity of the word vectors of xci and ecj as weights of xci terms.
In an exemplary embodiment of the present invention, the semantic similarity of x and xci may be calculated by calculating the cosine similarity of vectors v (x) and v (xci), i.e
In an exemplary embodiment of the present invention, the semantic similarity of xci and ecj can be calculated by calculating the cosine similarity of vectors v (xci) and v (ecj), i.e
S703, selecting the food material entity in the second context with the maximum similarity with the first context, and calculating whether the value of the maximum similarity is larger than a preset similarity threshold.
In an exemplary embodiment of the present invention, all candidate food material entities are traversed, and the food material names to be linked by the entities and the semantic similarity of each candidate food material entity are calculated, fromAnd finding out the candidate food material entity with the maximum semantic similarity. Using the candidate food material entity with the maximum semantic similarity to judge whether the calculated maximum similarity value is greater than a set similarity threshold S 0 And determining the subsequent processing steps according to the judging result.
And S704, when the value of the maximum similarity is larger than the preset similarity threshold, connecting the first food material entity in the first context to the second food material entity with the maximum similarity with the first context.
In an exemplary embodiment of the present invention, the method may further include: and outputting relevant entity information of the second food material entity in the knowledge base after the first food material entity is connected to the second food material entity.
In an exemplary embodiment of the present invention, when there is no second food material entity in the candidate food material entity set that matches with the first food material entity in the first text data, a NULL symbol NULL is output, and the first food material entity in the first text data is supplemented into the knowledge base.
The embodiment of the present invention further provides a food material entity linking device 1 between multi-source food material data, as shown in fig. 8, may include a processor 11 and a computer readable storage medium 12, where the computer readable storage medium 12 stores instructions, and when the instructions are executed by the processor 11, the food material entity linking method between multi-source food material data is implemented.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims (10)

1. A method of food material entity linking between multi-source food material data, the method comprising:
establishing a knowledge base of food material entities according to the food material data in the first food material data source;
establishing a synonym library of food material entities according to food material data in the second food material data source;
acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data;
acquiring a candidate food material entity set for food material entity link according to the food material entity text data set, the knowledge base and the synonym base;
and matching the first text data to be physically linked acquired from any food material data source with the candidate food material entity set, and establishing a link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set successfully matched.
2. The method of claim 1, wherein the obtaining a candidate food entity set for food entity linking comprises:
establishing a knowledge base of food material entities according to the food material data in the first food material data source;
Establishing a synonym library of food material entities according to food material data in the second food material data source;
acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data;
acquiring the candidate food entity set according to the food entity text data set, the knowledge base and the synonym base;
wherein, the first food material data source includes: centella asiatica;
the second food material data source comprises: chinese food ingredients list and/or wikipedia.
3. The method of claim 2, wherein the obtaining the candidate food entity set according to the food entity text data set, the knowledge base, and the synonym library comprises:
word segmentation is carried out on each text data segment in the food entity text data set;
respectively calculating a first word frequency-inverse text frequency index TF-IDF value of words in the word segmentation result in the knowledge base and a second TF-IDF value in the synonym base;
and respectively comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value, and taking a set of food entity compositions corresponding to words meeting that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food entity set.
4. The method of claim 2, wherein the establishing a knowledge base of food material entities from the food material data in the first food material data source comprises:
extracting food material information from the first food material data source; the food material information includes: class labels of all food materials and various attributes of pages corresponding to each food material entity in each class label;
cleaning the food material information to obtain descriptions meeting preset requirements about each food material entity;
expert auditing is carried out on the cleaned food material information so as to modify error data and nonstandard data in the description about the food material entity;
and storing the descriptions about the food material entities, which are subjected to expert verification, into a preset food material knowledge graph to serve as the knowledge base.
5. The method of claim 2, wherein when the second food material data source is "chinese food composition table", the establishing a synonym dictionary of food material entities from food material data in the second food material data source comprises:
extracting food names in the Chinese food composition table document according to a preset rule by using a text read-write tool to form a food name set; extracting one or more food material names corresponding to the food material names by using a regular expression to form a food material name set;
Traversing the food material name set and the food material unique name set, and aligning the food material names and the food material unique names in the knowledge base by using a character string matching rule;
and when the alignment is successful, recording the mapping relation between the identity ID of the corresponding food material entity in the knowledge base and the food material name of the same food material entity in the Chinese food composition table to form the synonymous word base.
6. The method of claim 2 or 5, wherein when the second food material data source is the wikipedia, the establishing a synonym library of food material entities from food material data in the second food material data source comprises:
traversing all food material names in the knowledge base, searching the food material names in the wikipedia by utilizing a crawler technology, and reserving webpage data of a search result;
extracting the other names of the food materials corresponding to the current food material names when the food material names of the search results are the same as the food material names in the searched knowledge base; the extracted synonyms are called as synonyms of food materials currently searched in the knowledge base; wherein, the position of the generic term of wikipedia includes: one or more bolded fonts of the first segment or the second segment;
When the food material names of the search results are different from the food material names in the searched knowledge base, extracting one or more food material names in the search results, and taking the food material names as synonyms of the food materials currently searched in the knowledge base;
and forming the extracted synonym data into the synonym library after expert auditing or merging the synonym library into the synonym library constructed before, and removing repeated data.
7. The method of claim 2, wherein the matching the first text data of the to-be-physically-linked acquired from any food material data source with the candidate food material entity set, and the building the link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set that is successfully matched, includes:
the first text data is segmented, stop words are removed, a first context of the first text data is obtained, and class labels, food names and description words of each candidate food entity in the candidate food entity set are segmented to serve as a second context of the candidate food entity;
Calculating semantic similarity of the first context and the second context according to a preset similarity algorithm;
selecting the food material entity in the second context with the maximum similarity with the first context, and calculating whether the value of the maximum similarity is larger than a preset similarity threshold value;
and when the maximum similarity value is larger than the preset similarity threshold value, connecting the first food material entity in the first context to the second food material entity with the maximum similarity with the first context.
8. The method of claim 7, wherein the predetermined similarity algorithm comprises:
wherein x represents the food material name to be physically linked in the first context, e represents the candidate food material entity, n represents the number of words in the first context, k represents the number of words in the second context, xc i Representing the ith term in the first context, ecj representing the jth term in the second context, v (x) representing the word vector of x, v (xci) representing the word vector of xci, v (ecj) representing the word vector of ecj, the word vector being generated using a Skip-gram model, sim (v (x), v (xci)) representing the semantic similarity of x and xci calculated by calculating the cosine similarity of the word vectors of x and xci, sim (v (xci), v (ecj)) representing the cosine similarity of the word vectors of xci and ecj as weights of xci terms, to calculate xci and ec j Semantic similarity of (c) to each other.
9. The method of claim 7, further comprising:
outputting relevant entity information of the second food material entity in the knowledge base after the first food material entity is connected to the second food material entity; and/or the number of the groups of groups,
and outputting a NULL symbol NULL when no second food material entity in the candidate food material entity set is matched with the first food material entity in the first text data, and supplementing the first food material entity in the first text data into the knowledge base.
10. A food material entity linking apparatus between multi-source food material data, comprising a processor and a computer readable storage medium having instructions stored therein, wherein the instructions, when executed by the processor, implement the food material entity linking method between multi-source food material data as claimed in any one of claims 1-9.
CN201910156613.7A 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data Active CN111708891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910156613.7A CN111708891B (en) 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910156613.7A CN111708891B (en) 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data

Publications (2)

Publication Number Publication Date
CN111708891A CN111708891A (en) 2020-09-25
CN111708891B true CN111708891B (en) 2023-12-08

Family

ID=72536055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910156613.7A Active CN111708891B (en) 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data

Country Status (1)

Country Link
CN (1) CN111708891B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650821A (en) * 2021-01-20 2021-04-13 济南浪潮高新科技投资发展有限公司 Entity alignment method fusing Wikidata

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN106250412A (en) * 2016-07-22 2016-12-21 浙江大学 The knowledge mapping construction method merged based on many source entities
CN107506486A (en) * 2017-09-21 2017-12-22 北京航空航天大学 A kind of relation extending method based on entity link

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN106250412A (en) * 2016-07-22 2016-12-21 浙江大学 The knowledge mapping construction method merged based on many source entities
CN107506486A (en) * 2017-09-21 2017-12-22 北京航空航天大学 A kind of relation extending method based on entity link

Also Published As

Publication number Publication date
CN111708891A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
US10824628B2 (en) Method, terminal device and storage medium for mining entity description tag
CN108920461B (en) Multi-type entity extraction method and device containing complex relationships
US10592737B2 (en) Mathematical formula learner support system
US20090049062A1 (en) Method for Organizing Structurally Similar Web Pages from a Web Site
US10002128B2 (en) System for tokenizing text in languages without inter-word separation
CN101950312B (en) Method for analyzing webpage content of internet
CN111160030B (en) Information extraction method, device and storage medium
CN113360699A (en) Model training method and device, image question answering method and device
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN111782892B (en) Similar character recognition method, device, apparatus and storage medium based on prefix tree
US20130054226A1 (en) Recognizing chemical names in a chinese document
WO2020199947A1 (en) Abstraction generation method, apparatus and device, and project management method
CN106372232B (en) Information mining method and device based on artificial intelligence
CN111199151A (en) Data processing method and data processing device
CN111708891B (en) Food material entity linking method and device between multi-source food material data
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
US20090182759A1 (en) Extracting entities from a web page
CN111492364B (en) Data labeling method and device and storage medium
CN110990539A (en) Manuscript internal duplicate checking method and device, storage medium and electronic equipment
CN111046662A (en) Training method, device and system of word segmentation model and storage medium
CN115796146A (en) File comparison method and device
CN115270768A (en) Method and equipment for determining target key words to be corrected in text
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
CN111241313A (en) Retrieval method and device supporting image input
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant