CN111708891A - Food material entity linking method and device among multi-source food material data - Google Patents

Food material entity linking method and device among multi-source food material data Download PDF

Info

Publication number
CN111708891A
CN111708891A CN201910156613.7A CN201910156613A CN111708891A CN 111708891 A CN111708891 A CN 111708891A CN 201910156613 A CN201910156613 A CN 201910156613A CN 111708891 A CN111708891 A CN 111708891A
Authority
CN
China
Prior art keywords
food material
entity
data
food
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910156613.7A
Other languages
Chinese (zh)
Other versions
CN111708891B (en
Inventor
朱泽春
钟敬德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Joyoung Co Ltd
Original Assignee
Joyoung Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Joyoung Co Ltd filed Critical Joyoung Co Ltd
Priority to CN201910156613.7A priority Critical patent/CN111708891B/en
Publication of CN111708891A publication Critical patent/CN111708891A/en
Application granted granted Critical
Publication of CN111708891B publication Critical patent/CN111708891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The embodiment of the invention discloses a method and a device for linking food material entities among multi-source food material data, wherein the method comprises the following steps: acquiring a candidate food material entity set used for food material entity linkage; matching first text data to be entity-linked, which is acquired from any food material data source, with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set which is successfully matched. By the embodiment, the text information for describing food materials from different food material data sources and the food material entities in the created knowledge graph are efficiently subjected to entity link, and the accuracy of food material entity link is improved.

Description

Food material entity linking method and device among multi-source food material data
Technical Field
The embodiment of the invention relates to a construction technology of a food material knowledge graph, in particular to a method and a device for food material entity linkage among multi-source food material data.
Background
With the development of knowledge graph technology, the knowledge graph technology is more convenient and efficient in application aspects of semantic search intelligent question and answer construction based on the knowledge graph, people have completely different naming and description modes for the same food material entity when relating to the construction of food material type entities in the construction of the related nutrition health knowledge graph, the common situation that the same food material has multiple different names is shown in the situation that people maintain different food material classifications and different food material names and the like for the same food material entity from different nutrition food websites, and the work of constructing the nutrition health knowledge graph is challenged.
Disclosure of Invention
The embodiment of the invention provides a food material entity linking method and device among multi-source food material data, which can efficiently perform entity linking on text information for describing food materials in different food material data sources and food material entities in a created knowledge graph (or knowledge base), and improve the accuracy of food material entity linking.
To achieve the object of the embodiment of the present invention, an embodiment of the present invention provides a method for linking food material entities among multi-source food material data, where the method may include:
acquiring a candidate food material entity set used for food material entity linkage;
matching first text data to be entity-linked, which is acquired from any food material data source, with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set which is successfully matched.
In an exemplary embodiment of the present invention, the obtaining of the candidate food material entity set for food material entity link includes:
establishing a knowledge base of food material entities according to food material data in a first food material data source;
establishing a synonym library of food material entities according to the food material data in the second food material data source;
acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data;
acquiring the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base;
wherein the first food material data source comprises: gourmet food material encyclopedia;
the second food material data source comprises: chinese food ingredient table and/or Vikipedia.
In an exemplary embodiment of the present invention, the obtaining the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base includes:
segmenting each section of text data in the food material entity text data set;
respectively calculating a first word frequency-inverse text frequency index TF-IDF value of a word in the word segmentation result in the knowledge base and a second TF-IDF value of the word in the synonym base;
and respectively comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value, and taking a set of food material entities corresponding to words meeting the condition that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food material entity set.
In an exemplary embodiment of the present invention, the establishing a knowledge base of food material entities from food material data in a first food material data source comprises:
extracting food material information from the first food material data source; the food material information includes: the category labels of all food materials and various attributes of the corresponding pages of each food material entity in each category label;
washing the food material information to obtain descriptions which are related to each food material entity and meet preset requirements;
performing expert review on the cleaned food material information to modify error data and non-standard data in the description of the food material entity;
and storing the description about the food material entity after being audited by the expert into a preset food material knowledge graph as the knowledge base.
In an exemplary embodiment of the present invention, when the second food material data source is "chinese food composition table", the establishing a thesaurus of food material entities according to the food material data in the second food material data source includes:
extracting food material names in the Chinese food component table document according to preset rules by using a text reading and writing tool to form a food material name set; extracting one or more food material alternative names corresponding to the food material names by using a regular expression to form a food material alternative name set;
traversing the food material name set and the food material alternative name set, and aligning the food material names and the food material alternative names in the knowledge base by using a character string matching method;
and when the alignment is successful, recording the mapping relation between the identity ID of the corresponding food material entity in the knowledge base and the food material name and food material alternative name of the same food material entity in the Chinese food component table to form the synonym library.
In an exemplary embodiment of the present invention, when the second food material data source is the wikipedia, the establishing a thesaurus of food material entities according to the food material data in the second food material data source includes:
traversing all food material names in the knowledge base, retrieving the food material names in the Wikipedia by utilizing a crawler technology, and reserving webpage data of a retrieval result;
when the food material name of the retrieval result is the same as the food material name in the retrieved knowledge base, extracting the alternative name of the food material corresponding to the current food material name; taking the extracted alternative name as a synonym of the current food materials to be searched in the knowledge base; wherein, the position of the different name of the Wikipedia entry comprises: one or more bolded fonts of the first or second segment;
when the food material name of the retrieval result is different from the food material name in the retrieved knowledge base, extracting one or more food material names in the retrieval result, and taking the one or more food material names as synonyms of the current retrieved food material in the knowledge base;
and (4) the extracted synonym data is audited by experts to form the synonym library or is combined into the synonym library constructed before, and the repeated data is removed.
In an exemplary embodiment of the present invention, the matching a first text data to be entity-linked, which is obtained from an arbitrary food material data source, with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set that is successfully matched includes:
segmenting the first text data, removing stop words to obtain a first context of the first text data, and segmenting class labels, food names and description words of each candidate food material entity in the candidate food material entity set to serve as a second context of the candidate food material entity;
calculating semantic similarity of the first context and the second context according to a preset similarity algorithm;
selecting the food material entity in the second context with the maximum similarity with the first context, and calculating whether the value of the maximum similarity is greater than a preset similarity threshold value;
and when the maximum similarity value is larger than the preset similarity threshold value, connecting the first food material entity in the first context to the selected second food material entity with the maximum similarity to the first context.
In an exemplary embodiment of the present invention, the preset similarity algorithm includes:
Figure BDA0001983132480000041
wherein x represents the food material name to be linked by the entity in the first context, e represents the candidate food material entity, n represents the number of words in the first context, k represents the number of words in the second context, and xciRepresenting the ith word in the first context, ecj representing the jth word in the second context, v (x) representing the word vector of x, v (xci) representing the word vector of xci, v (ecj) representing the word vector of ecj, the word vectors being generated using Skip-gram model, sim (v (x), v (xci)) representing the semantic similarity of x and xci by computing the cosine similarity of the word vectors of x and xci as weights for xci words, sim (v (xci), v (ecj)) representing the computation of xci and ec cosine by computing the cosine similarity of the word vectors of xci and ecjjSemantic similarity of (2).
In an exemplary embodiment of the invention, the method further comprises:
after the first food material entity is connected to the second food material entity, outputting related entity information of the second food material entity in the knowledge base; and/or the presence of a gas in the gas,
and when the second food material entity does not exist in the candidate food material entity set and is matched with the first food material entity in the first text data, outputting a NULL indicator NULL, and supplementing the first food material entity in the first text data into the knowledge base.
The embodiment of the invention also provides a food material entity linking device among multi-source food material data, which comprises a processor and a computer-readable storage medium, wherein instructions are stored in the computer-readable storage medium, and the food material entity linking device is characterized in that when the instructions are executed by the processor, the food material entity linking method among the multi-source food material data is realized.
The beneficial effects of the embodiment of the invention can include:
1. the food material entity linking method among multi-source food material data of the embodiment of the invention can comprise the following steps: acquiring a candidate food material entity set used for food material entity linkage; matching first text data to be entity-linked, which is acquired from any food material data source, with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set which is successfully matched. Through the scheme of the embodiment, the text information for describing food materials from different food material data sources and the food material entities in the created knowledge graph (or knowledge base) are efficiently subjected to entity link, and the accuracy of food material entity link is improved.
2. The obtaining of the candidate food material entity set for food material entity linking according to the embodiment of the present invention may include: establishing a knowledge base of food material entities according to food material data in a first food material data source; establishing a synonym library of food material entities according to the food material data in the second food material data source; acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data; and acquiring the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base. Through the embodiment scheme, the existing food material entity set (namely the candidate food material entity set) is constructed in a big data mode through various food material data sources, so that the data of the candidate food material entity set is more accurate, and the linking accuracy is increased.
3. The first food material data source of the embodiment of the invention comprises: gourmet food material encyclopedia; the second food material data source comprises: chinese food ingredient table and/or Vikipedia. In the scheme of the embodiment, the food material Pacific food material encyclopedia, the Chinese food composition table, the Wikipedia and the like are mature food material information websites, the content of the food material data sources is rich, the information is complete, and the comprehensiveness and the accuracy of the constructed knowledge base and the synonym base can be ensured.
4. The obtaining of the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base in the embodiment of the present invention includes: segmenting each section of text data in the food material entity text data set; respectively calculating a first word frequency-inverse text frequency index TF-IDF value of a word in the word segmentation result in the knowledge base and a second TF-IDF value of the word in the synonym base; and respectively comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value, and taking a set of food material entities corresponding to words meeting the condition that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food material entity set. Through the scheme of the embodiment, the TF-IDF value is used as the screening standard of the food material entity, so that the accuracy of the knowledge base is further improved.
5. In the embodiment of the present invention, when the second food material data source is "chinese food ingredient list", the establishing a synonym library of food material entities according to the food material data in the second food material data source includes: extracting food material names in the Chinese food component table document according to preset rules by using a text reading and writing tool to form a food material name set; extracting one or more food material alternative names corresponding to the food material names by using a regular expression to form a food material alternative name set; traversing the food material name set and the food material alternative name set, and aligning the food material names and the food material alternative names in the knowledge base by using a character string matching method; and when the alignment is successful, recording the mapping relation between the identity ID of the corresponding food material entity in the knowledge base and the food material name and food material alternative name of the same food material entity in the Chinese food component table to form the synonym library. When the second food material data source is the wikipedia, the establishing a synonym library of food material entities according to the food material data in the second food material data source includes: traversing all food material names in the knowledge base, retrieving the food material names in the Wikipedia by utilizing a crawler technology, and reserving webpage data of a retrieval result; when the food material name of the retrieval result is the same as the food material name in the retrieved knowledge base, extracting the alternative name of the food material corresponding to the current food material name; taking the extracted alternative name as a synonym of the current food materials to be searched in the knowledge base; wherein, the position of the different name of the Wikipedia entry comprises: one or more bolded fonts of the first or second segment; when the food material name of the retrieval result is different from the food material name in the retrieved knowledge base, extracting one or more food material names in the retrieval result, and taking the one or more food material names as synonyms of the current retrieved food material in the knowledge base; and (4) the extracted synonym data is audited by experts to form the synonym library or is combined into the synonym library constructed before, and the repeated data is removed. According to the embodiment, the synonym libraries can be established through different food material data sources, various technical schemes are provided for technical staff, the synonym libraries established through different food material data sources can be supplemented with each other, and comprehensiveness of the synonym libraries is guaranteed.
6. The matching of the first text data to be entity-linked, which is acquired from any food material data source, with the candidate food material entity set, and the establishing of the link between the first food material entity in the first text data and the second food material entity in the candidate food material entity set successfully matched in the embodiment of the present invention may include: segmenting the first text data, removing stop words to obtain a first context of the first text data, and segmenting class labels, food names and description words of each candidate food material entity in the candidate food material entity set to serve as a second context of the candidate food material entity; calculating semantic similarity of the first context and the second context according to a preset similarity algorithm; selecting the food material entity in the second context with the maximum similarity with the first context, and calculating whether the value of the maximum similarity is greater than a preset similarity threshold value; and when the maximum similarity value is larger than the preset similarity threshold value, connecting the first food material entity in the first context to the selected second food material entity with the maximum similarity to the first context. By means of the embodiment, the type names of the food material entities and the contexts of the candidate food material entities are represented by the word vectors, semantic similarity between the word vectors and the candidate food material entities is calculated through the word vectors, and matching efficiency and accuracy of the first text data and the candidate food material entity set are improved.
Additional features and advantages of embodiments of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the embodiments of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the examples of the application do not constitute a limitation of the embodiments of the invention.
Fig. 1 is a flowchart of a method for linking food material entities among multi-source food material data according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for obtaining a candidate food material entity set for food material entity linking according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for establishing a knowledge base of food material entities according to food material data in a first food material data source according to an embodiment of the invention;
fig. 4 is a flowchart of a method for establishing a synonym library of food material entities according to food material data in a second food material data source when the second food material data source is "chinese food component table" according to the embodiment of the present invention;
fig. 5 is a flowchart of a method for establishing a thesaurus of food material entities according to food material data in a second food material data source when the second food material data source is the wikipedia according to the embodiment of the present invention;
fig. 6 is a flowchart of a method for obtaining the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base in the embodiment of the present invention;
fig. 7 is a flowchart of a method for matching a first text data to be entity-linked, which is obtained from an arbitrary food material data source, with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set successfully matched according to the embodiment of the present invention;
fig. 8 is a block diagram illustrating an apparatus for linking food material entities among multi-source food material data according to an embodiment of the present invention;
fig. 9 is a schematic view illustrating a food material entity linking method between multi-source food material data according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
To achieve the object of the embodiment of the present invention, an embodiment of the present invention provides a food material entity linking method between multi-source food material data, as shown in fig. 1 and 9, the method may include S101-S102:
s101, obtaining a candidate food material entity set used for food material entity linking.
In an exemplary embodiment of the present invention, as shown in fig. 2, the acquiring a candidate food material entity set for food material entity link may include S201 to S204:
s201, establishing a knowledge base of food material entities according to food material data in a first food material data source.
In an exemplary embodiment of the invention, the knowledge base of food material entities may be a nutritional health knowledge graph denoted G ═ E, R, P, where E represents a set of nodes of food material entities in the graph, R represents a set of relationships between food material entities in the graph, and P represents a set of attributes of food material entities in the graph. Embodiments of the present invention aim to link food material textual information from multiple food material data sources (such as websites) into a food material entity E in a knowledge graph.
In an exemplary embodiment of the present invention, the first food material data source may include, but is not limited to: gourmet food material encyclopedia.
In an exemplary embodiment of the present invention, as shown in fig. 3, the establishing a knowledge base of food material entities according to food material data in a first food material data source may include S301-S304:
s301, extracting food material information from the first food material data source; the food material information includes: the category labels of all food materials and various attributes of the corresponding pages of each food material entity in each category label;
s302, cleaning the food material information to obtain descriptions which are about each food material entity and meet preset requirements;
s303, carrying out expert audit on the cleaned food material information so as to modify error data and non-standard data in the description of the food material entity;
s304, storing the description about the food material entity which is audited by the expert into a preset food material knowledge graph as the knowledge base.
In an exemplary embodiment of the present invention, a knowledge base may be created based on gourmet food material encyclopedia data, and at the same time, the data may be used as data of a food material entity set E of an initial knowledge graph.
In an exemplary embodiment of the present invention, as shown in fig. 4, the category labels of all food materials in the gourmet material encyclopedia, and the related information such as the category attributes of the page corresponding to each food material entity may be extracted first. And then, the extracted description of the food material entity in the gourmet food material encyclopedia can be cleaned, and the accurate description of the corresponding food material can be obtained. And finally, auditing the preprocessed (e.g., cleaned) food material entity data by a nutritional health expert, labeling and modifying error data and non-standard data, and storing the related information of the food material entity after the expert audits into a knowledge graph system to serve as a knowledge base of the embodiment of the invention.
S202, establishing a synonym library of the food material entities according to the food material data in the second food material data source.
In an exemplary embodiment of the present invention, the steps are not sequential to step S201, and may be performed simultaneously.
In an exemplary embodiment of the present invention, the second food material data source may include, but is not limited to: chinese food ingredient table and/or Vikipedia.
In the exemplary embodiment of the present invention, the synonym library may be created by one kind of second food material data source, or multiple synonym libraries may be created by multiple kinds of second food material data sources, and the multiple synonym libraries are subjected to deduplication processing, so as to obtain a more complete synonym library.
In an exemplary embodiment of the present invention, as shown in fig. 4, when the second food material data source is "chinese food composition table", the establishing a synonym library of food material entities according to the food material data in the second food material data source may include S401 to S403:
s401, extracting food material names in the Chinese food component table document according to preset rules by using a text reading and writing tool to form a food material name set; and extracting one or more food material alternative names corresponding to the food material names by using a regular expression to form a food material alternative name set.
In an exemplary embodiment of the present invention, the text reading and writing tool may be a POI tool having a reading and writing function for office text. The Apache POI is an open source function library of the Apache software foundation, provides API (application programming interface) to Java programs, and has the functions of reading and writing Microsoft Office format archives.
S402, traversing the food material name set and the food material alternative name set, and aligning the food material names and the food material alternative names in the knowledge base by using a character string matching rule.
In an exemplary embodiment of the present invention, a set of names of food materials extracted from the "chinese food composition table" and a corresponding set of names of food materials may be traversed, a string matching method (e.g., complete matching of strings) is used to match the names and the names of the food materials in the knowledge base, if the current name of the food material matches the unique name or name of the food material in the knowledge base, or the current name of the food material matches the unique name or name of the food material in the knowledge base, it is called as an alignment, that is, there is no ambiguity, then step S403 may be continuously executed; otherwise, the food materials are not aligned (the current food material name or the current food material alternative name is not matched with any food material name or alternative name in the knowledge base), the current food material name and the current food material alternative name can be determined to be the unaligned food material name and alternative name, the unaligned food material name and alternative name can be added into the knowledge base or the synonym base after being labeled and checked by a nutritional expert, or S402 is executed again to perform the realignment operation, so that the misjudgment is prevented.
And S403, when the alignment is successful, recording the mapping relation between the ID of the corresponding food material entity in the knowledge base and the food material name and food material difference name of the same food material entity in the Chinese food component table to form the synonym library.
In an exemplary embodiment of the present invention, as shown in fig. 5, when the second food material data source is the wikipedia, the establishing a thesaurus of food material entities according to the food material data in the second food material data source may include S501-S504:
s501, traversing all food material names in the knowledge base, searching the food material names in the Wikipedia by using a crawler technology, and reserving webpage data of a search result.
S502, when the food material name of the retrieval result is the same as the food material name in the retrieved knowledge base, extracting the alternative name of the food material corresponding to the current food material name; taking the extracted alternative name as a synonym of the current food materials to be searched in the knowledge base; wherein, the position of the different name of the Wikipedia entry comprises: first or second segment in one or more bolded fonts.
S503, when the food material name of the retrieval result is different from the food material name in the retrieved knowledge base, extracting one or more food material names in the retrieval result, and taking the one or more food material names as synonyms of the current retrieved food material in the knowledge base.
S504, the extracted synonym data is audited by experts to form the synonym library or is combined into the synonym library which is constructed before, and repeated data is removed.
In the exemplary embodiment of the present invention, all food material names in the knowledge base can be traversed, and the crawler technology is utilized to perform the search in the simplified chinese version of wikipedia, and the search result is a specific vocabulary entry page, or the search result is a redirection page, and the web page data can be retained. When the retrieval result is a specific entry page, and the food material name of the retrieval result is the same as the food material name in the retrieved knowledge base, a Document Object Model (DOM) parser can be used for converting extensible markup language (XML) into an object accessible by Java script (JavaScript) to obtain the text of the alternative name of the current food material, the alternative name of the Wikipedia entry is generally positioned in one or more bold fonts of the first section or the second section, and the extracted alternative name can be used as the synonym of the food material in the current detected knowledge base; when the food material name of the retrieval result is different from the food material name in the retrieved knowledge base, the information is the redirected webpage data, and the food material name and the alternative name of the retrieval result are extracted (filtered and the food material name of the retrieved knowledge base) and are taken as synonyms of the food materials in the current retrieved knowledge base. The extracted synonyms can be audited by a nutrition specialist to correct error data. The synonym data after being audited can be merged into the synonym library constructed before, and repeated data can be removed.
S203, text data about food material entities are obtained from any food material data source, and a food material entity text data set is formed through the text data.
In an exemplary embodiment of the present invention, a crawler technology can be used to crawl a large amount of webpage data of food material entities in any food material data source, and DOM parsing can be used to obtain text data in the webpage data. The method for cleaning the text of the food material in various formats (namely the text data) specifically includes but is not limited to: removing various invisible special characters (such as line feed characters, tab characters, carriage returns and the like), removing spaces among English characters, removing webpage labels which are not completely analyzed in the text and the like. After the data cleaning processing, the cleaned multi-source food material text, and the corresponding website information, page URL (uniform resource locator) and other data can be stored in a temporary storage medium (for example, common storage media such as MySQL, MongoDB and the like).
S204, obtaining the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base.
In an exemplary embodiment of the present invention, as shown in fig. 6, the obtaining the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base may include S601-S603:
s601, performing word segmentation on each section of text data in the food material entity text data set.
In an exemplary embodiment of the invention, the text data of the food material entity can be segmented by using ANSJ, wherein the ANSJ is a Chinese open source segmentation tool based on n-Gram + CRF + HMM.
S602, respectively calculating a first word frequency-inverse text frequency index TF-IDF value of the word in the word segmentation result in the knowledge base and a second TF-IDF value of the word in the synonym base.
S603, comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value respectively, and taking a set of food material entities corresponding to words meeting the condition that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food material entity set.
In an exemplary embodiment of the present invention, the TF-IDF value calculated in S602 may be compared with a preset TF-IDF threshold S, respectively, if the TF-IDF value is greater than the threshold S, the food material entity may be retained as a candidate food material entity, and after deduplication, a candidate food material entity set is added and stored in a temporary storage medium, montodb. If the TF-IDF value is less than the TF-IDF threshold S, the food material entity may be discarded.
S102, matching first text data to be entity-linked and acquired from any food material data source with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set which is successfully matched.
In an exemplary embodiment of the present invention, after the candidate food material entity set is created through the above steps, a food material entity link of any text data (such as the first text data) to be entity-linked, which is obtained from any food material data source, can be established based on the candidate food material entity set.
In an exemplary embodiment of the present invention, as shown in fig. 7, the matching of the first text data to be entity-linked, which is acquired from any food material data source, with the candidate food material entity set, and the link establishment between the first food material entity in the first text data and the second food material entity in the candidate food material entity set successfully matched may include S701-S704:
s701, performing word segmentation on the first text data, removing stop words, obtaining a first context of the first text data, and performing word segmentation on class labels, food names and description words of each candidate food material entity in the candidate food material entity set to serve as a second context of the candidate food material entity.
In an exemplary embodiment of the present invention, the context of the first text data to be linked to the entity and the context of the candidate food material entity may be segmented using ANSJ and respectively expressed as word sets of component segmented words.
S702, calculating the semantic similarity of the first context and the second context according to a preset similarity algorithm.
In an exemplary embodiment of the present invention, all candidate food material entities may be traversed, and the food material name x to be entity-linked in the first context and the semantic similarity sim (x, e) of each candidate food material entity e are calculated according to a preset similarity algorithm.
In an exemplary embodiment of the present invention, the preset similarity algorithm may include:
Figure BDA0001983132480000141
wherein x represents the food material name to be linked by the entity in the first context, e represents the candidate food material entity, n represents the number of words in the first context, k represents the number of words in the second context, and xciRepresenting the ith word in the first context, ecj representing the jth word in the second context, v (x) representing the word vector of x, v (xci) representing the word vector of xci, v (ecj) representing the word vector of ecj, the word vectors being generated using Skip-gram model, sim (v (x), v (xci)) representing the semantic similarity of x and xci calculated by calculating the cosine similarity of the word vectors of x and xci as weights for xci words, sim (v (xci), v (ecj)) representing the semantic cosine similarity of xci and ecj calculated by calculating the cosine similarity of the word vectors of xci and ecj.
In the exemplary embodiment of the present invention, when calculating semantic similarity of x and xci, it can be implemented by calculating cosine similarity of vectors v (x) and v (xci), that is, calculating cosine similarity of vectors v (x) and v (xci)
Figure BDA0001983132480000142
In the exemplary embodiment of the present invention, when calculating semantic similarity between xci and ecj, it can be implemented by calculating cosine similarity between vectors v (xci) and v (ecj), that is, it is implemented by calculating cosine similarity between vectors v (xci) and v (ecj)
Figure BDA0001983132480000143
Figure BDA0001983132480000144
S703, selecting the food material entity in the second context with the maximum similarity to the first context, and calculating whether the maximum similarity value is greater than a preset similarity threshold value.
In an exemplary embodiment of the invention, all candidate food material entities are traversed, and the food material name to be entity-linked and the semantic similarity of each candidate food material entity are calculated, so that the candidate food material entity with the largest semantic similarity is found out. Judging whether the calculated maximum similarity value is larger than a set similarity threshold value S or not by using the selected candidate food material entity with the maximum semantic similarity0And determining the subsequent processing step according to the judgment result.
S704, when the maximum similarity value is larger than the preset similarity threshold, connecting the first food material entity in the first context to the selected second food material entity with the maximum similarity to the first context.
In an exemplary embodiment of the present invention, the method may further include: and after the first food material entity is connected to the second food material entity, outputting the related entity information of the second food material entity in the knowledge base.
In an exemplary embodiment of the invention, when there is no second food material entity in the set of candidate food material entities matching the first food material entity in the first text data, a NULL indicator NULL is output and the first food material entity in the first text data is supplemented into the knowledge base.
An embodiment of the present invention further provides a food material entity linking apparatus 1 between multi-source food material data, as shown in fig. 8, the apparatus may include a processor 11 and a computer-readable storage medium 12, where the computer-readable storage medium 12 stores instructions, and when the instructions are executed by the processor 11, the food material entity linking method between multi-source food material data described in any one of the above is implemented.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (10)

1. A method for linking food material entities among multi-source food material data is characterized by comprising the following steps:
acquiring a candidate food material entity set used for food material entity linkage;
matching first text data to be entity-linked, which is acquired from any food material data source, with the candidate food material entity set, and establishing a link between a first food material entity in the first text data and a second food material entity in the candidate food material entity set which is successfully matched.
2. The method of food material entity linking between multi-source food material data of claim 1, wherein the obtaining of the set of candidate food material entities for food material entity linking comprises:
establishing a knowledge base of food material entities according to food material data in a first food material data source;
establishing a synonym library of food material entities according to the food material data in the second food material data source;
acquiring text data about food material entities from any food material data source, and forming a food material entity text data set through the text data;
acquiring the candidate food material entity set according to the food material entity text data set, the knowledge base and the synonym base;
wherein the first food material data source comprises: gourmet food material encyclopedia;
the second food material data source comprises: chinese food ingredient table and/or Vikipedia.
3. The method of food material entity linking between multi-source food material data of claim 2, wherein the obtaining the set of candidate food material entities from the set of food material entity text data, the knowledge base and the synonym base comprises:
segmenting each section of text data in the food material entity text data set;
respectively calculating a first word frequency-inverse text frequency index TF-IDF value of a word in the word segmentation result in the knowledge base and a second TF-IDF value of the word in the synonym base;
and respectively comparing the first TF-IDF value and the second TF-IDF value with a preset TF-IDF threshold value, and taking a set of food material entities corresponding to words meeting the condition that the first TF-IDF value is larger than the TF-IDF threshold value and/or the second TF-IDF value is larger than the TF-IDF threshold value as the candidate food material entity set.
4. The method of food material entity linking between multi-source food material data of claim 2, wherein the building of a knowledge base of food material entities from food material data in a first food material data source comprises:
extracting food material information from the first food material data source; the food material information includes: the category labels of all food materials and various attributes of the corresponding pages of each food material entity in each category label;
washing the food material information to obtain descriptions which are related to each food material entity and meet preset requirements;
performing expert review on the cleaned food material information to modify error data and non-standard data in the description of the food material entity;
and storing the description about the food material entity after being audited by the expert into a preset food material knowledge graph as the knowledge base.
5. The method of food material entity linking between multi-source food material data of claim 2, wherein when the second food material data source is "Chinese food ingredient Table", said establishing a thesaurus of food material entities from food material data in the second food material data source comprises:
extracting food material names in the Chinese food component table document according to preset rules by using a text reading and writing tool to form a food material name set; extracting one or more food material alternative names corresponding to the food material names by using a regular expression to form a food material alternative name set;
traversing the food material name set and the food material alternative name set, and aligning the food material names and the food material alternative names in the knowledge base by using a character string matching method;
and when the alignment is successful, recording the mapping relation between the identity ID of the corresponding food material entity in the knowledge base and the food material name and food material alternative name of the same food material entity in the Chinese food component table to form the synonym library.
6. The method of food material entity linking between multi-source food material data of claim 2 or 5, wherein when the second food material data source is the Wikipedia, the establishing a thesaurus of food material entities from food material data in the second food material data source comprises:
traversing all food material names in the knowledge base, retrieving the food material names in the Wikipedia by utilizing a crawler technology, and reserving webpage data of a retrieval result;
when the food material name of the retrieval result is the same as the food material name in the retrieved knowledge base, extracting the alternative name of the food material corresponding to the current food material name; taking the extracted alternative name as a synonym of the current food materials to be searched in the knowledge base; wherein, the position of the different name of the Wikipedia entry comprises: one or more bolded fonts of the first or second segment;
when the food material name of the retrieval result is different from the food material name in the retrieved knowledge base, extracting one or more food material names in the retrieval result, and taking the one or more food material names as synonyms of the current retrieved food material in the knowledge base;
and (4) the extracted synonym data is audited by experts to form the synonym library or is combined into the synonym library constructed before, and the repeated data is removed.
7. The method of claim 2, wherein the matching of a first text datum obtained from any food material data source and to be entity-linked with the candidate food material entity set and the linking of a first food material entity in the first text datum with a second food material entity in the candidate food material entity set that is successfully matched comprises:
segmenting the first text data, removing stop words to obtain a first context of the first text data, and segmenting class labels, food names and description words of each candidate food material entity in the candidate food material entity set to serve as a second context of the candidate food material entity;
calculating semantic similarity of the first context and the second context according to a preset similarity algorithm;
selecting the food material entity in the second context with the maximum similarity with the first context, and calculating whether the value of the maximum similarity is greater than a preset similarity threshold value;
and when the maximum similarity value is larger than the preset similarity threshold value, connecting the first food material entity in the first context to the selected second food material entity with the maximum similarity to the first context.
8. The method of food material entity linking between multi-source food material data of claim 7, wherein the preset similarity algorithm comprises:
Figure FDA0001983132470000041
wherein x represents the food material name to be linked by the entity in the first context, e represents the candidate food material entity, n represents the number of words in the first context, k represents the number of words in the second context, and xciRepresenting the ith word in the first context, ecj representing the jth word in the second context, v (x) representing a word vector for x, v (xci) representing a word vector for xci, v (ecj) representing a word vector for ecj, the word vectors being generated using the Skip-gram model, sim (v (x), v (xci)) representing a word represented byCalculating cosine similarity of word vectors of x and xci to calculate semantic similarity of x and xci as weights of xci words, sim (v (xci), v (ecj)) means calculating xci and ec by calculating cosine similarity of word vectors of xci and ecjjSemantic similarity of (2).
9. The method for food material entity linking between multi-source food material data of claim 7, wherein the method further comprises:
after the first food material entity is connected to the second food material entity, outputting related entity information of the second food material entity in the knowledge base; and/or the presence of a gas in the gas,
and when the second food material entity does not exist in the candidate food material entity set and is matched with the first food material entity in the first text data, outputting a NULL indicator NULL, and supplementing the first food material entity in the first text data into the knowledge base.
10. An apparatus for linking food material entities among multi-source food material data, comprising a processor and a computer-readable storage medium, wherein instructions are stored in the computer-readable storage medium, and when the instructions are executed by the processor, the method for linking food material entities among multi-source food material data according to any one of claims 1 to 9 is realized.
CN201910156613.7A 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data Active CN111708891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910156613.7A CN111708891B (en) 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910156613.7A CN111708891B (en) 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data

Publications (2)

Publication Number Publication Date
CN111708891A true CN111708891A (en) 2020-09-25
CN111708891B CN111708891B (en) 2023-12-08

Family

ID=72536055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910156613.7A Active CN111708891B (en) 2019-03-01 2019-03-01 Food material entity linking method and device between multi-source food material data

Country Status (1)

Country Link
CN (1) CN111708891B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650821A (en) * 2021-01-20 2021-04-13 济南浪潮高新科技投资发展有限公司 Entity alignment method fusing Wikidata

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN106250412A (en) * 2016-07-22 2016-12-21 浙江大学 The knowledge mapping construction method merged based on many source entities
CN107506486A (en) * 2017-09-21 2017-12-22 北京航空航天大学 A kind of relation extending method based on entity link

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN106250412A (en) * 2016-07-22 2016-12-21 浙江大学 The knowledge mapping construction method merged based on many source entities
CN107506486A (en) * 2017-09-21 2017-12-22 北京航空航天大学 A kind of relation extending method based on entity link

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650821A (en) * 2021-01-20 2021-04-13 济南浪潮高新科技投资发展有限公司 Entity alignment method fusing Wikidata

Also Published As

Publication number Publication date
CN111708891B (en) 2023-12-08

Similar Documents

Publication Publication Date Title
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
US7805289B2 (en) Aligning hierarchal and sequential document trees to identify parallel data
US8868556B2 (en) Method and device for tagging a document
US20100257440A1 (en) High precision web extraction using site knowledge
US11222053B2 (en) Searching multilingual documents based on document structure extraction
US20060277173A1 (en) Extraction of information from documents
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
US8140533B1 (en) Harvesting relational tables from lists on the web
CN110569496A (en) Entity linking method, device and storage medium
CN110377884A (en) Document analytic method, device, computer equipment and storage medium
CN109165373B (en) Data processing method and device
WO2020199947A1 (en) Abstraction generation method, apparatus and device, and project management method
CN106372232B (en) Information mining method and device based on artificial intelligence
CN115858773A (en) Keyword mining method, device and medium suitable for long document
CN111199151A (en) Data processing method and data processing device
CN111708891B (en) Food material entity linking method and device between multi-source food material data
US20090182759A1 (en) Extracting entities from a web page
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN114691907A (en) Cross-modal retrieval method, device and medium
CN111241313A (en) Retrieval method and device supporting image input
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
US20210011895A1 (en) Hierarchical document sectioning for contextual retrieval
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN116681042B (en) Content summary generation method, system and medium based on keyword extraction
JP2019168758A (en) Data processing device, data processing method and data processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant