CN116501893A

CN116501893A - Construction method of healthy diet knowledge graph

Info

Publication number: CN116501893A
Application number: CN202310523021.0A
Authority: CN
Inventors: 葛海鑫; 王诗童; 牛子越; 张健; 牛广林; 韦星星
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-07-28

Abstract

The invention discloses a method for constructing a healthy diet knowledge graph, and belongs to the technical field of knowledge graph construction. The construction steps comprise: s1, analyzing an HTML structure of a website, and capturing names, images and contained information of needed dishes and food materials; s2, classifying and aligning dishes and food materials, wherein the method comprises the following steps: extracting feature vectors in the images of the dishes and the food materials by using a CLIP model, obtaining image similarity according to the feature vectors, converting the image similarity into probability distribution, and classifying and aligning the dishes and the food materials according to the probability distribution; and S3, extracting the names of the dishes and the food materials after alignment, and the entities and the relations in the information, converting the names and the relations into a triplet format, and generating a healthy diet knowledge graph. The invention can improve the accuracy and generalization of entity alignment and obtain a high-quality healthy diet knowledge graph, thereby providing more accurate and scientific healthy diet knowledge for users.

Description

Construction method of healthy diet knowledge graph

Technical Field

The invention relates to the technical field of knowledge graph construction, in particular to a construction method of a healthy diet knowledge graph.

Background

With the continuous development of society, the improvement of living standard, more and more people pay attention to the scientificity and the balance of diet collocation.

At present, the diet structure of people is mainly recommended based on a knowledge graph, wherein the knowledge graph is a model or system capable of reflecting the deep relation between dishes and food materials, and can consider multiple factors to comprehensively provide diet collocation for users.

In the diet field, dishes or food materials have more multi-word synonyms, and the entities need to be fused. The traditional method is to map the food materials with different expressions and the same meaning to the same meta-food material by constructing a meta-food material homonymous mapping table, but the mode of manually constructing the meta-food material homonymous mapping table lacks good generalization.

Therefore, how to design a data-driven automated entity alignment method to improve its generalization is a problem that those skilled in the art are urgent to solve.

Disclosure of Invention

In view of the above, the invention provides a method for constructing a healthy diet knowledge graph, and aims to provide a novel method for aligning dishes and food materials so as to solve the problem of low generalization caused by manually constructing meta-food materials in the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a construction method of a healthy diet knowledge graph comprises the following steps:

s1, analyzing the HTML structure of a website, capturing the names, images and contained information of required dishes and food materials,

s2, classifying and aligning dishes and food materials, wherein the method comprises the following steps:

extracting feature vectors in the images of the dishes and the food materials by using a CLIP model, obtaining image similarity according to the feature vectors, converting the image similarity into probability distribution, and classifying and aligning the dishes and the food materials according to the probability distribution;

and S3, extracting the names of the dishes and the food materials after alignment, and the entities and the relations in the information, converting the names and the relations into a triplet format, and generating a healthy diet knowledge graph.

To further optimize the above technical solution, step S1 further includes separating URL of website and parsing rule so as to add new website source without modifying core code.

In order to further optimize the technical scheme, step S1 further includes removing repeated information in the dishes and the food materials, and normalizing the food materials and the names of the dishes.

In order to further optimize the above technical solution, the step of obtaining the image similarity includes:

determining the category names of dishes and food materials, and extracting text feature vectors of the category names by using a CLIP model;

and obtaining the image similarity according to the cosine similarity of the feature vectors in the images of the dishes and the food materials and the text feature vectors of various alias names.

In order to further optimize the technical scheme, the similarity value is converted into probability distribution by adopting a Softmax function, the method comprises the steps of determining indexes of each similarity, and obtaining the probability distribution according to the ratio of each index to the sum of all indexes.

In order to further optimize the technical scheme, in step S2, a confidence threshold is set for the probability distribution, and when the probability distribution is lower than the set confidence threshold, the probability distribution is manually corrected.

In order to further optimize the technical scheme, in step S3, after converting into the triplet format, entities and relations of the same type in the triples are combined first, and repeated triples are eliminated.

According to the technical scheme, the invention discloses a construction method of a healthy diet knowledge graph, and compared with the prior art, in the construction method, after dish and food material information is acquired, firstly, a list is established and converted into a tensor format, then images of the dish and the food material are converted into vectors by using a CLIP model, similarity between the images is acquired, the images are converted into probability distribution, the dish and the food material are classified and aligned according to the probability distribution, and finally, the aligned dish and food material are converted into triples, so that the healthy diet knowledge graph is established. Compared with the existing knowledge graph construction method, the method can improve the accuracy and generalization of entity alignment, and obtain the high-quality healthy diet knowledge graph, so that more accurate and scientific healthy diet knowledge is provided for users.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for constructing a knowledge graph of a healthy diet of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a construction method of a healthy diet knowledge graph, which mainly aims at the problem of synonym of dishes and food material multi-words, and adopts a CLIP model to classify and align food material images so as to ensure the accuracy of food materials in the knowledge graph.

The method comprises the following specific steps:

s1, separating URL and analysis rules of a website, and capturing names, images and contained information of needed dishes and food materials by adopting a data capturing strategy. According to different website structures, the required information is extracted by adopting different grabbing strategies, wherein the grabbing strategies comprise XPath or CSS selectors.

Further, in the invention, the Python crawler technology is mainly utilized to crawl the information of dishes and food materials on a plurality of websites. The websites comprise food under the sky (https:// www.meishichina.com /) and other websites capable of providing rich dishes and food material information.

In order to acquire the required data, the application writes a Python crawler program to analyze the HTML structure of a website and extract target information. Specifically, according to the structural characteristics of different menu websites, crawler programs suitable for the different menu websites are compiled, and the accuracy of the crawled data is ensured. And to achieve multi-site adaptation, the URL of the site and parsing rules are separated in order to add new site sources without modifying the core code.

Specifically, for each website, the HTML structure is analyzed to locate the node containing the target information, and then the required dish information is accurately captured through XPath or CSS selector and other technologies.

It should be noted that, in the crawling process, all relevant information of dishes and food materials is ensured to be acquired, wherein the dish information comprises dish names, dish images, dish attributes, tastes, contained food materials, relevant effects and the like; the food material information comprises food material names, food material images, food material attributes, nutritional ingredients, beneficial contraindicated relations, efficacy and the like, so that the comprehensiveness of the subsequent knowledge graph construction is facilitated.

The step S2 is to classify and align the images of the dishes and the food by using the CLIP model, so as to solve the problem of synonym of multiple words. Specifically, the image links of the websites can be analyzed, so that high-quality food material and dish images can be obtained and downloaded to the local storage.

In one embodiment, step S1 further includes removing duplicate information in the dishes and food materials, normalizing the food materials and the dish names, including unifying case and case, removing redundant spaces, symbols, and unifying abbreviations and holonomies, so that the data is clean and ordered. In this application, a data processing library of Python, such as Pandas, is specifically used to perform deduplication, screening, and formatting operations on the data.

S2, classifying and aligning dishes and food materials, wherein in the step, the method for performing multi-mode entity alignment by combining the image features of the food materials is adopted, and specifically comprises the following steps:

converting images of dishes and food materials into vectors by using a CLIP model, acquiring the similarity between the images and the categories of the dishes and the food materials according to the vectors, converting the similarity into probability distribution by using a Softmax function, and classifying and aligning the dishes and the food materials according to the probability distribution;

the method for obtaining the image similarity comprises the following steps:

and respectively encoding the food material image and the food material category name into feature vectors by using the existing ViT-B/32 version CLIP model. Inputting the preprocessed picture into a CLIP model, and extracting an image feature vector; and converting the category names into a text tensor format suitable for the CLIP model, and extracting text feature vectors. And then, calculating cosine similarity between the image feature vector and each category text feature vector to obtain similarity between the input image and each category.

The method for converting the similarity into the probability distribution comprises the following steps:

the invention converts similarity values into probability distributions using a Softmax function. The Softmax function maps each similarity value to between 0 and 1 and ensures that the sum of all values is 1, thus forming a probability distribution. Specifically, the indices of each similarity value are calculated first, and then each index is divided by the sum of all the indices to obtain a probability distribution.

Assume that there are 3 food categories, respectively "potato", "carrot" and "tomato", and a food picture to be classified. First, converting the category name into a text tensor, and extracting text feature vectors by using a CLIP model. The method comprises the steps of carrying out a first treatment on the surface of the Preprocessing food material pictures to be classified, and extracting image feature vectors by using a CLIP model; and then, calculating cosine similarity between the image feature vector and each category text feature vector to obtain a similarity value list [0.7,0.2,0.1]. Further, similarity values are converted to probability distributions using a Softmax function, the result of the conversion being [0.56,0.24,0.20]. The class with the highest probability value is selected as the output result according to the probability distribution. In this embodiment, the category with the highest probability is "potato", and thus the output result is "potato".

The pictures of potatoes, potatoes and potatoes are input, the output is 'potato', and thus, three types of synonymous food materials with different names are aligned and are classified as 'potato'.

In one embodiment, the aligned data is populated into a tensor format list containing dishes and food materials.

The CLIP is a cross-modal learning technology, and after pretraining through a large number of pictures-texts, images and texts can be mapped to the same representation space, so that the images and text information can be processed simultaneously, and tasks such as classification, description generation and the like of the images are realized.

However, although the model can accurately identify food materials in most cases, in some cases of edges or confusion, the model may be misjudged. In order to ensure the accuracy of food material entities in the knowledge graph, the classification result is additionally screened and corrected.

In one embodiment, a confidence threshold is set for the probability distribution for screening potentially problematic classification results, and when the confidence threshold is set, the classification results are manually modified. Specifically, the confidence threshold may be set to 0.9, and for samples with confidence level lower than 0.9 output by the CLIP model, the classification result may be considered unreliable, and manual inspection is required. During the inspection, the classification results of these samples can be corrected by referring to the relevant data and expert opinion.

In another embodiment, the classification results are fully reviewed to discover and correct erroneous classifications that the model may miss.

Therefore, the accuracy of entity alignment is further improved, the quality of food material entities in the knowledge graph is ensured, and more reliable data support is provided for subsequent application.

After the entity alignment is completed, the entity alignment is converted into a format of a triplet, so that the knowledge graph is constructed conveniently.

Firstly, extracting the entities and the relations (such as main materials, auxiliary materials, tastes, efficacy and the like) needed to be represented in the aligned dish and food material information, wherein the process comprises analysis of original data, and determining the types of the entities and the relations and the expression forms of the entities and the relations in the data.

After extraction, the three-tuple format is converted, when complex relationships are processed, intermediate entities can be introduced or attribute values can be expressed as part of the three-tuple, if when a dish has multiple tastes, a separate three-tuple can be created for each taste, such as (braised meat, taste, salty-fresh) and (braised meat, taste, slightly spicy).

In one embodiment, after the data is converted into the triplet format, entities and relations of the same type in the triples are combined first, repeated triples are eliminated, so that the storage space of a knowledge graph is reduced, and the query efficiency is improved.

The integrated and de-duplicated triples can be directly used for constructing a knowledge graph to represent the relationship between entities, and provide abundant data support for subsequent applications.

In one embodiment, the triplet data is stored and managed using a neo4j graph database. neo4j is a high-performance graph database management system, and is very suitable for storing and querying complex entity relations.

The specific steps for generating the healthy diet knowledge graph comprise:

analyzing knowledge system in healthy diet field, constructing layering ontology model,

then, installing a neo4j graph database, and creating a database instance; installing py2neo library of Python, obtaining URL, user name and password of neo4j graph database to realize connection of py2neo library and neo4j graph database,

node and relation in py2neo library are utilized to create Node and relation according to the triplet data, and submitted to upload to neo4j graph database; to increase the import speed, the Transaction object is created preferentially, then each triplet is processed in one cycle, converted into nodes and relations, added into the Transaction, and submitted to import the data batch into the database after the processing of all triples is completed.

After being imported into a neo4j graph database, the healthy diet knowledge graph containing concepts of dishes, food materials, dishes, cuisine, main materials, auxiliary materials, ingredients, tastes, manufacturing steps, nutrient elements, efficacy, dishes pictures and the like can be constructed. Through the knowledge graph, the relation between dishes can be easily inquired and analyzed, and a foundation is laid for subsequent research and application.

In one embodiment, the tabu information and the efficacy information of the dishes are integrated into the healthy diet knowledge graph so as to better meet the requirements of the user on food material selection and diet planning, and personalized diet suggestions are better provided for the user, so that the user can easily inquire the food materials and the dishes suitable for the user according to the physical condition and the health requirement of the user.

The knowledge graph constructed by the invention can be applied to the following scenes:

1. the method is applied to the healthy diet APP, and scientific and reasonable diet collocation suggestions and menu recommendations are provided for users;

2. the method is applied to research and development of nutritional health care products so as to help develop more scientific and effective products;

3. the nutrition guide device is applied to medical institutions, provides more comprehensive nutrition guide for doctors, and helps patients to better control diseases;

4. the method is applied to catering industry, helps chefs to better master scientificity of food material collocation, and develops healthier and nutritional dishes.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The method for constructing the healthy diet knowledge graph is characterized by comprising the following steps of:

2. The method according to claim 1, wherein in step S1, URL of the website is separated and rules of analysis are further included.

3. The method according to claim 1, wherein in step S1, the repeated information in the dishes and the food materials is removed, and the food materials and the names of the dishes are normalized.

4. The method for constructing a healthy diet knowledge graph according to claim 1, wherein the step of obtaining the image similarity comprises:

5. The method for constructing a healthy diet knowledge graph according to claim 1, wherein converting the similarity value into a probability distribution by using a Softmax function includes determining an index of each similarity, and obtaining the probability distribution according to a ratio of each index to a sum of all indexes.

6. The method according to claim 1, wherein in step S2, a confidence threshold is set for the probability distribution, and when the probability distribution is lower than the set confidence threshold, the probability distribution is manually corrected.

7. The method according to claim 1, wherein in step S3, after converting to the triplet format, entities and relations of the same type in the triples are combined first, and duplicate triples are eliminated.