WO2021147786A1

WO2021147786A1 - Knowledge graph construction method and apparatus, storage medium, and electronic device

Info

Publication number: WO2021147786A1
Application number: PCT/CN2021/072241
Authority: WO
Inventors: 李慧; 许蕾; 郝吉芳; 杨卓士; 商晓健; 王炳乾
Original assignee: 京东方科技集团股份有限公司
Priority date: 2020-01-20
Filing date: 2021-01-15
Publication date: 2021-07-29
Also published as: CN111241212A; CN111241212B

Abstract

An art field knowledge graph construction method and apparatus, a storage medium, and an electronic device. The method comprises: performing first preprocessing on structured data in an internal art data source and an external art data source to generate first structured data (S110); performing second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data (S120); performing fusion processing on the first structured data and the second structured data to generate fused art data (S130), wherein the fused art data comprises an art entity and an art relationship corresponding to the art entity; and generating an art triad according to the art entity and the art relationship, and generating an art domain knowledge graph according to the art triad (S140).

Description

Method and device for constructing knowledge graph, storage medium and electronic equipment

cross reference

This disclosure claims the priority of a Chinese patent application filed on January 20, 2020 with an application number of 202010066621.5 titled "Knowledge Graph Construction Method and Apparatus, Storage Medium, and Electronic Equipment". The entire content of the Chinese patent application is incorporated by reference. All are incorporated into this article.

Technical field

The present disclosure relates to the technical field of knowledge graph construction, and in particular to a method for constructing a knowledge graph in the art field, a device for constructing a knowledge graph in the art field, a computer-readable storage medium, and an electronic device.

Background technique

Knowledge graph is also called scientific knowledge graph. Knowledge graph uses visualization technology to describe knowledge resources and their carriers, mines, analyzes, constructs, draws, and displays knowledge and their interrelationships. It is a series of showing the development process and structural relationship of knowledge. A variety of different graphics, and provide a better way to organize, manage and understand the massive amount of information on the Internet. The knowledge graph is also the prototype of building a next-generation search engine, making search more semantic and intelligent. At present, there are two types of knowledge graphs: general knowledge graphs and domain knowledge graphs. Among them, the domain knowledge graph is also called the industry knowledge graph or the vertical knowledge graph, which is usually oriented to a specific field and is equivalent to an industry knowledge base based on semantic technology. Since the domain knowledge map is constructed based on industry data, it has a more rigorous and rich data model, and also has higher requirements for the depth and accuracy of domain knowledge.

However, the existing domain knowledge graph construction has big defects.

It should be noted that the information disclosed in the above background section is only used to enhance the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.

Public content

The purpose of the present disclosure is to overcome the above-mentioned shortcomings of the prior art, and provide a method for constructing a knowledge map in the art field, a device for constructing a knowledge map in the art field, a computer-readable storage medium, and an electronic device.

According to a first aspect of the present disclosure, there is provided a method for constructing a knowledge graph in the art field, the method comprising: performing first preprocessing on structured data in an internal art data source and an external art data source to generate a first structure Data; performing a second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data; converting the first structured data Perform fusion processing with the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art relationship corresponding to the art entity; generated according to the art entity and the art relationship Art triads, and generate a knowledge map of the art field according to the art triads.

In an exemplary embodiment of the present disclosure, the first preprocessing of the structured data in the internal art data source and the external art data source to generate the first structured data includes: processing the internal art data Data cleaning is performed on the structured data in the source and external art data sources; the data cleaning results of the structured data are repetitively checked to generate repetitive inspection data; a data dictionary and an error correction dictionary are generated based on the repetitive inspection data , Generate the first structured data based on the data dictionary.

In an exemplary embodiment of the present disclosure, the data cleaning of the structured data in the internal art data source and the external art data source includes: the structure of the internal art data source and the external art data source Single-valued attribute determination processing is performed on the data to obtain single-valued structured data; the first structured entity and the second structured relationship in the single-valued structured data are acquired, and the results of the single-valued attribute determination processing are counted Obtain a multi-value data table; when the multi-value data table does not contain multi-value data, use the first structured entity and the second structured relationship as the result of data cleaning; when the multi-value data table contains multi-value data In the case of value data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the result of data cleaning.

In an exemplary embodiment of the present disclosure, the obtaining the second structured entity and the second structured relationship according to the multi-valued data table as a data cleaning result includes: updating according to the multi-valued data table A data dictionary or an error correction dictionary; according to an update result of the data dictionary or the error correction dictionary, a second structured entity and a second structured relationship are obtained as a data cleaning result.

In an exemplary embodiment of the present disclosure, the performing repeatability inspection on the data cleaning result of the structured data to generate repeatability inspection data includes: performing art on the data cleaning result of the original structured data The repeatability test of the product entity generates the repeatability test result of the artwork; when the repeatability test result of the artwork is the same, the repeatability test of the artist entity is performed on the data cleaning result to generate the repeatability test result of the artist; when When the artist repeatability check result is the same, the creation time entity repeatability check is performed on the data cleaning result to generate a creation time repeatability check result; when the creation time repeatability check result is the same, it is determined The artwork entity is a duplicate artwork; fusion processing is performed on the duplicate artwork, and repeatability inspection data is generated according to the approved fusion processing result.

In an exemplary embodiment of the present disclosure, the method further includes: when the artist repeatability check result is different or the creation time repeatability check result is different, determining that the artwork entity has the same name Artwork; de-duplicate the artwork with the same name, and generate the repeatability inspection data according to the result of the de-duplication process.

In an exemplary embodiment of the present disclosure, the first structured data includes target artwork data, target artist data, and target art institution data; and the first structured data is combined with the second structured data. Fusion processing of the fusion data to generate fusion art data, including: fusion processing the reference artist data in the second structured data with the target artist data to generate fusion artist data; Perform fusion processing on the reference art data and the target art data to generate fused art data; perform fusion processing on the reference art institution data in the second structured data and the target art institution data to generate fused art Institutional data.

In an exemplary embodiment of the present disclosure, the fusion processing of the reference artist data in the second structured data with the target artist data to generate fused artist data includes: performing a fusion process on the second structured data according to a word vector model. 2. Perform vector conversion between the reference artist data in the structured data and the target artist data to obtain an artist word vector sequence; calculate the artist similarity vector between the artist word vector sequences, and calculate the artist similarity vector according to the first of the artist similarity vector Perform weighted calculation with a weight; obtain artist similarity according to the weighted calculation result, and determine whether the artist similarity is greater than a first threshold; compare the reference artist data corresponding to the artist similarity greater than the first threshold with the The target artist data is fused to generate fused artist data.

In an exemplary embodiment of the present disclosure, the fusion processing of the reference artwork data in the second structured data with the target artwork data to generate fused artwork data includes: according to word vectors The model performs vector conversion on the reference artwork data in the second structured data and the target artwork data to obtain an artwork word vector sequence; calculates the artwork similarity vector between the artwork word vector sequences, and then The second weight of the artwork similarity vector is weighted and calculated; the artwork similarity is obtained according to the weighted calculation result, and it is determined whether the artwork similarity is greater than a second threshold; the art that is greater than the second threshold is determined The reference artwork data corresponding to the product similarity and the target artwork data are fused to generate fused artwork data.

In an exemplary embodiment of the present disclosure, the fusion processing of the reference art institution data in the second structured data with the target art institution data to generate fused art institution data includes: according to word vectors The model performs vector conversion on the reference art institution data in the second structured data and the target art institution data to obtain an art institution word vector sequence; calculates the art institution similarity vector between the art institution word vector sequences, and calculates it according to The third weight of the art institution similarity vector is weighted; the art institution similarity is obtained according to the weighted calculation result, and it is determined whether the art institution similarity is greater than the third threshold; the art institution that is greater than the third threshold is determined The reference art institution data corresponding to the institution similarity and the target art institution data are fused to generate fused art institution data.

According to a second aspect of the present disclosure, there is provided a device for constructing a knowledge graph in the art field, the device comprising: a data processing module configured to perform first analysis on structured data in an internal art data source and an external art data source. Preprocessing to generate first structured data; a data analysis module configured to perform a second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain a second Structured data; a data fusion module configured to perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and a connection with the art The art relationship corresponding to the entity; the graph generation module is configured to generate an art triad according to the art entity and the art relation, and form an art domain knowledge graph according to the art triad.

According to a third aspect of the present disclosure, there is provided an electronic device, including: a processor and a memory; wherein a computer readable instruction is stored in the memory, and the computer readable instruction is executed by the processor to implement any of the foregoing examples A method for constructing a knowledge graph in the art field of an exemplary embodiment.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and the computer program, when executed by a processor, implements the method for constructing an art domain knowledge graph in any of the above exemplary embodiments . It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the present disclosure.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the specification to explain the principle of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 schematically shows a flowchart of a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure;

Fig. 2 schematically shows a flow chart of a method for generating first structured data in an exemplary embodiment of the present disclosure;

Fig. 3 schematically shows a flow chart of a method for data cleaning in an exemplary embodiment of the present disclosure;

FIG. 4 schematically shows a flow chart of a method for obtaining a data cleaning result in an exemplary embodiment of the present disclosure;

Fig. 5 schematically shows a flow chart of a method for generating repetitive inspection data in an exemplary embodiment of the present disclosure;

Fig. 6 schematically shows a flow chart of another method for generating repetitive inspection data in an exemplary embodiment of the present disclosure;

Fig. 7 schematically shows a flow chart of a method for generating fused art data in an exemplary embodiment of the present disclosure;

Fig. 8 schematically shows a flow chart of a method for obtaining fusion artist data in an exemplary embodiment of the present disclosure;

FIG. 9 schematically shows a flow chart of a method for obtaining fused artwork data in an exemplary embodiment of the present disclosure;

FIG. 10 schematically shows a flow chart of a method for obtaining fusion art institution data in an exemplary embodiment of the present disclosure;

FIG. 11 schematically shows a flow chart of a method for constructing an art domain knowledge graph of an application scenario in an exemplary embodiment of the present disclosure;

FIG. 12 schematically shows a flow chart of a method for first preprocessing of data in an application scenario in an exemplary embodiment of the present disclosure;

FIG. 13 schematically shows a flow chart of a method for data cleaning in an application scenario in an exemplary embodiment of the present disclosure;

FIG. 14 schematically shows a flowchart of a processing method when painting is repeated in an application scenario in an exemplary embodiment of the present disclosure; FIG.

Fig. 15 schematically shows a flow chart of a method for generating fusion art data in an application scenario in an exemplary embodiment of the present disclosure;

FIG. 16 schematically shows an interface diagram of a visualized art domain knowledge graph in an application scenario in an exemplary embodiment of the present disclosure;

FIG. 17 schematically shows a scene diagram of the application of the art domain knowledge graph in the art encyclopedia in an exemplary embodiment of the present disclosure;

FIG. 18 schematically shows a schematic diagram of a scene in which the knowledge graph of the art field is applied to the knowledge graph in an exemplary embodiment of the present disclosure;

FIG. 19 schematically shows a scene diagram of the application of the art domain knowledge graph in the art knowledge question and answer in an exemplary embodiment of the present disclosure;

FIG. 20 schematically shows a schematic diagram of a scene in which the art domain knowledge graph is applied to an overview of art knowledge in an exemplary embodiment of the present disclosure;

FIG. 21 schematically shows a structure diagram of an apparatus for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure;

FIG. 22 schematically shows an electronic device for implementing a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure;

FIG. 23 schematically illustrates a computer-readable storage medium used to implement a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, these embodiments are provided so that the present disclosure will be more comprehensive and complete, and the concept of the example embodiments will be fully conveyed To those skilled in the art. The described features, structures or characteristics can be combined in one or more embodiments in any suitable way. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present disclosure. However, those skilled in the art will realize that the technical solutions of the present disclosure can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, the well-known technical solutions are not shown or described in detail in order to avoid overwhelming the crowd and obscure all aspects of the present disclosure.

In this specification, the terms "a", "a", "the" and "said" are used to indicate the presence of one or more elements/components/etc.; the terms "including" and "have" are used to indicate open-ended Inclusive means and means that in addition to the listed elements/components/etc., there may be other elements/components/etc.; the terms "first" and "second" etc. are only used as marks, not to The number of its objects is limited.

In addition, the drawings are only schematic illustrations of the present disclosure, and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities.

The existing domain knowledge graph construction has corresponding defects. Specifically, the construction method of the knowledge map of the English professional domain is not fully applicable to the construction of the knowledge map of the Chinese professional domain. There is also the problem that the existing method of constructing the knowledge map of the professional domain cannot take into account the scale and accuracy of the professional knowledge. It is difficult to integrate domain knowledge acquired from multiple data sources.

In view of the problems in the related technology, the present disclosure proposes a method for constructing a knowledge graph in the art field. Figure 1 shows a flow chart of a method for constructing a knowledge map of the art field. As shown in Figure 1, the method for constructing a knowledge map of the art field includes at least the following steps:

Step S110. Perform first preprocessing on the structured data in the internal art data source and the external art data source to generate first structured data.

Step S120. Perform a second preprocessing on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data.

Step S130. Perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art relationship corresponding to the art entity.

Step S140. Generate an art triad according to the art entity and the art relationship, and generate an art domain knowledge graph according to the art triad.

In the exemplary embodiment of the present disclosure, on the one hand, the present disclosure uses data in an external data source and standardized data to perform data fusion processing, which greatly increases the scale of physical knowledge in the art field and improves the acquisition of knowledge in the art field. On the other hand, generating an art domain knowledge graph based on art triads helps to improve the relevance of entities in the knowledge graph and the comprehensiveness of the knowledge graph search, to understand query intentions more accurately, and to improve the accuracy of retrieval .

The steps of the method for constructing the knowledge graph in the art field will be described in detail below.

In step S110, first preprocessing is performed on the structured data in the internal art data source and the external art data source to generate the first structured data.

In the exemplary embodiment of the present disclosure, the internal art data source and the external art data source may be determined for the source of the art data. For example, the data in the internal art data source may be mainly processed manually Structured data, data in external art data sources can be crawled based on public data on the Internet, mainly semi-structured data. However, internal art data sources may also contain unstructured data and semi-structured data, and external art data sources may also contain structured data and unstructured data. Therefore, internal data sources and external data sources can be obtained. Structured data.

After the structured data is loaded, the first preprocessing can be performed according to the structured data to obtain the first structured data. In an alternative embodiment, FIG. 2 shows a schematic flow chart of the method for generating the first structured data. As shown in FIG. 2, the method includes at least the following steps: In step S210, the internal art data source and the external The structured data in the art data source is data cleaned.

Specifically, in an optional embodiment, FIG. 3 shows a schematic flow chart of a method for data cleaning of structured data. As shown in FIG. 3, the method includes at least the following steps: In step S310, the internal art The structured data in the data source and the external art data source is subjected to single-value attribute judgment processing to obtain single-value structured data. Among them, a single-valued attribute may be an attribute whose data has only one specific value. For example, the method for determining the single-valued attribute of structured data can be to determine whether there is only one author of a painting. When there are two writers corresponding to a painting, Van Gogh and Van Gogh, the corresponding list is not obtained. Value structured data; when the writer corresponding to a painting is Van Gogh, the single value structured data of the writer of the painting can be obtained as Van Gogh. In addition, the single-value structured data may also include paintings, creation time, genres, nationalities, etc., which are not specifically limited in this exemplary embodiment.

In step S320, the first structured entity and the first structured relationship in the single-valued structured data are obtained, and the result of the single-valued attribute determination processing is calculated to obtain a multi-valued data table. From the obtained single-value structured data, the corresponding first structured entity and first structured relationship can be extracted. For example, the first structured entity may include artist entity, artwork entity, creation time entity, genre entity, nationality entity, etc.; for an artist entity, the corresponding structured relationship may include created artwork entity, creation The relationship between the creation time, the genre that has been formed, and the nationality to which all the artwork entities correspond to. In addition, the structured data that failed the single-value attribute judgment can be counted to obtain a multi-value data table.

In step S330, when the multi-valued data table does not contain multi-valued data, the first structured entity and the first structured relationship are used as the data cleaning result. When the multi-value data is not counted or the multi-value data in all the multi-value data tables have been updated, further review can be carried out. When the review is a manual review and the manual review is passed, the obtained first structured entity and the first structured relationship are directly determined as the data cleaning result. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.

It should be noted that the review step mentioned in the embodiment of the present invention can be an automatic review according to a custom set rule, or it can be a direct review manually. Both manual review and automatic review are interchangeable.

In step S340, when the multi-valued data table contains multi-valued data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the result of data cleaning.

Further, in an optional embodiment, FIG. 4 shows a schematic flow chart of a method for obtaining data cleaning results. As shown in FIG. 4, the method at least includes the following steps: In step S410, according to the multi-valued data Table update data dictionary or error correction dictionary. For example, the multi-value data table of the writer corresponding to a painting may include two values of Van Gogh and Van Gogh. During further manual review, it is found that Van Gogh is an alias of Van Gogh, so Van Gogh can be replaced with Van Gogh, generate a corresponding error correction dictionary for update. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work. In addition, the error correction dictionary can be a database used to store data sources related to operations such as adding data, modifying data, and the relationship with other data, usage, and format; correspondingly, the data dictionary can be used to store format and A database of information such as content and other normative data such as the data source, relationship with other data, usage and format.

In step S420, the second structured entity and the second structured relationship are obtained as the data cleaning result according to the update result of the updated data dictionary or error correction dictionary. When the multi-value data table is not empty, the update results of the data dictionary and the error correction dictionary can be further judged until the multi-value data in all multi-value data tables is updated, and the second structured entity and the first structured entity are obtained. Two structured relations as a result of data cleaning.

In this exemplary embodiment, data cleaning results are generated for the structured entities and structured relationships in the multi-valued data table, which facilitates the update of knowledge in the art field, and the update method is simple and accurate.

In step S220, repeatability inspection is performed on the data cleaning results of the structured data in the internal art data source and the external art data source, and repeatability inspection data is generated. In an alternative embodiment, the structured entities include artwork entities, artist entities, and creation time entities. FIG. 5 shows a schematic flow chart of a method for generating repetitive inspection data. As shown in FIG. 5, the method at least It includes the following steps: in step S510, the repeatability inspection of the artwork entity is performed on the data cleaning results of the structured data in the internal art data source and the external art data source, and the artwork repeatability inspection result is generated. The inspection of the artwork entity may be to obtain the name of the artwork, and determine whether the artwork name is consistent or not, and generate the corresponding artwork repeatability inspection result. For example, when the two obtained paintings whose names are both "Mona Lisa" are subjected to a repeatability test, the repeatability test results can be obtained as the same; when one is named "Mona Lisa" "" and a painting titled "Girl with a Pearl Earring" are subject to the repeatability test of the artwork entity, and the repeatability test results are different. It should be noted that repeated verification is to determine whether two entities are substantially the same. For example, if the author of an artwork has a full name and abbreviation, it is essentially the author himself, and the structure of repeated verification should be the same. of.

In step S520, when the artwork repeatability check result is the same, perform the artist entity repeatability check on the data cleaning result to generate the artist repeatability check result. For example, when the names of the two paintings are both "Mona Lisa", it can be determined that the art repeatability test results of the paintings are the same. However, these two paintings may have been processed by later writers, or they may have come from different museums, or different paintings caused by other reasons, so they can be further judged. Specifically, it can be a repetitive test of the artist entity that created the artwork.

In step S530, when the artist repeatability check results are the same, the creation time entity repeatability check is performed on the data cleaning result, and the creation time repeatability test result is generated. For example, when the two paintings whose names are both "Mona Lisa" correspond to the same artist, it can be determined that the result of the repeatability test of the artist is the same. Furthermore, the time of creation can also be tested for repeatability.

In step S540, when the creation time repeatability check result is the same, it is determined that the artwork entity is a duplicate artwork. For example, when two paintings whose names are "Mona Lisa" correspond to the same artist and creation time, it can be determined that the creation time repeatability test result is the same. Therefore, it can be determined that the artwork is a duplicate artwork based on the repeatability test results of these three dimensions.

In step S550, fusion processing is performed on the repeated artwork, and repeatability inspection data is generated according to the approved fusion processing result. When two artwork entities are found to be duplicate artwork, the two artwork entities can be fused, and the result of the fusion processing can be manually reviewed. Manual review can further determine whether the repeatability test results of other dimensions are the same. When the manual review is passed, the repeatability test data of the artwork can be generated. In addition, the data dictionary can also be updated based on the repeatability test data.

In this exemplary embodiment, entity fusion processing can be performed on repeated paintings through three-dimensional judgments, and the data dictionary can be updated. The data dictionary can be more accurately perfected to ensure that the knowledge of the data dictionary is updated, and the problem is also reduced. The workload of multiple judgments brought by the same data dictionary.

In addition to generating repeatability inspection data based on duplicate artwork, it is also possible to generate repeatability inspection data based on art with the same name. In an alternative embodiment, FIG. 6 shows a schematic flowchart of another method for generating repetitive inspection data. As shown in FIG. 6, the method includes at least the following steps: In step S610, when the artist repeatedly inspects When the result is different or the repeatability test result of creation time is different, the artwork entity is determined to be the artwork with the same name. When the artwork repeatability test results are the same, it can be further determined whether the artist repeatability test results and the creation time repeatability test results are the same. For example, when the names of the two paintings are both "self-portraits," it can be determined that they were created by two different painters, so the artist's repeatability test results are different. In view of this, these two "self-portrait" paintings are regarded as paintings of the same name. In addition, when the result of the repeatability test of the artwork is the same, the repeatability test can be further performed on the creation time to determine whether the artwork is an artwork with the same name.

In step S620, deduplication processing is performed on the artwork with the same name, and repeatability inspection data is generated according to the deduplication processing result. For example, when two “self-portrait” paintings are the same-named paintings, they can be deduplicated, that is, the two paintings can be determined as two data dictionaries. In order to determine the update accuracy of the data dictionary, manual review can be carried out. Only the artwork with the same name that has passed the review can generate repetitive inspection data and update the data dictionary. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.

In this exemplary embodiment, by judging the other two dimensions of the artwork with the same artwork name, and performing entity deduplication processing on the artwork with the same name, the data dictionary can be updated, which can avoid the problem of too few dimensions. The problem of inaccurate knowledge judgment and obsolete data dictionary to ensure the accuracy of the data dictionary.

In step S230, a data dictionary and an error correction dictionary are generated according to the repeatability check data, and the first structured data is obtained based on the data dictionary. Specifically, in addition to the data dictionary included in the first structured data, some attribute data is not included in the data dictionary, and the attribute data also belongs to the first structured data. After the repetitive inspection data is generated, there may be problems with incorrect artwork names or other information, which can be further manually reviewed. After the manual review is passed, the corresponding data dictionary or error correction dictionary is generated. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work. Before the target art data is generated from the data dictionary or error correction dictionary, there may also be the question of whether the interval between foreign names is "·" or "-", or the interval between creation time is "." It is still a problem of "-". Therefore, the standardization of these data can also be judged, and a data dictionary or error correction dictionary that meets the storage specification of the art field database is used as the target art data, or a data dictionary that does not conform to the storage specification Or an error correction dictionary for correction, or it can be used as target art data.

In this exemplary embodiment, the corresponding target art data can be generated through the first preprocessing process of the structured data, the processing method is simple and accurate, the manual workload is reduced, and the practicability is extremely strong.

In step S120, second preprocessing is performed on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain the second structured data.

The second preprocessing can be a step of data cleaning, specifically it can be art data consistency check, missing value processing, invalid value processing, repeated data judgment, etc. It can also be configured or embedded in custom code pairs according to the processing requirements of art data. The cleaning work performed by data cleaning.

In the exemplary embodiment of the present disclosure, since the integrity of the data in the internal art data source may be only about 60%, the data in the external art data source can be merged and filled in to expand the data in the internal art data source. , To improve the integrity of the data in the internal art data source. Specifically, semi-structured data can be crawled from the Internet public data to obtain second structured data that can be used for filling.

The processing method for semi-structured data may be to use preset rules and preset regular expressions for parsing. For example, through “Da Vinci’s works include the Mona Lisa, the Last Supper, Our Lady of the Rocks, etc.”, the rule “author’s works have “works”” can be constructed; through “Da Vinci’s masterpieces Na Lisa embodies his exquisite artistic attainments" and can construct the rule "author"'s masterpiece "work"; through "Mona Lisa is an oil painting created by Italian painter Leonardo da Vinci", the rule "work" can be constructed It is "author" creation"; through "Mona Lisa represents Leonardo's highest artistic achievement", the rule ""work" represents "author"" can be constructed. Through these artificially constructed preset rules, the semi-structured data can be parsed to obtain the target structured data. In addition, you can also construct a regular expression based on "Da Vinci, born in April 1452, Italian, representative paintings such as the Mona Lisa, the Last Supper, Our Lady of the Rocks, etc." as the second comma The previous content is filled in to the date of birth, the content between the second comma and the third comma is filled in to the nationality, and the content after the third comma is filled in to the representative work to construct a regular expression. Therefore, the second structured data can also be obtained by performing the second preprocessing on the semi-structured data through the constructed regular expression.

In step S130, the first structured data and the second structured data are fused to generate fused art data; where the fused art data includes an art entity and an art relationship corresponding to the art entity.

In an exemplary embodiment of the present disclosure, the first structured data includes target artwork data, target artist data, and target art institution data. FIG. 7 shows a schematic flowchart of a method for generating fused art data, as shown in FIG. 7 , The method includes at least the following steps:

In step S710, the reference artist data and the target artist data in the second structured data are fused to generate fused artist data.

In an optional embodiment, FIG. 8 shows a schematic flow chart of a method for obtaining fusion artist data. As shown in FIG. 8, the method includes at least the following steps: In step S810, the second structured data is performed according to the word vector model. The reference artist data and the target artist data in the data are vectorized to obtain the artist word vector sequence. Among them, the word vector model may be a Word2Vec model. Among them, the Word2Vec model is the Word2Vec tool released by Google in 2013, which can be regarded as an important application of deep learning in the field of natural language processing. Although Word2Vec only has three layers of neural network, it has achieved very good results. Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can be better understood by the computer, and the vector generated by the word segmentation can also reflect semantic information. In order to use this semantic information, the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model. Among them, the CBOW model is to predict the input word segmentation given the context information; the Skip-grams model is to predict the context of the given input word segmentation. Among them, the first part is to build the model, and the second part is to obtain the embedded word vector through the model. Preferably, the Skip-grams model can be used for vector conversion of the word sequence. Using the Skip-grams model to convert word vectors, a 300-dimensional real number vector can be used to uniquely represent a word in the word space. The reference artist data and target artist data are represented by the number of word sequences multiplied by a 300 vector matrix to get The corresponding artist word vector sequence.

In step S820, the artist similarity vectors between the artist word vector sequences are calculated, and weighted calculation is performed according to the first weight of the artist data similarity vectors. For artist similarity, there may be multiple dimensions of artist similarity vectors, such as artist's nationality, artist's genre, and so on. Therefore, the artist similarity vector between the word vector sequences of each dimension of the artist can be calculated first.

For example, the lengths of two artist word vector sequences of the same dimension may be inconsistent, so the two artist word vector sequences can be used as the input of the Siamese Long short-term memory (Siamese Long short-term memory, referred to as Siamese LSTM) network model , To adapt to variable-length sequence pairs. The twin growth short-term memory network model is composed of two identical neural network models, and the two neural network models achieve the twinning purpose by sharing weights. The reference artist word vector sequence and the target artist word vector sequence are respectively input into two neural network models, and the distance between the input reference artist word vector sequence and the target artist word vector sequence is evaluated by calculating the distance between the two vector sequences. Artist similarity component. Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance. In addition, the artist similarity vector can also be calculated by other algorithms, which is not particularly limited in this exemplary embodiment.

After obtaining the artist similarity components of each dimension, it is also possible to perform a weighted calculation on the artist similarity components according to the preset first weight to obtain a weighted calculation result. For example, the dimensions related to the artist similarity component may include genre and nationality. Further, the corresponding first weight can be set to 0.4 for the genre, and the corresponding first weight can be set to 0.6 for the nationality, so as to multiply the genre component in the artist similarity component by 0.4, and the nationality component in the artist similarity component Take 0.6 and perform the sum calculation to get the corresponding calculation result.

In step S830, the artist similarity is obtained according to the weighted calculation result, and it is determined whether the artist similarity is greater than the first threshold. The artist similarity is obtained after the weighted calculation is performed on the artist similarity components of each dimension. Therefore, the first threshold can be set according to the overall value of the artist similarity, and it can be judged whether the artist similarity is greater than the first threshold. For example, the first threshold can be set to 1. When the weighted calculation result is 0.8, it can be determined that the artist similarity is less than the first threshold according to 0.8 being less than 1, and when the weighted calculation result is 1.2, it can be determined that the artist similarity is greater than the first threshold according to 1.2 greater than 1.

In step S840, the reference artist data and the target artist data corresponding to the artist similarity greater than the first threshold are fused to generate fused artist data. When the determination result is that the artist similarity is greater than the first threshold, it can be determined that the reference artist data and the target artist data point to the same artist, so the reference artist data and the target artist data are fused to obtain the fused artist data.

In this exemplary embodiment, by calculating the similarity vectors of the respective dimensions corresponding to the reference artist data and the target artist data, the reference artist data and the target artist data that meet the preset conditions can be fused to obtain the fused artist data. The calculation method is simple, the fusion accuracy is high, and the accuracy of the artist data acquisition is improved.

In step S720, the reference artwork data in the second structured data and the target artwork data are fused to generate fused artwork data.

In an alternative embodiment, FIG. 9 shows a schematic flow chart of a method for obtaining fused artwork data. As shown in FIG. 9, the method includes at least the following steps: In step S910, the second structure is adjusted according to the word vector model. The reference artwork data and the target artwork data in the transformation data are vectorized to obtain the artwork word vector sequence. Among them, the word vector model may be a Word2Vec model. Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can better be understood by the computer, and the vector generated by the word segmentation can also reflect semantic information. In order to use this semantic information, the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model. Preferably, the Skip-grams model can be used for vector conversion of the word sequence. Using the Skip-grams model to convert word vectors, a 300-dimensional real number vector can be used to uniquely represent a word in the word space. The reference art data and target art data are represented by the number of word sequences multiplied by a 300 vector matrix. In order to obtain the corresponding art word vector sequence.

In step S920, the artwork similarity vector between the artwork word vector sequences is calculated, and weighted calculation is performed according to the second weight of the artwork similarity vector. For artwork similarity, there may be multiple dimensions of artwork similarity vectors, such as the genre to which the artwork belongs, the creation time of the artwork, and the art institution where the artwork is preserved. Therefore, the artist similarity vector between the word vector sequences of each dimension of the artist can be calculated first.

For example, the lengths of two art word vector sequences of the same dimension may be inconsistent, so the two art word vector sequences can be used as a twin-growing short-term memory (Siamese Long short-term memory, referred to as twin LSTM) network model Input to accommodate sequence pairs of variable length. Input the reference art word vector sequence and the target art word vector sequence into two neural network models respectively, and evaluate the input reference art word vector sequence and the target art word vector by calculating the distance between the two vector sequences The artwork similarity component between sequences. Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance. In addition, the artwork similarity vector can also be calculated through other algorithms, which is not particularly limited in this exemplary embodiment.

After the artwork similarity components of each dimension are obtained, it is also possible to perform a weighted calculation on each artwork similarity component according to a preset second weight to obtain a weighted calculation result. For example, the corresponding second weight can be set to 0.4 for artwork, the corresponding second weight can be set to 0.3 for the creation time, and the second weight set for the preservation institution is also 0.3. Further, the similarity of the artwork The genre component in the component is multiplied by 0.4, the creation time component in the artwork similarity component is multiplied by 0.3, and the art institution component in the artwork similarity component is multiplied by 0.3, and the sum is calculated to obtain the corresponding calculation result.

In step S930, the artwork similarity is obtained according to the weighted calculation result, and it is judged whether the artwork similarity is greater than the second threshold. The artwork similarity is obtained after the weighted calculation is performed on the artwork similarity components of each dimension. Therefore, the second threshold can be set according to the overall value of the similarity of the artwork, and it can be judged whether the similarity of the artwork is greater than the second threshold. For example, the second threshold can be set to 2. When the weighted calculation result is 0.8, it can be determined that the artist similarity is less than the second threshold based on 0.8 being less than 2, and when the weighted calculation result is 3.2, it can be determined that the artist similarity is greater than the second threshold based on 3.2 is greater than 2.

In step S940, the reference artwork data corresponding to the artwork similarity greater than the second threshold and the target artwork data are fused to generate fused artwork data. When the judgment result is that the similarity of the artwork is greater than the second threshold, it can be determined that the reference artwork data and the target artwork data point to the same artwork, so the reference artwork data and the target artwork data are fused to obtain the fusion art品数据。 Product data.

In this exemplary embodiment, through calculation of the similarity vector of each dimension corresponding to the reference artwork data and the target artwork data, the reference artwork data and the target artwork data that meet the preset conditions can be fused to obtain The fusion of artwork data has a simple calculation method and high fusion accuracy, which improves the accuracy of artwork data acquisition.

In step S730, the reference art institution data in the second structured data and the target art institution data are fused to generate fused art institution data.

In an alternative embodiment, FIG. 10 shows a schematic flow chart of a method for obtaining fusion art institution data. As shown in FIG. 10, the method includes at least the following steps: In step S1010, the second structure is adjusted according to the word vector model. The reference art institution data and the target art institution data in the transformation data are vectorized to obtain the art institution word vector sequence. Among them, the word vector model may be a Word2Vec model. Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can better be understood by the computer, and the vector generated by the word segmentation can also reflect semantic information. In order to use this semantic information, the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model. Preferably, the Skip-grams model can be used for vector conversion of the word sequence. Using the Skip-grams model to convert word vectors, a 300-dimensional real number vector can be used to uniquely represent a word in the word space. The reference art institution data and target art institution data are represented by the number of word sequences multiplied by a 300 vector matrix. In order to obtain the corresponding art institution word vector sequence.

In step S1020, the art institution similarity vector between the word vector sequences of the art institution is calculated, and weighted calculation is performed according to the third weight of the art institution similarity vector. For the similarity of art institutions, there may be multiple dimensions of art similarity vectors, such as the country where the art institution is located, the establishment time of the art institution, the number of works in the art institution's collection, and so on. Therefore, the art institution similarity vector between the word vector sequences of the various dimensions of the art institution can be calculated first.

For example, the lengths of the word vector sequences of two art institutions in the same dimension may be inconsistent, so the word vector sequences of two art institutions can be used as a twin-growing short-term memory (Siamese Long short-term memory, referred to as twin LSTM) network model Input to accommodate sequence pairs of variable length. Input the reference art institution word vector sequence and the target art institution word vector sequence into two neural network models respectively, and evaluate the input reference art institution word vector sequence and the target art institution word vector by calculating the distance between the two vector sequences The similarity component of the art institution between the sequences. Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance. In addition, the art institution similarity vector can also be calculated by other algorithms, which is not specifically limited in this exemplary embodiment.

After obtaining the similarity components of the art institutions in each dimension, the similarity components of the art institutions can also be weighted according to the preset third weight to obtain the weighted calculation result. For example, the dimensions related to the similarity component of an art institution can include the country where the art institution is located, the time when the art institution was established, and the number of works in the art institution's collection. Further, the corresponding third weight can be set to 0.5 for the country, the corresponding third weight can be set to 0.2 for the establishment time, and the corresponding third weight for the number of works in the collection is set to 0.3, so as to set the corresponding third weight of the art institution similarity component to The country component is multiplied by 0.5, the establishment time component in the art institution similarity component is multiplied by 0.2, and the number of works in the art institution similarity component is multiplied by 0.3, and the sum is calculated to obtain the corresponding calculation result.

In step S1030, the similarity of the art institution is obtained according to the weighted calculation result, and it is judged whether the similarity of the art institution is greater than the third threshold. After weighting the similarity components of art institutions in each dimension, the similarity of art institutions is obtained. Therefore, a third threshold can be set according to the overall value of the similarity of the art institution, and it can be judged whether the similarity of the art institution is greater than the third threshold. For example, the third threshold can be set to 3. When the weighted calculation result is 0.8, it can be determined that the artist similarity is less than the third threshold according to 0.8 less than 3; when the weighted calculation result is 3.2, it can be determined that the artist similarity is greater than the third threshold according to 3.2 greater than 3.

In step S1040, the reference art institution data corresponding to the art institution similarity greater than the third threshold and the target art institution data are fused to generate fused art institution data. When the judgment result is that the similarity of the art institution is greater than the third threshold, it can be determined that the reference art institution data and the target art institution data point to the same art institution. Therefore, the reference art institution data and the target art institution data are fused to obtain the fusion art Institutional data.

In this exemplary embodiment, by calculating the similarity vectors of each dimension corresponding to the reference art institution data and the target art institution data, the reference art institution data and the target art institution data that meet the preset conditions can be fused to obtain The fusion of art institution data has a simple calculation method and high fusion accuracy, which improves the accuracy of art institution data acquisition.

In step S140, an art triad is generated according to the art entity and the art relationship, and an art domain knowledge graph is generated according to the art triad.

In the exemplary embodiment of the present disclosure, the art entities that can be extracted in the fusion art data may include artists, artworks, and art institutions. It is worth noting that when the fusion art data also includes other art entities, May be used as part of the knowledge graph in the field of generative art.

Among them, the knowledge map, also known as the scientific knowledge map, is a series of various graphs showing the relationship between the development process and structure of knowledge. It uses visualization technology to describe knowledge resources and their carriers, and mines, analyzes, constructs, draws and displays knowledge and them. The interrelationship between these subjects is through combining the theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with metrological citation analysis, co-occurrence analysis and other methods, and using visualized maps to vividly display the core of the subject The modern theories of structure, development history, frontier fields and overall knowledge structure to achieve the purpose of multidisciplinary integration, provide practical and valuable references for disciplinary research. The knowledge graph is a structured semantic knowledge base used to describe concepts and their relationships in the physical world in symbolic form. Its basic unit is entity-relation-entity triples, and entities and their related attributes-keys. Yes, entities are connected to each other through relationships, forming a networked knowledge structure.

Therefore, after extracting the entity relationship between the artist, the artwork and the art institution, and the three, the association model of the knowledge graph in the art field can be constructed, and the visual knowledge graph in the art field can also be drawn through the drawing program.

The method for constructing the art domain knowledge graph in the embodiments of the present disclosure will be described in detail below in conjunction with an application scenario.

FIG. 11 shows a schematic flowchart of a method for constructing a knowledge graph of an art domain in an application scenario. As shown in FIG. 11, in step S1110, an internal data source. That is, the original structured data. In addition, you can also load structured data from external data sources to enrich the original structured data sources.

In step S1111, data cleaning and error correction. Specifically, FIG. 12 shows a schematic flowchart of a method for first preprocessing of data in an application scenario. As shown in FIG. 12, in step S1210, original data is loaded. Specifically, load the structured data in the internal data source and the external data source as the original structured data.

In step S1211, data processing (split, clean). Specifically, FIG. 13 shows a schematic flowchart of a method for performing data cleaning on original structured data in an application scenario. As shown in FIG. 13, in step S1310, the original data is loaded. Specifically, load the structured data in the internal data source and the external data source as the original structured data.

In step S1311, data processing (single-valued attributes). That is, single-valued attribute judgment processing is performed on the original structured data.

In step S1312, the entity table ((attribute value)+acquisition relation table) is obtained, and the error information table is counted. Obtain the first structured entity and the first structured relationship in the single-valued structured data according to the result of the single-valued attribute determination processing, and count the multi-valued data that does not satisfy the single-valued attribute to obtain the corresponding multi-valued data table.

In step S1313, the error information table is empty. Specifically, it is determined whether there is still unupdated multi-value data in the multi-value data table. When the multi-value data table is empty, output the first structured entity and the first structured relationship; when the multi-value data table is not empty, the multi-value data can be audited to update the error correction dictionary or the data dictionary.

After data cleaning is performed on the original structure data, in step S1212, the judgment is repeated. Specifically, repeatability checks can be performed on the results of data cleaning. When the results of the artistic repetition inspection are the same, FIG. 14 shows a schematic flow chart of the processing method when the paintings are the same. As shown in FIG. 14, in step S1410, the paintings with the same name are used. Specifically, paintings with the same repeatability test results are obtained.

In step S1411, the author judges the same. Specifically, perform artist repeatability inspection, and generate artist repeatability inspection results.

In step S1412, it is determined that the creation time is the same. Specifically, when the artist repeatability test results are the same, the creation time repeatability test is further performed to generate the creation time repeatability test result.

In step S1413, the painting judgment is repeated. Specifically, when the result of the repeatability test of creation time is the same, the two paintings are determined as duplicate paintings.

In step S1414, the data is fused. Specifically, data fusion processing is performed on two repeated paintings, and the corresponding data fusion processing result is obtained.

In step S1415, it is reviewed. Specifically, the data fusion processing result is manually reviewed, and the manual review result is obtained.

In step S1416, the data/dictionary is updated. Specifically, when the manual review result is approved, a data dictionary is generated for updating.

In step S1417, the painting is renamed. After judging that the creation artist and creation time of the painting of the same name are the same, it can be determined that the two paintings are paintings of the same name.

In step S1418, review is performed. De-duplicate two paintings with duplicate names, and manually review the results of the de-duplication to determine the accuracy of the de-duplication.

In step S1213, entities are deduplicated/fused. Manually check the results of reprocessing or fusion processing in the past, as well as non-repetitive data cleaning results, and manually verify incorrect information such as the name of the painting and the name of the artist.

In step S1214, a dictionary is generated. Specifically, a corresponding data dictionary or error correction dictionary can be generated according to the result of the error checking process.

In step S1215, it is judged that the data specification is correct. Specifically, the generated data dictionary or error correction dictionary may have a naming specification that is inconsistent with the storage specification in the database, and further data specification processing steps can be performed. Add the normalized dictionary to the data dictionary or error correction dictionary.

In step S1112, the post-update data is updated. Specifically, the target art data can be generated according to the generated data dictionary or error correction dictionary, that is, the updated data.

In step S1113, the data is fused. Specifically, data fusion processing can be performed on the first structured data and the second structured data. Wherein, the second structured data may be structured data obtained by crawling semi-structured data from an external data source for processing, and then storing it in a MySQL database. Fig. 15 shows a schematic flow chart of a method for generating fusion art data in an application scenario. As shown in Fig. 15, in step S1510, external data is crawled. Specifically, semi-structured data is crawled from external data sources. Wherein, the external data source may be a public data source on the Internet, or other data sources, which is not particularly limited in this exemplary embodiment.

In step S1511, the semi-structured data is analyzed. Specifically, the second preprocessing is performed on the semi-structured data according to preset rules and regular expressions to obtain structured data.

In step S1512, structure data, such as artwork, artist, and art structure. Further, the obtained structured data can also be standardized to generate second structured data.

In step S1513, after the artists are grouped according to the birth year and month, Word2Vec is used to calculate the similarity of each relationship and attribute to obtain the similarity vectors of the two artists in the same birth year. The Word2Vec algorithm is used to calculate the artist similarity vectors of each dimension of the reference artist and the target artist data.

In step S1514, the similarity vector of the previous step is weighted to obtain the similarity of the artist. Specifically, the artist similarity vector is weighted and calculated according to the first weight to generate the corresponding artist similarity.

In step S1515, the similarity is higher than the set threshold 1. Specifically, the artist similarity is compared with the first threshold.

In step S1516, the corresponding artists perform fusion. Specifically, when the artist similarity is greater than the first threshold, the reference artist data and the target artist data are fused to generate fused artist data.

In step S1517, after grouping according to the author of the artwork, word2vec is used to calculate the similarity of each relationship and attribute respectively, and the similarity vector of the two artworks of the same author is obtained. Specifically, the Word2Vec algorithm is used to calculate the reference artwork and The artwork similarity vector of each dimension of the target artwork data.

In step S1518, the similarity vector of the previous step is weighted to obtain the similarity of the artwork. Specifically, the artwork similarity vector is weighted according to the second weight to generate the corresponding artwork similarity.

In step S1519, the similarity is higher than the set threshold 2. Specifically, the artwork similarity is compared with the second threshold.

In step S1520, the corresponding artworks are merged. Specifically, when the artwork similarity is greater than the second threshold, the reference artwork data and the target artwork data are fused to generate fused artwork data.

In step S1521, for the art institution, word2vec is used to calculate the similarity of each relationship and attribute respectively, and the similarity vector of the pair of art institutions is obtained. Specifically, the Word2Vec algorithm is used to calculate the art institution similarity vectors of various dimensions of the reference art institution and the target art institution data.

In step S1522, the similarity vector of the previous step is weighted to obtain the similarity of the art institution. Specifically, the art institution similarity vector is weighted according to the third weight to generate the corresponding art institution similarity.

In step S1523, the similarity is higher than the set threshold 3. Specifically, the similarity of art institutions is compared with the third threshold.

In step S1524, the corresponding art institutions perform fusion. Specifically, when the art institution similarity is greater than the third threshold, the reference art institution data and the target art institution data are fused to generate fused art institution data.

In step S1114, the data is merged. Specifically, the fused fused artist data, fused artwork data, and fused art institution data can be obtained after fusion.

In addition, after the fusion processing data is obtained, the data quality can also be evaluated in step S1115. Specifically, the art data from unmatched external data sources is extracted to evaluate the fusion processing data. In this plan, the main evaluation indicators include the accuracy and completeness of the fusion processing data.

In step S1116, the entity relationship is extracted. Specifically, the art entity in the fusion art data and the art relationship corresponding to the art entity are extracted to realize the schema design of the database. Among them, the schema contains schema objects, which can be table (table), column (column), data type (data type), view (view), stored procedure (stored procedures), relationship (relationships), primary key (primary key) , Foreign key, etc. The database model can be represented by a visual diagram, which shows the artistic entities and their relationships with each other.

In step S1117, the KG_neo4j database (MySQL). Specifically, the generated art domain knowledge graph composed of art triads of artists, artworks, and art institutions is obtained, and the whole is stored in a graph database, such as Neo4j. FIG. 16 shows a schematic interface diagram of a visualized knowledge map of the art field. As shown in FIG. 16, the art entities include artists, artworks, and art institutions. Among them, the entities related to the artist can have nationality, death year, birthplace, birth year and genre, etc. The attributes corresponding to the artist have English names and aliases; the entities related to the artwork can have creation year, creation medium, category and Themes, etc., attributes corresponding to artworks have unique codes (Identity document, ID for short), aliases, and dimensions; attributes corresponding to art institutions have English names.

The art domain knowledge graph can be applied to the art encyclopedia, art graph, art knowledge question and answer, and art knowledge overview respectively. Figure 17 shows a schematic diagram of the scene applied in the Art Encyclopedia. As shown in Figure 17, it can be applied in the Art Encyclopedia. After the user initiates a search, it can be recognized by art entity recognition, thulac word segmentation package and data dictionary to reach Vinci, And show the knowledge related to Da Vinci; Figure 18 shows a schematic diagram of the scene applied in the knowledge graph, as shown in Figure 18, the knowledge graph drawn by the drawing component E-charts is visually displayed; Figure 19 shows A schematic diagram of the scene applied in the art knowledge question and answer, as shown in Figure 19, the user’s question is segmented through the thulac word segmentation package, and a visual knowledge map corresponding to the art question is generated through the matching results of preset rules or regular expressions Figure 20 shows a schematic diagram of a scene applied in an overview of art knowledge. As shown in Figure 20, a data dictionary can be used to generate a corresponding overview of art knowledge.

In the exemplary embodiment of the present disclosure, on the one hand, the present disclosure uses data in an external data source and standardized data to perform data fusion processing, which greatly increases the scale of physical knowledge in the art field and improves the acquisition of knowledge in the art field. On the other hand, generating an art domain knowledge graph based on art entities and art relationships helps to improve the relevance of entities in the knowledge graph and the comprehensiveness of the knowledge graph search, understand query intentions more accurately, and improve the accuracy of retrieval Rate.

It should be noted that although the above exemplary embodiments describe the steps of the method in the present disclosure in a specific order, this does not require or imply that these steps must be performed in the specific order, or that all steps must be performed. In order to achieve the desired result. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, and so on.

In addition, in an exemplary embodiment of the present disclosure, a device for constructing a knowledge graph in the art field is also provided. FIG. 21 shows a schematic diagram of the structure of an art domain knowledge graph construction device. As shown in FIG. 21, the art domain knowledge graph construction device 2100 may include: a data processing module 2110, a data analysis module 2120, a data fusion module 2130, and graph generation Module 2140. in:

The data processing module 2110 is configured to perform first preprocessing on the structured data in the internal art data source and the external art data source to generate first structured data; the data analysis module 2120 is configured to perform the first preprocessing on the internal art data source and Perform the second preprocessing on the unstructured data and semi-structured data in the external art data source to obtain the second structured data; the data fusion module 2130 is configured to perform fusion processing on the first structured data and the second structured data , To generate fusion art data; among them, fusion art data includes art entities and art relationships corresponding to the art entities; the graph generation module 2140 is configured to generate art triads according to the art entities and art relations, and to generate art triads according to the art triads Knowledge map in the field of art.

The specific details of the construction device of the above-mentioned art domain knowledge graph have been described in detail in the corresponding art domain knowledge graph construction method, so it will not be repeated here.

It should be noted that although several modules or units of the apparatus 2100 for constructing the knowledge graph of the art field are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied. The modules or units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more modules or units may be integrated into one module or unit. The above-mentioned modules or units can be implemented in the form of hardware or software functional modules or units. The specific hardware can be a CPU, a microprocessor, a GPU, an FPGA, or a single-chip microcomputer.

In addition, although the various steps of the method in the present disclosure are described in a specific order in the drawings, this does not require or imply that these steps must be performed in the specific order, or that all the steps shown must be performed to achieve the desired result. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, etc.

Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiments of the present disclosure.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided. The electronic device includes a processor and a memory for storing executable instructions of the processor; the processor is configured to The executable instruction is executed to execute the above-mentioned method for constructing a knowledge graph in the art field.

The electronic device 2200 according to this embodiment of the present invention will be described below with reference to FIG. 22. The electronic device 2200 shown in FIG. 22 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.

As shown in FIG. 22, the electronic device 2200 is represented in the form of a general-purpose computing device. The components of the electronic device 2200 may include, but are not limited to: the aforementioned at least one processing unit 2210, the aforementioned at least one storage unit 2220, a bus 2230 connecting different system components (including the storage unit 2220 and the processing unit 2210), and a display unit 2240.

Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 2210, so that the processing unit 2210 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Example steps.

The storage unit 2220 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 2221 and/or a cache storage unit 2222, and may further include a read-only storage unit (ROM) 2223.

The storage unit 2220 may also include a program/utility tool 2224 having a set (at least one) program module 2225. Such program module 2225 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.

The bus 2230 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.

The electronic device 2200 may also communicate with one or more external devices 2400 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 2200, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 2200 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 2250. In addition, the electronic device 2200 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through a network adapter 2260. As shown in the figure, the network adapter 2240 communicates with other modules of the electronic device 2200 through the bus 2230. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 2200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

Through the description of the above embodiments, those skilled in the art can easily understand that the exemplary embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a non-volatile computer-readable storage medium is also provided, on which a computer program capable of implementing the above method of this specification is stored. In some possible embodiments, various aspects of the present invention can also be implemented in the form of a program product, which includes program code, and when the program product runs on a terminal device, the program code is used to cause the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.

Referring to FIG. 23, a program product 2300 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer. However, the program product of the present invention is not limited to this. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.

The program product can use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.

The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.

The program code used to perform the operations of the present invention can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).

Those skilled in the art will easily think of other embodiments of the present disclosure after considering the specification and practicing the invention disclosed herein. This application is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field that are not disclosed in the present disclosure. . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.

Claims

A method for constructing a knowledge graph in the field of art, wherein the method includes:

Perform the first preprocessing of the structured data in the internal art data source and the external art data source to generate the first structured data;

Performing a second preprocessing on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data;

Performing fusion processing on the first structured data and the second structured data to generate fused art data; wherein the fused art data includes an art entity and an art relationship corresponding to the art entity;

An art triad is generated according to the art entity and the art relationship, and an art domain knowledge graph is generated according to the art triad.
The method for constructing an art domain knowledge graph according to claim 1, wherein the first preprocessing of the structured data in the internal art data source and the external art data source to generate the first structured data comprises:

Perform data cleaning on the structured data in internal art data sources and external art data sources;

Perform repeatability inspection on the data cleaning results of the structured data in the internal art data source and the external art data source, and generate repeatability inspection data;

A data dictionary and an error correction dictionary are generated according to the repeatability check data, and the first structured data is obtained based on the data dictionary.
The method for constructing an art domain knowledge graph according to claim 2, wherein said performing data cleaning on the structured data in the internal art data source and the external art data source comprises:

Perform single-value attribute judgment processing on the structured data in the internal art data source and the external art data source to obtain single-value structured data;

Acquiring the first structured entity and the first structured relationship in the single-valued structured data, and calculating the result of the single-valued attribute determination processing to obtain a multi-valued data table;

When the multi-value data table does not contain multi-value data, use the first structured entity and the first structured relationship as a data cleaning result;

When the multi-valued data table contains multi-valued data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the data cleaning result.
The method for constructing a knowledge graph in the art field according to claim 3, wherein the obtaining the second structured entity and the second structured relationship according to the multi-valued data table as a result of data cleaning comprises:

Update the data dictionary or the error correction dictionary according to the multi-value data table;

According to the updated result of the updated data dictionary or the error correction dictionary, the second structured entity and the second structured relationship are obtained as the data cleaning result.
The method for constructing a knowledge graph in the art domain according to claim 4, wherein said performing repeatability inspection on the data cleaning result of the structured data to generate repeatability inspection data comprises:

Performing the repeatability inspection of the artwork entity on the data cleaning results of the structured data in the internal art data source and the external art data source, and generating the artwork repeatability inspection result;

When the artwork repeatability inspection result is the same, perform the artist entity repeatability inspection on the data cleaning result to generate the artist repeatability inspection result;

When the artist repeatability check results are the same, perform a creation time entity repeatability check on the data cleaning result to generate a creation time repeatability test result;

When the creation time repeatability check result is the same, it is determined that the artwork entity is a duplicate artwork;

Fusion processing is performed on the duplicate artwork, and repeatability inspection data is generated according to the approved fusion processing result.
The method for constructing a knowledge graph in the art field according to claim 5, wherein the method further comprises:

When the repeatability test result of the artist is different or the creation time repeatability test result is different, the artwork entity is determined to be an artwork of the same name;

De-duplication processing is performed on the artwork with the same name, and the repeatability inspection data is generated according to the de-duplication processing result.
The method for constructing a knowledge graph in the art field according to claim 1, wherein the first structured data includes target artwork data, target artist data, and target art institution data;

The fusion processing of the first structured data and the second structured data to generate fused art data includes:

Performing fusion processing on the reference artist data in the second structured data and the target artist data to generate fused artist data;

Performing fusion processing on the reference artwork data in the second structured data and the target artwork data to generate fused artwork data;

Fusion processing is performed on the reference art institution data in the second structured data and the target art institution data to generate fused art institution data.
8. The method for constructing a knowledge graph in the art domain according to claim 7, wherein said fusing the reference artist data in the second structured data with the target artist data to generate fused artist data includes:

Performing vector conversion on the reference artist data in the second structured data and the target artist data according to the word vector model to obtain an artist word vector sequence;

Calculating artist similarity vectors between the artist word vector sequences, and performing weighted calculation according to the first weight of the artist data similarity vectors;

Obtaining the artist similarity according to the weighted calculation result, and judging whether the artist similarity is greater than the first threshold;

Perform fusion processing on the reference artist data and the target artist data corresponding to the artist similarity greater than the first threshold to generate fused artist data.
8. The method for constructing a knowledge graph in the art domain according to claim 7, wherein said fusing the reference artwork data in the second structured data with the target artwork data to generate fused artwork data, include:

Performing vector conversion on the reference artwork data in the second structured data and the target artwork data according to the word vector model to obtain an artwork word vector sequence;

Calculate the artwork similarity vector between the artwork word vector sequences, and perform weighted calculation according to the second weight of the artwork similarity vector;

Obtaining artwork similarity according to the weighted calculation result, and judging whether the artwork similarity is greater than a second threshold;

Fusion processing is performed on the reference artwork data and the target artwork data corresponding to the artwork similarity greater than the second threshold to generate fused artwork data.
8. The method for constructing an art domain knowledge graph according to claim 7, wherein the fusion processing is performed on the reference art institution data in the second structured data and the target art institution data to generate fused art institution data, include:

Performing vector conversion on the reference art institution data in the second structured data and the target art institution data according to the word vector model to obtain an art institution word vector sequence;

Calculating the art institution similarity vector between the art institution word vector sequences, and performing weighting calculation according to the third weight of the art institution similarity vector;

Obtaining the similarity of the art institution according to the weighted calculation result, and judging whether the similarity of the art institution is greater than the third threshold;

Fusion processing is performed on the reference art institution data and the target art institution data corresponding to the art institution similarity greater than the third threshold to generate fused art institution data.
A device for constructing a knowledge map in the art field, which includes:

The data processing module is configured to perform the first preprocessing on the structured data in the internal art data source and the external art data source to generate the first structured data;

A data analysis module configured to perform second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data;

The data fusion module is configured to perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art entity corresponding to the art entity Artistic relationship

The graph generation module is configured to generate an art triad according to the art entity and the art relationship, and generate an art domain knowledge graph according to the art triad.
A non-volatile computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to realize the method for constructing a knowledge graph in the art field according to any one of claims 1-10 .
An electronic device, including:

processor;

A memory for storing executable instructions of the processor;

Wherein, the processor is configured to execute the method for constructing an art domain knowledge graph according to any one of claims 1-10 by executing the executable instructions.