WO2021147786A1 - Knowledge graph construction method and apparatus, storage medium, and electronic device - Google Patents

Knowledge graph construction method and apparatus, storage medium, and electronic device Download PDF

Info

Publication number
WO2021147786A1
WO2021147786A1 PCT/CN2021/072241 CN2021072241W WO2021147786A1 WO 2021147786 A1 WO2021147786 A1 WO 2021147786A1 CN 2021072241 W CN2021072241 W CN 2021072241W WO 2021147786 A1 WO2021147786 A1 WO 2021147786A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
art
structured
artwork
artist
Prior art date
Application number
PCT/CN2021/072241
Other languages
French (fr)
Chinese (zh)
Inventor
李慧
许蕾
郝吉芳
杨卓士
商晓健
王炳乾
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2021147786A1 publication Critical patent/WO2021147786A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the technical field of knowledge graph construction, and in particular to a method for constructing a knowledge graph in the art field, a device for constructing a knowledge graph in the art field, a computer-readable storage medium, and an electronic device.
  • Knowledge graph is also called scientific knowledge graph.
  • Knowledge graph uses visualization technology to describe knowledge resources and their carriers, mines, analyzes, constructs, draws, and displays knowledge and their interrelationships. It is a series of showing the development process and structural relationship of knowledge. A variety of different graphics, and provide a better way to organize, manage and understand the massive amount of information on the Internet.
  • the knowledge graph is also the prototype of building a next-generation search engine, making search more semantic and intelligent.
  • general knowledge graphs and domain knowledge graphs.
  • the domain knowledge graph is also called the industry knowledge graph or the vertical knowledge graph, which is usually oriented to a specific field and is equivalent to an industry knowledge base based on semantic technology. Since the domain knowledge map is constructed based on industry data, it has a more rigorous and rich data model, and also has higher requirements for the depth and accuracy of domain knowledge.
  • the purpose of the present disclosure is to overcome the above-mentioned shortcomings of the prior art, and provide a method for constructing a knowledge map in the art field, a device for constructing a knowledge map in the art field, a computer-readable storage medium, and an electronic device.
  • a method for constructing a knowledge graph in the art field comprising: performing first preprocessing on structured data in an internal art data source and an external art data source to generate a first structure Data; performing a second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data; converting the first structured data Perform fusion processing with the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art relationship corresponding to the art entity; generated according to the art entity and the art relationship Art triads, and generate a knowledge map of the art field according to the art triads.
  • the first preprocessing of the structured data in the internal art data source and the external art data source to generate the first structured data includes: processing the internal art data Data cleaning is performed on the structured data in the source and external art data sources; the data cleaning results of the structured data are repetitively checked to generate repetitive inspection data; a data dictionary and an error correction dictionary are generated based on the repetitive inspection data , Generate the first structured data based on the data dictionary.
  • the data cleaning of the structured data in the internal art data source and the external art data source includes: the structure of the internal art data source and the external art data source Single-valued attribute determination processing is performed on the data to obtain single-valued structured data; the first structured entity and the second structured relationship in the single-valued structured data are acquired, and the results of the single-valued attribute determination processing are counted Obtain a multi-value data table; when the multi-value data table does not contain multi-value data, use the first structured entity and the second structured relationship as the result of data cleaning; when the multi-value data table contains multi-value data In the case of value data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the result of data cleaning.
  • the obtaining the second structured entity and the second structured relationship according to the multi-valued data table as a data cleaning result includes: updating according to the multi-valued data table A data dictionary or an error correction dictionary; according to an update result of the data dictionary or the error correction dictionary, a second structured entity and a second structured relationship are obtained as a data cleaning result.
  • the performing repeatability inspection on the data cleaning result of the structured data to generate repeatability inspection data includes: performing art on the data cleaning result of the original structured data
  • the repeatability test of the product entity generates the repeatability test result of the artwork; when the repeatability test result of the artwork is the same, the repeatability test of the artist entity is performed on the data cleaning result to generate the repeatability test result of the artist; when When the artist repeatability check result is the same, the creation time entity repeatability check is performed on the data cleaning result to generate a creation time repeatability check result; when the creation time repeatability check result is the same, it is determined
  • the artwork entity is a duplicate artwork; fusion processing is performed on the duplicate artwork, and repeatability inspection data is generated according to the approved fusion processing result.
  • the method further includes: when the artist repeatability check result is different or the creation time repeatability check result is different, determining that the artwork entity has the same name Artwork; de-duplicate the artwork with the same name, and generate the repeatability inspection data according to the result of the de-duplication process.
  • the first structured data includes target artwork data, target artist data, and target art institution data; and the first structured data is combined with the second structured data.
  • Fusion processing of the fusion data to generate fusion art data including: fusion processing the reference artist data in the second structured data with the target artist data to generate fusion artist data; Perform fusion processing on the reference art data and the target art data to generate fused art data; perform fusion processing on the reference art institution data in the second structured data and the target art institution data to generate fused art Institutional data.
  • the fusion processing of the reference artist data in the second structured data with the target artist data to generate fused artist data includes: performing a fusion process on the second structured data according to a word vector model. 2. Perform vector conversion between the reference artist data in the structured data and the target artist data to obtain an artist word vector sequence; calculate the artist similarity vector between the artist word vector sequences, and calculate the artist similarity vector according to the first of the artist similarity vector Perform weighted calculation with a weight; obtain artist similarity according to the weighted calculation result, and determine whether the artist similarity is greater than a first threshold; compare the reference artist data corresponding to the artist similarity greater than the first threshold with the The target artist data is fused to generate fused artist data.
  • the fusion processing of the reference artwork data in the second structured data with the target artwork data to generate fused artwork data includes: according to word vectors
  • the model performs vector conversion on the reference artwork data in the second structured data and the target artwork data to obtain an artwork word vector sequence; calculates the artwork similarity vector between the artwork word vector sequences, and then The second weight of the artwork similarity vector is weighted and calculated; the artwork similarity is obtained according to the weighted calculation result, and it is determined whether the artwork similarity is greater than a second threshold; the art that is greater than the second threshold is determined
  • the reference artwork data corresponding to the product similarity and the target artwork data are fused to generate fused artwork data.
  • the fusion processing of the reference art institution data in the second structured data with the target art institution data to generate fused art institution data includes: according to word vectors
  • the model performs vector conversion on the reference art institution data in the second structured data and the target art institution data to obtain an art institution word vector sequence; calculates the art institution similarity vector between the art institution word vector sequences, and calculates it according to
  • the third weight of the art institution similarity vector is weighted; the art institution similarity is obtained according to the weighted calculation result, and it is determined whether the art institution similarity is greater than the third threshold; the art institution that is greater than the third threshold is determined
  • the reference art institution data corresponding to the institution similarity and the target art institution data are fused to generate fused art institution data.
  • a device for constructing a knowledge graph in the art field comprising: a data processing module configured to perform first analysis on structured data in an internal art data source and an external art data source. Preprocessing to generate first structured data; a data analysis module configured to perform a second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain a second Structured data; a data fusion module configured to perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and a connection with the art The art relationship corresponding to the entity; the graph generation module is configured to generate an art triad according to the art entity and the art relation, and form an art domain knowledge graph according to the art triad.
  • an electronic device including: a processor and a memory; wherein a computer readable instruction is stored in the memory, and the computer readable instruction is executed by the processor to implement any of the foregoing examples A method for constructing a knowledge graph in the art field of an exemplary embodiment.
  • a computer-readable storage medium having a computer program stored thereon, and the computer program, when executed by a processor, implements the method for constructing an art domain knowledge graph in any of the above exemplary embodiments .
  • Fig. 1 schematically shows a flowchart of a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure
  • Fig. 2 schematically shows a flow chart of a method for generating first structured data in an exemplary embodiment of the present disclosure
  • FIG. 3 schematically shows a flow chart of a method for data cleaning in an exemplary embodiment of the present disclosure
  • FIG. 4 schematically shows a flow chart of a method for obtaining a data cleaning result in an exemplary embodiment of the present disclosure
  • Fig. 5 schematically shows a flow chart of a method for generating repetitive inspection data in an exemplary embodiment of the present disclosure
  • Fig. 6 schematically shows a flow chart of another method for generating repetitive inspection data in an exemplary embodiment of the present disclosure
  • Fig. 7 schematically shows a flow chart of a method for generating fused art data in an exemplary embodiment of the present disclosure
  • Fig. 8 schematically shows a flow chart of a method for obtaining fusion artist data in an exemplary embodiment of the present disclosure
  • FIG. 9 schematically shows a flow chart of a method for obtaining fused artwork data in an exemplary embodiment of the present disclosure
  • FIG. 10 schematically shows a flow chart of a method for obtaining fusion art institution data in an exemplary embodiment of the present disclosure
  • FIG. 11 schematically shows a flow chart of a method for constructing an art domain knowledge graph of an application scenario in an exemplary embodiment of the present disclosure
  • FIG. 12 schematically shows a flow chart of a method for first preprocessing of data in an application scenario in an exemplary embodiment of the present disclosure
  • FIG. 13 schematically shows a flow chart of a method for data cleaning in an application scenario in an exemplary embodiment of the present disclosure
  • FIG. 14 schematically shows a flowchart of a processing method when painting is repeated in an application scenario in an exemplary embodiment of the present disclosure
  • Fig. 15 schematically shows a flow chart of a method for generating fusion art data in an application scenario in an exemplary embodiment of the present disclosure
  • FIG. 16 schematically shows an interface diagram of a visualized art domain knowledge graph in an application scenario in an exemplary embodiment of the present disclosure
  • FIG. 17 schematically shows a scene diagram of the application of the art domain knowledge graph in the art encyclopedia in an exemplary embodiment of the present disclosure
  • FIG. 18 schematically shows a schematic diagram of a scene in which the knowledge graph of the art field is applied to the knowledge graph in an exemplary embodiment of the present disclosure
  • FIG. 19 schematically shows a scene diagram of the application of the art domain knowledge graph in the art knowledge question and answer in an exemplary embodiment of the present disclosure
  • FIG. 20 schematically shows a schematic diagram of a scene in which the art domain knowledge graph is applied to an overview of art knowledge in an exemplary embodiment of the present disclosure
  • FIG. 21 schematically shows a structure diagram of an apparatus for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure
  • FIG. 22 schematically shows an electronic device for implementing a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure
  • FIG. 23 schematically illustrates a computer-readable storage medium used to implement a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure.
  • the terms “a”, “a”, “the” and “said” are used to indicate the presence of one or more elements/components/etc.; the terms “including” and “have” are used to indicate open-ended Inclusive means and means that in addition to the listed elements/components/etc., there may be other elements/components/etc.; the terms “first” and “second” etc. are only used as marks, not to The number of its objects is limited.
  • the existing domain knowledge graph construction has corresponding defects. Specifically, the construction method of the knowledge map of the English professional domain is not fully applicable to the construction of the knowledge map of the Chinese professional domain. There is also the problem that the existing method of constructing the knowledge map of the professional domain cannot take into account the scale and accuracy of the professional knowledge. It is difficult to integrate domain knowledge acquired from multiple data sources.
  • Figure 1 shows a flow chart of a method for constructing a knowledge map of the art field.
  • the method for constructing a knowledge map of the art field includes at least the following steps:
  • Step S110 Perform first preprocessing on the structured data in the internal art data source and the external art data source to generate first structured data.
  • Step S120 Perform a second preprocessing on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data.
  • Step S130 Perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art relationship corresponding to the art entity.
  • Step S140 Generate an art triad according to the art entity and the art relationship, and generate an art domain knowledge graph according to the art triad.
  • the present disclosure uses data in an external data source and standardized data to perform data fusion processing, which greatly increases the scale of physical knowledge in the art field and improves the acquisition of knowledge in the art field.
  • generating an art domain knowledge graph based on art triads helps to improve the relevance of entities in the knowledge graph and the comprehensiveness of the knowledge graph search, to understand query intentions more accurately, and to improve the accuracy of retrieval .
  • step S110 first preprocessing is performed on the structured data in the internal art data source and the external art data source to generate the first structured data.
  • the internal art data source and the external art data source may be determined for the source of the art data.
  • the data in the internal art data source may be mainly processed manually Structured data
  • data in external art data sources can be crawled based on public data on the Internet, mainly semi-structured data.
  • internal art data sources may also contain unstructured data and semi-structured data
  • external art data sources may also contain structured data and unstructured data. Therefore, internal data sources and external data sources can be obtained. Structured data.
  • FIG. 2 shows a schematic flow chart of the method for generating the first structured data. As shown in FIG. 2, the method includes at least the following steps: In step S210, the internal art data source and the external The structured data in the art data source is data cleaned.
  • FIG. 3 shows a schematic flow chart of a method for data cleaning of structured data.
  • the method includes at least the following steps:
  • step S310 the internal art
  • the structured data in the data source and the external art data source is subjected to single-value attribute judgment processing to obtain single-value structured data.
  • a single-valued attribute may be an attribute whose data has only one specific value.
  • the method for determining the single-valued attribute of structured data can be to determine whether there is only one author of a painting. When there are two writers corresponding to a painting, Van Gogh and Van Gogh, the corresponding list is not obtained.
  • the single value structured data of the writer of the painting can be obtained as Van Gogh.
  • the single-value structured data may also include paintings, creation time, genres, nationalities, etc., which are not specifically limited in this exemplary embodiment.
  • step S320 the first structured entity and the first structured relationship in the single-valued structured data are obtained, and the result of the single-valued attribute determination processing is calculated to obtain a multi-valued data table.
  • the corresponding first structured entity and first structured relationship can be extracted.
  • the first structured entity may include artist entity, artwork entity, creation time entity, genre entity, nationality entity, etc.; for an artist entity, the corresponding structured relationship may include created artwork entity, creation The relationship between the creation time, the genre that has been formed, and the nationality to which all the artwork entities correspond to.
  • the structured data that failed the single-value attribute judgment can be counted to obtain a multi-value data table.
  • step S330 when the multi-valued data table does not contain multi-valued data, the first structured entity and the first structured relationship are used as the data cleaning result.
  • the multi-value data is not counted or the multi-value data in all the multi-value data tables have been updated, further review can be carried out.
  • the review is a manual review and the manual review is passed, the obtained first structured entity and the first structured relationship are directly determined as the data cleaning result.
  • the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.
  • the review step mentioned in the embodiment of the present invention can be an automatic review according to a custom set rule, or it can be a direct review manually. Both manual review and automatic review are interchangeable.
  • step S340 when the multi-valued data table contains multi-valued data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the result of data cleaning.
  • FIG. 4 shows a schematic flow chart of a method for obtaining data cleaning results.
  • the method at least includes the following steps:
  • step S410 according to the multi-valued data Table update data dictionary or error correction dictionary.
  • the multi-value data table of the writer corresponding to a painting may include two values of Van Gogh and Van Gogh.
  • Van Gogh is an alias of Van Gogh, so Van Gogh can be replaced with Van Gogh, generate a corresponding error correction dictionary for update.
  • the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.
  • the error correction dictionary can be a database used to store data sources related to operations such as adding data, modifying data, and the relationship with other data, usage, and format; correspondingly, the data dictionary can be used to store format and A database of information such as content and other normative data such as the data source, relationship with other data, usage and format.
  • step S420 the second structured entity and the second structured relationship are obtained as the data cleaning result according to the update result of the updated data dictionary or error correction dictionary.
  • the update results of the data dictionary and the error correction dictionary can be further judged until the multi-value data in all multi-value data tables is updated, and the second structured entity and the first structured entity are obtained. Two structured relations as a result of data cleaning.
  • data cleaning results are generated for the structured entities and structured relationships in the multi-valued data table, which facilitates the update of knowledge in the art field, and the update method is simple and accurate.
  • step S220 repeatability inspection is performed on the data cleaning results of the structured data in the internal art data source and the external art data source, and repeatability inspection data is generated.
  • the structured entities include artwork entities, artist entities, and creation time entities.
  • FIG. 5 shows a schematic flow chart of a method for generating repetitive inspection data. As shown in FIG. 5, the method at least It includes the following steps: in step S510, the repeatability inspection of the artwork entity is performed on the data cleaning results of the structured data in the internal art data source and the external art data source, and the artwork repeatability inspection result is generated.
  • the inspection of the artwork entity may be to obtain the name of the artwork, and determine whether the artwork name is consistent or not, and generate the corresponding artwork repeatability inspection result.
  • repeatability test results can be obtained as the same; when one is named “Mona Lisa” "” and a painting titled “Girl with a Pearl Earring” are subject to the repeatability test of the artwork entity, and the repeatability test results are different.
  • repeated verification is to determine whether two entities are substantially the same. For example, if the author of an artwork has a full name and abbreviation, it is essentially the author himself, and the structure of repeated verification should be the same. of.
  • step S520 when the artwork repeatability check result is the same, perform the artist entity repeatability check on the data cleaning result to generate the artist repeatability check result.
  • the names of the two paintings are both "Mona Lisa"
  • these two paintings may have been processed by later writers, or they may have come from different museums, or different paintings caused by other reasons, so they can be further judged.
  • it can be a repetitive test of the artist entity that created the artwork.
  • step S530 when the artist repeatability check results are the same, the creation time entity repeatability check is performed on the data cleaning result, and the creation time repeatability test result is generated. For example, when the two paintings whose names are both "Mona Lisa" correspond to the same artist, it can be determined that the result of the repeatability test of the artist is the same. Furthermore, the time of creation can also be tested for repeatability.
  • step S540 when the creation time repeatability check result is the same, it is determined that the artwork entity is a duplicate artwork. For example, when two paintings whose names are "Mona Lisa" correspond to the same artist and creation time, it can be determined that the creation time repeatability test result is the same. Therefore, it can be determined that the artwork is a duplicate artwork based on the repeatability test results of these three dimensions.
  • step S550 fusion processing is performed on the repeated artwork, and repeatability inspection data is generated according to the approved fusion processing result.
  • the two artwork entities can be fused, and the result of the fusion processing can be manually reviewed. Manual review can further determine whether the repeatability test results of other dimensions are the same.
  • the repeatability test data of the artwork can be generated.
  • the data dictionary can also be updated based on the repeatability test data.
  • entity fusion processing can be performed on repeated paintings through three-dimensional judgments, and the data dictionary can be updated.
  • the data dictionary can be more accurately perfected to ensure that the knowledge of the data dictionary is updated, and the problem is also reduced.
  • FIG. 6 shows a schematic flowchart of another method for generating repetitive inspection data.
  • the method includes at least the following steps:
  • step S610 when the artist repeatedly inspects When the result is different or the repeatability test result of creation time is different, the artwork entity is determined to be the artwork with the same name.
  • the artwork repeatability test results are the same, it can be further determined whether the artist repeatability test results and the creation time repeatability test results are the same.
  • the repeatability test can be further performed on the creation time to determine whether the artwork is an artwork with the same name.
  • step S620 deduplication processing is performed on the artwork with the same name, and repeatability inspection data is generated according to the deduplication processing result.
  • two “self-portrait” paintings are the same-named paintings, they can be deduplicated, that is, the two paintings can be determined as two data dictionaries.
  • manual review can be carried out. Only the artwork with the same name that has passed the review can generate repetitive inspection data and update the data dictionary.
  • the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.
  • the data dictionary can be updated, which can avoid the problem of too few dimensions.
  • a data dictionary and an error correction dictionary are generated according to the repeatability check data, and the first structured data is obtained based on the data dictionary.
  • some attribute data is not included in the data dictionary, and the attribute data also belongs to the first structured data.
  • the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.
  • the standardization of these data can also be judged, and a data dictionary or error correction dictionary that meets the storage specification of the art field database is used as the target art data, or a data dictionary that does not conform to the storage specification Or an error correction dictionary for correction, or it can be used as target art data.
  • the corresponding target art data can be generated through the first preprocessing process of the structured data, the processing method is simple and accurate, the manual workload is reduced, and the practicability is extremely strong.
  • step S120 second preprocessing is performed on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain the second structured data.
  • the second preprocessing can be a step of data cleaning, specifically it can be art data consistency check, missing value processing, invalid value processing, repeated data judgment, etc. It can also be configured or embedded in custom code pairs according to the processing requirements of art data.
  • the cleaning work performed by data cleaning can be a step of data cleaning, specifically it can be art data consistency check, missing value processing, invalid value processing, repeated data judgment, etc. It can also be configured or embedded in custom code pairs according to the processing requirements of art data.
  • the data in the external art data source can be merged and filled in to expand the data in the internal art data source.
  • semi-structured data can be crawled from the Internet public data to obtain second structured data that can be used for filling.
  • the processing method for semi-structured data may be to use preset rules and preset regular expressions for parsing.
  • the rule “author’s works have “works”” can be constructed; through “Da Vinci’s masterpieces Na Lisa embodies his tiny artistic attainments” and can construct the rule "author”'s masterpiece "work”; through “Mona Lisa is an oil painting created by Italian painter Leonardo da Vinci”, the rule “work” can be constructed It is "author” creation”; through “Mona Lisa represents Leonardo's highest artistic achievement”, the rule ""work” represents “author”” can be constructed.
  • the semi-structured data can be parsed to obtain the target structured data.
  • the previous content is filled in to the date of birth
  • the content between the second comma and the third comma is filled in to the nationality
  • the content after the third comma is filled in to the representative work to construct a regular expression. Therefore, the second structured data can also be obtained by performing the second preprocessing on the semi-structured data through the constructed regular expression.
  • step S130 the first structured data and the second structured data are fused to generate fused art data; where the fused art data includes an art entity and an art relationship corresponding to the art entity.
  • the first structured data includes target artwork data, target artist data, and target art institution data.
  • FIG. 7 shows a schematic flowchart of a method for generating fused art data, as shown in FIG. 7 , The method includes at least the following steps:
  • step S710 the reference artist data and the target artist data in the second structured data are fused to generate fused artist data.
  • FIG. 8 shows a schematic flow chart of a method for obtaining fusion artist data.
  • the method includes at least the following steps:
  • step S810 the second structured data is performed according to the word vector model.
  • the reference artist data and the target artist data in the data are vectorized to obtain the artist word vector sequence.
  • the word vector model may be a Word2Vec model.
  • the Word2Vec model is the Word2Vec tool released by Google in 2013, which can be regarded as an important application of deep learning in the field of natural language processing.
  • Word2Vec only has three layers of neural network, it has achieved very good results.
  • the word segmentation can be expressed as a word vector, and the text can be digitized, which can be better understood by the computer, and the vector generated by the word segmentation can also reflect semantic information.
  • the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model.
  • CBOW Continuous Bag-of-Words Model
  • the Skip-grams model is to predict the context of the given input word segmentation.
  • the first part is to build the model, and the second part is to obtain the embedded word vector through the model.
  • the Skip-grams model can be used for vector conversion of the word sequence.
  • a 300-dimensional real number vector can be used to uniquely represent a word in the word space.
  • the reference artist data and target artist data are represented by the number of word sequences multiplied by a 300 vector matrix to get The corresponding artist word vector sequence.
  • step S820 the artist similarity vectors between the artist word vector sequences are calculated, and weighted calculation is performed according to the first weight of the artist data similarity vectors.
  • artist similarity there may be multiple dimensions of artist similarity vectors, such as artist's nationality, artist's genre, and so on. Therefore, the artist similarity vector between the word vector sequences of each dimension of the artist can be calculated first.
  • the lengths of two artist word vector sequences of the same dimension may be inconsistent, so the two artist word vector sequences can be used as the input of the Siamese Long short-term memory (Siamese Long short-term memory, referred to as Siamese LSTM) network model , To adapt to variable-length sequence pairs.
  • the twin growth short-term memory network model is composed of two identical neural network models, and the two neural network models achieve the twinning purpose by sharing weights.
  • the reference artist word vector sequence and the target artist word vector sequence are respectively input into two neural network models, and the distance between the input reference artist word vector sequence and the target artist word vector sequence is evaluated by calculating the distance between the two vector sequences.
  • Artist similarity component is Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance.
  • the artist similarity vector can also be calculated by other algorithms, which is not particularly limited in this exemplary embodiment.
  • the dimensions related to the artist similarity component may include genre and nationality.
  • the corresponding first weight can be set to 0.4 for the genre, and the corresponding first weight can be set to 0.6 for the nationality, so as to multiply the genre component in the artist similarity component by 0.4, and the nationality component in the artist similarity component Take 0.6 and perform the sum calculation to get the corresponding calculation result.
  • step S830 the artist similarity is obtained according to the weighted calculation result, and it is determined whether the artist similarity is greater than the first threshold.
  • the artist similarity is obtained after the weighted calculation is performed on the artist similarity components of each dimension. Therefore, the first threshold can be set according to the overall value of the artist similarity, and it can be judged whether the artist similarity is greater than the first threshold. For example, the first threshold can be set to 1.
  • the weighted calculation result is 0.8
  • the weighted calculation result is 1.2
  • step S840 the reference artist data and the target artist data corresponding to the artist similarity greater than the first threshold are fused to generate fused artist data.
  • the determination result is that the artist similarity is greater than the first threshold, it can be determined that the reference artist data and the target artist data point to the same artist, so the reference artist data and the target artist data are fused to obtain the fused artist data.
  • the reference artist data and the target artist data that meet the preset conditions can be fused to obtain the fused artist data.
  • the calculation method is simple, the fusion accuracy is high, and the accuracy of the artist data acquisition is improved.
  • step S720 the reference artwork data in the second structured data and the target artwork data are fused to generate fused artwork data.
  • FIG. 9 shows a schematic flow chart of a method for obtaining fused artwork data.
  • the method includes at least the following steps:
  • step S910 the second structure is adjusted according to the word vector model.
  • the reference artwork data and the target artwork data in the transformation data are vectorized to obtain the artwork word vector sequence.
  • the word vector model may be a Word2Vec model.
  • Word2Vec model Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can better be understood by the computer, and the vector generated by the word segmentation can also reflect semantic information.
  • the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model.
  • CBOW Continuous Bag-of-Words Model
  • the Skip-grams model can be used for vector conversion of the word sequence.
  • a 300-dimensional real number vector can be used to uniquely represent a word in the word space.
  • the reference art data and target art data are represented by the number of word sequences multiplied by a 300 vector matrix. In order to obtain the corresponding art word vector sequence.
  • step S920 the artwork similarity vector between the artwork word vector sequences is calculated, and weighted calculation is performed according to the second weight of the artwork similarity vector.
  • artwork similarity there may be multiple dimensions of artwork similarity vectors, such as the genre to which the artwork belongs, the creation time of the artwork, and the art institution where the artwork is preserved. Therefore, the artist similarity vector between the word vector sequences of each dimension of the artist can be calculated first.
  • the lengths of two art word vector sequences of the same dimension may be inconsistent, so the two art word vector sequences can be used as a twin-growing short-term memory (Siamese Long short-term memory, referred to as twin LSTM) network model Input to accommodate sequence pairs of variable length.
  • Input the reference art word vector sequence and the target art word vector sequence into two neural network models respectively, and evaluate the input reference art word vector sequence and the target art word vector by calculating the distance between the two vector sequences
  • the artwork similarity component between sequences Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance.
  • the artwork similarity vector can also be calculated through other algorithms, which is not particularly limited in this exemplary embodiment.
  • the corresponding second weight can be set to 0.4 for artwork
  • the corresponding second weight can be set to 0.3 for the creation time
  • the second weight set for the preservation institution is also 0.3.
  • the similarity of the artwork The genre component in the component is multiplied by 0.4
  • the creation time component in the artwork similarity component is multiplied by 0.3
  • the art institution component in the artwork similarity component is multiplied by 0.3, and the sum is calculated to obtain the corresponding calculation result.
  • step S930 the artwork similarity is obtained according to the weighted calculation result, and it is judged whether the artwork similarity is greater than the second threshold.
  • the artwork similarity is obtained after the weighted calculation is performed on the artwork similarity components of each dimension. Therefore, the second threshold can be set according to the overall value of the similarity of the artwork, and it can be judged whether the similarity of the artwork is greater than the second threshold. For example, the second threshold can be set to 2.
  • the weighted calculation result is 0.8
  • the weighted calculation result is 3.2
  • step S940 the reference artwork data corresponding to the artwork similarity greater than the second threshold and the target artwork data are fused to generate fused artwork data.
  • the judgment result is that the similarity of the artwork is greater than the second threshold, it can be determined that the reference artwork data and the target artwork data point to the same artwork, so the reference artwork data and the target artwork data are fused to obtain the fusion art ⁇ Product data.
  • the reference artwork data and the target artwork data that meet the preset conditions can be fused to obtain
  • the fusion of artwork data has a simple calculation method and high fusion accuracy, which improves the accuracy of artwork data acquisition.
  • step S730 the reference art institution data in the second structured data and the target art institution data are fused to generate fused art institution data.
  • FIG. 10 shows a schematic flow chart of a method for obtaining fusion art institution data.
  • the method includes at least the following steps:
  • step S1010 the second structure is adjusted according to the word vector model.
  • the reference art institution data and the target art institution data in the transformation data are vectorized to obtain the art institution word vector sequence.
  • the word vector model may be a Word2Vec model.
  • Word2Vec model Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can better be understood by the computer, and the vector generated by the word segmentation can also reflect semantic information.
  • the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model.
  • CBOW Continuous Bag-of-Words Model
  • the Skip-grams model can be used for vector conversion of the word sequence.
  • a 300-dimensional real number vector can be used to uniquely represent a word in the word space.
  • the reference art institution data and target art institution data are represented by the number of word sequences multiplied by a 300 vector matrix. In order to obtain the corresponding art institution word vector sequence.
  • step S1020 the art institution similarity vector between the word vector sequences of the art institution is calculated, and weighted calculation is performed according to the third weight of the art institution similarity vector.
  • the art institution similarity vector between the word vector sequences of the various dimensions of the art institution can be calculated first.
  • the lengths of the word vector sequences of two art institutions in the same dimension may be inconsistent, so the word vector sequences of two art institutions can be used as a twin-growing short-term memory (Siamese Long short-term memory, referred to as twin LSTM) network model Input to accommodate sequence pairs of variable length.
  • Input the reference art institution word vector sequence and the target art institution word vector sequence into two neural network models respectively, and evaluate the input reference art institution word vector sequence and the target art institution word vector by calculating the distance between the two vector sequences
  • the similarity component of the art institution between the sequences Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance.
  • the art institution similarity vector can also be calculated by other algorithms, which is not specifically limited in this exemplary embodiment.
  • the similarity components of the art institutions can also be weighted according to the preset third weight to obtain the weighted calculation result.
  • the dimensions related to the similarity component of an art institution can include the country where the art institution is located, the time when the art institution was established, and the number of works in the art institution's collection.
  • the corresponding third weight can be set to 0.5 for the country, the corresponding third weight can be set to 0.2 for the establishment time, and the corresponding third weight for the number of works in the collection is set to 0.3, so as to set the corresponding third weight of the art institution similarity component to The country component is multiplied by 0.5, the establishment time component in the art institution similarity component is multiplied by 0.2, and the number of works in the art institution similarity component is multiplied by 0.3, and the sum is calculated to obtain the corresponding calculation result.
  • step S1030 the similarity of the art institution is obtained according to the weighted calculation result, and it is judged whether the similarity of the art institution is greater than the third threshold.
  • a third threshold can be set according to the overall value of the similarity of the art institution, and it can be judged whether the similarity of the art institution is greater than the third threshold.
  • the third threshold can be set to 3.
  • step S1040 the reference art institution data corresponding to the art institution similarity greater than the third threshold and the target art institution data are fused to generate fused art institution data.
  • the judgment result is that the similarity of the art institution is greater than the third threshold, it can be determined that the reference art institution data and the target art institution data point to the same art institution. Therefore, the reference art institution data and the target art institution data are fused to obtain the fusion art Institutional data.
  • the reference art institution data and the target art institution data that meet the preset conditions can be fused to obtain
  • the fusion of art institution data has a simple calculation method and high fusion accuracy, which improves the accuracy of art institution data acquisition.
  • step S140 an art triad is generated according to the art entity and the art relationship, and an art domain knowledge graph is generated according to the art triad.
  • the art entities that can be extracted in the fusion art data may include artists, artworks, and art institutions. It is worth noting that when the fusion art data also includes other art entities, May be used as part of the knowledge graph in the field of generative art.
  • the knowledge map also known as the scientific knowledge map, is a series of various graphs showing the relationship between the development process and structure of knowledge. It uses visualization technology to describe knowledge resources and their carriers, and mines, analyzes, constructs, draws and displays knowledge and them. The interrelationship between these subjects is through combining the theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with metrological citation analysis, co-occurrence analysis and other methods, and using visualized maps to vividly display the core of the subject.
  • the knowledge graph is a structured semantic knowledge base used to describe concepts and their relationships in the physical world in symbolic form. Its basic unit is entity-relation-entity triples, and entities and their related attributes-keys. Yes, entities are connected to each other through relationships, forming a networked knowledge structure.
  • the association model of the knowledge graph in the art field can be constructed, and the visual knowledge graph in the art field can also be drawn through the drawing program.
  • FIG. 11 shows a schematic flowchart of a method for constructing a knowledge graph of an art domain in an application scenario.
  • an internal data source That is, the original structured data.
  • FIG. 12 shows a schematic flowchart of a method for first preprocessing of data in an application scenario.
  • original data is loaded. Specifically, load the structured data in the internal data source and the external data source as the original structured data.
  • step S1211 data processing (split, clean).
  • FIG. 13 shows a schematic flowchart of a method for performing data cleaning on original structured data in an application scenario.
  • the original data is loaded. Specifically, load the structured data in the internal data source and the external data source as the original structured data.
  • step S1311 data processing (single-valued attributes). That is, single-valued attribute judgment processing is performed on the original structured data.
  • step S1312 the entity table ((attribute value)+acquisition relation table) is obtained, and the error information table is counted.
  • the entity table ((attribute value)+acquisition relation table) is obtained, and the error information table is counted.
  • step S1313 the error information table is empty. Specifically, it is determined whether there is still unupdated multi-value data in the multi-value data table. When the multi-value data table is empty, output the first structured entity and the first structured relationship; when the multi-value data table is not empty, the multi-value data can be audited to update the error correction dictionary or the data dictionary.
  • FIG. 14 shows a schematic flow chart of the processing method when the paintings are the same. As shown in FIG. 14, in step S1410, the paintings with the same name are used. Specifically, paintings with the same repeatability test results are obtained.
  • step S1411 the author judges the same. Specifically, perform artist repeatability inspection, and generate artist repeatability inspection results.
  • step S1412 it is determined that the creation time is the same. Specifically, when the artist repeatability test results are the same, the creation time repeatability test is further performed to generate the creation time repeatability test result.
  • step S1413 the painting judgment is repeated. Specifically, when the result of the repeatability test of creation time is the same, the two paintings are determined as duplicate paintings.
  • step S1414 the data is fused. Specifically, data fusion processing is performed on two repeated paintings, and the corresponding data fusion processing result is obtained.
  • step S1415 it is reviewed. Specifically, the data fusion processing result is manually reviewed, and the manual review result is obtained.
  • step S1416 the data/dictionary is updated. Specifically, when the manual review result is approved, a data dictionary is generated for updating.
  • step S1417 the painting is renamed. After judging that the creation artist and creation time of the painting of the same name are the same, it can be determined that the two paintings are paintings of the same name.
  • step S1418 review is performed. De-duplicate two paintings with duplicate names, and manually review the results of the de-duplication to determine the accuracy of the de-duplication.
  • step S1213 entities are deduplicated/fused. Manually check the results of reprocessing or fusion processing in the past, as well as non-repetitive data cleaning results, and manually verify incorrect information such as the name of the painting and the name of the artist.
  • step S1214 a dictionary is generated. Specifically, a corresponding data dictionary or error correction dictionary can be generated according to the result of the error checking process.
  • step S1215 it is judged that the data specification is correct.
  • the generated data dictionary or error correction dictionary may have a naming specification that is inconsistent with the storage specification in the database, and further data specification processing steps can be performed. Add the normalized dictionary to the data dictionary or error correction dictionary.
  • step S1112 the post-update data is updated.
  • the target art data can be generated according to the generated data dictionary or error correction dictionary, that is, the updated data.
  • the data is fused. Specifically, data fusion processing can be performed on the first structured data and the second structured data.
  • the second structured data may be structured data obtained by crawling semi-structured data from an external data source for processing, and then storing it in a MySQL database.
  • Fig. 15 shows a schematic flow chart of a method for generating fusion art data in an application scenario.
  • external data is crawled.
  • semi-structured data is crawled from external data sources.
  • the external data source may be a public data source on the Internet, or other data sources, which is not particularly limited in this exemplary embodiment.
  • step S1511 the semi-structured data is analyzed. Specifically, the second preprocessing is performed on the semi-structured data according to preset rules and regular expressions to obtain structured data.
  • step S1512 structure data, such as artwork, artist, and art structure. Further, the obtained structured data can also be standardized to generate second structured data.
  • step S1513 after the artists are grouped according to the birth year and month, Word2Vec is used to calculate the similarity of each relationship and attribute to obtain the similarity vectors of the two artists in the same birth year.
  • the Word2Vec algorithm is used to calculate the artist similarity vectors of each dimension of the reference artist and the target artist data.
  • step S1514 the similarity vector of the previous step is weighted to obtain the similarity of the artist. Specifically, the artist similarity vector is weighted and calculated according to the first weight to generate the corresponding artist similarity.
  • step S1515 the similarity is higher than the set threshold 1. Specifically, the artist similarity is compared with the first threshold.
  • step S1516 the corresponding artists perform fusion. Specifically, when the artist similarity is greater than the first threshold, the reference artist data and the target artist data are fused to generate fused artist data.
  • step S1517 after grouping according to the author of the artwork, word2vec is used to calculate the similarity of each relationship and attribute respectively, and the similarity vector of the two artworks of the same author is obtained.
  • the Word2Vec algorithm is used to calculate the reference artwork and The artwork similarity vector of each dimension of the target artwork data.
  • step S1518 the similarity vector of the previous step is weighted to obtain the similarity of the artwork. Specifically, the artwork similarity vector is weighted according to the second weight to generate the corresponding artwork similarity.
  • step S1519 the similarity is higher than the set threshold 2. Specifically, the artwork similarity is compared with the second threshold.
  • step S1520 the corresponding artworks are merged. Specifically, when the artwork similarity is greater than the second threshold, the reference artwork data and the target artwork data are fused to generate fused artwork data.
  • step S1521 for the art institution, word2vec is used to calculate the similarity of each relationship and attribute respectively, and the similarity vector of the pair of art institutions is obtained.
  • the Word2Vec algorithm is used to calculate the art institution similarity vectors of various dimensions of the reference art institution and the target art institution data.
  • step S1522 the similarity vector of the previous step is weighted to obtain the similarity of the art institution. Specifically, the art institution similarity vector is weighted according to the third weight to generate the corresponding art institution similarity.
  • step S1523 the similarity is higher than the set threshold 3. Specifically, the similarity of art institutions is compared with the third threshold.
  • step S1524 the corresponding art institutions perform fusion. Specifically, when the art institution similarity is greater than the third threshold, the reference art institution data and the target art institution data are fused to generate fused art institution data.
  • step S1114 the data is merged. Specifically, the fused fused artist data, fused artwork data, and fused art institution data can be obtained after fusion.
  • the data quality can also be evaluated in step S1115. Specifically, the art data from unmatched external data sources is extracted to evaluate the fusion processing data.
  • the main evaluation indicators include the accuracy and completeness of the fusion processing data.
  • the entity relationship is extracted.
  • the art entity in the fusion art data and the art relationship corresponding to the art entity are extracted to realize the schema design of the database.
  • the schema contains schema objects, which can be table (table), column (column), data type (data type), view (view), stored procedure (stored procedures), relationship (relationships), primary key (primary key) , Foreign key, etc.
  • the database model can be represented by a visual diagram, which shows the artistic entities and their relationships with each other.
  • step S1117 the KG_neo4j database (MySQL). Specifically, the generated art domain knowledge graph composed of art triads of artists, artworks, and art institutions is obtained, and the whole is stored in a graph database, such as Neo4j.
  • FIG. 16 shows a schematic interface diagram of a visualized knowledge map of the art field.
  • the art entities include artists, artworks, and art institutions. Among them, the entities related to the artist can have nationality, death year, birthplace, birth year and genre, etc.
  • the attributes corresponding to the artist have English names and aliases; the entities related to the artwork can have creation year, creation medium, category and Themes, etc., attributes corresponding to artworks have unique codes (Identity document, ID for short), aliases, and dimensions; attributes corresponding to art institutions have English names.
  • FIG 17 shows a schematic diagram of the scene applied in the Art Encyclopedia. As shown in Figure 17, it can be applied in the Art Encyclopedia. After the user initiates a search, it can be recognized by art entity recognition, thulac word segmentation package and data dictionary to reach Vinci, And show the knowledge related to Da Vinci;
  • Figure 18 shows a schematic diagram of the scene applied in the knowledge graph, as shown in Figure 18, the knowledge graph drawn by the drawing component E-charts is visually displayed;
  • Figure 19 shows A schematic diagram of the scene applied in the art knowledge question and answer, as shown in Figure 19, the user’s question is segmented through the thulac word segmentation package, and a visual knowledge map corresponding to the art question is generated through the matching results of preset rules or regular expressions
  • Figure 20 shows a schematic diagram of a scene applied in an overview of art knowledge. As shown in Figure 20, a data dictionary can be used to generate a corresponding overview of art knowledge
  • the present disclosure uses data in an external data source and standardized data to perform data fusion processing, which greatly increases the scale of physical knowledge in the art field and improves the acquisition of knowledge in the art field.
  • generating an art domain knowledge graph based on art entities and art relationships helps to improve the relevance of entities in the knowledge graph and the comprehensiveness of the knowledge graph search, understand query intentions more accurately, and improve the accuracy of retrieval Rate.
  • FIG. 21 shows a schematic diagram of the structure of an art domain knowledge graph construction device.
  • the art domain knowledge graph construction device 2100 may include: a data processing module 2110, a data analysis module 2120, a data fusion module 2130, and graph generation Module 2140. in:
  • the data processing module 2110 is configured to perform first preprocessing on the structured data in the internal art data source and the external art data source to generate first structured data;
  • the data analysis module 2120 is configured to perform the first preprocessing on the internal art data source and Perform the second preprocessing on the unstructured data and semi-structured data in the external art data source to obtain the second structured data;
  • the data fusion module 2130 is configured to perform fusion processing on the first structured data and the second structured data , To generate fusion art data; among them, fusion art data includes art entities and art relationships corresponding to the art entities;
  • the graph generation module 2140 is configured to generate art triads according to the art entities and art relations, and to generate art triads according to the art triads Knowledge map in the field of art.
  • modules or units of the apparatus 2100 for constructing the knowledge graph of the art field are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • the modules or units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more modules or units may be integrated into one module or unit.
  • the above-mentioned modules or units can be implemented in the form of hardware or software functional modules or units.
  • the specific hardware can be a CPU, a microprocessor, a GPU, an FPGA, or a single-chip microcomputer.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
  • a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
  • Including several instructions to make a computing device which may be a personal computer, a server, a mobile terminal, or a network device, etc.
  • an electronic device capable of implementing the above method.
  • the electronic device includes a processor and a memory for storing executable instructions of the processor; the processor is configured to The executable instruction is executed to execute the above-mentioned method for constructing a knowledge graph in the art field.
  • the electronic device 2200 according to this embodiment of the present invention will be described below with reference to FIG. 22.
  • the electronic device 2200 shown in FIG. 22 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
  • the electronic device 2200 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 2200 may include, but are not limited to: the aforementioned at least one processing unit 2210, the aforementioned at least one storage unit 2220, a bus 2230 connecting different system components (including the storage unit 2220 and the processing unit 2210), and a display unit 2240.
  • the storage unit stores program code, and the program code can be executed by the processing unit 2210, so that the processing unit 2210 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Example steps.
  • the storage unit 2220 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 2221 and/or a cache storage unit 2222, and may further include a read-only storage unit (ROM) 2223.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 2220 may also include a program/utility tool 2224 having a set (at least one) program module 2225.
  • program module 2225 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 2230 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 2200 may also communicate with one or more external devices 2400 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 2200, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 2200 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 2250.
  • the electronic device 2200 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through a network adapter 2260.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • the network adapter 2240 communicates with other modules of the electronic device 2200 through the bus 2230. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 2200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the exemplary embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
  • a non-volatile computer-readable storage medium is also provided, on which a computer program capable of implementing the above method of this specification is stored.
  • various aspects of the present invention can also be implemented in the form of a program product, which includes program code, and when the program product runs on a terminal device, the program code is used to cause the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 2300 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
  • the program product of the present invention is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of the present invention can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers.

Abstract

An art field knowledge graph construction method and apparatus, a storage medium, and an electronic device. The method comprises: performing first preprocessing on structured data in an internal art data source and an external art data source to generate first structured data (S110); performing second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data (S120); performing fusion processing on the first structured data and the second structured data to generate fused art data (S130), wherein the fused art data comprises an art entity and an art relationship corresponding to the art entity; and generating an art triad according to the art entity and the art relationship, and generating an art domain knowledge graph according to the art triad (S140).

Description

知识图谱的构建方法及装置、存储介质、电子设备Method and device for constructing knowledge graph, storage medium and electronic equipment
交叉引用cross reference
本公开要求于2020年1月20日提交的申请号为202010066621.5名称为“知识图谱的构建方法及装置、存储介质、电子设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。This disclosure claims the priority of a Chinese patent application filed on January 20, 2020 with an application number of 202010066621.5 titled "Knowledge Graph Construction Method and Apparatus, Storage Medium, and Electronic Equipment". The entire content of the Chinese patent application is incorporated by reference. All are incorporated into this article.
技术领域Technical field
本公开涉及知识图谱构建技术领域,具体而言,涉及一种艺术领域知识图谱的构建方法与艺术领域知识图谱的构建装置、计算机可读存储介质及电子设备。The present disclosure relates to the technical field of knowledge graph construction, and in particular to a method for constructing a knowledge graph in the art field, a device for constructing a knowledge graph in the art field, a computer-readable storage medium, and an electronic device.
背景技术Background technique
知识图谱又称为科学知识图谱,知识图谱用可视化技术描述知识资源及其载体,挖掘、分析、构建、绘制和显示知识及其之间的相互联系,是显示知识发展进程与结构关系的一系列各种不同的图形,并且提供了一种更好地组织、管理和理解互联网海量信息的方式。知识图谱也是构建下一代搜索引擎的雏形,使得搜索更加语义化和智能化。目前,知识图谱分别有通用知识图谱和领域知识图谱两类。其中,领域知识图谱又称为行业知识图谱或者垂直知识图谱,通常面向某一特定领域,相当于基于语义技术的行业知识库。由于领域知识图谱是基于行业数据构建的,因此有着更为严格和丰富的数据模式,也对领域知识的深度和准确性有着更高的要求。Knowledge graph is also called scientific knowledge graph. Knowledge graph uses visualization technology to describe knowledge resources and their carriers, mines, analyzes, constructs, draws, and displays knowledge and their interrelationships. It is a series of showing the development process and structural relationship of knowledge. A variety of different graphics, and provide a better way to organize, manage and understand the massive amount of information on the Internet. The knowledge graph is also the prototype of building a next-generation search engine, making search more semantic and intelligent. At present, there are two types of knowledge graphs: general knowledge graphs and domain knowledge graphs. Among them, the domain knowledge graph is also called the industry knowledge graph or the vertical knowledge graph, which is usually oriented to a specific field and is equivalent to an industry knowledge base based on semantic technology. Since the domain knowledge map is constructed based on industry data, it has a more rigorous and rich data model, and also has higher requirements for the depth and accuracy of domain knowledge.
但是,现有的领域知识图谱构建存在着较大缺陷。However, the existing domain knowledge graph construction has big defects.
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background section is only used to enhance the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.
公开内容Public content
本公开的目的在于克服上述现有技术的不足,提供一种艺术领域知 识图谱的构建方法、艺术领域知识图谱的构建装置、计算机可读存储介质及电子设备。The purpose of the present disclosure is to overcome the above-mentioned shortcomings of the prior art, and provide a method for constructing a knowledge map in the art field, a device for constructing a knowledge map in the art field, a computer-readable storage medium, and an electronic device.
根据本公开的第一个方面,提供一种艺术领域知识图谱的构建方法,所述方法包括:对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据;对所述内部艺术数据源和所述所述外部艺术数据源中的非结构化数据、半结构化数据进行第二预处理得到第二结构化数据;将所述第一结构化数据与所述第二结构化数据进行融合处理,生成融合艺术数据;其中,所述融合艺术数据中包括艺术实体以及与所述艺术实体对应的艺术关系;根据所述艺术实体和所述艺术关系生成艺术三元组,并根据所述艺术三元组生成艺术领域知识图谱。According to a first aspect of the present disclosure, there is provided a method for constructing a knowledge graph in the art field, the method comprising: performing first preprocessing on structured data in an internal art data source and an external art data source to generate a first structure Data; performing a second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data; converting the first structured data Perform fusion processing with the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art relationship corresponding to the art entity; generated according to the art entity and the art relationship Art triads, and generate a knowledge map of the art field according to the art triads.
在本公开的一种示例性实施例中,所述对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据,包括:对所述内部艺术数据源和外部艺术数据源中的结构化数据进行数据清洗;对所述结构化数据的数据清洗结果进行重复性检验,生成重复性检验数据;根据所述重复性检验数据生成数据字典和纠错字典,基于数据字典生成第一结构化数据。In an exemplary embodiment of the present disclosure, the first preprocessing of the structured data in the internal art data source and the external art data source to generate the first structured data includes: processing the internal art data Data cleaning is performed on the structured data in the source and external art data sources; the data cleaning results of the structured data are repetitively checked to generate repetitive inspection data; a data dictionary and an error correction dictionary are generated based on the repetitive inspection data , Generate the first structured data based on the data dictionary.
在本公开的一种示例性实施例中,所述对所述内部艺术数据源和外部艺术数据源中的结构化数据进行数据清洗,包括:对内部艺术数据源和外部艺术数据源中的结构化数据进行单值属性判定处理,以得到单值结构化数据;获取所述单值结构化数据中的第一结构化实体和第二结构化关系,并统计所述单值属性判定处理的结果得到多值数据表;当所述多值数据表中未包含多值数据时,将所述第一结构化实体和第二结构化关系作为数据清洗结果;当所述多值数据表中包含多值数据时,根据所述多值数据表得到第二结构化实体和第二结构化关系,以作为数据清洗结果。In an exemplary embodiment of the present disclosure, the data cleaning of the structured data in the internal art data source and the external art data source includes: the structure of the internal art data source and the external art data source Single-valued attribute determination processing is performed on the data to obtain single-valued structured data; the first structured entity and the second structured relationship in the single-valued structured data are acquired, and the results of the single-valued attribute determination processing are counted Obtain a multi-value data table; when the multi-value data table does not contain multi-value data, use the first structured entity and the second structured relationship as the result of data cleaning; when the multi-value data table contains multi-value data In the case of value data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the result of data cleaning.
在本公开的一种示例性实施例中,所述根据所述多值数据表得到第二结构化实体和第二结构化关系,以作为数据清洗结果,包括:根据所述多值数据表更新数据字典或纠错字典;根据所述数据字典或所述纠错字典的更新结果,得到第二结构化实体和第二结构化关系作为数据清洗结果。In an exemplary embodiment of the present disclosure, the obtaining the second structured entity and the second structured relationship according to the multi-valued data table as a data cleaning result includes: updating according to the multi-valued data table A data dictionary or an error correction dictionary; according to an update result of the data dictionary or the error correction dictionary, a second structured entity and a second structured relationship are obtained as a data cleaning result.
在本公开的一种示例性实施例中,所述对所述结构化数据的数据清洗结果进行重复性检验,生成重复性检验数据,包括:对所述原始结构化数据的数据清洗结果进行艺术品实体的重复性检验,生成艺术品重复性检验结果;当所述艺术品重复性检验结果为相同时,对所述数据清洗结果进行艺术家实体的重复性检验,生成艺术家重复性检验结果;当所述艺术家重复性检验结果为相同时,对所述数据清洗结果进行创作时间实体的重复性检验,生成创作时间重复性检验结果;当所述创作时间重复性检验结果为相同时,确定所述艺术品实体为重复艺术品;对所述重复艺术品进行融合处理,并根据审核通过的融合处理结果生成重复性检验数据。In an exemplary embodiment of the present disclosure, the performing repeatability inspection on the data cleaning result of the structured data to generate repeatability inspection data includes: performing art on the data cleaning result of the original structured data The repeatability test of the product entity generates the repeatability test result of the artwork; when the repeatability test result of the artwork is the same, the repeatability test of the artist entity is performed on the data cleaning result to generate the repeatability test result of the artist; when When the artist repeatability check result is the same, the creation time entity repeatability check is performed on the data cleaning result to generate a creation time repeatability check result; when the creation time repeatability check result is the same, it is determined The artwork entity is a duplicate artwork; fusion processing is performed on the duplicate artwork, and repeatability inspection data is generated according to the approved fusion processing result.
在本公开的一种示例性实施例中,所述方法还包括:当所述艺术家重复性检验结果为不同或所述创作时间重复性检验结果为不同时,确定所述艺术品实体为重名艺术品;对所述重名艺术品进行去重处理,并根据去重处理结果生成所述重复性检验数据。In an exemplary embodiment of the present disclosure, the method further includes: when the artist repeatability check result is different or the creation time repeatability check result is different, determining that the artwork entity has the same name Artwork; de-duplicate the artwork with the same name, and generate the repeatability inspection data according to the result of the de-duplication process.
在本公开的一种示例性实施例中,所述第一结构化数据包括目标艺术品数据、目标艺术家数据和目标艺术机构数据;所述将所述第一结构化数据与所述第二结构化数据进行融合处理,生成融合艺术数据,包括:将所述第二结构化数据中的参考艺术家数据与所述目标艺术家数据进行融合处理,生成融合艺术家数据;将所述第二结构化数据中的参考艺术品数据与所述目标艺术品数据进行融合处理,生成融合艺术品数据;将所述第二结构化数据中的参考艺术机构数据与所述目标艺术机构数据进行融合处理,生成融合艺术机构数据。In an exemplary embodiment of the present disclosure, the first structured data includes target artwork data, target artist data, and target art institution data; and the first structured data is combined with the second structured data. Fusion processing of the fusion data to generate fusion art data, including: fusion processing the reference artist data in the second structured data with the target artist data to generate fusion artist data; Perform fusion processing on the reference art data and the target art data to generate fused art data; perform fusion processing on the reference art institution data in the second structured data and the target art institution data to generate fused art Institutional data.
在本公开的一种示例性实施例中,所述将所述第二结构化数据中的参考艺术家数据与所述目标艺术家数据进行融合处理,生成融合艺术家数据,包括:根据词向量模型对第二结构化数据中的参考艺术家数据和所述目标艺术家数据进行向量转换,得到艺术家词向量序列;计算所述艺术家词向量序列之间的艺术家相似度向量,并根据所述艺术家相似度向量的第一权重进行加权计算;根据加权计算结果得到艺术家相似度,并判断所述艺术家相似度是否大于第一阈值;将大于所述第一阈值的所述艺术家相似度对应的所述参考艺术家数据和所述目标艺术家数据进行 融合处理,生成融合艺术家数据。In an exemplary embodiment of the present disclosure, the fusion processing of the reference artist data in the second structured data with the target artist data to generate fused artist data includes: performing a fusion process on the second structured data according to a word vector model. 2. Perform vector conversion between the reference artist data in the structured data and the target artist data to obtain an artist word vector sequence; calculate the artist similarity vector between the artist word vector sequences, and calculate the artist similarity vector according to the first of the artist similarity vector Perform weighted calculation with a weight; obtain artist similarity according to the weighted calculation result, and determine whether the artist similarity is greater than a first threshold; compare the reference artist data corresponding to the artist similarity greater than the first threshold with the The target artist data is fused to generate fused artist data.
在本公开的一种示例性实施例中,所述将所述第二结构化数据中的参考艺术品数据与所述目标艺术品数据进行融合处理,生成融合艺术品数据,包括:根据词向量模型对第二结构化数据中的参考艺术品数据和所述目标艺术品数据进行向量转换,得到艺术品词向量序列;计算所述艺术品词向量序列之间的艺术品相似度向量,并根据所述艺术品相似度向量的第二权重进行加权计算;根据加权计算结果得到艺术品相似度,并判断所述艺术品相似度是否大于第二阈值;将大于所述第二阈值的所述艺术品相似度对应的所述参考艺术品数据和所述目标艺术品数据进行融合处理,生成融合艺术品数据。In an exemplary embodiment of the present disclosure, the fusion processing of the reference artwork data in the second structured data with the target artwork data to generate fused artwork data includes: according to word vectors The model performs vector conversion on the reference artwork data in the second structured data and the target artwork data to obtain an artwork word vector sequence; calculates the artwork similarity vector between the artwork word vector sequences, and then The second weight of the artwork similarity vector is weighted and calculated; the artwork similarity is obtained according to the weighted calculation result, and it is determined whether the artwork similarity is greater than a second threshold; the art that is greater than the second threshold is determined The reference artwork data corresponding to the product similarity and the target artwork data are fused to generate fused artwork data.
在本公开的一种示例性实施例中,所述将所述第二结构化数据中的参考艺术机构数据与所述目标艺术机构数据进行融合处理,生成融合艺术机构数据,包括:根据词向量模型对第二结构化数据中的参考艺术机构数据和所述目标艺术机构数据进行向量转换,得到艺术机构词向量序列;计算所述艺术机构词向量序列之间的艺术机构相似度向量,并根据所述艺术机构相似度向量的第三权重进行加权计算;根据加权计算结果得到艺术机构相似度,并判断所述艺术机构相似度是否大于第三阈值;将大于所述第三阈值的所述艺术机构相似度对应的所述参考艺术机构数据和所述目标艺术机构数据进行融合处理,生成融合艺术机构数据。In an exemplary embodiment of the present disclosure, the fusion processing of the reference art institution data in the second structured data with the target art institution data to generate fused art institution data includes: according to word vectors The model performs vector conversion on the reference art institution data in the second structured data and the target art institution data to obtain an art institution word vector sequence; calculates the art institution similarity vector between the art institution word vector sequences, and calculates it according to The third weight of the art institution similarity vector is weighted; the art institution similarity is obtained according to the weighted calculation result, and it is determined whether the art institution similarity is greater than the third threshold; the art institution that is greater than the third threshold is determined The reference art institution data corresponding to the institution similarity and the target art institution data are fused to generate fused art institution data.
根据本公开的第二个方面,提供一种艺术领域知识图谱的构建装置,所述装置包括:数据处理模块,被配置为对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据;数据解析模块,被配置为对所述内部艺术数据源和所述外部艺术数据源中的非结构化数据和半结构化数据进行第二预处理得到第二结构化数据;数据融合模块,被配置为将第一结构化数据与所述第二结构化数据进行融合处理,生成融合艺术数据;其中,所述融合艺术数据中包括艺术实体以及与所述艺术实体对应的艺术关系;图谱生成模块,被配置为根据所述艺术实体和所述艺术关系生成艺术三元组,并根据所述艺术三元组成艺术领域知识图谱。According to a second aspect of the present disclosure, there is provided a device for constructing a knowledge graph in the art field, the device comprising: a data processing module configured to perform first analysis on structured data in an internal art data source and an external art data source. Preprocessing to generate first structured data; a data analysis module configured to perform a second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain a second Structured data; a data fusion module configured to perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and a connection with the art The art relationship corresponding to the entity; the graph generation module is configured to generate an art triad according to the art entity and the art relation, and form an art domain knowledge graph according to the art triad.
根据本公开的第三个方面,提供一种电子设备,包括:处理器和存 储器;其中,存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现上述任意示例性实施例的艺术领域知识图谱的构建方法。According to a third aspect of the present disclosure, there is provided an electronic device, including: a processor and a memory; wherein a computer readable instruction is stored in the memory, and the computer readable instruction is executed by the processor to implement any of the foregoing examples A method for constructing a knowledge graph in the art field of an exemplary embodiment.
根据本公开的第四个方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任意示例性实施例中的艺术领域知识图谱的构建方法。应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and the computer program, when executed by a processor, implements the method for constructing an art domain knowledge graph in any of the above exemplary embodiments . It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the present disclosure.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the specification to explain the principle of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1示意性示出本公开示例性实施例中一种艺术领域知识图谱的构建方法的流程图;Fig. 1 schematically shows a flowchart of a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure;
图2示意性示出本公开示例性实施例中生成第一结构化数据的方法的流程示意图;Fig. 2 schematically shows a flow chart of a method for generating first structured data in an exemplary embodiment of the present disclosure;
图3示意性示出本公开示例性实施例中进行数据清洗的方法的流程示意图;Fig. 3 schematically shows a flow chart of a method for data cleaning in an exemplary embodiment of the present disclosure;
图4示意性示出本公开示例性实施例中一种得到数据清洗结果的方法的流程示意图;FIG. 4 schematically shows a flow chart of a method for obtaining a data cleaning result in an exemplary embodiment of the present disclosure;
图5示意性示出本公开示例性实施例中一种生成重复性检验数据的方法的流程示意图;Fig. 5 schematically shows a flow chart of a method for generating repetitive inspection data in an exemplary embodiment of the present disclosure;
图6示意性示出本公开示例性实施例中另一种生成重复性检验数据的方法的流程示意图;Fig. 6 schematically shows a flow chart of another method for generating repetitive inspection data in an exemplary embodiment of the present disclosure;
图7示意性示出本公开示例性实施例中生成融合艺术数据的方法的流程示意图;Fig. 7 schematically shows a flow chart of a method for generating fused art data in an exemplary embodiment of the present disclosure;
图8示意性示出本公开示例性实施例中得到融合艺术家数据的方法的流程示意图;Fig. 8 schematically shows a flow chart of a method for obtaining fusion artist data in an exemplary embodiment of the present disclosure;
图9示意性示出本公开示例性实施例中得到融合艺术品数据的方法的流程示意图;FIG. 9 schematically shows a flow chart of a method for obtaining fused artwork data in an exemplary embodiment of the present disclosure;
图10示意性示出本公开示例性实施例中得到融合艺术机构数据的方法的流程示意图;FIG. 10 schematically shows a flow chart of a method for obtaining fusion art institution data in an exemplary embodiment of the present disclosure;
图11示意性示出本公开示例性实施例中应用场景的艺术领域知识图谱构建方法的流程示意图;FIG. 11 schematically shows a flow chart of a method for constructing an art domain knowledge graph of an application scenario in an exemplary embodiment of the present disclosure;
图12示意性示出本公开示例性实施例中应用场景下进行数据第一预处理的方法的流程示意图;FIG. 12 schematically shows a flow chart of a method for first preprocessing of data in an application scenario in an exemplary embodiment of the present disclosure;
图13示意性示出本公开示例性实施例中应用场景下进行数据清洗的方法的流程示意图;FIG. 13 schematically shows a flow chart of a method for data cleaning in an application scenario in an exemplary embodiment of the present disclosure;
图14示意性示出本公开示例性实施例中应用场景下对画作重复时的处理方法的流程示意图;FIG. 14 schematically shows a flowchart of a processing method when painting is repeated in an application scenario in an exemplary embodiment of the present disclosure; FIG.
图15示意性示出本公开示例性实施例中应用场景下生成融合艺术数据的方法的流程示意图;Fig. 15 schematically shows a flow chart of a method for generating fusion art data in an application scenario in an exemplary embodiment of the present disclosure;
图16示意性示出本公开示例性实施例中应用场景下可视化的艺术领域知识图谱的界面示意图;FIG. 16 schematically shows an interface diagram of a visualized art domain knowledge graph in an application scenario in an exemplary embodiment of the present disclosure;
图17示意性示出本公开示例性实施例中艺术领域知识图谱应用在艺术百科中的场景示意图;FIG. 17 schematically shows a scene diagram of the application of the art domain knowledge graph in the art encyclopedia in an exemplary embodiment of the present disclosure;
图18示意性示出本公开示例性实施例中艺术领域知识图谱应用在知识图谱中的场景示意图;FIG. 18 schematically shows a schematic diagram of a scene in which the knowledge graph of the art field is applied to the knowledge graph in an exemplary embodiment of the present disclosure;
图19示意性示出本公开示例性实施例中艺术领域知识图谱应用在艺术知识问答中的场景示意图;FIG. 19 schematically shows a scene diagram of the application of the art domain knowledge graph in the art knowledge question and answer in an exemplary embodiment of the present disclosure;
图20示意性示出本公开示例性实施例中艺术领域知识图谱应用在艺术知识概述中的场景示意图;FIG. 20 schematically shows a schematic diagram of a scene in which the art domain knowledge graph is applied to an overview of art knowledge in an exemplary embodiment of the present disclosure;
图21示意性示出本公开示例性实施例中一种艺术领域知识图谱的构建装置的结构示意图;FIG. 21 schematically shows a structure diagram of an apparatus for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure;
图22示意性示出本公开示例性实施例中一种用于实现艺术领域知识图谱的构建方法的电子设备;FIG. 22 schematically shows an electronic device for implementing a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure;
图23示意性示出本公开示例性实施例中一种用于实现艺术领域知识图谱的构建方法的计算机可读存储介质。FIG. 23 schematically illustrates a computer-readable storage medium used to implement a method for constructing a knowledge graph in the art field in an exemplary embodiment of the present disclosure.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, these embodiments are provided so that the present disclosure will be more comprehensive and complete, and the concept of the example embodiments will be fully conveyed To those skilled in the art. The described features, structures or characteristics can be combined in one or more embodiments in any suitable way. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present disclosure. However, those skilled in the art will realize that the technical solutions of the present disclosure can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, the well-known technical solutions are not shown or described in detail in order to avoid overwhelming the crowd and obscure all aspects of the present disclosure.
本说明书中使用用语“一个”、“一”、“该”和“所述”用以表示存在一个或多个要素/组成部分/等;用语“包括”和“具有”用以表示开放式的包括在内的意思并且是指除了列出的要素/组成部分/等之外还可存在另外的要素/组成部分/等;用语“第一”和“第二”等仅作为标记使用,不是对其对象的数量限制。In this specification, the terms "a", "a", "the" and "said" are used to indicate the presence of one or more elements/components/etc.; the terms "including" and "have" are used to indicate open-ended Inclusive means and means that in addition to the listed elements/components/etc., there may be other elements/components/etc.; the terms "first" and "second" etc. are only used as marks, not to The number of its objects is limited.
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。In addition, the drawings are only schematic illustrations of the present disclosure, and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities.
现有的领域知识图谱构建存在着对应缺陷。具体的,英文专业领域知识图谱的构建方法无法完全适用于中文专业领域知识图谱的构建的问题,还存在着现有专业领域知识图谱的构建方法难以兼顾获取专业知识的的规模和准确性,也难以融合从多种数据源获取的领域知识。The existing domain knowledge graph construction has corresponding defects. Specifically, the construction method of the knowledge map of the English professional domain is not fully applicable to the construction of the knowledge map of the Chinese professional domain. There is also the problem that the existing method of constructing the knowledge map of the professional domain cannot take into account the scale and accuracy of the professional knowledge. It is difficult to integrate domain knowledge acquired from multiple data sources.
针对相关技术中存在的问题,本公开提出了一种艺术领域知识图谱的构建方法。图1示出了艺术领域知识图谱的构建方法的流程图,如图1所示,艺术领域知识图谱的构建方法至少包括以下步骤:In view of the problems in the related technology, the present disclosure proposes a method for constructing a knowledge graph in the art field. Figure 1 shows a flow chart of a method for constructing a knowledge map of the art field. As shown in Figure 1, the method for constructing a knowledge map of the art field includes at least the following steps:
步骤S110.对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据。Step S110. Perform first preprocessing on the structured data in the internal art data source and the external art data source to generate first structured data.
步骤S120.对内部艺术数据源和外部艺术数据源中的非结构化数据 和半结构化数据进行第二预处理得到第二结构化数据。Step S120. Perform a second preprocessing on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data.
步骤S130.将第一结构化数据与第二结构化数据进行融合处理,生成融合艺术数据;其中,融合艺术数据包括艺术实体以及与艺术实体对应的艺术关系。Step S130. Perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art relationship corresponding to the art entity.
步骤S140.根据艺术实体和艺术关系生成艺术三元组,并根据艺术三元组生成艺术领域知识图谱。Step S140. Generate an art triad according to the art entity and the art relationship, and generate an art domain knowledge graph according to the art triad.
在本公开的示例性实施例中,本公开一方面,通过外部数据源中的数据和已规范化的数据进行数据融合处理,极大地增加了艺术领域的实体知识的规模,提高了艺术领域知识获取的准确性;另一方面,根据艺术三元组生成艺术领域知识图谱,有助于提高知识图谱中实体的关联性和知识图谱搜索的全面性,更加准确地理解查询意图,提高检索的准确率。In the exemplary embodiment of the present disclosure, on the one hand, the present disclosure uses data in an external data source and standardized data to perform data fusion processing, which greatly increases the scale of physical knowledge in the art field and improves the acquisition of knowledge in the art field. On the other hand, generating an art domain knowledge graph based on art triads helps to improve the relevance of entities in the knowledge graph and the comprehensiveness of the knowledge graph search, to understand query intentions more accurately, and to improve the accuracy of retrieval .
下面对艺术领域知识图谱的构建方法的各个步骤进行详细说明。The steps of the method for constructing the knowledge graph in the art field will be described in detail below.
在步骤S110中,对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据第一预处理。In step S110, first preprocessing is performed on the structured data in the internal art data source and the external art data source to generate the first structured data.
在本公开的示例性实施例中,内部艺术数据源和外部艺术数据源可以是针对艺术数据的获取来源进行确定的,举例而言,内部艺术数据源中的数据可以主要是经过人工处理后的结构化数据,外部艺术数据源中的数据可以是根据互联网的公开数据爬取到的,以半结构化数据为主。但是,内部艺术数据源中也可能包含有非结构化数据和半结构化数据,外部艺术数据源中也会存在结构化数据、非结构化数据,因此,可以获取内部数据源和外部数据源中的结构化数据。In the exemplary embodiment of the present disclosure, the internal art data source and the external art data source may be determined for the source of the art data. For example, the data in the internal art data source may be mainly processed manually Structured data, data in external art data sources can be crawled based on public data on the Internet, mainly semi-structured data. However, internal art data sources may also contain unstructured data and semi-structured data, and external art data sources may also contain structured data and unstructured data. Therefore, internal data sources and external data sources can be obtained. Structured data.
在加载出结构化数据之后,可以根据结构化数据进行第一预处理得到第一结构化数据。在可选的实施例中,图2示出了生成第一结构化数据的方法的流程示意图,如图2所示,该方法至少包括以下步骤:在步骤S210中,对内部艺术数据源和外部艺术数据源中的结构化数据进行数据清洗。After the structured data is loaded, the first preprocessing can be performed according to the structured data to obtain the first structured data. In an alternative embodiment, FIG. 2 shows a schematic flow chart of the method for generating the first structured data. As shown in FIG. 2, the method includes at least the following steps: In step S210, the internal art data source and the external The structured data in the art data source is data cleaned.
具体的,在可选的实施例中,图3示出了对结构化数据进行数据清洗的方法的流程示意图,如图3所示,该方法至少包括以下步骤:在步骤S310中,对内部艺术数据源和外部艺术数据源中的结构化数据进行 单值属性判定处理,以得到单值结构化数据。其中,单值属性可以是一个数据仅有一个具体取值的属性。举例而言,对结构化数据的单值属性判定方法可以是判定一幅画作的作家是否仅有一人,当一幅画作对应的作家分别有梵高和梵谷两个时,不得到对应的单值结构化数据;当一幅画作对应的作家为梵高时,因此可以得到该画作的作家的单值结构化数据为梵高。除此之外,单值结构化数据还可以包括画作、创作时间、流派、国籍等,本示例性实施例对此不做特殊限定。Specifically, in an optional embodiment, FIG. 3 shows a schematic flow chart of a method for data cleaning of structured data. As shown in FIG. 3, the method includes at least the following steps: In step S310, the internal art The structured data in the data source and the external art data source is subjected to single-value attribute judgment processing to obtain single-value structured data. Among them, a single-valued attribute may be an attribute whose data has only one specific value. For example, the method for determining the single-valued attribute of structured data can be to determine whether there is only one author of a painting. When there are two writers corresponding to a painting, Van Gogh and Van Gogh, the corresponding list is not obtained. Value structured data; when the writer corresponding to a painting is Van Gogh, the single value structured data of the writer of the painting can be obtained as Van Gogh. In addition, the single-value structured data may also include paintings, creation time, genres, nationalities, etc., which are not specifically limited in this exemplary embodiment.
在步骤S320中,获取单值结构化数据中的第一结构化实体和第一结构化关系,并统计单值属性判定处理的结果得到多值数据表。在已获取到的单值结构化数据中,可以提取到对应的第一结构化实体和第一结构化关系。举例而言,第一结构化实体可以包括艺术家实体、艺术品实体、创作时间实体、流派实体、国籍实体等;针对一个艺术家实体而言,对应的结构化关系可以包括创作的艺术品实体、创作的所有艺术品实体对应的创作时间、已形成的流派和归属的国籍等之间的关系。除此之外,还可以将单值属性判定未通过的结构化数据进行统计,得到多值数据表。In step S320, the first structured entity and the first structured relationship in the single-valued structured data are obtained, and the result of the single-valued attribute determination processing is calculated to obtain a multi-valued data table. From the obtained single-value structured data, the corresponding first structured entity and first structured relationship can be extracted. For example, the first structured entity may include artist entity, artwork entity, creation time entity, genre entity, nationality entity, etc.; for an artist entity, the corresponding structured relationship may include created artwork entity, creation The relationship between the creation time, the genre that has been formed, and the nationality to which all the artwork entities correspond to. In addition, the structured data that failed the single-value attribute judgment can be counted to obtain a multi-value data table.
在步骤S330中,当多值数据表中未包含多值数据时,将第一结构化实体和第一结构化关系作为数据清洗结果。当未统计到多值数据或者是已经将所有的多值数据表中的多值数据均进行更新以后,可以进行进一步的审核。当审核为人工审核并且人工审核通过时,直接将获取到的第一结构化实体和结第一构化关系确定为数据清洗结果。除此之外,审核的方式还可以是根据预设规则进行自动化审核,以节省审核流程和人力成本,提高审核工作的准确性。In step S330, when the multi-valued data table does not contain multi-valued data, the first structured entity and the first structured relationship are used as the data cleaning result. When the multi-value data is not counted or the multi-value data in all the multi-value data tables have been updated, further review can be carried out. When the review is a manual review and the manual review is passed, the obtained first structured entity and the first structured relationship are directly determined as the data cleaning result. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.
需要说明的是本发明实施例中提到审核步骤可以是按照自定义设定规则进行的自动审核,也可以是人工进行直接审核。人工审核和自动审核都是可以互换的。It should be noted that the review step mentioned in the embodiment of the present invention can be an automatic review according to a custom set rule, or it can be a direct review manually. Both manual review and automatic review are interchangeable.
在步骤S340中,当多值数据表中包含多值数据时,根据多值数据表得到第二结构化实体和第二结构化关系,以作为数据清洗结果。In step S340, when the multi-valued data table contains multi-valued data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the result of data cleaning.
进一步的,在可选的实施例中,图4示出了一种得到数据清洗结果的方法的流程示意图,如图4所示,该方法至少包括以下步骤:在步骤 S410中,根据多值数据表更新数据字典或纠错字典。举例而言,在一幅画作对应的作家的多值数据表中可能包括梵高和梵谷两个值,在进一步的人工审核时候发现梵谷为梵高的别名,因此可以将梵谷替换为梵高,生成一条对应的纠错字典进行更新。除此之外,审核的方式还可以是根据预设规则进行自动化审核,以节省审核流程和人力成本,提高审核工作的准确性。并且,该纠错字典可以是用来存储与增加数据、修改数据等操作相关的数据来源、与其他数据的关系、用途和格式等信息的数据库;对应的,数据字典可以是用来存储格式及内容等规范数据的数据来源、与其他数据的关系、用途和格式等信息的数据库。Further, in an optional embodiment, FIG. 4 shows a schematic flow chart of a method for obtaining data cleaning results. As shown in FIG. 4, the method at least includes the following steps: In step S410, according to the multi-valued data Table update data dictionary or error correction dictionary. For example, the multi-value data table of the writer corresponding to a painting may include two values of Van Gogh and Van Gogh. During further manual review, it is found that Van Gogh is an alias of Van Gogh, so Van Gogh can be replaced with Van Gogh, generate a corresponding error correction dictionary for update. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work. In addition, the error correction dictionary can be a database used to store data sources related to operations such as adding data, modifying data, and the relationship with other data, usage, and format; correspondingly, the data dictionary can be used to store format and A database of information such as content and other normative data such as the data source, relationship with other data, usage and format.
在步骤S420中,根据更新后的数据字典或纠错字典的更新结果,得到第二结构化实体和第二结构化关系作为数据清洗结果。在根据多值数据表不为空的时候,对数据字典和纠错字典的更新结果可以进一步的判断,直到所有多值数据表中的多值数据更新完毕,得到的第二结构化实体和第二结构化关系作为数据清洗结果。In step S420, the second structured entity and the second structured relationship are obtained as the data cleaning result according to the update result of the updated data dictionary or error correction dictionary. When the multi-value data table is not empty, the update results of the data dictionary and the error correction dictionary can be further judged until the multi-value data in all multi-value data tables is updated, and the second structured entity and the first structured entity are obtained. Two structured relations as a result of data cleaning.
在本示例性实施例中,通过对多值数据表中的结构化实体和结构化关系生成数据清洗结果,便于艺术领域的知识更新,且更新方式简单和准确性高。In this exemplary embodiment, data cleaning results are generated for the structured entities and structured relationships in the multi-valued data table, which facilitates the update of knowledge in the art field, and the update method is simple and accurate.
在步骤S220中,对内部艺术数据源和外部艺术数据源中的结构化数据的数据清洗结果进行重复性检验,生成重复性检验数据。在可选的实施例中,结构化实体包括艺术品实体、艺术家实体和创作时间实体,图5示出了一种生成重复性检验数据的方法的流程示意图,如图5所示,该方法至少包括以下步骤:在步骤S510中,对内部艺术数据源和外部艺术数据源中的结构化数据的数据清洗结果进行艺术品实体的重复性检验,生成艺术品重复性检验结果。对艺术品实体的检验可以是获取到艺术品的名称,并根据对艺术品名称的一致与否进行判定,生成对应的艺术品重复性检验结果。举例而言,当将获取到的两幅画作的名称均为“蒙娜丽莎”的艺术品进行重复性检验,可以得到重复性检验结果为相同;当将一幅名称为“蒙娜丽莎”的画作和一幅名称为“戴珍珠耳环的少女”的画作进行艺术品实体的重复性检验,可以得到重复性检验结果为不同。需要说明的是重复性验证是确定两个实体是否实质上是相同的, 例如如果一个艺术品的作者的名字有全称和简称,实质上都是该作者本人,此时重复性验证结构应该是相同的。In step S220, repeatability inspection is performed on the data cleaning results of the structured data in the internal art data source and the external art data source, and repeatability inspection data is generated. In an alternative embodiment, the structured entities include artwork entities, artist entities, and creation time entities. FIG. 5 shows a schematic flow chart of a method for generating repetitive inspection data. As shown in FIG. 5, the method at least It includes the following steps: in step S510, the repeatability inspection of the artwork entity is performed on the data cleaning results of the structured data in the internal art data source and the external art data source, and the artwork repeatability inspection result is generated. The inspection of the artwork entity may be to obtain the name of the artwork, and determine whether the artwork name is consistent or not, and generate the corresponding artwork repeatability inspection result. For example, when the two obtained paintings whose names are both "Mona Lisa" are subjected to a repeatability test, the repeatability test results can be obtained as the same; when one is named "Mona Lisa" "" and a painting titled "Girl with a Pearl Earring" are subject to the repeatability test of the artwork entity, and the repeatability test results are different. It should be noted that repeated verification is to determine whether two entities are substantially the same. For example, if the author of an artwork has a full name and abbreviation, it is essentially the author himself, and the structure of repeated verification should be the same. of.
在步骤S520中,当艺术品重复性检验结果为相同时,对数据清洗结果进行艺术家实体的重复性检验,生成艺术家重复性检验结果。举例而言,当两幅画作的名称均为“蒙娜丽莎”时,可以确定该画作的艺术品重复性检验结果为相同。但是,这两幅画作可能是后期作家进行处理得到的,也可能是来自于不同的博物馆,或者是其他原因造成的不同画作,因此可以进行进一步的判定。具体的,可以是对艺术品的创作艺术家实体进行重复性检验。In step S520, when the artwork repeatability check result is the same, perform the artist entity repeatability check on the data cleaning result to generate the artist repeatability check result. For example, when the names of the two paintings are both "Mona Lisa", it can be determined that the art repeatability test results of the paintings are the same. However, these two paintings may have been processed by later writers, or they may have come from different museums, or different paintings caused by other reasons, so they can be further judged. Specifically, it can be a repetitive test of the artist entity that created the artwork.
在步骤S530中,当艺术家重复性检验结果为相同时,对数据清洗结果进行创作时间实体的重复性检验,生成创作时间重复性检验结果。举例而言,当两幅名称均为“蒙娜丽莎”的画作对应的艺术家也相同时,可以确定该艺术家重复性检验的结果为相同。进一步的,还可以对创作的时间进行重复性检验。In step S530, when the artist repeatability check results are the same, the creation time entity repeatability check is performed on the data cleaning result, and the creation time repeatability test result is generated. For example, when the two paintings whose names are both "Mona Lisa" correspond to the same artist, it can be determined that the result of the repeatability test of the artist is the same. Furthermore, the time of creation can also be tested for repeatability.
在步骤S540中,当创作时间重复性检验结果为相同时,确定艺术品实体为重复艺术品。举例而言,当两幅画作名称均为“蒙娜丽莎”的画作对应的艺术家和创作时间也相同,可以确定该创作时间重复性检验结果为相同。因此,可以根据这三个维度的重复性检验结果确定该艺术品为重复艺术品。In step S540, when the creation time repeatability check result is the same, it is determined that the artwork entity is a duplicate artwork. For example, when two paintings whose names are "Mona Lisa" correspond to the same artist and creation time, it can be determined that the creation time repeatability test result is the same. Therefore, it can be determined that the artwork is a duplicate artwork based on the repeatability test results of these three dimensions.
在步骤S550中,对重复艺术品进行融合处理,并根据审核通过的融合处理结果生成重复性检验数据。在发现两个艺术品实体为重复艺术品时,可以将两个艺术品实体进行融合处理,并且将融合处理的结果进行人工审核。人工审核可以进一步确定其他维度的重复性检验结果是否相同,当通过人工审核时,可以生成该艺术品的重复性检验数据。并且,还可以根据该重复性检验数据对数据字典进行更新。In step S550, fusion processing is performed on the repeated artwork, and repeatability inspection data is generated according to the approved fusion processing result. When two artwork entities are found to be duplicate artwork, the two artwork entities can be fused, and the result of the fusion processing can be manually reviewed. Manual review can further determine whether the repeatability test results of other dimensions are the same. When the manual review is passed, the repeatability test data of the artwork can be generated. In addition, the data dictionary can also be updated based on the repeatability test data.
在本示例性实施例中,通过三个维度的判定可以对重复画作进行实体融合处理,实现对数据字典的更新,可以更为精确的完善数据字典,确保数据字典的知识更新,也减轻了由于同一条数据字典带来的多次判定的工作量。In this exemplary embodiment, entity fusion processing can be performed on repeated paintings through three-dimensional judgments, and the data dictionary can be updated. The data dictionary can be more accurately perfected to ensure that the knowledge of the data dictionary is updated, and the problem is also reduced. The workload of multiple judgments brought by the same data dictionary.
除了根据重复艺术品生成重复性检验数据之外,还可以根据重名艺 术品生成重复性检验数据。在可选的实施例中,图6示出了另一种生成重复性检验数据的方法的流程示意图,如图6所示,该方法至少包括以下步骤:在步骤S610中,当艺术家重复性检验结果为不同或创作时间重复性检验结果为不同时,确定艺术品实体为重名艺术品。在艺术品重复性检验结果为相同的时候,可以进一步的判定艺术家重复性检验结果和创作时间重复性检验结果是否相同。举例而言,当两幅画作的名称均为“自画像”时,可以确定是两个不同的画家创作的,因此艺术家重复性检验结果为不同。鉴于此,这两幅“自画像”的画作为重名画作。除此之外,当艺术品重复性检验结果为相同时,还可以进一步对创作时间进行重复性检验,确定该艺术品是否为重名艺术品。In addition to generating repeatability inspection data based on duplicate artwork, it is also possible to generate repeatability inspection data based on art with the same name. In an alternative embodiment, FIG. 6 shows a schematic flowchart of another method for generating repetitive inspection data. As shown in FIG. 6, the method includes at least the following steps: In step S610, when the artist repeatedly inspects When the result is different or the repeatability test result of creation time is different, the artwork entity is determined to be the artwork with the same name. When the artwork repeatability test results are the same, it can be further determined whether the artist repeatability test results and the creation time repeatability test results are the same. For example, when the names of the two paintings are both "self-portraits," it can be determined that they were created by two different painters, so the artist's repeatability test results are different. In view of this, these two "self-portrait" paintings are regarded as paintings of the same name. In addition, when the result of the repeatability test of the artwork is the same, the repeatability test can be further performed on the creation time to determine whether the artwork is an artwork with the same name.
在步骤S620中,对重名艺术品进行去重处理,并根据去重处理结果生成重复性检验数据。举例而言,当两幅“自画像”的画作为重名画作时,可以进行去重处理,亦即将这两幅画作确定为两条数据字典。为确定数据字典的更新准确性,可以进行人工审核,只有审核通过的重名艺术品才生成重复性检验数据,对数据字典进行更新。除此之外,审核的方式还可以是根据预设规则进行自动化审核,以节省审核流程和人力成本,提高审核工作的准确性。In step S620, deduplication processing is performed on the artwork with the same name, and repeatability inspection data is generated according to the deduplication processing result. For example, when two “self-portrait” paintings are the same-named paintings, they can be deduplicated, that is, the two paintings can be determined as two data dictionaries. In order to determine the update accuracy of the data dictionary, manual review can be carried out. Only the artwork with the same name that has passed the review can generate repetitive inspection data and update the data dictionary. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work.
在本示例性实施例中,通过对艺术品名称相同的艺术品进行其他两个维度的判定,并对重名画作进行实体去重处理实现对数据字典的更新,可以避免由于维度过少带来的知识判定不准确和数据字典陈旧的问题,确保数据字典的准确性。In this exemplary embodiment, by judging the other two dimensions of the artwork with the same artwork name, and performing entity deduplication processing on the artwork with the same name, the data dictionary can be updated, which can avoid the problem of too few dimensions. The problem of inaccurate knowledge judgment and obsolete data dictionary to ensure the accuracy of the data dictionary.
在步骤S230中,根据重复性检验数据生成数据字典和纠错字典,基于数据字典得到第一结构化数据。具体的,第一结构化数据中除了包含数据字典,还有一些属性数据不包含在数据字典中,该属性数据也是属于第一结构化数据的。在生成重复性检验数据之后,可能还有艺术品名称或者其他信息不正确的问题,可以进行进一步的人工审核。当人工审核通过之后,生成对应的数据字典或者是纠错字典。除此之外,审核的方式还可以是根据预设规则进行自动化审核,以节省审核流程和人力成本,提高审核工作的准确性。在由数据字典或者纠错字典生成目标艺术数据的之前,还可能会存在外国人名中间的间隔是“·”,还是“-” 的问题,或者是存在创作时间之间的间隔是“.”还是“-”的问题,因此,还可以对这些数据的规范性进行判定,将符合艺术领域数据库的存储规范的数据字典或纠错字典作为目标艺术数据,或者是将不符合存储规范的数据字典或纠错字典进行修正,也可以作为目标艺术数据。In step S230, a data dictionary and an error correction dictionary are generated according to the repeatability check data, and the first structured data is obtained based on the data dictionary. Specifically, in addition to the data dictionary included in the first structured data, some attribute data is not included in the data dictionary, and the attribute data also belongs to the first structured data. After the repetitive inspection data is generated, there may be problems with incorrect artwork names or other information, which can be further manually reviewed. After the manual review is passed, the corresponding data dictionary or error correction dictionary is generated. In addition, the way of review can also be automated review according to preset rules to save review process and labor costs and improve the accuracy of review work. Before the target art data is generated from the data dictionary or error correction dictionary, there may also be the question of whether the interval between foreign names is "·" or "-", or the interval between creation time is "." It is still a problem of "-". Therefore, the standardization of these data can also be judged, and a data dictionary or error correction dictionary that meets the storage specification of the art field database is used as the target art data, or a data dictionary that does not conform to the storage specification Or an error correction dictionary for correction, or it can be used as target art data.
在本示例性实施例中,通过对结构化数据的第一预处理过程可以生成对应的目标艺术数据,处理方式简单且准确,减轻了人工工作量,实用性极强。In this exemplary embodiment, the corresponding target art data can be generated through the first preprocessing process of the structured data, the processing method is simple and accurate, the manual workload is reduced, and the practicability is extremely strong.
在步骤S120中,对内部艺术数据源和外部艺术数据源中的非结构化数据和半结构化数据进行第二预处理得到第二结构化数据。In step S120, second preprocessing is performed on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain the second structured data.
第二预处理可以是数据清洗的步骤,具体可以是艺术数据一致性检查、缺失值处理、无效值处理、重复数据判断等,还可以根据对艺术数据的加工需求通过配置或者嵌入自定义代码对数据清洗进行的清洗工作。The second preprocessing can be a step of data cleaning, specifically it can be art data consistency check, missing value processing, invalid value processing, repeated data judgment, etc. It can also be configured or embedded in custom code pairs according to the processing requirements of art data. The cleaning work performed by data cleaning.
在本公开的示例性实施例中,由于内部艺术数据源中的数据完整度可能只有60%左右,因此可以通过外部艺术数据源中的数据对其进行融合填空,扩充内部艺术数据源中的数据,提升内部艺术数据源中的数据完整性。具体的,可以在互联网公开数据中爬取到半结构化数据,以得到可以用于填充的第二结构化数据。In the exemplary embodiment of the present disclosure, since the integrity of the data in the internal art data source may be only about 60%, the data in the external art data source can be merged and filled in to expand the data in the internal art data source. , To improve the integrity of the data in the internal art data source. Specifically, semi-structured data can be crawled from the Internet public data to obtain second structured data that can be used for filling.
对半结构化数据的处理方法可以是利用预设规则和预设正则表达式进行解析。举例而言,通过“达芬奇的作品有蒙娜丽莎、最后的晚餐、岩间圣母等”,可以构建规则““作者”的作品有“作品””;通过“达芬奇的杰作蒙娜丽莎体现了他精湛的艺术造诣”,可以构建规则““作者”的杰作“作品””;通过“蒙娜丽莎是意大利画家达芬奇创作的油画”,可以构建规则““作品”是“作者”创作”;通过“蒙娜丽莎代表了达芬奇的最高艺术成就”,可以构建规则““作品”代表了“作者””等。通过这些人工构建的预设规则可以对半结构化数据进行解析得到目标结构化数据。除此之外,还可以根据“达芬奇,1452年4月生人,意大利人,代表画作有蒙娜丽莎、最后的晚餐、岩间圣母等”构建正则表达式为在第二个逗号之前的内容填充到出生年月,在第二个逗号和第三个逗号之间的内容填充到国籍,在第三个逗号之后的内容填 充到代表作的方式构建正则表达式。因此,还可以通过构建好的正则表达式对半结构化数据进行第二预处理得到第二结构化数据。The processing method for semi-structured data may be to use preset rules and preset regular expressions for parsing. For example, through “Da Vinci’s works include the Mona Lisa, the Last Supper, Our Lady of the Rocks, etc.”, the rule “author’s works have “works”” can be constructed; through “Da Vinci’s masterpieces Na Lisa embodies his exquisite artistic attainments" and can construct the rule "author"'s masterpiece "work"; through "Mona Lisa is an oil painting created by Italian painter Leonardo da Vinci", the rule "work" can be constructed It is "author" creation"; through "Mona Lisa represents Leonardo's highest artistic achievement", the rule ""work" represents "author"" can be constructed. Through these artificially constructed preset rules, the semi-structured data can be parsed to obtain the target structured data. In addition, you can also construct a regular expression based on "Da Vinci, born in April 1452, Italian, representative paintings such as the Mona Lisa, the Last Supper, Our Lady of the Rocks, etc." as the second comma The previous content is filled in to the date of birth, the content between the second comma and the third comma is filled in to the nationality, and the content after the third comma is filled in to the representative work to construct a regular expression. Therefore, the second structured data can also be obtained by performing the second preprocessing on the semi-structured data through the constructed regular expression.
在步骤S130中,将第一结构化数据与第二结构化数据进行融合处理,生成融合艺术数据;其中,融合艺术数据包括艺术实体以及与艺术实体对应的艺术关系。In step S130, the first structured data and the second structured data are fused to generate fused art data; where the fused art data includes an art entity and an art relationship corresponding to the art entity.
在本公开的示例性实施例中,第一结构化数据包括目标艺术品数据、目标艺术家数据和目标艺术机构数据,图7示出了生成融合艺术数据的方法的流程示意图,如图7所示,该方法至少包括以下步骤:In an exemplary embodiment of the present disclosure, the first structured data includes target artwork data, target artist data, and target art institution data. FIG. 7 shows a schematic flowchart of a method for generating fused art data, as shown in FIG. 7 , The method includes at least the following steps:
在步骤S710中,将第二结构化数据中的参考艺术家数据与目标艺术家数据进行融合处理,生成融合艺术家数据。In step S710, the reference artist data and the target artist data in the second structured data are fused to generate fused artist data.
在可选的实施例中,图8示出了得到融合艺术家数据的方法的流程示意图,如图8所示,该方法至少包括以下步骤:在步骤S810中,根据词向量模型对第二结构化数据中的参考艺术家数据和目标艺术家数据进行向量转换,得到艺术家词向量序列。其中,该词向量模型可以是Word2Vec模型。其中,Word2Vec模型是2013年谷歌发布的Word2Vec工具,可以看作是深度学习在自然语言处理领域的一个重要应用。虽然Word2Vec只有三层神经网络,但是已经取得了非常好的效果。通过Word2Vec模型可以将分词表示为词向量,将文字进行数字化处理,能够更好的让计算机理解,也能够让分词生成的向量体现语义信息。为了利用这种语义信息,Word2Vec模型可以采用两种具体的实现方法,分别是连续词袋模型(Continuous Bag-of-Words Model,简称CBOW)和Skip-grams模型。其中,CBOW模型是给定上下文信息,来预测输入分词;Skip-grams模型是给定输入分词来预测上下文,其中,第一部分为建立模型,第二部分通过模型获取嵌入词向量。优选的,对词序列进行向量转换可以采用Skip-grams模型。利用Skip-grams模型进行词向量的转换,可以用一个300维度的实数向量在词空间唯一表示一个词,参考艺术家数据和目标艺术家数据是用词序列个数乘以300向量矩阵来表示,以得到对应的艺术家词向量序列。In an optional embodiment, FIG. 8 shows a schematic flow chart of a method for obtaining fusion artist data. As shown in FIG. 8, the method includes at least the following steps: In step S810, the second structured data is performed according to the word vector model. The reference artist data and the target artist data in the data are vectorized to obtain the artist word vector sequence. Among them, the word vector model may be a Word2Vec model. Among them, the Word2Vec model is the Word2Vec tool released by Google in 2013, which can be regarded as an important application of deep learning in the field of natural language processing. Although Word2Vec only has three layers of neural network, it has achieved very good results. Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can be better understood by the computer, and the vector generated by the word segmentation can also reflect semantic information. In order to use this semantic information, the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model. Among them, the CBOW model is to predict the input word segmentation given the context information; the Skip-grams model is to predict the context of the given input word segmentation. Among them, the first part is to build the model, and the second part is to obtain the embedded word vector through the model. Preferably, the Skip-grams model can be used for vector conversion of the word sequence. Using the Skip-grams model to convert word vectors, a 300-dimensional real number vector can be used to uniquely represent a word in the word space. The reference artist data and target artist data are represented by the number of word sequences multiplied by a 300 vector matrix to get The corresponding artist word vector sequence.
在步骤S820中,计算艺术家词向量序列之间的艺术家相似度向量,并根据艺术家数据相似度向量的第一权重进行加权计算。对于艺术家相 似度而言,可能存在多个维度的艺术家相似度向量,例如艺术家的国籍、艺术家的流派等。因此,可以先计算艺术家各个维度的词向量序列之间的艺术家相似度向量。In step S820, the artist similarity vectors between the artist word vector sequences are calculated, and weighted calculation is performed according to the first weight of the artist data similarity vectors. For artist similarity, there may be multiple dimensions of artist similarity vectors, such as artist's nationality, artist's genre, and so on. Therefore, the artist similarity vector between the word vector sequences of each dimension of the artist can be calculated first.
举例而言,同一个维度的两个艺术家词向量序列的长度有可能不一致,因此可以将两个艺术家词向量序列作为孪生长短期记忆(Siamese Long short-term memory,简称孪生LSTM)网络模型的输入,以适应长度可变的序列对。孪生长短期记忆网络模型由两个相同的神经网络模型构成,两个神经网络模型间通过共享权值达到孪生的目的。将参考艺术家词向量序列与目标艺术家词向量序列分别输入两个神经网络模型中,通过计算这两个向量序列之间的距离来评估输入的参考艺术家词向量序列和目标艺术家词向量序列之间的艺术家相似度分量。其中,对两个向量序列之间的距离计算主要依赖于曼哈顿距离。除此之外,还可以通过通过其他算法计算艺术家相似度向量,本示例性实施例对此不做特殊限定。For example, the lengths of two artist word vector sequences of the same dimension may be inconsistent, so the two artist word vector sequences can be used as the input of the Siamese Long short-term memory (Siamese Long short-term memory, referred to as Siamese LSTM) network model , To adapt to variable-length sequence pairs. The twin growth short-term memory network model is composed of two identical neural network models, and the two neural network models achieve the twinning purpose by sharing weights. The reference artist word vector sequence and the target artist word vector sequence are respectively input into two neural network models, and the distance between the input reference artist word vector sequence and the target artist word vector sequence is evaluated by calculating the distance between the two vector sequences. Artist similarity component. Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance. In addition, the artist similarity vector can also be calculated by other algorithms, which is not particularly limited in this exemplary embodiment.
在得到各个维度的艺术家相似度分量之后,还可以根据预设的第一权重对各艺术家相似度分量进行加权计算,得到加权计算结果。举例而言,与艺术家相似度分量相关的维度可以包括流派、国籍。进一步的,对流派可以设置对应的第一权重为0.4,对于国籍可以设置对应的第一权重为0.6,以将艺术家相似度分量中的流派分量乘以0.4,艺术家相似度分量中的国籍分量乘以0.6,并进行求和计算得到对应的计算结果。After obtaining the artist similarity components of each dimension, it is also possible to perform a weighted calculation on the artist similarity components according to the preset first weight to obtain a weighted calculation result. For example, the dimensions related to the artist similarity component may include genre and nationality. Further, the corresponding first weight can be set to 0.4 for the genre, and the corresponding first weight can be set to 0.6 for the nationality, so as to multiply the genre component in the artist similarity component by 0.4, and the nationality component in the artist similarity component Take 0.6 and perform the sum calculation to get the corresponding calculation result.
在步骤S830中,根据加权计算结果得到艺术家相似度,并判断艺术家相似度是否大于第一阈值。在对各个维度的艺术家相似度分量进行加权计算之后得到艺术家相似度。因此,可以根据对该艺术家相似度的总体值设置第一阈值,并判断艺术家相似度是否大于第一阈值。举例而言,可以设置第一阈值为1。当加权计算结果为0.8时,可以根据0.8小于1,确定艺术家相似度小于第一阈值;当加权计算结果为1.2时,可以根据1.2大于1,确定艺术家相似度大于第一阈值。In step S830, the artist similarity is obtained according to the weighted calculation result, and it is determined whether the artist similarity is greater than the first threshold. The artist similarity is obtained after the weighted calculation is performed on the artist similarity components of each dimension. Therefore, the first threshold can be set according to the overall value of the artist similarity, and it can be judged whether the artist similarity is greater than the first threshold. For example, the first threshold can be set to 1. When the weighted calculation result is 0.8, it can be determined that the artist similarity is less than the first threshold according to 0.8 being less than 1, and when the weighted calculation result is 1.2, it can be determined that the artist similarity is greater than the first threshold according to 1.2 greater than 1.
在步骤S840中,将大于第一阈值的艺术家相似度对应的参考艺术家数据和目标艺术家数据进行融合处理,生成融合艺术家数据。当判定结果为艺术家相似度大于第一阈值时,可以确定参考艺术家数据与目标 艺术家数据指向同一个艺术家,因此对参考艺术家数据与目标艺术家数据进行融合处理,并得到融合艺术家数据。In step S840, the reference artist data and the target artist data corresponding to the artist similarity greater than the first threshold are fused to generate fused artist data. When the determination result is that the artist similarity is greater than the first threshold, it can be determined that the reference artist data and the target artist data point to the same artist, so the reference artist data and the target artist data are fused to obtain the fused artist data.
在本示例性实施例中,通过与参考艺术家数据和目标艺术家数据对应的各个维度的相似度向量的计算,可以对满足预设条件的参考艺术家数据与目标艺术家数据进行融合处理得到融合艺术家数据,计算方式简单,融合准确度高,提高了艺术家数据获取的准确率。In this exemplary embodiment, by calculating the similarity vectors of the respective dimensions corresponding to the reference artist data and the target artist data, the reference artist data and the target artist data that meet the preset conditions can be fused to obtain the fused artist data. The calculation method is simple, the fusion accuracy is high, and the accuracy of the artist data acquisition is improved.
在步骤S720中,将第二结构化数据中的参考艺术品数据与目标艺术品数据进行融合处理,生成融合艺术品数据。In step S720, the reference artwork data in the second structured data and the target artwork data are fused to generate fused artwork data.
在可选的实施例中,图9示出了得到融合艺术品数据的方法的流程示意图,如图9所示,该方法至少包括以下步骤:在步骤S910中,根据词向量模型对第二结构化数据中的参考艺术品数据和目标艺术品数据进行向量转换,得到艺术品词向量序列。其中,该词向量模型可以是Word2Vec模型。通过Word2Vec模型可以将分词表示为词向量,将文字进行数字化处理,能够更好的让计算机理解,也能够让分词生成的向量体现语义信息。为了利用这种语义信息,Word2Vec模型可以采用两种具体的实现方法,分别是连续词袋模型(Continuous Bag-of-Words Model,简称CBOW)和Skip-grams模型。优选的,对词序列进行向量转换可以采用Skip-grams模型。利用Skip-grams模型进行词向量的转换,可以用一个300维度的实数向量在词空间唯一表示一个词,参考艺术品数据和目标艺术品数据是用词序列个数乘以300向量矩阵来表示,以得到对应的艺术品词向量序列。In an alternative embodiment, FIG. 9 shows a schematic flow chart of a method for obtaining fused artwork data. As shown in FIG. 9, the method includes at least the following steps: In step S910, the second structure is adjusted according to the word vector model. The reference artwork data and the target artwork data in the transformation data are vectorized to obtain the artwork word vector sequence. Among them, the word vector model may be a Word2Vec model. Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can better be understood by the computer, and the vector generated by the word segmentation can also reflect semantic information. In order to use this semantic information, the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model. Preferably, the Skip-grams model can be used for vector conversion of the word sequence. Using the Skip-grams model to convert word vectors, a 300-dimensional real number vector can be used to uniquely represent a word in the word space. The reference art data and target art data are represented by the number of word sequences multiplied by a 300 vector matrix. In order to obtain the corresponding art word vector sequence.
在步骤S920中,计算艺术品词向量序列之间的艺术品相似度向量,并根据艺术品相似度向量的第二权重进行加权计算。对于艺术品相似度而言,可能存在多个维度的艺术品相似度向量,例如艺术品所属的流派、艺术品的创作时间、艺术品保存的艺术机构等。因此,可以先计算艺术家各个维度的词向量序列之间的艺术家相似度向量。In step S920, the artwork similarity vector between the artwork word vector sequences is calculated, and weighted calculation is performed according to the second weight of the artwork similarity vector. For artwork similarity, there may be multiple dimensions of artwork similarity vectors, such as the genre to which the artwork belongs, the creation time of the artwork, and the art institution where the artwork is preserved. Therefore, the artist similarity vector between the word vector sequences of each dimension of the artist can be calculated first.
举例而言,同一个维度的两个艺术品词向量序列的长度有可能不一致,因此可以将两个艺术品词向量序列作为孪生长短期记忆(Siamese Long short-term memory,简称孪生LSTM)网络模型的输入,以适应长度可变的序列对。将参考艺术品词向量序列与目标艺术品词向量序列分 别输入两个神经网络模型中,通过计算这两个向量序列之间的距离来评估输入的参考艺术品词向量序列和目标艺术品词向量序列之间的艺术品相似度分量。其中,对两个向量序列之间的距离计算主要依赖于曼哈顿距离。除此之外,还可以通过通过其他算法计算艺术品相似度向量,本示例性实施例对此不做特殊限定。For example, the lengths of two art word vector sequences of the same dimension may be inconsistent, so the two art word vector sequences can be used as a twin-growing short-term memory (Siamese Long short-term memory, referred to as twin LSTM) network model Input to accommodate sequence pairs of variable length. Input the reference art word vector sequence and the target art word vector sequence into two neural network models respectively, and evaluate the input reference art word vector sequence and the target art word vector by calculating the distance between the two vector sequences The artwork similarity component between sequences. Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance. In addition, the artwork similarity vector can also be calculated through other algorithms, which is not particularly limited in this exemplary embodiment.
在得到各个维度的艺术品相似度分量之后,还可以根据预设的第二权重对各艺术品相似度分量进行加权计算,得到加权计算结果。举例而言,对于艺术品可以设置对应的第二权重为0.4,对于创作时间可以设置对应的第二权重为0.3,对于保存机构设置的第二权重也为0.3,进一步的,将艺术品相似度分量中的流派分量乘以0.4,艺术品相似度分量中的创作时间分量乘以0.3,艺术品相似度分量中的艺术机构分量乘以0.3,并进行求和计算得到对应的计算结果。After the artwork similarity components of each dimension are obtained, it is also possible to perform a weighted calculation on each artwork similarity component according to a preset second weight to obtain a weighted calculation result. For example, the corresponding second weight can be set to 0.4 for artwork, the corresponding second weight can be set to 0.3 for the creation time, and the second weight set for the preservation institution is also 0.3. Further, the similarity of the artwork The genre component in the component is multiplied by 0.4, the creation time component in the artwork similarity component is multiplied by 0.3, and the art institution component in the artwork similarity component is multiplied by 0.3, and the sum is calculated to obtain the corresponding calculation result.
在步骤S930中,根据加权计算结果得到艺术品相似度,并判断艺术品相似度是否大于第二阈值。在对各个维度的艺术品相似度分量进行加权计算之后得到艺术品相似度。因此,可以根据对该艺术品相似度的总体值设置第二阈值,并判断艺术品相似度是否大于第二阈值。举例而言,可以设置第二阈值为2。当加权计算结果为0.8时,可以根据0.8小于2,确定艺术家相似度小于第二阈值;当加权计算结果为3.2时,可以根据3.2大于2,确定艺术家相似度大于第二阈值。In step S930, the artwork similarity is obtained according to the weighted calculation result, and it is judged whether the artwork similarity is greater than the second threshold. The artwork similarity is obtained after the weighted calculation is performed on the artwork similarity components of each dimension. Therefore, the second threshold can be set according to the overall value of the similarity of the artwork, and it can be judged whether the similarity of the artwork is greater than the second threshold. For example, the second threshold can be set to 2. When the weighted calculation result is 0.8, it can be determined that the artist similarity is less than the second threshold based on 0.8 being less than 2, and when the weighted calculation result is 3.2, it can be determined that the artist similarity is greater than the second threshold based on 3.2 is greater than 2.
在步骤S940中,将大于第二阈值的艺术品相似度对应的参考艺术品数据和目标艺术品数据进行融合处理,生成融合艺术品数据。当判定结果为艺术品相似度大于第二阈值时,可以确定参考艺术品数据与目标艺术品数据指向同一个艺术品,因此对参考艺术品数据与目标艺术品数据进行融合处理,并得到融合艺术品数据。In step S940, the reference artwork data corresponding to the artwork similarity greater than the second threshold and the target artwork data are fused to generate fused artwork data. When the judgment result is that the similarity of the artwork is greater than the second threshold, it can be determined that the reference artwork data and the target artwork data point to the same artwork, so the reference artwork data and the target artwork data are fused to obtain the fusion art品数据。 Product data.
在本示例性实施例中,通过与参考艺术品数据和目标艺术品数据对应的各个维度的相似度向量的计算,可以对满足预设条件的参考艺术品数据与目标艺术品数据进行融合处理得到融合艺术品数据,计算方式简单,融合准确度高,提高了艺术品数据获取的准确率。In this exemplary embodiment, through calculation of the similarity vector of each dimension corresponding to the reference artwork data and the target artwork data, the reference artwork data and the target artwork data that meet the preset conditions can be fused to obtain The fusion of artwork data has a simple calculation method and high fusion accuracy, which improves the accuracy of artwork data acquisition.
在步骤S730中,将第二结构化数据中的参考艺术机构数据与目标艺术机构数据进行融合处理,生成融合艺术机构数据。In step S730, the reference art institution data in the second structured data and the target art institution data are fused to generate fused art institution data.
在可选的实施例中,图10示出了得到融合艺术机构数据的方法的流程示意图,如图10所示,该方法至少包括以下步骤:在步骤S1010中,根据词向量模型对第二结构化数据中的参考艺术机构数据和目标艺术机构数据进行向量转换,得到艺术机构词向量序列。其中,该词向量模型可以是Word2Vec模型。通过Word2Vec模型可以将分词表示为词向量,将文字进行数字化处理,能够更好的让计算机理解,也能够让分词生成的向量体现语义信息。为了利用这种语义信息,Word2Vec模型可以采用两种具体的实现方法,分别是连续词袋模型(Continuous Bag-of-Words Model,简称CBOW)和Skip-grams模型。优选的,对词序列进行向量转换可以采用Skip-grams模型。利用Skip-grams模型进行词向量的转换,可以用一个300维度的实数向量在词空间唯一表示一个词,参考艺术机构数据和目标艺术机构数据是用词序列个数乘以300向量矩阵来表示,以得到对应的艺术机构词向量序列。In an alternative embodiment, FIG. 10 shows a schematic flow chart of a method for obtaining fusion art institution data. As shown in FIG. 10, the method includes at least the following steps: In step S1010, the second structure is adjusted according to the word vector model. The reference art institution data and the target art institution data in the transformation data are vectorized to obtain the art institution word vector sequence. Among them, the word vector model may be a Word2Vec model. Through the Word2Vec model, the word segmentation can be expressed as a word vector, and the text can be digitized, which can better be understood by the computer, and the vector generated by the word segmentation can also reflect semantic information. In order to use this semantic information, the Word2Vec model can adopt two specific implementation methods, namely the Continuous Bag-of-Words Model (CBOW) and the Skip-grams model. Preferably, the Skip-grams model can be used for vector conversion of the word sequence. Using the Skip-grams model to convert word vectors, a 300-dimensional real number vector can be used to uniquely represent a word in the word space. The reference art institution data and target art institution data are represented by the number of word sequences multiplied by a 300 vector matrix. In order to obtain the corresponding art institution word vector sequence.
在步骤S1020中,计算艺术机构词向量序列之间的艺术机构相似度向量,并根据艺术机构相似度向量的第三权重进行加权计算。对于艺术机构相似度而言,可能存在多个维度的艺术品相似度向量,例如艺术机构所在的国家、艺术机构的成立时间、艺术机构的馆藏作品数量等。因此,可以先计算艺术机构各个维度的词向量序列之间的艺术机构相似度向量。In step S1020, the art institution similarity vector between the word vector sequences of the art institution is calculated, and weighted calculation is performed according to the third weight of the art institution similarity vector. For the similarity of art institutions, there may be multiple dimensions of art similarity vectors, such as the country where the art institution is located, the establishment time of the art institution, the number of works in the art institution's collection, and so on. Therefore, the art institution similarity vector between the word vector sequences of the various dimensions of the art institution can be calculated first.
举例而言,同一个维度的两个艺术机构词向量序列的长度有可能不一致,因此可以将两个艺术机构词向量序列作为孪生长短期记忆(Siamese Long short-term memory,简称孪生LSTM)网络模型的输入,以适应长度可变的序列对。将参考艺术机构词向量序列与目标艺术机构词向量序列分别输入两个神经网络模型中,通过计算这两个向量序列之间的距离来评估输入的参考艺术机构词向量序列和目标艺术机构词向量序列之间的艺术机构相似度分量。其中,对两个向量序列之间的距离计算主要依赖于曼哈顿距离。除此之外,还可以通过通过其他算法计算艺术机构相似度向量,本示例性实施例对此不做特殊限定。For example, the lengths of the word vector sequences of two art institutions in the same dimension may be inconsistent, so the word vector sequences of two art institutions can be used as a twin-growing short-term memory (Siamese Long short-term memory, referred to as twin LSTM) network model Input to accommodate sequence pairs of variable length. Input the reference art institution word vector sequence and the target art institution word vector sequence into two neural network models respectively, and evaluate the input reference art institution word vector sequence and the target art institution word vector by calculating the distance between the two vector sequences The similarity component of the art institution between the sequences. Among them, the calculation of the distance between two vector sequences mainly depends on the Manhattan distance. In addition, the art institution similarity vector can also be calculated by other algorithms, which is not specifically limited in this exemplary embodiment.
在得到各个维度的艺术机构相似度分量之后,还可以根据预设的第三权重对各艺术机构相似度分量进行加权计算,得到加权计算结果。举 例而言,与艺术机构相似度分量相关的维度可以包括艺术机构所在的国家、艺术机构的成立时间、艺术机构的馆藏作品数量。进一步的,对国家可以设置对应的第三权重为0.5,对成立时间可以设置对应的第三权重为0.2,对于馆藏作品数量设置对应的第三权重为0.3,以将艺术机构相似度分量中的国家分量乘以0.5,艺术机构相似度分量中的成立时间分量乘以0.2,艺术机构相似度分量中的馆藏作品数量分量乘以0.3,并进行求和计算得到对应的计算结果。After obtaining the similarity components of the art institutions in each dimension, the similarity components of the art institutions can also be weighted according to the preset third weight to obtain the weighted calculation result. For example, the dimensions related to the similarity component of an art institution can include the country where the art institution is located, the time when the art institution was established, and the number of works in the art institution's collection. Further, the corresponding third weight can be set to 0.5 for the country, the corresponding third weight can be set to 0.2 for the establishment time, and the corresponding third weight for the number of works in the collection is set to 0.3, so as to set the corresponding third weight of the art institution similarity component to The country component is multiplied by 0.5, the establishment time component in the art institution similarity component is multiplied by 0.2, and the number of works in the art institution similarity component is multiplied by 0.3, and the sum is calculated to obtain the corresponding calculation result.
在步骤S1030中,根据加权计算结果得到艺术机构相似度,并判断艺术机构相似度是否大于第三阈值。在对各个维度的艺术机构相似度分量进行加权计算之后得到艺术机构相似度。因此,可以根据对该艺术机构相似度的总体值设置第三阈值,并判断艺术机构相似度是否大于第三阈值。举例而言,可以设置第三阈值为3。当加权计算结果为0.8时,可以根据0.8小于3,确定艺术家相似度小于第三阈值;当加权计算结果为3.2时,可以根据3.2大于3,确定艺术家相似度大于第三阈值。In step S1030, the similarity of the art institution is obtained according to the weighted calculation result, and it is judged whether the similarity of the art institution is greater than the third threshold. After weighting the similarity components of art institutions in each dimension, the similarity of art institutions is obtained. Therefore, a third threshold can be set according to the overall value of the similarity of the art institution, and it can be judged whether the similarity of the art institution is greater than the third threshold. For example, the third threshold can be set to 3. When the weighted calculation result is 0.8, it can be determined that the artist similarity is less than the third threshold according to 0.8 less than 3; when the weighted calculation result is 3.2, it can be determined that the artist similarity is greater than the third threshold according to 3.2 greater than 3.
在步骤S1040中,将大于第三阈值的艺术机构相似度对应的参考艺术机构数据和目标艺术机构数据进行融合处理生成融合艺术机构数据。当判定结果为艺术机构相似度大于第三阈值时,可以确定参考艺术机构数据与目标艺术机构数据指向同一个艺术机构,因此对参考艺术机构数据与目标艺术机构数据进行融合处理,并得到融合艺术机构数据。In step S1040, the reference art institution data corresponding to the art institution similarity greater than the third threshold and the target art institution data are fused to generate fused art institution data. When the judgment result is that the similarity of the art institution is greater than the third threshold, it can be determined that the reference art institution data and the target art institution data point to the same art institution. Therefore, the reference art institution data and the target art institution data are fused to obtain the fusion art Institutional data.
在本示例性实施例中,通过与参考艺术机构数据和目标艺术机构数据对应的各个维度的相似度向量的计算,可以对满足预设条件的参考艺术机构数据与目标艺术机构数据进行融合处理得到融合艺术机构数据,计算方式简单,融合准确度高,提高了艺术机构数据获取的准确率。In this exemplary embodiment, by calculating the similarity vectors of each dimension corresponding to the reference art institution data and the target art institution data, the reference art institution data and the target art institution data that meet the preset conditions can be fused to obtain The fusion of art institution data has a simple calculation method and high fusion accuracy, which improves the accuracy of art institution data acquisition.
在步骤S140中,根据艺术实体和艺术关系生成艺术三元组,并根据艺术三元组生成艺术领域知识图谱。In step S140, an art triad is generated according to the art entity and the art relationship, and an art domain knowledge graph is generated according to the art triad.
在本公开的示例性实施例中,在融合艺术数据中可以提取到的艺术实体可以包括艺术家、艺术品和艺术机构等,值得说明的是,当融合艺术数据中还包括其他艺术实体时,也可能作为生成艺术领域知识图谱中的一部分。In the exemplary embodiment of the present disclosure, the art entities that can be extracted in the fusion art data may include artists, artworks, and art institutions. It is worth noting that when the fusion art data also includes other art entities, May be used as part of the knowledge graph in the field of generative art.
其中,知识图谱又称为科学知识图谱,是显示知识发展进程与结构 关系的一系列各种不同的图形,用可视化技术描述知识资源及其载体,挖掘、分析、构建、绘制和显示知识及它们之间的相互关系,通过将应用数学、图形学、信息可视化技术、信息科学等学科的理论与方法与计量学引文分析、共现分析等方法结合,并利用可视化的图谱形象地展示学科的核心结构、发展历史、前沿领域以及整体知识结构达到多学科融合目的的现代理论,为学科研究提供切实的、有价值的参考。知识图谱是结构化的语义知识库,用于以符号形式描述物理世界中的概念及其相互关系,它的基本组成单位是实体-关系-实体三元组,以及实体及其相关属性-键值对,实体之间通过关系相互联结,构成网状的知识结构。Among them, the knowledge map, also known as the scientific knowledge map, is a series of various graphs showing the relationship between the development process and structure of knowledge. It uses visualization technology to describe knowledge resources and their carriers, and mines, analyzes, constructs, draws and displays knowledge and them. The interrelationship between these subjects is through combining the theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with metrological citation analysis, co-occurrence analysis and other methods, and using visualized maps to vividly display the core of the subject The modern theories of structure, development history, frontier fields and overall knowledge structure to achieve the purpose of multidisciplinary integration, provide practical and valuable references for disciplinary research. The knowledge graph is a structured semantic knowledge base used to describe concepts and their relationships in the physical world in symbolic form. Its basic unit is entity-relation-entity triples, and entities and their related attributes-keys. Yes, entities are connected to each other through relationships, forming a networked knowledge structure.
因此,在提取到艺术家、艺术品和艺术机构及三者之间的实体关系之后,可以构建出艺术领域的知识图谱的关联模型,还可以通过绘制程序绘制出可视化的艺术领域知识图谱。Therefore, after extracting the entity relationship between the artist, the artwork and the art institution, and the three, the association model of the knowledge graph in the art field can be constructed, and the visual knowledge graph in the art field can also be drawn through the drawing program.
下面结合一应用场景对本公开实施例中的艺术领域知识图谱的构建方法做出详细说明。The method for constructing the art domain knowledge graph in the embodiments of the present disclosure will be described in detail below in conjunction with an application scenario.
图11示出了应用场景中的艺术领域知识图谱构建方法的流程示意图,如图11所示,在步骤S1110中,内部数据源。亦即原始结构化数据。除此之外,还可以加载外部数据源中的结构化数据,以丰富原始结构化数据来源。FIG. 11 shows a schematic flowchart of a method for constructing a knowledge graph of an art domain in an application scenario. As shown in FIG. 11, in step S1110, an internal data source. That is, the original structured data. In addition, you can also load structured data from external data sources to enrich the original structured data sources.
在步骤S1111中,数据清洗和纠错。具体的,图12示出了应用场景中进行数据第一预处理的方法的流程示意图,如图12所示,在步骤S1210中,加载原始数据。具体是加载内部数据源和外部数据源中的结构化数据作为原始结构化数据。In step S1111, data cleaning and error correction. Specifically, FIG. 12 shows a schematic flowchart of a method for first preprocessing of data in an application scenario. As shown in FIG. 12, in step S1210, original data is loaded. Specifically, load the structured data in the internal data source and the external data source as the original structured data.
在步骤S1211中,数据处理(拆分、清洗)。具体的,图13示出了应用场景中对原始结构化数据进行数据清洗的方法的流程示意图,如图13所示,在步骤S1310中,加载原始数据。具体是加载内部数据源和外部数据源中的结构化数据,作为原始结构化数据。In step S1211, data processing (split, clean). Specifically, FIG. 13 shows a schematic flowchart of a method for performing data cleaning on original structured data in an application scenario. As shown in FIG. 13, in step S1310, the original data is loaded. Specifically, load the structured data in the internal data source and the external data source as the original structured data.
在步骤S1311中,数据处理(单值属性)。亦即对原始结构化数据进行单值属性判定处理。In step S1311, data processing (single-valued attributes). That is, single-valued attribute judgment processing is performed on the original structured data.
在步骤S1312中,获取实体表((属性值)+获取关系表),统计错误信息表。根据单值属性判定处理的结果获取单值结构化数据中的第一结 构化实体和第一结构化关系,并统计不满足单值属性的多值数据,得到对应的多值数据表。In step S1312, the entity table ((attribute value)+acquisition relation table) is obtained, and the error information table is counted. Obtain the first structured entity and the first structured relationship in the single-valued structured data according to the result of the single-valued attribute determination processing, and count the multi-valued data that does not satisfy the single-valued attribute to obtain the corresponding multi-valued data table.
在步骤S1313中,错误信息表为空。具体的,对多值数据表中是否还存在未更新的多值数据进行判定。当多值数据表为空时,输出第一结构化实体和第一结构化关系;当多值数据表不为空时,可以对多值数据进行审核,以更新纠错字典或数据字典。In step S1313, the error information table is empty. Specifically, it is determined whether there is still unupdated multi-value data in the multi-value data table. When the multi-value data table is empty, output the first structured entity and the first structured relationship; when the multi-value data table is not empty, the multi-value data can be audited to update the error correction dictionary or the data dictionary.
在对原始结构数据进行数据清洗之后,在步骤S1212中,重复判断。具体的,可以对数据清洗结果进行重复性检验。当艺术重复检验结果为相同时,图14示出了对画作相同时的处理方法的流程示意图,如图14所示,在步骤S1410中,同名画作。具体的,获取到艺术品重复性检验结果相同的画作。After data cleaning is performed on the original structure data, in step S1212, the judgment is repeated. Specifically, repeatability checks can be performed on the results of data cleaning. When the results of the artistic repetition inspection are the same, FIG. 14 shows a schematic flow chart of the processing method when the paintings are the same. As shown in FIG. 14, in step S1410, the paintings with the same name are used. Specifically, paintings with the same repeatability test results are obtained.
在步骤S1411中,作者相同判断。具体的,进行艺术家重复性检验,生成艺术家重复性检验结果。In step S1411, the author judges the same. Specifically, perform artist repeatability inspection, and generate artist repeatability inspection results.
在步骤S1412中,创作时间相同判断。具体的,当艺术家重复性检验结果为相同时,进一步进行创作时间重复性检验,生成创作时间重复性检验结果。In step S1412, it is determined that the creation time is the same. Specifically, when the artist repeatability test results are the same, the creation time repeatability test is further performed to generate the creation time repeatability test result.
在步骤S1413中,重复画作判断。具体的,当创作时间重复性检验结果为相同时,确定两幅画作为重复画作。In step S1413, the painting judgment is repeated. Specifically, when the result of the repeatability test of creation time is the same, the two paintings are determined as duplicate paintings.
在步骤S1414中,数据融合。具体的,对两幅重复画作进行数据融合处理,并得到对应的数据融合处理的结果。In step S1414, the data is fused. Specifically, data fusion processing is performed on two repeated paintings, and the corresponding data fusion processing result is obtained.
在步骤S1415中,审核。具体的,人工对数据融合处理结果进行审核,得到人工审核结果。In step S1415, it is reviewed. Specifically, the data fusion processing result is manually reviewed, and the manual review result is obtained.
在步骤S1416中,更新数据/字典。具体的,当人工审核结果为审核通过时,生成一条数据字典,以进行更新。In step S1416, the data/dictionary is updated. Specifically, when the manual review result is approved, a data dictionary is generated for updating.
在步骤S1417中,重名画作。在判断同名画作的创作艺术家和创作时间均相同,可以确定这两幅画作是重名画作。In step S1417, the painting is renamed. After judging that the creation artist and creation time of the painting of the same name are the same, it can be determined that the two paintings are paintings of the same name.
在步骤S1418中,审核。对两幅重名画作进行去重处理,并对去重处理的结果进行人工审核,确定去重处理的准确性。In step S1418, review is performed. De-duplicate two paintings with duplicate names, and manually review the results of the de-duplication to determine the accuracy of the de-duplication.
在步骤S1213中,实体去重/融合。对已经过去重处理或融合处理,以及未重复的数据清洗结果进行人工查错,对画作名称、画家名称等不 正确的信息进行人工核实。In step S1213, entities are deduplicated/fused. Manually check the results of reprocessing or fusion processing in the past, as well as non-repetitive data cleaning results, and manually verify incorrect information such as the name of the painting and the name of the artist.
在步骤S1214中,生成字典。具体的,根据查错处理结果可以生成对应的数据字典或纠错字典。In step S1214, a dictionary is generated. Specifically, a corresponding data dictionary or error correction dictionary can be generated according to the result of the error checking process.
在步骤S1215中,数据规范正确判断。具体的,已生成的数据字典或者纠错字典可能存在命名规范与数据库中的存储规范不一致的情况,可以进行进一步的数据规范处理步骤。将进行规范化处理的字典新增到数据字典或者是纠错字典中。In step S1215, it is judged that the data specification is correct. Specifically, the generated data dictionary or error correction dictionary may have a naming specification that is inconsistent with the storage specification in the database, and further data specification processing steps can be performed. Add the normalized dictionary to the data dictionary or error correction dictionary.
在步骤S1112中,更新后数据。具体可以根据生成的数据字典或纠错字典生成目标艺术数据,亦即更新后数据。In step S1112, the post-update data is updated. Specifically, the target art data can be generated according to the generated data dictionary or error correction dictionary, that is, the updated data.
在步骤S1113中,数据融合。具体可以对第一结构化数据和第二结构化数据进行数据融合处理。其中,第二结构化数据可以是从外部数据源中爬取半结构化数据进行处理后转换为的结构化数据,并且保存在MySQL数据库中。图15示出了应用场景中生成融合艺术数据的方法的流程示意图,如图15所示,在步骤S1510中,外部数据爬取。具体的,从外部数据源中爬取半结构化数据。其中,外部数据源可以是互联网的公开数据源,也可以是其他数据源,本示例性实施例对此不做特殊限定。In step S1113, the data is fused. Specifically, data fusion processing can be performed on the first structured data and the second structured data. Wherein, the second structured data may be structured data obtained by crawling semi-structured data from an external data source for processing, and then storing it in a MySQL database. Fig. 15 shows a schematic flow chart of a method for generating fusion art data in an application scenario. As shown in Fig. 15, in step S1510, external data is crawled. Specifically, semi-structured data is crawled from external data sources. Wherein, the external data source may be a public data source on the Internet, or other data sources, which is not particularly limited in this exemplary embodiment.
在步骤S1511中,解析半结构化数据。具体的,根据预设规则和正则表达式对半结构化数据进行第二预处理,得到结构化数据。In step S1511, the semi-structured data is analyzed. Specifically, the second preprocessing is performed on the semi-structured data according to preset rules and regular expressions to obtain structured data.
在步骤S1512中,结构化数据,如艺术品、艺术家和艺术结构。进一步的,还可以对得到的结构化数据进行规范化处理,生成第二结构化数据。In step S1512, structure data, such as artwork, artist, and art structure. Further, the obtained structured data can also be standardized to generate second structured data.
在步骤S1513中,按照艺术家的出生年月分组后,运用Word2Vec分别计算各关系、属性的相似度,得到同一出生年两两艺术家的相似度向量。分别运用Word2Vec算法计算参考艺术家和目标艺术家数据的各个维度的艺术家相似度向量。In step S1513, after the artists are grouped according to the birth year and month, Word2Vec is used to calculate the similarity of each relationship and attribute to obtain the similarity vectors of the two artists in the same birth year. The Word2Vec algorithm is used to calculate the artist similarity vectors of each dimension of the reference artist and the target artist data.
在步骤S1514中,对上一步相似度向量加权得到艺术家的相似度。具体的,根据第一权重对艺术家相似度向量进行加权计算,生成对应的艺术家相似度。In step S1514, the similarity vector of the previous step is weighted to obtain the similarity of the artist. Specifically, the artist similarity vector is weighted and calculated according to the first weight to generate the corresponding artist similarity.
在步骤S1515中,相似度高于设定阈值1。具体的,将艺术家相似 度与第一阈值进行比较。In step S1515, the similarity is higher than the set threshold 1. Specifically, the artist similarity is compared with the first threshold.
在步骤S1516中,相应艺术家进行融合。具体的,当艺术家相似度大于第一阈值时,将参考艺术家数据和目标艺术家数据进行融合,生成融合艺术家数据。In step S1516, the corresponding artists perform fusion. Specifically, when the artist similarity is greater than the first threshold, the reference artist data and the target artist data are fused to generate fused artist data.
在步骤S1517中,按照艺术品的作者进行分组后,运用word2vec分别计算各关系、属性的相似度,得到同一作者两两艺术品的相似度向量,具体的,分别运用Word2Vec算法计算参考艺术品和目标艺术品数据的各个维度的艺术品相似度向量。In step S1517, after grouping according to the author of the artwork, word2vec is used to calculate the similarity of each relationship and attribute respectively, and the similarity vector of the two artworks of the same author is obtained. Specifically, the Word2Vec algorithm is used to calculate the reference artwork and The artwork similarity vector of each dimension of the target artwork data.
在步骤S1518中,对上一步相似度向量加权得到艺术品的相似度。具体的,根据第二权重对艺术品相似度向量进行加权计算,生成对应的艺术品相似度。In step S1518, the similarity vector of the previous step is weighted to obtain the similarity of the artwork. Specifically, the artwork similarity vector is weighted according to the second weight to generate the corresponding artwork similarity.
在步骤S1519中,相似度高于设定阈值2。具体的,将艺术品相似度与第二阈值进行比较。In step S1519, the similarity is higher than the set threshold 2. Specifically, the artwork similarity is compared with the second threshold.
在步骤S1520中,相应艺术品进行融合。具体的,当艺术品相似度大于第二阈值时,将参考艺术品数据和目标艺术品数据进行融合,生成融合艺术品数据。In step S1520, the corresponding artworks are merged. Specifically, when the artwork similarity is greater than the second threshold, the reference artwork data and the target artwork data are fused to generate fused artwork data.
在步骤S1521中,对于艺术机构,运用word2vec分别计算各关系、属性的相似度,得到两两艺术机构的相似度向量。具体的,分别运用Word2Vec算法计算参考艺术机构和目标艺术机构数据的各个维度的艺术机构相似度向量。In step S1521, for the art institution, word2vec is used to calculate the similarity of each relationship and attribute respectively, and the similarity vector of the pair of art institutions is obtained. Specifically, the Word2Vec algorithm is used to calculate the art institution similarity vectors of various dimensions of the reference art institution and the target art institution data.
在步骤S1522中,对上一步相似度向量加权得到艺术机构的相似度。具体的,根据第三权重对艺术机构相似度向量进行加权计算,生成对应的艺术机构相似度。In step S1522, the similarity vector of the previous step is weighted to obtain the similarity of the art institution. Specifically, the art institution similarity vector is weighted according to the third weight to generate the corresponding art institution similarity.
在步骤S1523中,相似度高于设定阈值3。具体的,将艺术机构相似度与第三阈值进行比较。In step S1523, the similarity is higher than the set threshold 3. Specifically, the similarity of art institutions is compared with the third threshold.
在步骤S1524中,相应艺术机构进行融合。具体的,当艺术机构相似度大于第三阈值时,将参考艺术机构数据和目标艺术机构数据进行融合,生成融合艺术机构数据。In step S1524, the corresponding art institutions perform fusion. Specifically, when the art institution similarity is greater than the third threshold, the reference art institution data and the target art institution data are fused to generate fused art institution data.
在步骤S1114中,融合后数据。具体的,可以得到融合后的融合艺术家数据、融合艺术品数据和融合艺术机构数据。In step S1114, the data is merged. Specifically, the fused fused artist data, fused artwork data, and fused art institution data can be obtained after fusion.
除此之外,在得到融合处理数据之后,还可以在步骤S1115中,数据质量评估。具体的,抽取未匹配的外部数据源中的艺术数据对融合处理数据进行评估。在本方案中,主要的评估指标包括融合处理数据的准确性和完整性。In addition, after the fusion processing data is obtained, the data quality can also be evaluated in step S1115. Specifically, the art data from unmatched external data sources is extracted to evaluate the fusion processing data. In this plan, the main evaluation indicators include the accuracy and completeness of the fusion processing data.
在步骤S1116中,实体关系提取。具体的,提取所述融合艺术数据中的艺术实体以及与所述艺术实体对应的艺术关系,实现数据库的模式(Schema)设计。其中,模式中包含了schema对象,可以是表(table)、列(column)、数据类型(data type)、视图(view)、存储过程(stored procedures)、关系(relationships)、主键(primary key)、外键(foreign key)等。数据库模式可以用一个可视化的图来表示,它显示了艺术实体及其相互之间的关系。In step S1116, the entity relationship is extracted. Specifically, the art entity in the fusion art data and the art relationship corresponding to the art entity are extracted to realize the schema design of the database. Among them, the schema contains schema objects, which can be table (table), column (column), data type (data type), view (view), stored procedure (stored procedures), relationship (relationships), primary key (primary key) , Foreign key, etc. The database model can be represented by a visual diagram, which shows the artistic entities and their relationships with each other.
在步骤S1117中,KG_neo4j数据库(MySQL)。具体的,获取生成的艺术家、艺术品和艺术机构的艺术三元组构成的艺术领域知识图谱,整体保存在图数据库,例如Neo4j中。图16示出了可视化的艺术领域知识图谱的界面示意图,如图16所示,艺术实体包括艺术家、艺术品和艺术机构。其中,与艺术家相关的实体可以有国籍、死亡年、出生地、出生年月和流派等,与艺术家对应的属性有英文名和别名;与艺术品相关的实体可以有创作年份、创作媒介、类别和题材等,与艺术品对应的属性有唯一编码(Identity document,简称ID)、别名和尺寸;与艺术机构对应的属性有英文名。In step S1117, the KG_neo4j database (MySQL). Specifically, the generated art domain knowledge graph composed of art triads of artists, artworks, and art institutions is obtained, and the whole is stored in a graph database, such as Neo4j. FIG. 16 shows a schematic interface diagram of a visualized knowledge map of the art field. As shown in FIG. 16, the art entities include artists, artworks, and art institutions. Among them, the entities related to the artist can have nationality, death year, birthplace, birth year and genre, etc. The attributes corresponding to the artist have English names and aliases; the entities related to the artwork can have creation year, creation medium, category and Themes, etc., attributes corresponding to artworks have unique codes (Identity document, ID for short), aliases, and dimensions; attributes corresponding to art institutions have English names.
艺术领域知识图谱分别可以应用在艺术百科、艺术图谱、艺术知识问答和艺术知识概述中。图17示出了应用在艺术百科中的场景示意图,如图17所示,可以应用在艺术百科中,在用户发起搜索之后,可以通过艺术实体识别以及thulac分词包和数据字典识别到达芬奇,并展示与达芬奇相关的知识;图18示出了应用在知识图谱中的场景示意图,如图18所示,通过绘制组件E-charts绘制出的知识图谱进行可视化展示;图19示出了应用在艺术知识问答中的场景示意图,如图19所示,通过thulac分词包对用户的问题进行分词处理,并通过预设规则或者正则表达式的匹配结果,生成与艺术问题对应的可视化知识图谱;图20示出了应用在艺术知识概述中的场景示意图,如图20所示,利用数据字典 可以生成对应的艺术知识概览。The art domain knowledge graph can be applied to the art encyclopedia, art graph, art knowledge question and answer, and art knowledge overview respectively. Figure 17 shows a schematic diagram of the scene applied in the Art Encyclopedia. As shown in Figure 17, it can be applied in the Art Encyclopedia. After the user initiates a search, it can be recognized by art entity recognition, thulac word segmentation package and data dictionary to reach Vinci, And show the knowledge related to Da Vinci; Figure 18 shows a schematic diagram of the scene applied in the knowledge graph, as shown in Figure 18, the knowledge graph drawn by the drawing component E-charts is visually displayed; Figure 19 shows A schematic diagram of the scene applied in the art knowledge question and answer, as shown in Figure 19, the user’s question is segmented through the thulac word segmentation package, and a visual knowledge map corresponding to the art question is generated through the matching results of preset rules or regular expressions Figure 20 shows a schematic diagram of a scene applied in an overview of art knowledge. As shown in Figure 20, a data dictionary can be used to generate a corresponding overview of art knowledge.
在本公开的示例性实施例中,本公开一方面,通过外部数据源中的数据和已规范化的数据进行数据融合处理,极大地增加了艺术领域的实体知识的规模,提高了艺术领域知识获取的准确性;另一方面,根据艺术实体和艺术关系生成艺术领域知识图谱,有助于提高知识图谱中实体的关联性和知识图谱搜索的全面性,更加准确地理解查询意图,提高检索的准确率。In the exemplary embodiment of the present disclosure, on the one hand, the present disclosure uses data in an external data source and standardized data to perform data fusion processing, which greatly increases the scale of physical knowledge in the art field and improves the acquisition of knowledge in the art field. On the other hand, generating an art domain knowledge graph based on art entities and art relationships helps to improve the relevance of entities in the knowledge graph and the comprehensiveness of the knowledge graph search, understand query intentions more accurately, and improve the accuracy of retrieval Rate.
需要说明的是,虽然以上示例性实施例的实施方式以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或者必须执行全部的步骤才能实现期望的结果。附加地或者备选地,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。It should be noted that although the above exemplary embodiments describe the steps of the method in the present disclosure in a specific order, this does not require or imply that these steps must be performed in the specific order, or that all steps must be performed. In order to achieve the desired result. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, and so on.
此外,在本公开的示例性实施例中,还提供一种艺术领域知识图谱的构建装置。图21示出了艺术领域知识图谱的构建装置的结构示意图,如图21所示,艺术领域知识图谱的构建装置2100可以包括:数据处理模块2110、数据解析模块2120、数据融合模块2130和图谱生成模块2140。其中:In addition, in an exemplary embodiment of the present disclosure, a device for constructing a knowledge graph in the art field is also provided. FIG. 21 shows a schematic diagram of the structure of an art domain knowledge graph construction device. As shown in FIG. 21, the art domain knowledge graph construction device 2100 may include: a data processing module 2110, a data analysis module 2120, a data fusion module 2130, and graph generation Module 2140. in:
数据处理模块2110,被配置为对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据;数据解析模块2120,被配置为对内部艺术数据源和外部艺术数据源中的非结构化数据、半结构化数据进行第二预处理得到第二结构化数据;数据融合模块2130,被配置为将第一结构化数据与第二结构化数据进行融合处理,生成融合艺术数据;其中,融合艺术数据包括艺术实体以及与艺术实体对应的艺术关系;图谱生成模块2140,被配置为根据艺术实体和艺术关系生成艺术三元组,并根据艺术三元组生成艺术领域知识图谱。The data processing module 2110 is configured to perform first preprocessing on the structured data in the internal art data source and the external art data source to generate first structured data; the data analysis module 2120 is configured to perform the first preprocessing on the internal art data source and Perform the second preprocessing on the unstructured data and semi-structured data in the external art data source to obtain the second structured data; the data fusion module 2130 is configured to perform fusion processing on the first structured data and the second structured data , To generate fusion art data; among them, fusion art data includes art entities and art relationships corresponding to the art entities; the graph generation module 2140 is configured to generate art triads according to the art entities and art relations, and to generate art triads according to the art triads Knowledge map in the field of art.
上述艺术领域知识图谱的构建装置的具体细节已经在对应的艺术领域知识图谱的构建方法中进行了详细的描述,因此此处不再赘述。The specific details of the construction device of the above-mentioned art domain knowledge graph have been described in detail in the corresponding art domain knowledge graph construction method, so it will not be repeated here.
应当注意,尽管在上文详细描述中提及了艺术领域知识图谱的构建装置2100的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和 功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。在本申请的实施例中的各模块或单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上模块或单元集成在一个模块或单元中。上述模块或单元既可以采用硬件的形式实现,也可以采用软件功能模块或单元的形式实现,具体硬件可以是CPU、微处理器、GPU、FPGA或单片机等。It should be noted that although several modules or units of the apparatus 2100 for constructing the knowledge graph of the art field are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied. The modules or units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more modules or units may be integrated into one module or unit. The above-mentioned modules or units can be implemented in the form of hardware or software functional modules or units. The specific hardware can be a CPU, a microprocessor, a GPU, an FPGA, or a single-chip microcomputer.
此外,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。In addition, although the various steps of the method in the present disclosure are described in a specific order in the drawings, this does not require or imply that these steps must be performed in the specific order, or that all the steps shown must be performed to achieve the desired result. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
此外,在本公开的示例性实施例中,还提供了一种能够实现上述方法的电子设备,该电子设备包含处理器、存储器,用于存储处理器的可执行指令;处理器被配置为经由执行所述可执行指令来执行上述的艺术领域知识图谱的构建方法。In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided. The electronic device includes a processor and a memory for storing executable instructions of the processor; the processor is configured to The executable instruction is executed to execute the above-mentioned method for constructing a knowledge graph in the art field.
下面参照图22来描述根据本发明的这种实施例的电子设备2200。图22显示的电子设备2200仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。The electronic device 2200 according to this embodiment of the present invention will be described below with reference to FIG. 22. The electronic device 2200 shown in FIG. 22 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
如图22所示,电子设备2200以通用计算设备的形式表现。电子设备2200的组件可以包括但不限于:上述至少一个处理单元2210、上述至少一个存储单元2220、连接不同系统组件(包括存储单元2220和处理单元2210)的总线2230、显示单元2240。As shown in FIG. 22, the electronic device 2200 is represented in the form of a general-purpose computing device. The components of the electronic device 2200 may include, but are not limited to: the aforementioned at least one processing unit 2210, the aforementioned at least one storage unit 2220, a bus 2230 connecting different system components (including the storage unit 2220 and the processing unit 2210), and a display unit 2240.
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元2210执行,使得所述处理单元2210执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施例的步骤。Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 2210, so that the processing unit 2210 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Example steps.
存储单元2220可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)2221和/或高速缓存存储单元2222,还可以进一步包括只读存储单元(ROM)2223。The storage unit 2220 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 2221 and/or a cache storage unit 2222, and may further include a read-only storage unit (ROM) 2223.
存储单元2220还可以包括具有一组(至少一个)程序模块2225的程序/实用工具2224,这样的程序模块2225包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 2220 may also include a program/utility tool 2224 having a set (at least one) program module 2225. Such program module 2225 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
总线2230可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。The bus 2230 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
电子设备2200也可以与一个或多个外部设备2400(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备2200交互的设备通信,和/或与使得该电子设备2200能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口2250进行。并且,电子设备2200还可以通过网络适配器2260与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器2240通过总线2230与电子设备2200的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备2200使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 2200 may also communicate with one or more external devices 2400 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 2200, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 2200 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 2250. In addition, the electronic device 2200 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through a network adapter 2260. As shown in the figure, the network adapter 2240 communicates with other modules of the electronic device 2200 through the bus 2230. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 2200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
通过以上的实施例的描述,本领域的技术人员易于理解,这里描述的示例实施例可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施例的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等) 执行根据本公开实施例的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the exemplary embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
在本公开的示例性实施例中,还提供了一种非易失性计算机可读存储介质,其上存储有能够实现本说明书上述方法的计算机程序。在一些可能的实施例中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施例的步骤。In an exemplary embodiment of the present disclosure, a non-volatile computer-readable storage medium is also provided, on which a computer program capable of implementing the above method of this specification is stored. In some possible embodiments, various aspects of the present invention can also be implemented in the form of a program product, which includes program code, and when the program product runs on a terminal device, the program code is used to cause the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.
参考图23所示,描述了根据本发明的实施例的用于实现上述方法的程序产品2300,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Referring to FIG. 23, a program product 2300 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer. However, the program product of the present invention is not limited to this. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product can use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言的任意组合来编写用于执行本发明 操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。The program code used to perform the operations of the present invention can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其他实施例。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。Those skilled in the art will easily think of other embodiments of the present disclosure after considering the specification and practicing the invention disclosed herein. This application is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field that are not disclosed in the present disclosure. . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.

Claims (13)

  1. 一种艺术领域知识图谱的构建方法,其中,所述方法包括:A method for constructing a knowledge graph in the field of art, wherein the method includes:
    对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据;Perform the first preprocessing of the structured data in the internal art data source and the external art data source to generate the first structured data;
    对所述内部艺术数据源和所述外部艺术数据源中的非结构化数据和半结构化数据进行第二预处理得到第二结构化数据;Performing a second preprocessing on the unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data;
    将所述第一结构化数据与所述第二结构化数据进行融合处理,生成融合艺术数据;其中,所述融合艺术数据包括艺术实体以及与所述艺术实体对应的艺术关系;Performing fusion processing on the first structured data and the second structured data to generate fused art data; wherein the fused art data includes an art entity and an art relationship corresponding to the art entity;
    根据所述艺术实体和所述艺术关系生成艺术三元组,并根据所述艺术三元组生成艺术领域知识图谱。An art triad is generated according to the art entity and the art relationship, and an art domain knowledge graph is generated according to the art triad.
  2. 根据权利要求1所述的艺术领域知识图谱的构建方法,其中,所述对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据,包括:The method for constructing an art domain knowledge graph according to claim 1, wherein the first preprocessing of the structured data in the internal art data source and the external art data source to generate the first structured data comprises:
    对内部艺术数据源和外部艺术数据源中的结构化数据进行数据清洗;Perform data cleaning on the structured data in internal art data sources and external art data sources;
    对所述内部艺术数据源和外部艺术数据源中的结构化数据的数据清洗结果进行重复性检验,生成重复性检验数据;Perform repeatability inspection on the data cleaning results of the structured data in the internal art data source and the external art data source, and generate repeatability inspection data;
    根据所述重复性检验数据生成数据字典和纠错字典,基于所述数据字典得到第一结构化数据。A data dictionary and an error correction dictionary are generated according to the repeatability check data, and the first structured data is obtained based on the data dictionary.
  3. 根据权利要求2所述的艺术领域知识图谱的构建方法,其中,所述对内部艺术数据源和外部艺术数据源中的结构化数据进行数据清洗,包括:The method for constructing an art domain knowledge graph according to claim 2, wherein said performing data cleaning on the structured data in the internal art data source and the external art data source comprises:
    对内部艺术数据源和外部艺术数据源中的结构化数据进行单值属性判定处理,以得到单值结构化数据;Perform single-value attribute judgment processing on the structured data in the internal art data source and the external art data source to obtain single-value structured data;
    获取所述单值结构化数据中的第一结构化实体和第一结构化关系,并统计所述单值属性判定处理的结果得到多值数据表;Acquiring the first structured entity and the first structured relationship in the single-valued structured data, and calculating the result of the single-valued attribute determination processing to obtain a multi-valued data table;
    当所述多值数据表中未包含多值数据时,将所述第一结构化实体和第一结构化关系作为数据清洗结果;When the multi-value data table does not contain multi-value data, use the first structured entity and the first structured relationship as a data cleaning result;
    当所述多值数据表中包含多值数据时,根据所述多值数据表得到第 二结构化实体和第二结构化关系,以作为数据清洗结果。When the multi-valued data table contains multi-valued data, the second structured entity and the second structured relationship are obtained according to the multi-valued data table as the data cleaning result.
  4. 根据权利要求3所述的艺术领域知识图谱的构建方法,其中,所述根据所述多值数据表得到第二结构化实体和第二结构化关系,以作为数据清洗结果,包括:The method for constructing a knowledge graph in the art field according to claim 3, wherein the obtaining the second structured entity and the second structured relationship according to the multi-valued data table as a result of data cleaning comprises:
    根据所述多值数据表更新数据字典或纠错字典;Update the data dictionary or the error correction dictionary according to the multi-value data table;
    根据更新后的数据字典或纠错字典的更新结果,得到第二结构化实体和第二结构化关系作为数据清洗结果。According to the updated result of the updated data dictionary or the error correction dictionary, the second structured entity and the second structured relationship are obtained as the data cleaning result.
  5. 根据权利要求4所述的艺术领域知识图谱的构建方法,其中,所述对所述结构化数据的数据清洗结果进行重复性检验,生成重复性检验数据,包括:The method for constructing a knowledge graph in the art domain according to claim 4, wherein said performing repeatability inspection on the data cleaning result of the structured data to generate repeatability inspection data comprises:
    对所述内部艺术数据源和外部艺术数据源中的结构化数据的数据清洗结果进行艺术品实体的重复性检验,生成艺术品重复性检验结果;Performing the repeatability inspection of the artwork entity on the data cleaning results of the structured data in the internal art data source and the external art data source, and generating the artwork repeatability inspection result;
    当所述艺术品重复性检验结果为相同时,对所述数据清洗结果进行艺术家实体的重复性检验,生成艺术家重复性检验结果;When the artwork repeatability inspection result is the same, perform the artist entity repeatability inspection on the data cleaning result to generate the artist repeatability inspection result;
    当所述艺术家重复性检验结果为相同时,对所述数据清洗结果进行创作时间实体的重复性检验,生成创作时间重复性检验结果;When the artist repeatability check results are the same, perform a creation time entity repeatability check on the data cleaning result to generate a creation time repeatability test result;
    当所述创作时间重复性检验结果为相同时,确定所述艺术品实体为重复艺术品;When the creation time repeatability check result is the same, it is determined that the artwork entity is a duplicate artwork;
    对所述重复艺术品进行融合处理,并根据审核通过的融合处理结果生成重复性检验数据。Fusion processing is performed on the duplicate artwork, and repeatability inspection data is generated according to the approved fusion processing result.
  6. 根据权利要求5所述的艺术领域知识图谱的构建方法,其中,所述方法还包括:The method for constructing a knowledge graph in the art field according to claim 5, wherein the method further comprises:
    当所述艺术家重复性检验结果为不同或所述创作时间重复性检验结果为不同时,确定所述艺术品实体为重名艺术品;When the repeatability test result of the artist is different or the creation time repeatability test result is different, the artwork entity is determined to be an artwork of the same name;
    对所述重名艺术品进行去重处理,并根据去重处理结果生成所述重复性检验数据。De-duplication processing is performed on the artwork with the same name, and the repeatability inspection data is generated according to the de-duplication processing result.
  7. 根据权利要求1所述的艺术领域知识图谱的构建方法,其中,所述第一结构化数据包括目标艺术品数据、目标艺术家数据和目标艺术机构数据;The method for constructing a knowledge graph in the art field according to claim 1, wherein the first structured data includes target artwork data, target artist data, and target art institution data;
    所述将所述第一结构化数据与所述第二结构化数据进行融合处理, 生成融合艺术数据,包括:The fusion processing of the first structured data and the second structured data to generate fused art data includes:
    将所述第二结构化数据中的参考艺术家数据与所述目标艺术家数据进行融合处理,生成融合艺术家数据;Performing fusion processing on the reference artist data in the second structured data and the target artist data to generate fused artist data;
    将所述第二结构化数据中的参考艺术品数据与所述目标艺术品数据进行融合处理,生成融合艺术品数据;Performing fusion processing on the reference artwork data in the second structured data and the target artwork data to generate fused artwork data;
    将所述第二结构化数据中的参考艺术机构数据与所述目标艺术机构数据进行融合处理,生成融合艺术机构数据。Fusion processing is performed on the reference art institution data in the second structured data and the target art institution data to generate fused art institution data.
  8. 根据权利要求7所述的艺术领域知识图谱的构建方法,其中,所述将所述第二结构化数据中的参考艺术家数据与所述目标艺术家数据进行融合处理,生成融合艺术家数据,包括:8. The method for constructing a knowledge graph in the art domain according to claim 7, wherein said fusing the reference artist data in the second structured data with the target artist data to generate fused artist data includes:
    根据词向量模型对第二结构化数据中的参考艺术家数据和所述目标艺术家数据进行向量转换,得到艺术家词向量序列;Performing vector conversion on the reference artist data in the second structured data and the target artist data according to the word vector model to obtain an artist word vector sequence;
    计算所述艺术家词向量序列之间的艺术家相似度向量,并根据所述艺术家数据相似度向量的第一权重进行加权计算;Calculating artist similarity vectors between the artist word vector sequences, and performing weighted calculation according to the first weight of the artist data similarity vectors;
    根据加权计算结果得到艺术家相似度,并判断所述艺术家相似度是否大于第一阈值;Obtaining the artist similarity according to the weighted calculation result, and judging whether the artist similarity is greater than the first threshold;
    将大于所述第一阈值的所述艺术家相似度对应的所述参考艺术家数据和所述目标艺术家数据进行融合处理,生成融合艺术家数据。Perform fusion processing on the reference artist data and the target artist data corresponding to the artist similarity greater than the first threshold to generate fused artist data.
  9. 根据权利要求7所述的艺术领域知识图谱的构建方法,其中,所述将所述第二结构化数据中的参考艺术品数据与所述目标艺术品数据进行融合处理,生成融合艺术品数据,包括:8. The method for constructing a knowledge graph in the art domain according to claim 7, wherein said fusing the reference artwork data in the second structured data with the target artwork data to generate fused artwork data, include:
    根据词向量模型对第二结构化数据中的参考艺术品数据和所述目标艺术品数据进行向量转换,得到艺术品词向量序列;Performing vector conversion on the reference artwork data in the second structured data and the target artwork data according to the word vector model to obtain an artwork word vector sequence;
    计算所述艺术品词向量序列之间的艺术品相似度向量,并根据所述艺术品相似度向量的第二权重进行加权计算;Calculate the artwork similarity vector between the artwork word vector sequences, and perform weighted calculation according to the second weight of the artwork similarity vector;
    根据加权计算结果得到艺术品相似度,并判断所述艺术品相似度是否大于第二阈值;Obtaining artwork similarity according to the weighted calculation result, and judging whether the artwork similarity is greater than a second threshold;
    将大于所述第二阈值的所述艺术品相似度对应的所述参考艺术品数据和所述目标艺术品数据进行融合处理,生成融合艺术品数据。Fusion processing is performed on the reference artwork data and the target artwork data corresponding to the artwork similarity greater than the second threshold to generate fused artwork data.
  10. 根据权利要求7所述的艺术领域知识图谱的构建方法,其中, 所述将所述第二结构化数据中的参考艺术机构数据与所述目标艺术机构数据进行融合处理,生成融合艺术机构数据,包括:8. The method for constructing an art domain knowledge graph according to claim 7, wherein the fusion processing is performed on the reference art institution data in the second structured data and the target art institution data to generate fused art institution data, include:
    根据词向量模型对第二结构化数据中的参考艺术机构数据和所述目标艺术机构数据进行向量转换,得到艺术机构词向量序列;Performing vector conversion on the reference art institution data in the second structured data and the target art institution data according to the word vector model to obtain an art institution word vector sequence;
    计算所述艺术机构词向量序列之间的艺术机构相似度向量,并根据所述艺术机构相似度向量的第三权重进行加权计算;Calculating the art institution similarity vector between the art institution word vector sequences, and performing weighting calculation according to the third weight of the art institution similarity vector;
    根据加权计算结果得到艺术机构相似度,并判断所述艺术机构相似度是否大于第三阈值;Obtaining the similarity of the art institution according to the weighted calculation result, and judging whether the similarity of the art institution is greater than the third threshold;
    将大于所述第三阈值的所述艺术机构相似度对应的所述参考艺术机构数据和所述目标艺术机构数据进行融合处理生成融合艺术机构数据。Fusion processing is performed on the reference art institution data and the target art institution data corresponding to the art institution similarity greater than the third threshold to generate fused art institution data.
  11. 一种艺术领域知识图谱的构建装置,其中,包括:A device for constructing a knowledge map in the art field, which includes:
    数据处理模块,被配置为对内部艺术数据源和外部艺术数据源中的结构化数据进行第一预处理,生成第一结构化数据;The data processing module is configured to perform the first preprocessing on the structured data in the internal art data source and the external art data source to generate the first structured data;
    数据解析模块,被配置为对所述内部艺术数据源和所述外部艺术数据源中的非结构化数据和半结构化数据进行第二预处理得到第二结构化数据;A data analysis module configured to perform second preprocessing on unstructured data and semi-structured data in the internal art data source and the external art data source to obtain second structured data;
    数据融合模块,被配置为将所述第一结构化数据与所述第二结构化数据进行融合处理,生成融合艺术数据;其中,所述融合艺术数据包括艺术实体以及与所述艺术实体对应的艺术关系;The data fusion module is configured to perform fusion processing on the first structured data and the second structured data to generate fused art data; wherein, the fused art data includes an art entity and an art entity corresponding to the art entity Artistic relationship
    图谱生成模块,被配置为根据所述艺术实体和所述艺术关系生成艺术三元组,并根据所述艺术三元组生成艺术领域知识图谱。The graph generation module is configured to generate an art triad according to the art entity and the art relationship, and generate an art domain knowledge graph according to the art triad.
  12. 一种非易失性计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1-10中任意一项所述的艺术领域知识图谱的构建方法。A non-volatile computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to realize the method for constructing a knowledge graph in the art field according to any one of claims 1-10 .
  13. 一种电子设备,其中,包括:An electronic device, including:
    处理器;processor;
    存储器,用于存储所述处理器的可执行指令;A memory for storing executable instructions of the processor;
    其中,所述处理器被配置为经由执行所述可执行指令来执行权利要求1-10中任意一项所述的艺术领域知识图谱的构建方法。Wherein, the processor is configured to execute the method for constructing an art domain knowledge graph according to any one of claims 1-10 by executing the executable instructions.
PCT/CN2021/072241 2020-01-20 2021-01-15 Knowledge graph construction method and apparatus, storage medium, and electronic device WO2021147786A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010066621.5A CN111241212B (en) 2020-01-20 2020-01-20 Knowledge graph construction method and device, storage medium and electronic equipment
CN202010066621.5 2020-01-20

Publications (1)

Publication Number Publication Date
WO2021147786A1 true WO2021147786A1 (en) 2021-07-29

Family

ID=70879750

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/072241 WO2021147786A1 (en) 2020-01-20 2021-01-15 Knowledge graph construction method and apparatus, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN111241212B (en)
WO (1) WO2021147786A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114055451A (en) * 2021-11-24 2022-02-18 深圳大学 Robot operation skill expression method based on knowledge graph

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241212B (en) * 2020-01-20 2023-10-24 京东方科技集团股份有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
WO2022012687A1 (en) * 2020-07-17 2022-01-20 武汉联影医疗科技有限公司 Medical data processing method and system
CN112287043B (en) * 2020-12-29 2021-06-18 成都数联铭品科技有限公司 Automatic graph code generation method and system based on domain knowledge and electronic equipment
CN113783833B (en) * 2021-07-27 2023-09-01 齐鑫 Method and device for constructing computer security knowledge graph
CN115687622A (en) * 2022-11-09 2023-02-03 易元数字(北京)大数据科技有限公司 Method and device for storing artwork data by using graph database and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268581A (en) * 2017-07-14 2018-07-10 广东神马搜索科技有限公司 The construction method and device of knowledge mapping
CN109815340A (en) * 2019-01-17 2019-05-28 云南师范大学 A kind of construction method of national culture information resources knowledge mapping
US20190220752A1 (en) * 2017-12-08 2019-07-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for incorporating structured entity
CN110222200A (en) * 2019-06-20 2019-09-10 京东方科技集团股份有限公司 Method and apparatus for entity fusion
CN110704411A (en) * 2019-09-27 2020-01-17 京东方科技集团股份有限公司 Knowledge graph building method and device suitable for art field and electronic equipment
CN111241212A (en) * 2020-01-20 2020-06-05 京东方科技集团股份有限公司 Knowledge graph construction method and device, storage medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009053303A (en) * 2007-08-24 2009-03-12 Nippon Telegr & Teleph Corp <Ntt> Discussion knowledge graph constructing method, device, and program, and recording medium with program recorded thereon
CN107358315A (en) * 2017-06-26 2017-11-17 深圳市金立通信设备有限公司 A kind of information forecasting method and terminal
CN110390021A (en) * 2019-06-13 2019-10-29 平安科技(深圳)有限公司 Drug knowledge mapping construction method, device, computer equipment and storage medium
CN110457502B (en) * 2019-08-21 2023-07-18 京东方科技集团股份有限公司 Knowledge graph construction method, man-machine interaction method, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268581A (en) * 2017-07-14 2018-07-10 广东神马搜索科技有限公司 The construction method and device of knowledge mapping
US20190220752A1 (en) * 2017-12-08 2019-07-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for incorporating structured entity
CN109815340A (en) * 2019-01-17 2019-05-28 云南师范大学 A kind of construction method of national culture information resources knowledge mapping
CN110222200A (en) * 2019-06-20 2019-09-10 京东方科技集团股份有限公司 Method and apparatus for entity fusion
CN110704411A (en) * 2019-09-27 2020-01-17 京东方科技集团股份有限公司 Knowledge graph building method and device suitable for art field and electronic equipment
CN111241212A (en) * 2020-01-20 2020-06-05 京东方科技集团股份有限公司 Knowledge graph construction method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114055451A (en) * 2021-11-24 2022-02-18 深圳大学 Robot operation skill expression method based on knowledge graph
CN114055451B (en) * 2021-11-24 2023-07-07 深圳大学 Robot operation skill expression method based on knowledge graph

Also Published As

Publication number Publication date
CN111241212B (en) 2023-10-24
CN111241212A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
WO2021147786A1 (en) Knowledge graph construction method and apparatus, storage medium, and electronic device
JP7064262B2 (en) Knowledge graph understanding support system based on natural language generation technology
Lin et al. A natural‐language‐based approach to intelligent data retrieval and representation for cloud BIM
CN109657068B (en) Cultural relic knowledge graph generation and visualization method for intelligent museum
Gilson et al. From web data to visualization via ontology mapping
CN107679221B (en) Time-space data acquisition and service combination scheme generation method for disaster reduction task
CN110134724A (en) A kind of the data intelligence extraction and display system and method for Building Information Model
CN111488465A (en) Knowledge graph construction method and related device
CN111274267A (en) Database query method and device and computer readable storage medium
US20120239677A1 (en) Collaborative knowledge management
CN115905553A (en) Construction drawing inspection specification knowledge extraction and knowledge graph construction method and system
CN113190687A (en) Knowledge graph determining method and device, computer equipment and storage medium
CN115203337A (en) Database metadata relation knowledge graph generation method
Prudhomme et al. Automatic Integration of Spatial Data into the Semantic Web.
CN113610626A (en) Bank credit risk identification knowledge graph construction method and device, computer equipment and computer readable storage medium
Yang et al. User story clustering in agile development: a framework and an empirical study
CN111309930B (en) Medical knowledge graph entity alignment method based on representation learning
CN112732969A (en) Image semantic analysis method and device, storage medium and electronic equipment
CN108614821B (en) Geological data interconnection and mutual-checking system
CN115576983A (en) Statement generation method and device, electronic equipment and medium
Wang et al. An ontology-based approach for marine geochemical data interoperation
CN115757735A (en) Intelligent retrieval method and system for power grid digital construction result resources
CN115408532A (en) Open source information-oriented weapon equipment knowledge graph construction method, system, device and storage medium
Bhagat et al. Sparx-Data Preprocessing Module
Wu et al. A summary of the latest research on knowledge graph technology

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21744389

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21744389

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21744389

Country of ref document: EP

Kind code of ref document: A1