CN112163127B

CN112163127B - Relationship graph construction method and device, electronic equipment and storage medium

Info

Publication number: CN112163127B
Application number: CN202011066029.1A
Authority: CN
Inventors: 蒋维; 万月亮; 程强
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2023-11-21
Anticipated expiration: 2040-09-30
Also published as: WO2022068348A1; CN112163127A

Abstract

The invention discloses a relationship graph construction method, a relationship graph construction device, electronic equipment and a storage medium. The method comprises the following steps: receiving each original data set, and extracting the original relation data of each original data set according to the extraction strategy corresponding to each original data set; grouping the original relationship data and the historical relationship data according to the attribute key value of the original relationship data and the attribute key value of the historical relationship data to obtain a plurality of groups of intermediate relationship data; merging and de-duplicating each group of intermediate relationship data to obtain target relationship data; and storing the target relation data into a distributed graph database, and constructing a relation map corresponding to the target relation data in the distributed graph database. The method realizes the storage based on the distributed extensible graph structure, solves the problems of large data size and difficult extension of the relational data, and ensures the timeliness of data processing.

Description

Relationship graph construction method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a method and a device for constructing a relationship map, electronic equipment and a storage medium.

Background

With the continuous development of the large age, the data volume presents an exponential explosive growth, and there are hundreds of millions or even billions of incremental data per day. The data sources are more and more (such as internet data of mobile networks 4G and 5G, internet of things data and the like), the data forms are also various (such as internet chat data, taxi taking data, shopping data and the like), and how to extract valuable relation information and establish clear and concise relation patterns becomes a problem to be solved urgently. Moreover, since the data amount increases exponentially, the complexity of data processing increases recursively, and a large amount of storage space is required for storing the relationship map.

Disclosure of Invention

The invention provides a method, a device, electronic equipment and a storage medium for constructing a relational graph, which are used for storing relational data in a distributed graph database, solving the problems of large data volume and difficult expansion and ensuring the timeliness of data processing.

In a first aspect, an embodiment of the present invention provides a method for constructing a relationship map, including:

receiving each original data set, and extracting original relation data of each original data set according to an extraction strategy corresponding to each original data set;

Grouping the original relationship data and the historical relationship data according to the attribute key value of the original relationship data and the attribute key value of the historical relationship data to obtain a plurality of groups of intermediate relationship data;

merging and de-duplicating each group of the intermediate relationship data to obtain target relationship data;

and storing the target relation data into a distributed graph database, and constructing a relation graph corresponding to the target relation data in the distributed graph database.

In a second aspect, an embodiment of the present invention further provides a relationship map construction apparatus, where the apparatus includes:

the original relation data extraction module is used for receiving each original data set and extracting the original relation data of each original data set according to the extraction strategy corresponding to each original data set;

the intermediate relation data acquisition module is used for grouping the original relation data and the historical relation data according to the attribute key value of the original relation data and the attribute key value of the historical relation data to obtain a plurality of groups of intermediate relation data;

the target relation data acquisition module is used for merging and deduplicating each group of intermediate relation data to obtain target relation data;

And the target relation data storage module is used for storing the target relation data into a distributed graph database, and constructing a relation graph corresponding to the target relation data in the distributed graph database.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a relationship graph construction method as provided by embodiments of the present invention.

In a fourth aspect, embodiments of the present invention further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a relationship graph construction method as provided by embodiments of the present invention.

According to the method, the original relation data of each original data set are extracted according to the extraction strategy corresponding to each original data set, so that the original relation data of each original data set are obtained, the original relation data and the historical relation data are grouped according to the attribute key value of the original relation data and the attribute key value of the historical relation data, a plurality of groups of intermediate relation data are obtained, each group of intermediate relation data is integrated and de-duplicated to obtain target relation data, valuable relation information is obtained, the target relation data are stored in a distributed graph database, a relation graph corresponding to the target relation data is constructed in the distributed graph database, a relation graph based on the distributed graph database is obtained, storage of a distributed extensible graph structure is realized, the problems of large data size and difficult extension of the relation data are solved, and timeliness of data processing is guaranteed.

Drawings

In order to more clearly illustrate the technical solution of the exemplary embodiments of the present invention, a brief description is given below of the drawings required for describing the embodiments. It is obvious that the drawings presented are only drawings of some of the embodiments of the invention to be described, and not all the drawings, and that other drawings can be made according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a relationship graph construction method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a relationship graph construction method according to a second embodiment of the present invention;

FIG. 3 is a schematic flow chart of a method for constructing a relationship map according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a relationship map construction apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flow chart of a relationship graph construction method according to an embodiment of the present invention, where the embodiment is applicable to a case of extracting relationship information of a plurality of data sets and establishing a relationship graph based on a distributed graph database, the method may be performed by a relationship graph construction device, and the device may be implemented by hardware and/or software, and the method specifically includes the following steps:

s110, receiving each original data set, and extracting the original relation data of each original data set according to the extraction strategy corresponding to each original data set.

The original data sets refer to a set of data which is not subjected to relation extraction and comprises a plurality of objects, and each original data set can be distinguished according to corresponding data forms, such as an internet chat data set, a taxi taking data set, a shopping data set or a terminal operation data set. Based on the data form of each original data set, a corresponding extraction strategy is formulated to extract the original relationship data in the original data set, and the extraction strategy corresponding to the shopping data set is for extracting data such as ID of a buyer and a seller which generate a relationship, type of the relationship, time of the relationship or number of times of the relationship. Correspondingly, the original relation data comprises corresponding relation data extracted by the original data set under the extraction strategy. Specifically, the target user analyzes the data structure of the original data set, judges whether the relation extraction is needed, if the relation data extraction is needed, configures an extraction strategy corresponding to the original data set, maps the attribute values in the original data set through mapping of the configuration file, and obtains the original relation data in the original data set according to the extraction relation of a certain rule.

Exemplary, the configuration of the extraction strategy corresponding to the original dataset is as follows:

< Param name= "const" value= "identification card"/>

< Param name= "const" value= "handset"/>

< Param name= "const" value= "identification card-handset number-relationship"/>

Optionally, after receiving each original data set, the original data sets are standardized, where the standardization may be to fill the necessary entries and key fields in the original data sets to obtain standard original data sets, and extract the original relationship data of each standard original data set according to the extraction policy corresponding to each standardized original data set.

S120, grouping the original relationship data and the historical relationship data according to the attribute key value of the original relationship data and the attribute key value of the historical relationship data to obtain a plurality of groups of intermediate relationship data.

The original relation data is corresponding relation data extracted from the original data set input at present, and the historical relation data is corresponding relation data extracted from the original data set input in history. The attribute key value is generated by key fields in the original relation data and the historical relation data and is used for identifying the original relation data and the historical relation data, grouping is carried out according to the identification, and the grouped relation data are used as intermediate relation data.

Optionally, the historical relationship data is target relationship data stored in a file system, wherein the file system is used for storing the extracted target relationship data temporarily, for example, target relationship data within a preset time period, the preset time period can be set according to requirements, for example, one day, one week or half month, and when a preset condition is met, the target relationship data stored in the file system is stored in a distributed graph database. It will be appreciated that the file system may be regarded as a temporary floor area for target relationship data, such as hdfs (Hadoop Distributed File System Hadoop, distributed file system), i.e. the relationship data after extraction and processing is stored in the file system first, and when the file system reaches a release condition, the relationship data in the file system is stored in the distributed graph database. For example, the release condition may be a preset time threshold or data amount threshold, which may be determined according to the data storage speed or the storage space of the file system. It can be understood that the historical relationship data in the application is the relationship data subjected to grouping, merging and duplication removal, and if the historical relationship data does not exist, i.e. the data is not stored in the file system, the original relationship data is grouped according to the attribute key value of the original relationship data, so as to obtain multiple groups of intermediate relationship data.

According to the embodiment, the processed relationship data is temporarily stored in the file system to serve as the historical relationship data, the original relationship data and the historical relationship data are grouped, so that data obtained after the current original relationship data and the historical relationship data are grouped are obtained, the historical relationship data do not need to be called from a graph database, the processing pressure of a server is relieved, and meanwhile the current and the historical relationship data in the same group are conveniently subjected to merging and deduplication processing at the same time, so that more accurate relationship data are obtained.

Optionally, the original relationship data and the historical relationship data respectively include a first category of the first object, a corresponding value of the first category, a second category of the second object, a corresponding value of the second category, a relationship type between the first object and the second object, a relationship occurrence time, a relationship occurrence number of days, a relationship data source, a relationship source data set type and a reliability coefficient field.

The first category of the first object may refer to a data category of an object in an active relationship, the second category of the second object may refer to a data category of an object in a passive relationship, and the data categories are respectively stored in a first category field of the first object and a second category field of the second object, and the first object in the relationship data is a buyer, the second object is a seller, the first category of the first object may be a vigorous ID of the buyer, and the second category of the second object may be a vigorous ID of the seller. Optionally, the second class of the second object is a data class of an object of an active occurrence relationship, and the first class of the first object is a data class of an object of an active occurrence relationship, which is not limited in the present application. Correspondingly, the corresponding value of the first category and the corresponding value of the second category are specific data of the first category of the first object and the second category of the second object respectively, and the specific data are stored in the corresponding value field of the first category and the corresponding value field of the second category, such as specific ID data of wang ID of the buyer and the seller in the above example. The relationship type between the first object and the second object refers to the kind of relationship generated between the first object and the second object, and is stored in the corresponding field, such as the purchase relationship between the buyer and the seller in the above example; or the corresponding relationship type is a friend relationship when the object A and the object B in the relationship data extracted by the chat data set are friends, or the corresponding relationship type is an interconnection relationship when the chat session exists between the object A and the object C; or if the D object takes the D11 trains in the relation data extracted by the travel data set, the relation type between the D object and the D11 trains is a taking relation.

It is understood that the relationship occurrence time refers to the latest time when the first object and the second object produce a relationship; the relation occurrence number refers to the total number of times the first object and the second object generate a relation; the number of days of relation occurrence refers to the total number of days of relation occurrence of the first object and the second object, and the total number of days of relation occurrence of the first object and the second object are respectively stored in corresponding fields. For example, if the first object and the second object generate three relationships in three time periods of 2020/09/27/15:00, 2020/09/27/17:00, 2020/09/28/17:00 respectively, the relationship occurrence time is the latest time 2020/09/28/17:00 in the three relationships, the relationship occurrence times are 3 times, and the relationship occurrence days are 2 days.

In this embodiment, the relational data source refers to a data source where relational data occurs, such as 3G, 4G, or 5G, that is, a data source when the first object and the second object generate a relationship; the relationship source data set category refers to a category of a source data set of the relationship data, such as a shopping data set, a travel data set, or a chat data set. The reliability coefficient field is used for storing a reliability coefficient, the reliability coefficient is a reliability score calculated according to a specific field of the relation data, the reliability coefficient is used for representing the reliability degree of the relation data, and the higher the reliability coefficient is, the more reliable the relation data is. In this embodiment, the reliability coefficient field is set in the relational data, so that the user obtains the reliability of the relational data according to the reliability coefficient field, thereby judging the accuracy of the relational data, and realizing the rapid positioning of the erroneous relational data in a large number of erroneous relational data. Optionally, the original relationship data and the historical relationship data further comprise one or more extension fields for extending the content of the relationship data, so that the fields can be conveniently added to the original relationship data, and the development cost is reduced.

Illustratively, the configuration process of the various fields of the raw relationship data and the historical relationship data is as follows:

< DataSet dscode= "relation_0001" version= "1" chname= "relation DataSet" description= "relation DataSet field description" >

< Field code= "F00001" enname= "a_iden_type" chname= "first category of first object"/>

< Field code= "F00002" enname= "a_iden_string" chname= "corresponding value of first category"/> "

< Field code= "F00003" enname= "b_iden_type" chname= "second category of second object"/>

< Field code= "F00004" enname= "b_iden_string" chname= "corresponding value of second category"/> "

< Field code= "F00005" enname= "related_type" chname= "relationship TYPE between first object and second object"/>

< Field code= "F00006" enname= "first_COLLECT_TIME" chname= "relation occurrence TIME"/>

< Field code= "F00007" enname= "COUNT" chname= "relation occurrence COUNT"/>

< Field code= "F00008" enname= "day_count" chname= "relation occurrence DAYs"/>

< Field code= "F00009" enname= "data_source" chname= "relational DATA Source"/>

< Field code= "F00010" enname= "from_dataset" chname= "relational source data set category"/>

< Field code= "F00011" enname= "field_ext1" chname= "reliability coefficient"/>

< Field code= "F00012" enname= "field_ext1" chname= "extension Field 1"/>

< Field code= "F00013" enname= "field_ext2" chname= "extension Field 2"/>

Accordingly, the attribute key value of the original relationship data is determined based on the first category of the first object, the corresponding value of the first category, the second category of the second object, the corresponding value of the second category, and the relationship type between the first object and the second object.

Specifically, the attribute key value of the original relationship data is uniquely determined according to the five parameters, and each relationship data has a corresponding attribute key value, so that the original relationship data with the same five parameters has the same attribute key value and is the same group of intermediate relationship data. For example, according to the buyer ID, the corresponding value of the seller ID, and the purchasing relationship in the relationship data, the attribute key value of the relationship data is determined, and the relationship data which are the same in the corresponding value of the buyer ID, the corresponding value of the seller ID, and the corresponding value of the seller ID and are all purchasing relationships is determined as the same group of intermediate relationship data. It is to be appreciated that the attribute key value of the historical relationship data is also determined based on the first category of the first object, the corresponding value of the first category, the second category of the second object, the corresponding value of the second category, and the relationship type between the first object and the second object.

S130, merging and deduplicating each group of intermediate relationship data to obtain target relationship data.

Specifically, the application carries out corresponding merging and de-duplication processing on each field of the intermediate relationship data, thereby obtaining the merged statistical relationship data as target relationship data, realizing the statistical processing on the current and historical relationship data and improving the accuracy of the relationship data.

Optionally, merging and deduplicating each set of the intermediate relationship data to obtain target relationship data, including: determining a relationship occurrence time field value of each relationship data in each group of intermediate relationship data, and performing de-duplication processing on the intermediate relationship data based on the relationship occurrence time field value; counting the relation occurrence time field value after the duplication elimination treatment to determine the relation occurrence days in the target relation data; and determining the maximum relationship occurrence time field value based on the relationship occurrence time field value after the deduplication processing, and determining the relationship occurrence time field value in the target relationship data based on the maximum relationship occurrence time field value.

Specifically, counting the relationship occurrence time of a plurality of relationship data in each group of intermediate relationship data, performing duplicate removal processing on the equal relationship occurrence time to obtain one or more unequal relationship occurrence time, and counting the relationship occurrence days of the group of intermediate relationship data according to the one or more relationship occurrence time to serve as the relationship occurrence days of the target relationship data. For example, if the relationship occurrence time after the duplication removal of a certain set of intermediate relationship data includes 2020/09/27/12:00, 2020/09/27/13:00, 2020/09/28/15:00, the number of relationship occurrence days of the target relationship data corresponding to the set of intermediate relationship data is 2. Based on the unequal relationship occurrence time or times, the latest relationship occurrence time is determined, and the latest relationship occurrence time 2020/09/28/15:00 is taken as the relationship occurrence time in the target relationship data as the three relationship occurrence times in the example.

Optionally, merging and deduplicating each group of the intermediate relationship data to obtain target relationship data, and further including: based on the relation occurrence number field value of each relation data in each group of intermediate relation data, accumulating and adding the relation occurrence number field values to obtain the relation occurrence number field value in the target relation data; and performing de-duplication processing on the intermediate relationship data based on the relationship data source and the relationship source data set type of each group of intermediate relationship data to obtain the relationship data source and the relationship source data set type corresponding to the target relationship data.

Specifically, the relationship occurrence times of a plurality of relationship data in each group of intermediate relationship data are counted, the relationship occurrence times of the plurality of relationship data in the group of intermediate relationship data are cumulatively added, and the relationship occurrence times of the target relationship data are determined based on the cumulatively added result. For example, if the number of times of relationship occurrence of a certain set of intermediate relationship data includes 2 times, 1 time, and 3 times, the number of times of relationship occurrence of target relationship data corresponding to the set of intermediate relationship data is 7 times.

Specifically, the relationship data sources of a plurality of relationship data in each group of intermediate relationship data are counted, the equal relationship data sources are subjected to de-duplication processing, and one or more relationship data sources which are not equal are obtained and used as the relationship data sources of the target relationship data. For example, if the relationship data sources of a certain set of intermediate relationship data include 3G, 4G, 5G, and 3G, the relationship data sources of the target relationship data corresponding to the set of intermediate relationship data are 3G, 4G, and 5G.

Specifically, the relationship source data set types of a plurality of relationship data in each group of intermediate relationship data are counted, the equal relationship source data set types are subjected to de-duplication processing, and one or more relationship source data set types which are not equal are obtained and serve as the relationship source data set types of the target relationship data. For example, if the relationship source data set category of the intermediate relationship data includes a shopping data set, and a travel data set, the relationship source data set category of the target relationship data corresponding to the intermediate relationship data set is a shopping data set travel data set.

In the embodiment, the relationship occurrence time, the relationship occurrence days, the relationship occurrence times, the relationship data sources and the relationship source data set types of the intermediate relationship data are merged and de-duplicated, so that real-time and accurate target relationship data are obtained.

And S140, storing the target relation data into a distributed graph database, and constructing a relation graph corresponding to the target relation data in the distributed graph database.

The distributed graph database is used for storing a large amount of target relation data, and displaying the target relation data in a graphical mode, namely a relation map corresponding to the target relation data, such as a JanusGraph, nebulaGraph distributed graph database of Apache TinkerPop and the like. The distributed graph database can increase the space size of a cache by adding a cross-machine expansion cluster, support large concurrent transaction processing and graph operation processing, and provide vertex-level query by using a vertex-centered index so as to alleviate the problem of super nodes. According to the application, the target relational data is stored in the distributed graph database, so that the query speed and the storage speed of the relational data are improved, the storage pressure of a large amount of relational data is reduced, and the relational data can be conveniently called by big data application.

According to the technical scheme of the embodiment, original relation data of each original data set are extracted according to an extraction strategy corresponding to each original data set, so that the original relation data of each original data set are obtained, the original relation data and the historical relation data are grouped according to attribute key values of the original relation data and attribute key values of the historical relation data, multiple groups of intermediate relation data are obtained, each group of intermediate relation data is integrated and de-duplicated to obtain target relation data, valuable relation information is obtained, the target relation data are stored in a distributed graph database, a relation graph corresponding to the target relation data is constructed in the distributed graph database, a relation graph based on the distributed graph database is obtained, storage of a distributed extensible graph structure is achieved, the problems that the quantity of the relation data is large and the relation data is not easy to extend are solved, and timeliness of data processing is guaranteed.

Example two

Fig. 2 is a flow chart of a relationship graph construction method according to a second embodiment of the present invention, where the step of calculating the reliability coefficient of the target relationship data is added before storing the target relationship data in the distributed graph database on the basis of the above embodiment. Wherein the same or the same terms as those of the above-described embodiments are not repeated herein.

Referring to fig. 2, the relationship map construction method provided in this embodiment specifically includes the following steps:

s210, receiving each original data set, and extracting the original relation data of each original data set according to the extraction strategy corresponding to each original data set.

S220, grouping the original relationship data and the historical relationship data according to the attribute key value of the original relationship data and the attribute key value of the historical relationship data to obtain a plurality of groups of intermediate relationship data.

S230, merging and deduplicating each group of intermediate relationship data to obtain target relationship data.

S240, calculating the reliability coefficient of the target relation data, and storing the reliability coefficient in a reliability relation field of the target relation data.

Wherein the reliability coefficient of the target relationship data is used to characterize the accuracy of the target relationship data. Optionally, the reliability coefficient is obtained by weighting calculation based on a statistical reliability value of the target relation data and a reliability value of the data set; the statistical reliability value is obtained by weighting and calculating the number of data sources, the number of data sets and the discovery times of the target relation data, and the data set reliability value is determined based on the weight maximum value of each data set.

Specifically, in the calculation of the reliability coefficient, the statistical reliability value has a corresponding statistical weight, the data set reliability value has a corresponding data set weight, the intervals of the statistical weight and the data set weight are all [0,1], the two weight values can be dynamically configured by a user, and the addition value of the configured statistical weight and the data set weight is ensured to be 1. Optionally, the statistical reliability value of the target relationship data and the weighted calculation result of the reliability value of the data set are multiplied by a maximum reliability value, and the maximum reliability value is set by a constant by a user, and a specific calculation formula is as follows:

Reliability= (statisticweight+datasetweight) maxScore, where Reliability is the Reliability coefficient, statisticWeight is the statistical weight, statisticScore is the statistical Reliability value, datasetWeight is the dataset weight, datasetsrocee is the dataset Reliability value, maxScore is the maximum Reliability value.

Specifically, in the calculation of the statistical reliability value, the number of data sources of the target relationship data refers to the number of relationship data sources in the target relationship data, and the data sources have corresponding data source number weights; the number of the data sets refers to the number of the relational source data sets in the target relational data, and the number of the data sets is provided with corresponding data set number weights; the discovery times refer to the occurrence times of the relationships in the target relationship data, and the discovery times have corresponding discovery times weights. Optionally, the statistical reliability value is further obtained based on the relationship occurrence time and the corresponding time weight, and the specific formula is as follows:

statisticScore＝dataSourceWeight*log _b1 (dataSourceCount+1)+dataSetWeight*log _b2 (datasetCount+1)+timeWeight*e ^-2a +countWeight*log _b3 (count+1), wherein datasourcebight is the data source number weight, datasetWeight is the data set number weight, timeWeight is the time weight, count weight is the number of discovery times weight, dataSourceWeight, datasetWeight, timeWeight, and Count weight is between 0 and 1, and the addition value is 1, which is dynamically configurable. The datasourceCount is the number of data sources, and the number of the relational data sources of the target relational data is taken; the datasetCount is the number of data sets, and the number of relational source data sets of the target relational data is taken; and taking the count as the discovery times and taking the value in the relation occurrence times field of the target relation data. A represents the credibility, which is (current time stamp seconds-relation occurrence time field value)/seconds of 10 years, if the relation occurrence time field value is larger than 1, 1 is taken, if the relation occurrence time field value is 3-4 years away from the current time stamp seconds, the value of a is smaller than 0.5, and the credibility of target relation data is reduced to half. b1, b2, b3 are corresponding bases, which can be dynamically configured by the user And (3) forming the finished product.

Specifically, the dataset reliability takes the maximum of all individual dataset weights, datasetscore= max (singleDatasetWeight). Individual dataset weights may be dynamically configured, with individual dataset weights configured according to confidence. The above-mentioned configuration process of each weight is exemplified as follows:

< Item Relay type= "identification card-phone number-relation" enable= "true" Desc= "different reliability coefficient configuration corresponding to different target relation data" >

< statistical weight= "0.8" Desc= "statistical Weight" >

< FieldKey= "F00009" weight= "0.5" Desc= "data Source count Weight"/>

< FieldKey= "F00010" weight= "0.3" Desc= "Source data set number Weight"/>

< FieldKey= "F00006" weight= "0.2" Desc= "Source data time Weight" +.

< FieldKey= "F00007" weight= "0.1" Desc= "found times Weight"/>

< paramkey= "b1" weight= "3" Desc= "base number of data source number coefficient"/>

< paramkey= "b2" weight= "5" desc= "base of data set number coefficient"/>

< paramkey= "b3" weight= "15" desc= "base of found number coefficient"/>

< Dataset weight= "0.2" Desc= "data set Weight" >

< data Desc= "weight for Individual dataset" >

< Dataset code= "0001" name= "shopping" weight= "0.5" desc= "shopping table Weight is 0.5" >

< Dataset code= "0002" name= "getting" weight= "0.4" desc= "getting table Weight is 0.4" >

S250, judging whether the reliability coefficient meets the preset coefficient threshold condition, if so, executing S260; if not, then S270 is performed.

The preset coefficient threshold value is preset by a user, if the reliability coefficient is not smaller than the preset coefficient threshold value condition, the reliability coefficient is judged to meet the preset coefficient threshold value condition, and the higher the preset coefficient threshold value is, the higher the requirement of the user on the accuracy of the target relation data stored in the distributed graph database is indicated. For example, the preset coefficient threshold may be set to 52, and when the reliability coefficient is not less than 52, S260 is performed; when the reliability coefficient is less than 52, S270 is performed.

S260, storing the target relation data into a distributed graph database, and constructing a relation graph corresponding to the target relation data in the distributed graph database.

Specifically, when the reliability coefficient meets the preset coefficient threshold condition, storing the corresponding target relationship data into a distributed graph database, and constructing a relationship map corresponding to the target relationship data in the distributed graph database. Optionally, before storing the corresponding target relationship data in the distributed graph database, storing the target relationship data in the file system as history relationship data, so that the subsequently input original relationship data and the history relationship data in the file system can be merged together and subjected to duplication removal processing, and if the file system does not have new original relationship data when the file system reaches the release condition, storing the history relationship data in the file system in the distributed graph database.

S270, discarding the target relation data.

Specifically, when the reliability coefficient does not meet the preset coefficient threshold condition, the corresponding target relationship data is discarded and not stored in the distributed graph database, so that a disordered relationship graph is prevented from being generated in the distributed graph database.

According to the technical scheme, the reliability coefficient is stored in the reliability relation field of the target relation data through calculating the reliability coefficient of the target relation data, the corresponding target relation data of which the reliability coefficient meets the preset coefficient threshold condition is stored in the distributed graph database, and the relation graph corresponding to the target relation data is constructed in the distributed graph database, so that the relation data in the obtained relation graph meets the reliability coefficient condition, the reliability of the relation data in the relation graph is improved, the application accuracy of the relation graph is improved, and a large number of error relations or relations which are completely irrelevant in the application are prevented from being connected together.

Example III

Fig. 3 is a flow chart of a relationship map construction method according to a third embodiment of the present invention, and the present embodiment provides a preferred embodiment based on the foregoing embodiment. Wherein the same or the same terms as those of the above-described embodiments are not repeated herein. As shown in fig. 3, the method specifically includes the following steps:

S301, receiving each original data set.

S302, extracting the original relation data of each original data set according to the extraction strategy corresponding to each original data set.

S303, judging whether historical relation data exist, if so, executing S304; if not, S305 is performed.

Specifically, whether the historical relationship data exists is checked from the file system, and whether the historical relationship data exists can be judged according to the input file catalogue of the file system.

S304, loading the history relation data, and determining the history relation data and the original relation data as all relation data.

Specifically, the historical relationship data is read from the file system, and the set of the historical relationship data and the original relationship data is used as all relationship data to carry out subsequent grouping, merging and deduplication operations.

S305, traversing all the relational data, and generating attribute key values according to the first category of the first object, the corresponding value of the first category, the second category of the second object, the corresponding value of the second category and the relation type between the first object and the second object of all the relational data.

S306, grouping all the relationship data according to the attribute key values to obtain a plurality of groups of intermediate relationship data.

S307, traversing each group of intermediate data, judging whether traversing is completed, and if not, executing S308.

S308, merging and deduplicating each group of intermediate relationship data to obtain target relationship data.

S309, calculating the reliability coefficient of the target relation data, and storing the reliability coefficient in a reliability relation field of the target relation data.

And S310, storing the target relation data into a distributed graph database according to a standard format, and constructing a relation graph corresponding to the target relation data in the distributed graph database.

The target relation data stored in the distributed graph database is enabled to meet the standard format requirement by standardizing the target relation data based on the standard format.

According to the technical scheme, whether historical relationship data exist or not is judged by extracting original relationship data of each original data set, the historical relationship data and the original relationship data are determined to be all relationship data, all relationship data are traversed, corresponding attribute key values are generated, all relationship data are grouped according to the attribute key values to obtain multiple groups of intermediate relationship data, merging and deduplication are carried out on each group of intermediate relationship data to obtain target relationship data, reliability coefficients of the target relationship data are calculated, the reliability coefficients are stored in reliability relationship fields of the target relationship data, valuable relationship information is obtained, the target relationship data are stored in a distributed graph database according to a standard format, a relationship graph corresponding to the target relationship data is constructed in the distributed graph database, a relationship graph based on the distributed graph database is obtained, storage of a graph structure based on the distributed extensible graph is achieved, the problems that the data size of the relationship data is large and the relationship data is not easy to expand are solved, and timeliness of data processing is guaranteed.

Example IV

Fig. 4 is a schematic structural diagram of a relationship map construction device according to a fourth embodiment of the present invention, where the embodiment is applicable to a situation of extracting relationship information of multiple data sets and establishing a relationship map based on a distributed graph database, and the device specifically includes: an original relationship data extraction module 410, an intermediate relationship data acquisition module 420, a target relationship data acquisition module 430, and a target relationship data storage module 440.

The original relationship data extraction module 410 is configured to receive each original data set, and extract original relationship data of each original data set according to an extraction policy corresponding to each original data set;

the intermediate relationship data obtaining module 420 is configured to group the original relationship data and the historical relationship data according to the attribute key value of the original relationship data and the attribute key value of the historical relationship data, so as to obtain multiple groups of intermediate relationship data;

the target relationship data obtaining module 430 is configured to merge and deduplicate each set of intermediate relationship data to obtain target relationship data;

the target relationship data storage module 440 is configured to store the target relationship data into a distributed graph database, and construct a relationship graph corresponding to the target relationship data in the distributed graph database.

In this embodiment, by receiving each original data set, extracting the original relationship data of each original data set according to the extraction policy corresponding to each original data set, so as to obtain the original relationship data of each original data set, grouping the original relationship data and the historical relationship data according to the attribute key value of the original relationship data and the attribute key value of the historical relationship data, obtaining multiple groups of intermediate relationship data, merging and deduplicating each group of intermediate relationship data, obtaining the target relationship data, so as to obtain valuable relationship information, storing the target relationship data in a distributed graph database, and constructing a relationship graph corresponding to the target relationship data in the distributed graph database, so as to obtain a relationship graph based on the distributed graph database, thereby realizing the storage of a graph structure based on distributed scalability, solving the problems of large amount of relationship data and difficult expansion, and ensuring the timeliness of data processing.

On the basis of the device, the device can optionally further comprise a reliability coefficient calculation module, which is used for calculating the reliability coefficient of the target relationship data and storing the reliability coefficient in the reliability relationship field of the target relationship data. Accordingly, the target relationship data storage module 440 is configured to store target relationship data whose reliability coefficient meets a preset coefficient threshold condition to the distributed graph database, and construct a relationship graph corresponding to the target relationship data in the distributed graph database.

Optionally, the reliability coefficient calculation module is configured to weight and calculate a reliability coefficient based on a statistical reliability value of the target relationship data and a reliability value of the data set; the statistical reliability value is obtained by weighting and calculating the number of data sources, the number of data sets and the discovery times of the target relation data, and the data set reliability value is determined based on the weight maximum value of each data set.

Optionally, the target relational data acquiring module 430 is further configured to determine a relational occurrence time field value of each relational data in each set of intermediate relational data, and perform deduplication processing on the intermediate relational data based on the relational occurrence time field value; counting the relation occurrence time field value after the duplication elimination treatment to determine the relation occurrence days in the target relation data; and determining the maximum relationship occurrence time field value based on the relationship occurrence time field value after the deduplication processing, and determining the relationship occurrence time field value in the target relationship data based on the maximum relationship occurrence time field value.

Optionally, the target relationship data obtaining module 430 is further configured to add up the relationship occurrence number field values based on the relationship occurrence number field values of each relationship data in each set of intermediate relationship data, to obtain a relationship occurrence number field value in the target relationship data; and performing de-duplication processing on the intermediate relationship data based on the relationship data source and the relationship source data set type of each group of intermediate relationship data to obtain the relationship data source and the relationship source data set type corresponding to the target relationship data.

The relationship graph construction device provided by the embodiment of the invention can execute the relationship graph construction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, the units and modules included in the above system are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present invention.

Example five

Fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. Fig. 5 shows a block diagram of an exemplary electronic device 50 suitable for use in implementing the embodiments of the present invention. The electronic device 50 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 5, the electronic device 50 is embodied in the form of a general purpose computing device. Components of electronic device 50 may include, but are not limited to: one or more processors or processing units 501, a system memory 502, and a bus 503 that connects the various system components (including the system memory 502 and processing units 501).

Bus 503 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 50 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 50 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 502 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 504 and/or cache memory 505. Electronic device 50 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 506 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 503 through one or more data medium interfaces. Memory 502 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 508 having a set (at least one) of program modules 507 may be stored, for example, in memory 502, such program modules 507 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 507 typically perform the functions and/or methods of the described embodiments of the invention.

The electronic device 50 may also communicate with one or more external devices 509 (e.g., keyboard, pointing device, display 510, etc.), one or more devices that enable a user to interact with the electronic device 50, and/or any device (e.g., network card, modem, etc.) that enables the electronic device 50 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 511. Also, the electronic device 50 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter 512. As shown, the network adapter 512 communicates with other modules of the electronic device 50 over the bus 503. It should be appreciated that although not shown in fig. 5, other hardware and/or software modules may be used in connection with electronic device 50, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 501 executes various functional applications and data processing by running a program stored in the system memory 502, for example, to implement a relationship map construction method provided by an embodiment of the present invention, the method includes:

receiving each original data set, and extracting the original relation data of each original data set according to the extraction strategy corresponding to each original data set;

and storing the target relation data into a distributed graph database, and constructing a relation map corresponding to the target relation data in the distributed graph database.

Of course, those skilled in the art will understand that the processor may also implement the technical solution of the relationship map construction method provided in any embodiment of the present invention.

Example six

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a relationship-graph construction method as provided by any embodiment of the present invention, the method comprising:

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. The relation map construction method is characterized by comprising the following steps of:

Storing the target relation data into a distributed graph database, and constructing a relation graph corresponding to the target relation data in the distributed graph database;

the history relation data is corresponding relation data extracted from an original data set input in a history mode;

the attribute key value is generated by key fields in the original relation data and the historical relation data;

the original relation data and the history relation data respectively comprise a first category of a first object, a corresponding value of the first category, a second category of a second object, a corresponding value of the second category, a relation type between the first object and the second object, relation occurrence time, relation occurrence times, relation occurrence days, relation data sources, relation source data set types and reliability coefficient fields;

the attribute key value of the original relationship data is determined based on a first category of the first object, a corresponding value of the first category, a second category of the second object, a corresponding value of the second category, and a relationship type between the first object and the second object.

2. The method of claim 1, further comprising, prior to said storing said target relationship data to a distributed graph database:

Calculating the reliability coefficient of the target relation data, and storing the reliability coefficient in a reliability relation field of the target relation data;

judging whether the reliability coefficient meets a preset coefficient threshold condition or not;

and if not, discarding the target relation data.

3. The method of claim 2, wherein the reliability coefficient is calculated based on a statistical reliability value and a dataset reliability value weighting of the target relationship data; the statistical reliability value is obtained by weighting and calculating the number of data sources, the number of data sets and the discovery times of the target relation data, and the data set reliability value is determined based on the weight maximum value of each data set.

4. The method of claim 1, wherein merging and deduplicating each set of the intermediate relationship data to obtain target relationship data comprises:

determining a relationship occurrence time field value of each relationship data in each group of intermediate relationship data, and performing de-duplication processing on the intermediate relationship data based on the relationship occurrence time field value;

counting the relation occurrence time field value after the duplication elimination treatment to determine the relation occurrence days in the target relation data;

And determining the maximum relationship occurrence time field value based on the relationship occurrence time field value after the deduplication processing, and determining the relationship occurrence time field value in the target relationship data based on the maximum relationship occurrence time field value.

5. The method of claim 4, wherein merging and deduplicating each set of the intermediate relationship data to obtain target relationship data, further comprising:

based on the relation occurrence number field values of each relation data in each group of intermediate relation data, accumulating and adding all relation occurrence number field values to obtain the relation occurrence number field values in the target relation data;

and performing de-duplication processing on the intermediate relationship data based on the relationship data source and the relationship source data set type of each group of intermediate relationship data to obtain the relationship data source and the relationship source data set type corresponding to the target relationship data.

6. A relationship map construction apparatus, comprising:

the target relation data storage module is used for storing the target relation data into a distributed graph database, and constructing a relation graph corresponding to the target relation data in the distributed graph database;

7. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the relationship graph construction method as recited in claims 1-5.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a relationship graph construction method as claimed in claims 1-5.