CN116010670A

CN116010670A - Data catalog recommendation method, device and application based on data blood relationship

Info

Publication number: CN116010670A
Application number: CN202211694228.6A
Authority: CN
Inventors: 郁强; 陶阳; 黄红叶; 赵军辉
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-25

Abstract

The invention provides a data item recommending method, a data item recommending device and an application based on data blood relation, which are used for acquiring catalog information of at least one data catalog and the data blood relation among different data catalogs, wherein the catalog information comprises catalog labels, catalog names and catalog fields; selecting a basic data directory from the data directories, wherein unselected data directories are used as alternative data directories; calculating the directory similarity of each candidate data directory and the basic data directory based on the directory information and the data blood-edge relationship, wherein the directory similarity is weighted by the label similarity, the name similarity, the field similarity and the blood-edge relationship similarity; and selecting the alternative data catalogues as recommended data catalogues according to the sequence of the catalogue similarity from high to low, fully considering the value of the data blood relationship, and being suitable for recommending the data catalogues with complex data blood relationship and wide sources.

Description

Data catalog recommendation method, device and application based on data blood relationship

Technical Field

The invention relates to the field of data catalog recommendation, in particular to a data catalog recommendation method, device and application based on data blood relationship.

Background

A data catalog is a listing of all data assets in an organization that helps data professionals find the most relevant data for any analysis or business purpose, it acts as a data listing and provides the necessary information to evaluate the suitability of the data for the intended use, it also helps analysts and other data users to find the target data they need for a particular purpose. Along with the development of informatization, each industry forms a data catalog for standard data use, the data standard is promoted through the data catalog, and meanwhile, the existing data assets can be provided in the catalog form, so that the interconnection of data is ensured, and the maximum value of data use is realized.

At present, two modes of collaborative filtering recommendation and content-based recommendation are mainly adopted for recommending the data catalogs, wherein the collaborative filtering recommendation is required to rely on analyzing and processing the use habit of a user, however, in some specific use scenes, the user cannot use a large amount of the data catalogs, and further the system cannot capture a large amount of user behaviors, and cannot recommend the data catalogs by using the collaborative filtering recommendation method; the content-based recommendation is a matching recommendation based on similarity of labels of the data catalogs, and the method can only provide the data catalogs similar in content for users, but cannot consider the relevance among the data catalogs, so that the recommendation effect of the data catalogs is poor.

In summary, the existing data directory recommendation method does not consider the relevance between the data directories, resulting in poor recommendation effect.

Disclosure of Invention

The invention aims to provide a data item recommending method, a data item recommending device and an application based on a data blood-edge relationship, wherein the data blood-edge relationship of a data directory is introduced as a recommending influence factor to recommend the data directory, the value of the data blood-edge relationship is fully considered, and the method and the device are suitable for recommending the data directory with complex data blood-edge relationship and wide sources.

In order to achieve the above object, the present solution provides a data catalog recommendation method based on data blood relationship, including: acquiring directory information of at least one data directory and data blood relationship among different data directories, wherein the directory information comprises directory labels, directory names and directory fields; selecting a basic data directory from the data directories, wherein unselected data directories are used as alternative data directories; calculating the catalog similarity of each candidate data catalog and the basic data catalog based on the catalog information and the data blood-edge relationship, wherein the catalog similarity is weighted by label similarity, name similarity, field similarity and blood-edge relationship similarity; and selecting the alternative data catalogs as recommended data catalogs according to the sequence from high to low of the catalogs similarity.

In a second aspect, the present disclosure provides a data catalog recommendation device based on data blood relationship, including: the data catalog acquisition unit is used for acquiring catalog information of at least one data catalog and data blood relationship among different data catalogs, wherein the catalog information comprises catalog labels, catalog names and catalog fields; a selecting unit for selecting a base data directory from the data directories, the unselected data directory being an alternative data directory; a catalog similarity calculating unit, configured to calculate, based on the catalog information and the data blood-edge relationship, a catalog similarity of each candidate data catalog and the basic data catalog, where the catalog similarity is obtained by weighting a label similarity, a name similarity, a field similarity, and a blood-edge relationship similarity; and the recommending unit is used for selecting the alternative data catalogs as recommended data catalogs according to the order of the catalogs from high to low.

A third aspect provides an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform a data catalog recommendation method based on data blood-lineage relationships.

A fourth aspect provides a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process comprising a data catalog recommendation method based on said data blood relationship.

Compared with the prior art, the technical scheme has the following characteristics and beneficial effects:

1. according to the scheme, the data blood-lineage relation among the data catalogues is originally introduced as a recommendation influence factor of data catalogue recommendation, and the expression of the data blood-lineage relation is added, so that the problem that the traditional data catalogue recommendation algorithm does not consider the circulation among the data is solved, and the recommendation accuracy is improved.

2. And through the data blood edge relationship generated in the data management and the data warehouse construction process, the hugegraph graph database is used for storing the data blood edge relationship, so that the hierarchy of the data blood edge relationship and the blood edge relationship consistency are conveniently and rapidly calculated.

3. Different from collaborative filtering recommendation algorithm, the multi-dimensional recommendation influence factors of the directory name, the directory label, the directory field name and the data blood relationship calculate the final overall similarity, the directories with the highest similarity are displayed for recommendation according to the similarity, the effect of considering both the content of the data directory and the data blood relationship of the data directory is achieved, the problem caused by cold start and no large amount of data used for recommendation due to collaborative filtering recommendation is avoided, that is, the recommendation method of the scheme is completely based on the data of the data directory, and a user can know the data directory with the highest relevance with the currently viewed data directory.

4. The scheme also introduces the deep neural network to learn data blood edges and based on content parameters, and dynamically adjusts the recommendation algorithm of the data catalogue to achieve the expected effect. In other words, the linear regression algorithm of machine learning is used in the scheme to avoid coefficient solidification, and the characteristic that the manually set coefficients are inaccurate is reduced, so that the data directory recommendation method is more scientific and accurate.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of a data catalog recommendation method based on data blood relationship according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data catalog recommendation method based on data blood relationship according to one embodiment of the present application;

FIG. 3 is a block diagram of a data catalog recommendation device based on data blood relationship for a library device according to an embodiment of the present application;

fig. 4 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

Example 1

The scheme provides a data catalog recommending method based on data blood relationship, which is applied to recommending data catalogs with complex data catalog relationship, and can avoid cold start defects of collaborative filtering recommendation without collecting the use habit of a user.

Specifically, as shown in fig. 1 and fig. 2, the data catalog recommendation method based on the data blood relationship provided by the present scheme includes the following steps:

acquiring directory information of at least one data directory and data blood relationship among different data directories, wherein the directory information comprises directory labels, directory names and directory fields;

selecting a basic data directory from the data directories, wherein unselected data directories are used as alternative data directories;

calculating the catalog similarity of each candidate data catalog and the basic data catalog based on the catalog information and the data blood-edge relationship, wherein the catalog similarity is weighted by label similarity, name similarity, field similarity and blood-edge relationship similarity;

and selecting the alternative data catalogs as recommended data catalogs according to the sequence from high to low of the catalogs similarity.

In the step of acquiring the catalog information of at least one data catalog and the data blood-edge relation among different data catalogs, raw data are acquired from a data source, the raw data are subjected to data treatment to obtain a database table with the data blood-edge relation, the database table is cataloged to obtain a corresponding data catalog, and the corresponding catalog information is constructed.

In some embodiments, the raw data is from multiple data sources. In particular, for the recommendation of the data catalogue of the cross-unit and cross-department, the original data is derived from a plurality of units and a plurality of departments, and the original data has the characteristics of huge data volume and complex data blood relationship.

In some embodiments, data governance of raw data includes data cleansing, data extraction, and data conversion of raw data to convert the raw data into various database tables. It should be noted that, in the data warehouse construction process, the database tables of each level are mutually converted, often, the data of one database table may come from several database tables, then the database table is downstream of the database tables, sometimes the data of one database table flows to several other database tables, then the database table is upstream of the database tables, so that the database tables obtained after data treatment have a data blood relationship with the database tables.

According to the scheme, the hugegraph graph database is used for storing the data blood relationship among the database tables, and the levels and the association relationship among the database tables can be clearly expressed through the graph database, so that the corresponding data blood relationship can be conveniently called later.

And the manager catalogs the database table to obtain a data catalog, and selects a catalog label, an edit catalog name and a catalog field of the data catalog when the manager catalogs the data to obtain catalog information of the current data catalog. Because the data items of the data catalogue are all generated from the database table after data management, the data blood relationship of the database table can be used as the data blood relationship of the data catalogue.

In some embodiments, the directory names and directory tags of the data directories are configured according to the service requirements, the same type of data directory shares the same directory tag, and the same data directory has at least one directory tag.

And selecting a basic data directory to be published from the data directories by a user, or selecting any data directory from the data directories by the system to serve as the basic data directory, wherein all other unselected data directories serve as alternative data directories. The proposal selects a recommended data catalogue from the candidate data catalogues based on the catalogue label, the catalogue name, the catalogue field and the data blood relationship as recommendation influencing factors.

Specifically, in the step of calculating the directory similarity between each candidate data directory and the basic data directory based on the directory information and the data blood-edge relationship, wherein the directory similarity is weighted by the tag similarity, the name similarity, the field similarity and the blood-edge relationship similarity, calculating the tag similarity, the name similarity and the field similarity between each candidate data directory and the basic data directory based on the directory information, calculating the blood-edge relationship similarity between each candidate data directory and the basic data directory based on the data blood-edge relationship, and weighting and summing the tag similarity, the name similarity, the field similarity and the blood-edge relationship similarity to obtain the directory similarity of each candidate data directory.

Specifically, the similarity between all directory labels of each candidate data directory and all directory labels of the basic data directory is calculated to obtain the label similarity, and the calculation formula is as follows:

wherein J ₁ (A ₁ ，B ₁ ) Is the label similarity, A ₁ All directory labels for base data directory, B ₁ Is all directory labels of the candidate data directory.

Specifically, the similarity between the segmentation word of the directory name of the candidate data directory and the segmentation word of the directory name of the basic data directory is calculated to obtain the name similarity, and the calculation formula is as follows:

wherein J ₂ (A ₂ ，B ₂ ) Is the similarity of names, A ₂ A segmentation word for the directory name of the basic data directory, B ₂ The segmentation word for the directory name of the candidate data directory.

In the scheme, the directory name of the data directory is subjected to text segmentation according to an n-gram language model, and the value of n can be 2 or other values.

Specifically, the similarity between all directory fields of the candidate data directory and all directory fields of the basic data directory is calculated to obtain a field similarity, and the calculation formula is as follows:

wherein J ₃ (A ₃ ，B ₃ ) For field similarity, A ₃ All directory fields of the base data directory, B ₃ All directory fields for the candidate data directory.

In the scheme, the directory field of the data directory is a Chinese-English field, and because the directory field of the data directory has the same condition, whether the directory fields are the same or not can not be judged independently according to Chinese or English, so that the judgment of the field similarity is needed to be carried out by combining the target fields of Chinese and English.

Specifically, the number of candidate data blood-edge fields of the database table corresponding to the candidate data catalog is obtained based on the data blood-edge relation, the number of basic data blood-edge fields of the database table corresponding to the basic data catalog is obtained based on the data blood-edge relation, the interval hierarchy between the candidate data catalog and the basic data catalog is obtained based on the data blood-edge relation, and the similarity of the blood-edge relation is calculated based on the number of candidate data blood-edge, the number of basic data blood-edge fields and the interval hierarchy.

Specifically, if the interval level between the candidate data directory and the base data directory is i, the similarity distance between the candidate data directory and the base data directory is

In addition, if a data blood edge field still has a field-level data blood edge relation after i rounds of database table traffic, the data blood edge fields are considered to be identical, so the scheme is based on alternative data blood edge fieldsAnd calculating the similarity between the number and the number of the blood-edge fields of the basic data.

Specifically, a formula for calculating the similarity of the blood-edge relationship between each candidate data directory and the basic data directory based on the data blood-edge relationship is as follows:

wherein J ₄ (A ₄ ，B ₄ ) Refers to the blood relationship similarity of each alternative data catalog and the basic data catalog, i is the interval level between the alternative data catalog and the basic data catalog, A ₄ B is the number of the blood-edge fields of the basic data of the database table corresponding to the basic data catalog ₄ The number of the candidate data blood edge fields of the database table corresponding to the candidate data catalog is determined.

In the step of obtaining the directory similarity of each candidate data directory by weighting and summing the tag similarity, the name similarity, the field similarity and the blood-edge relationship similarity, the tag similarity, the name similarity, the field similarity and the blood-edge relationship similarity are provided with corresponding weights, and the sum of the weights of the tag similarity, the name similarity, the field similarity and the blood-edge relationship similarity is 1.

Specifically, in some embodiments the directory similarity is calculated as follows:

wherein a, B, c, d are weights of tag similarity, name similarity, field similarity, and blood relationship similarity, respectively, and J (A, B) is directory similarity.

In some embodiments, the weights of tag similarity, name similarity, field similarity, and blood-lineage similarity are preset to 0.25.

In the step of selecting the candidate data directories as recommended data directories according to the order of the directory similarity from high to low, the candidate data directory with the top ranking is selected as the recommended data directory according to the order of the directory similarity from high to low.

In addition, in other embodiments, a deep neural network is introduced to dynamically learn the weights of tag similarity, name similarity, field similarity, and blood-lineage similarity. Specifically, a machine learning multiple regression algorithm is adopted to establish the weights of the label similarity, the name similarity, the field similarity and the blood-edge relationship similarity and the management of the catalog similarity, the weights of the label similarity, the name similarity, the field similarity and the blood-edge relationship similarity are dynamically adjusted according to the condition that the recommended data catalog is clicked in practice, and then the data catalog recommendation method is continuously optimized in the use process.

Specifically, the multiple regression algorithm of machine learning constructs an algorithm model to be used as a mapping relation between weights of label similarity, name similarity, field similarity and blood relationship similarity and directory similarity, and function parameters are adjusted in the dynamic learning process to enable the fitting of the mapping relation to be the best.

Example two

Based on the same conception, referring to fig. 3, the present application further provides a data catalog recommendation device based on data blood relationship, including:

the data catalog acquisition unit is used for acquiring catalog information of at least one data catalog and data blood relationship among different data catalogs, wherein the catalog information comprises catalog labels, catalog names and catalog fields;

a selecting unit for selecting a base data directory from the data directories, the unselected data directory being an alternative data directory;

a catalog similarity calculating unit, configured to calculate, based on the catalog information and the data blood-edge relationship, a catalog similarity of each candidate data catalog and the basic data catalog, where the catalog similarity is obtained by weighting a label similarity, a name similarity, a field similarity, and a blood-edge relationship similarity;

and the recommending unit is used for selecting the alternative data catalogs as recommended data catalogs according to the order of the catalogs from high to low.

The same technical matters as those of the first embodiment in the second embodiment will not be repeated.

Example III

The present embodiment also provides an electronic device, referring to fig. 4, comprising a memory 304 and a processor 302, the memory 304 having stored therein a computer program, the processor 302 being arranged to run the computer program to perform the steps of any of the embodiments of the data catalog recommendation method based on data blood relationship described above.

In particular, the processor 302 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 304 may include, among other things, mass storage 304 for data or instructions. By way of example, and not limitation, memory 304 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 304 may include removable or non-removable (or fixed) media, where appropriate. Memory 304 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 304 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 304 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or FLASH memory (FLASH) or a combination of two or more of these. The RAM may be Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM) where appropriate, and the DRAM may be fast page mode dynamic random access memory 304 (FPMDRAM), extended Data Output Dynamic Random Access Memory (EDODRAM), synchronous Dynamic Random Access Memory (SDRAM), or the like.

Memory 304 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 302.

The processor 302 implements any of the data directory recommendation methods based on data blood-lineage relationships in the above-described embodiments by reading and executing computer program instructions stored in the memory 304.

Optionally, the electronic apparatus may further include a transmission device 306 and an input/output device 308, where the transmission device 306 is connected to the processor 302, and the input/output device 308 is connected to the processor 302.

The transmission device 306 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 306 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

The input-output device 308 is used to input or output information. In this embodiment, the input information may be a data directory, source data, or the like, and the output information may be a recommended data directory, or the like.

Alternatively, in the present embodiment, the above-mentioned processor 302 may be configured to execute the following steps by a computer program:

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and/or macros can be stored in any apparatus-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. In addition, in this regard, it should be noted that any blocks of the logic flows as illustrated may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on a physical medium such as a memory chip or memory block implemented within a processor, a magnetic medium such as a hard disk or floppy disk, and an optical medium such as, for example, a DVD and its data variants, a CD, etc. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The foregoing examples merely represent several embodiments of the present application, the description of which is more specific and detailed and which should not be construed as limiting the scope of the present application in any way. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. The data catalog recommending method based on the data blood relationship is characterized by comprising the following steps of:

2. The data catalog recommendation method based on data blood relationship according to claim 1, wherein in the step of acquiring catalog information of at least one data catalog and data blood relationship between different data catalogs, raw data is acquired from a data source, data management is performed on the raw data to obtain a database table with the data blood relationship, the database table is cataloged to obtain a corresponding data catalog, and corresponding catalog information is constructed.

3. The data catalog recommendation method based on data blood relationship according to claim 1, wherein the similarity between all catalog labels of each candidate data catalog and all catalog labels of the basic data catalog is calculated to obtain label similarity.

4. The data catalog recommendation method based on the data blood relationship according to claim 1, wherein similarity between the cut words of the catalog names of the candidate data catalogues and the cut words of the catalog names of the basic data catalogues is calculated to obtain name similarity.

5. The data catalog recommendation method based on data blood relationship according to claim 1, wherein the similarity between all catalog fields of the candidate data catalog and all catalog fields of the base data catalog is calculated to obtain field similarity.

6. The data catalog recommendation method based on the data blood relationship according to claim 1, wherein the number of candidate data blood relationship fields of the database table corresponding to the candidate data catalog is obtained based on the data blood relationship, the number of basic data blood relationship fields of the database table corresponding to the basic data catalog is obtained based on the data blood relationship, the interval hierarchy between the candidate data catalog and the basic data catalog is obtained based on the data blood relationship, and the blood relationship similarity is calculated based on the number of candidate data blood relationship, the number of basic data blood relationship fields and the interval hierarchy.

7. The data catalogue recommending method based on the data blood relationship according to claim 1, wherein a machine learning multiple regression algorithm is adopted to establish the weights of the label similarity, the name similarity, the field similarity and the blood relationship similarity and the management of the catalogue similarity, and the weights of the label similarity, the name similarity, the field similarity and the blood relationship similarity are dynamically adjusted according to the actual condition of clicking the recommended data catalogue.

8. A data catalog recommendation device based on data blood relationship, comprising:

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the data catalog recommendation method based on data blood relationship of any one of claims 1 to 7.

10. A readable storage medium, characterized in that the readable storage medium has stored therein a computer program comprising program code for controlling a process to execute a process comprising the data catalog recommendation method based on data blood relationship according to any one of claims 1 to 7.