CN113312396B - Metadata processing method and device based on big data - Google Patents

Metadata processing method and device based on big data Download PDF

Info

Publication number
CN113312396B
CN113312396B CN202110517886.7A CN202110517886A CN113312396B CN 113312396 B CN113312396 B CN 113312396B CN 202110517886 A CN202110517886 A CN 202110517886A CN 113312396 B CN113312396 B CN 113312396B
Authority
CN
China
Prior art keywords
metadata
sample
target
word
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110517886.7A
Other languages
Chinese (zh)
Other versions
CN113312396A (en
Inventor
程大伟
朱鹏
盛程凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhejin Information Technology Co ltd
Original Assignee
Shanghai Zhejin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhejin Information Technology Co ltd filed Critical Shanghai Zhejin Information Technology Co ltd
Priority to CN202110517886.7A priority Critical patent/CN113312396B/en
Publication of CN113312396A publication Critical patent/CN113312396A/en
Application granted granted Critical
Publication of CN113312396B publication Critical patent/CN113312396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application aims to provide a metadata processing method and device based on big data, which are characterized in that a training model for judging and determining the table relations of different databases is determined; acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; the method comprises the steps of sequentially carrying out word segmentation processing and word vector training on each target metadata to obtain target word vectors corresponding to each target metadata, inputting the word vectors corresponding to the two target metadata into a training model to carry out association relation analysis to obtain association results between the two target metadata, wherein the association results comprise no association and association, and if the association exists, the association results also comprise target association relations corresponding to the two target metadata, so that the analysis of table structure information of tables in a database is realized to embody the table-to-table relation of the database.

Description

Metadata processing method and device based on big data
Technical Field
The present application relates to the field of computer technologies, and in particular, to a metadata processing method and apparatus based on big data.
Background
In the prior art, metadata is an important component of a data warehouse, is an indication map of the data warehouse, and plays an important role in the processes of data source extraction, data warehouse development, business analysis, data warehouse service, data refinement, reconstruction and the like. The huge impact of metadata in the overall data warehouse development and application process can be seen in fig. 1, and metadata with strong definition and description capabilities and perfect content has decisive significance for effectively developing and managing the data warehouse to a certain extent. Furthermore, metadata is an efficient way to describe complex organization data, which encapsulates the semantic nature of a dataset in an aggregated form. Metadata plays a central role in making data available for long periods of time, and is critical to data understanding, storage, preservation, management, and discovery for future use, which makes application of complex research on data easier by allowing input data to be searched based on the content and context of the input data. In addition to better utilization, other data management efforts are facilitated, such as managing and utilizing similarities between data sets. However, in different research areas, there are different metadata standards and data management methods, each of which contains different research-specific data features.
Currently, metadata management is an immature area. From the business perspective, many people are not clear about the purpose of establishing a metadata management and exchange platform, and cannot determine how much value is brought to enterprises by centralizing the metadata. From the current stage of the enterprise, establishing such a platform does not generate much value; even in the initial stage of the construction of the platform and who uses the platform. From the technical point of view, the unified metadata standard has not been really established, and most of common metadata management tools provide metadata exchange functions, but these exchanges cannot deal with the situation involving interaction with numerous tools. If such tools are not able to fully centralize all metadata, faults will occur in the data stream throughout the data warehouse and so-called consistent metadata management platforms cannot be established. Therefore, the association relation between the metadata is missing, and the data sharing and the mutual association with other data sets cannot be realized.
Disclosure of Invention
The application aims to provide a metadata processing method and equipment based on big data, which can obtain the association relation between tables of a database, so that a user has clear knowledge of the table structure of an unknown database, and the method and equipment are favorable for the subsequent use and utilization of the database.
According to an aspect of the present application, there is provided a metadata processing method based on big data, wherein the method includes:
Determining a training model for judging and determining the relations among tables of different databases;
acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database;
word segmentation processing and word vector training are sequentially carried out on each target metadata respectively to obtain a target word vector corresponding to each target metadata;
inputting word vectors corresponding to the two target metadata into the training model for association relation analysis to obtain association results between the two target metadata, wherein the association results comprise no association and association;
if the association exists, the association result also comprises target association relations corresponding to the two target metadata.
Further, in the above method, determining a training model for evaluating and determining relationships between tables of different databases includes:
acquiring a preset number of sample metadata and determining a sample word vector of each sample metadata;
grouping the preset number of sample metadata in any two-two combination by taking the two sample metadata as a unit to obtain each group of training samples;
respectively splicing sample word vectors of the two sample metadata in each group of training samples to obtain spliced word vectors of each group of training samples;
and performing deep learning and model training on the spliced word vectors of all groups of training samples in the preset number of sample data to obtain training models for judging and determining the table relations of different databases.
Further, in the above method, the obtaining a preset number of sample metadata and determining a sample word vector of each sample metadata includes:
Acquiring a preset number of sample metadata, wherein each sample metadata is table structure information of a table of a database;
performing word segmentation processing on each sample metadata respectively to obtain word segmentation information of each sample metadata and segmented sample metadata;
and training word vectors based on the word segmentation information of each sample metadata and the segmented sample metadata respectively to obtain sample word vectors of each sample metadata.
Further, in the above method, the training of word vectors based on the word segmentation information of each sample metadata and the segmented sample metadata respectively to obtain a sample word vector of each sample metadata includes:
Counting word segmentation information of each sample metadata respectively, and sequencing all words in each sample metadata according to a word frequency sequence from high to low to obtain an initial word vector of each word in each sample metadata;
And respectively inputting the initial word vectors of all words in each sample metadata and the corresponding segmented sample metadata into a related model for generating word vectors to perform word vector training, so as to obtain sample word vectors of each sample metadata.
Further, in the above method, the performing deep learning and model training on the spliced word vectors of all groups of training samples in the preset number of sample data to obtain training models for judging and determining the relationships among tables of different databases includes:
deep learning is respectively carried out on the spliced word vectors of all groups of training samples in the preset number of sample data, so that sample characterization vectors of each group of training samples are obtained;
Calculating the characterization value of each group of training samples based on the sample characterization vector of each group of training samples;
Model training is carried out based on the characterization values of all groups of training samples, and a training model for judging and determining the table relations of different databases is obtained.
Further, in the above method, the training of the model based on the characterization values of all the training samples to obtain a training model for evaluating and determining the relationships between tables of different databases includes:
And respectively carrying out the following operations on each training sample to obtain training models for judging and determining the relations among tables of different databases:
Judging whether the characterization value of each training sample is larger than a preset association characterization threshold value,
If so, associating two sample metadata in the training sample, calculating the similarity between the two associated sample metadata, setting corresponding association relations for the two associated sample metadata based on the similarity, and setting different association relations for different values or value intervals of the similarity;
If not, no correlation exists between the two sample metadata in the training samples.
Further, in the above method, the step of sequentially performing word segmentation and word vector training on each target metadata to obtain a target word vector corresponding to each target metadata includes:
Performing word segmentation processing on each target metadata respectively to obtain target word segmentation information of each target metadata and segmented target metadata;
and training word vectors based on the target word segmentation information of each target metadata and the corresponding segmented target metadata respectively to obtain target word vectors of each target metadata.
Further, in the above method, the training of word vectors based on the target word segmentation information of each target metadata and the corresponding segmented target metadata to obtain target word vectors of each target metadata includes:
Counting the target word segmentation information of each target metadata respectively, and sequencing all target words in each target metadata according to the sequence of word frequency from high to low to obtain an initial target word vector of each target word in each target metadata;
And respectively inputting the initial target word vectors of all target words in each target metadata and the corresponding target metadata after word segmentation into a related model for generating word vectors to carry out word vector training, so as to obtain the target word vector of each target metadata.
According to another aspect of the present application, there is also provided a non-volatile storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement a metadata processing method based on big data as described above.
According to another aspect of the present application, there is also provided an apparatus for data processing, wherein the apparatus includes:
one or more processors;
A computer readable medium for storing one or more computer readable instructions,
The one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement the metadata processing method based on big data as described above.
Compared with the prior art, the method and the device have the advantages that the training model for judging and determining the relations among the tables of different databases is determined firstly; in an actual application scene, acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; word segmentation processing and word vector training are sequentially carried out on each target metadata respectively to obtain a target word vector corresponding to each target metadata; inputting word vectors corresponding to the two target metadata into the training model for association relation analysis to obtain an association result between the two target metadata, wherein the association result comprises no association and association; if the two target metadata to be processed are associated, the association result further comprises a target association relationship corresponding to the two target metadata, so that the association relationship between the two target metadata from different tables is obtained by analyzing the table structure information of the tables in the database, the table relationship of the tables in the database is reflected, a user has clear knowledge of the table structure of the unknown database, and the subsequent use and utilization of the database are facilitated.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 is a schematic diagram of an impact domain of data warehouse metadata in the prior art;
FIG. 2 illustrates a flow diagram of a metadata processing method based on big data in accordance with an aspect of the present application;
FIG. 3 illustrates a framework diagram of a word segmentation tool in a metadata-based metadata processing method in accordance with an aspect of the present application;
FIG. 4 illustrates a training schematic of a training model for evaluating and determining table relationships of different databases for training in a metadata processing method based on big data in accordance with an aspect of the present application.
The same or similar reference numbers in the drawings refer to the same or similar parts.
Detailed Description
The application is described in further detail below with reference to the accompanying drawings.
In one exemplary configuration of the application, the terminal, the device of the service network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.
According to an embodiment of an aspect of the present application, a metadata processing method based on big data is provided, and a specific interaction flow is shown in fig. 2, where the method relates to a client and a server of a full text search engine, where the client of the full text search engine may preferably be REST HIGH LEVEL CLIENT, and the method includes steps S11, S12, S13 and S14 executed by the client, and specifically includes the following steps:
Step S11, determining a training model for judging and determining the relations among tables of different databases; the training model is obtained by training according to the table structure information of different tables of different databases, so that the table relationship of the different tables, namely the association relationship among the different tables, can be calculated through the training model in the actual application scene.
Step S12, obtaining two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; here, the two target metadata may be from different tables of the same database or may be from different tables of different databases, so that table structure information of tables of the database to be processed is taken as target metadata to be processed.
And S13, performing word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata.
And S14, inputting word vectors corresponding to the two target metadata into the training model for association relation analysis, and obtaining an association result between the two target metadata, wherein the association result comprises no association and association.
If the association exists, the association result also comprises target association relations corresponding to the two target metadata.
Through the steps S11 to S14, the method is realized that the association relation between two target metadata from different tables is obtained by analyzing the table structure information of the tables in the database, so that the table relation of the tables in the database is embodied, a user has a clearer knowledge on the table structure of an unknown database, and the subsequent use and utilization of the database are facilitated.
For example, before the computation of the relationships between tables of the databases is performed, a training Model for evaluating and determining the relationships between tables of different databases is determined in the step S11; in an actual application scenario, the step S12 obtains two target metadata to be processed, which are target metadata Data (target 1) and target metadata Data (target 2), respectively, wherein each target metadata is table structure information of a table of a database; in step S13, word segmentation processing and word vector training are sequentially performed on the target metadata Data (target 1) and the target metadata Data (target 2) respectively, so as to obtain a target word vector V (target 1) corresponding to the target metadata Data (target 1) and a target word vector V (target 2) corresponding to the target metadata Data (target 2); in step S14, the target word vector V (target 1) corresponding to the target metadata Data (target 1) and the target word vector V (target 2) corresponding to the target metadata Data (target 2) are input to the training Model together for performing association analysis, so as to obtain an association result between the target metadata Data (target 1) and the target metadata Data (target 2), wherein the association result includes no association and no association, the calculation and output of a relationship between two target metadata tables are realized in an actual application scenario through the training Model, and when there is an association between the target metadata Data (target 1) and the target metadata Data (target 2), the target association relationship between the target metadata Data (target 1) and the target metadata Data (target 2) is output, so that by the training Model, not only can the association result between two target metadata to be processed be calculated and output, but also when there is an association between the two target metadata, the specific correspondence between the two target metadata is calculated and output, so that the target metadata can be further known about the association relationship between the two target metadata tables, and the application database can be further used for protecting the corresponding metadata.
Next to the above embodiment of the present application, the step S11 of determining a training model for evaluating and determining relationships between tables of different databases specifically includes:
acquiring a preset number of sample metadata and determining a sample word vector of each sample metadata;
grouping the preset number of sample metadata in any two-two combination by taking the two sample metadata as a unit to obtain each group of training samples;
respectively splicing sample word vectors of the two sample metadata in each group of training samples to obtain spliced word vectors of each group of training samples;
and performing deep learning and model training on the spliced word vectors of all groups of training samples in the preset number of sample data to obtain training models for judging and determining the table relations of different databases.
For example, in the training process of training the training model, a predetermined number of sample metadata for performing model training is first obtained, where if N number of sample metadata is N, where N may be a positive integer greater than or equal to 2, the N number of sample metadata are respectively sample metadata data (sample 1), sample metadata data (sample 2), … …, sample metadata data (sample (N-1)) and sample metadata data (sample N), and a sample word vector of each sample metadata is determined, where the sample word vector of each sample metadata in sample metadata data (sample 1), sample metadata data (sample 2), … …, sample metadata data (sample (N-1)) and sample metadata data (sample N) is respectively: sample word vector V (sample 1), sample word vector V (sample 2), … …, sample word vector V (sample (N-1)), and sample word vector V (sample N); then, grouping the N sample metadata in any two combinations, for example, sample metadata data (sample 1) and sample metadata data (sample 2) as a group, sample metadata data (sample 1), sample metadata data (sample 3) as a group, sample metadata data (sample 2), sample metadata data (sample 3) as a group, … …, sample metadata data (sample 1) and sample metadata data (sample N) as a group, and sample metadata data (sample (N-1)) and sample metadata data (sample N) as a group, to obtain N (N-1)/2 sets of training samples, i.e., any two sample metadata in the N sample metadata form one training sample; then, the sample word vectors of the two sample metadata in each set of training samples are spliced respectively to obtain a spliced word vector of each set of training samples, for example, the spliced word vector of a set of training samples formed by the sample metadata data (sample 1) and the sample metadata data (sample 2) is V (sample 12), the spliced word vector of a set of training samples formed by the sample metadata data (sample 1) and the sample metadata data (sample 3) is V (sample 13), … …, and the spliced word vector of a set of training samples formed by the sample metadata data (sample (N-1)) and the sample metadata data (sample N) is V (sample (N-1) N), so as to obtain the spliced word vector of each set of training samples formed by N sample metadata; and finally, performing deep learning and Model training on the spliced word vectors of N (N-1)/2 groups of training samples formed by the N sample metadata to obtain training Model models for judging and determining the relationships among the tables of different databases, and realizing the training and the determination of the training models for judging and determining the relationships among the tables of different databases.
Next to the above embodiment of the present application, the obtaining a preset number of sample metadata and determining a sample word vector of each sample metadata specifically includes:
Acquiring a preset number of sample metadata, wherein each sample metadata is table structure information of a table of a database; here, the sample metadata is data about data, and according to functions, the sample metadata may be divided into description metadata, structure metadata, management metadata, and the like; the description metadata mainly provides information for searching and positioning and is used for finding and identifying specific data, the structure metadata mainly records the constitution of the data and the interrelationship of the data, and the management metadata provides information required by data resource management, so that in the model training process, table structure information of a table of a database is used as sample metadata for training a model, and the acquisition of the sample metadata is realized.
Performing word segmentation processing on each sample metadata respectively to obtain word segmentation information of each sample metadata and segmented sample metadata;
and training word vectors based on the word segmentation information of each sample metadata and the segmented sample metadata respectively to obtain sample word vectors of each sample metadata.
In this embodiment, when the obtained metadata of each sample is segmented, a jieba segmentation tool is used to segment each sample metadata, where a frame diagram of the jieba segmentation tool is shown in fig. 3. The most important word segmentation tool in jieba is a dictionary, if jieba word segmentation tools are adopted, a dictionary table universal to the jieba word segmentation tools is used, so that the word cannot be segmented into a desired effect in a specific word segmentation environment. Wherein the code for modifying the dictionary table comprises the steps of: firstly, loading a dictionary to generate a trie tree word segmentation model, in a preferred embodiment of the application, taking a Linux system as an example, a cache file is generated when the dictionary is loaded, because in the word segmentation processing environment of the application, two jieba word segments are adopted, one is regional, the other is a keyword jieba, and if related configuration is not changed, file names are the same and cannot be used at the same time; secondly, given sentences to be segmented (for example, sample metadata to be segmented), continuous Chinese characters and English characters are obtained in use, the sample metadata are segmented into phrase lists, DAG (dictionary lookup) and dynamic programming are used for each phrase to obtain a maximum probability path, words which are not found in the dictionary list in the DAG are combined into a new segment phrase, hidden Markov models (Hidden Markov Model, HMM) are used for segmentation, namely, HHMM models are used for identifying new words, namely, new words outside the dictionary list are identified; finally, a word generator is generated by using the yieldgrammar of python, and word by word returns are carried out so as to realize word segmentation processing on each sample metadata.
For example, in the training process of training the training model, a preset number of sample metadata for performing model training is first obtained, where if N number of obtained sample metadata is N, where N may be a positive integer greater than or equal to 2, the N number of sample metadata are respectively sample metadata data (sample 1), sample metadata data (sample 2), … …, sample metadata data (sample (N-1)) and sample metadata data (sample N); then, the sample metadata data (sample 1), the sample metadata data (sample 2), … …, the sample metadata data (sample (N-1)) and the sample metadata data (sample N) are subjected to word segmentation processing to obtain word segmentation information 1 of the sample metadata data (sample 1) and the segmented sample metadata data '(sample 1), word segmentation information 2 of the sample metadata data (sample 2) and the segmented sample metadata data' (sample 2), … …, word segmentation information (N-1) of the sample metadata data (sample (N-1)) and segmented sample metadata data '(sample (N-1)), and word segmentation information N of the sample metadata data (sample N) and the segmented sample metadata data' (sample), respectively, so as to realize word segmentation processing of each sample metadata; finally, word vector training is performed based on word segmentation information 1 of sample metadata data (sample 1) and segmented sample metadata data '(sample 1) to obtain a sample word vector V (sample 1) of the sample metadata data (sample 1), word vector training is performed based on word segmentation information 2 of the sample metadata data (sample 2) and segmented sample metadata data' (sample 2) to obtain a sample word vector V (sample 2) of the sample metadata data (sample 2), … …, word vector training is performed based on word segmentation information (N-1) of the sample metadata data (sample (N-1)) and segmented sample metadata data '(sample (N-1)) to obtain a sample word vector V (sample N-1)) of the sample metadata data (sample N), and word vector training is performed based on word segmentation information N of the sample metadata data (sample N) and segmented sample metadata data' (sample N) to obtain a sample word vector V (sample N) of the sample metadata data (sample N), so as to realize word training and determination of sample word vector of each sample metadata.
Next, according to the above embodiment of the present application, the training of word vectors is performed based on word segmentation information of each sample metadata and the segmented sample metadata, so as to obtain a sample word vector of each sample metadata, which specifically includes:
Counting word segmentation information of each sample metadata respectively, and sequencing all words in each sample metadata according to a word frequency sequence from high to low to obtain an initial word vector of each word in each sample metadata;
And respectively inputting the initial word vectors of all words in each sample metadata and the corresponding segmented sample metadata into a related model for generating word vectors to perform word vector training, so as to obtain sample word vectors of each sample metadata.
For example, after the word segmentation information of each sample metadata is calculated, a representation of each sample metadata, that is, a sample word vector of each sample metadata, needs to be obtained. It should be noted that a word vector is a vector obtained by mapping words to a semantic space. The word2vec is a general training word vector tool in the prior art, the word2vec is realized by a neural network, and two models are respectively: CBOW and Skip-gram, which are similar during training. The Skip-gram model predicts the context around a word using the word as input, and CBOW models predicts the word itself using the context of the word as input. The preprocessing of sample word vector training for each sample metadata in this embodiment includes the steps of: firstly, generating a vocabulary for text of word segmentation information of each input sample metadata, counting word frequency of each word, for each sample metadata, sorting all words in the sample metadata according to word frequency from high to low, forming a vocabulary, and implementing duplication elimination processing on repeated words in a mode of sorting all words according to word frequency, wherein each word has a one-hot initial word vector, the dimension of the vector is consistent with the number of words after duplication elimination, if the word appears in the vocabulary, the corresponding position in the vocabulary in the initial word vector is 1, the other positions are all 0, and if the word does not appear in the vocabulary, the vector is all 0. Then, generating a one-hot initial word vector for each word of the text of the word segmentation information of each sample metadata, wherein the one-hot initial word vector needs to contain the original position of each word because of being context-dependent, for example, if the sample metadata contains 80 words and is ordered from high to low according to word frequency, the dimension of the one-hot initial word vector of the word corresponding to the highest word frequency is 80 and the value of the first dimension is 1, the other dimension values are 0, and furthermore, the dimension of the one-hot initial word vector of the word with word frequency ordering 30 is 80 and the value of the 30 th dimension is 1, and the other dimension values are 0, so that the one-hot initial word vector of each word can be obtained; then, determining the dimension of the one-hot initial word vector of each word in each sample metadata, wherein the dimension of the one-hot initial word vector of the word in each sample metadata is the number of words which are subjected to duplication elimination with all words contained in the sample metadata; finally, respectively inputting initial word vectors of all words in each piece of sample metadata and the corresponding piece of sample metadata after word segmentation into a related model for generating word vectors: word vector training is carried out in the word2vec function so as to obtain a sample word vector of each sample metadata, and training and generating of the sample word vector of each sample metadata are achieved.
In the above embodiment of the present application, the performing deep learning and model training on the concatenated word vectors of all groups of training samples in the preset number of sample data to obtain training models for evaluating and determining relationships between tables of different databases includes:
deep learning is respectively carried out on the spliced word vectors of all groups of training samples in the preset number of sample data, so that sample characterization vectors of each group of training samples are obtained;
Calculating the characterization value of each group of training samples based on the sample characterization vector of each group of training samples;
Model training is carried out based on the characterization values of all groups of training samples, and a training model for judging and determining the table relations of different databases is obtained.
For example, after obtaining the sample word vector of each sample metadata, a model for judging whether or not there is a relationship between every two sample metadata and which relationship is required to be trained, as shown in fig. 4, is a training flowchart for training a training model for judging and determining the relationships between tables of different databases. Firstly, splicing sample word vectors of two sample metadata to obtain a spliced word vector of a training sample formed by each two sample metadata; then, putting the spliced word vector of each group of training samples into a multi-layer neural network for analysis and study of deep characterization, and then putting the spliced word vector into an attention mechanism for further analysis and study of better characterization so as to obtain sample characterization vectors of each group of training samples; finally, training the characterization values of N (N-1)/2 groups of training samples formed by N sample metadata to obtain training Model models for judging and determining the relationships among the tables of different databases so as to realize the training and the determination of the training models for judging and determining the relationships among the tables of different databases.
In the above embodiment of the present application, the training model for evaluating and determining the relationships between tables of different databases is obtained by performing model training based on the characterization values of all the training samples, and includes:
And respectively carrying out the following operations on each training sample to obtain training models for judging and determining the relations among tables of different databases:
Judging whether the characterization value of each training sample is larger than a preset association characterization threshold, wherein the preset association characterization threshold can take any value between 0 and 1, and in a preferred embodiment of the present application, the preset association characterization threshold is preferably 0.5.
If so, associating two sample metadata in the training sample, calculating the similarity between the two associated sample metadata, setting corresponding association relations for the two associated sample metadata based on the similarity, and setting different association relations for different values or value intervals of the similarity;
If not, no correlation exists between the two sample metadata in the training samples.
In a preferred embodiment of the present application, after deep learning is performed to obtain a characterization value of each set of training samples, the following operations are performed on each training sample to obtain training models for evaluating and determining relationships among tables of different databases: inputting the characterization value of each training sample into a discriminator as shown in fig. 4, and judging whether the characterization value of each training sample is greater than a preset association characterization threshold value: 0.5, if the representation value of the training sample is more than 0.5 and less than or equal to 1, associating two sample metadata in the training sample, and when the two sample metadata in the training sample are associated, calculating the similarity between the two associated sample metadata, setting corresponding association relations for the two associated sample metadata based on the similarity, wherein different association relations are correspondingly set for the different similarity values or value intervals; if not, indicating that the characterization value of the training sample is greater than or equal to zero and less than or equal to 0.5, wherein two sample metadata in the training sample are not associated, so as to realize training and determination of a training model for judging and determining the table relations of different databases.
Next, in the foregoing embodiment of the present application, step 13 performs word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata, and specifically includes:
Performing word segmentation processing on each target metadata respectively to obtain target word segmentation information of each target metadata and segmented target metadata;
and training word vectors based on the target word segmentation information of each target metadata and the corresponding segmented target metadata respectively to obtain target word vectors of each target metadata.
In the actual application scenario, two target metadata to be processed are acquired in the step S12: after the target metadata Data (target 1) and the target metadata Data (target 2), in step S13, the target metadata Data (target 1) and the target metadata Data (target 2) are subjected to word segmentation processing to obtain target word segmentation information of the target metadata Data (target 1) and the segmented target metadata Data '(target 1) and target word segmentation information of the target metadata Data (target 2) and segmented target metadata Data' (target 2), respectively; then, the target word segmentation information and the segmented target metadata Data '(target 1) of the target metadata Data (target 1) and the segmented target metadata Data' (target 2) of the target metadata Data (target 2) are input to a correlation model for generating a word vector, respectively: word2vec functions are used for training word vectors to obtain target word vectors V (target 1) corresponding to target word vectors of target metadata Data (target 1) and target word vectors V (target 2) corresponding to target metadata Data (target 2), and training and generating of target word vectors of each target metadata to be processed are achieved.
Next, in the above embodiment of the present application, the step S14 performs word vector training based on the target word segmentation information of each target metadata and the corresponding segmented target metadata, to obtain a target word vector of each target metadata, and specifically includes:
Counting the target word segmentation information of each target metadata respectively, and sequencing all target words in each target metadata according to the sequence of word frequency from high to low to obtain an initial target word vector of each target word in each target metadata;
And respectively inputting the initial target word vectors of all target words in each target metadata and the corresponding target metadata after word segmentation into a related model for generating word vectors to carry out word vector training, so as to obtain the target word vector of each target metadata.
For example, after the word segmentation information of each target metadata is calculated, a representation of each target metadata, that is, a target word vector of each target metadata, needs to be obtained. It should be noted that a word vector is a vector obtained by mapping words to a semantic space. The preprocessing of the target word vector training of each target metadata in this embodiment includes the steps of: firstly, generating a vocabulary for the text of target word segmentation information of each input target metadata, counting word frequency of each target word, for each target metadata, sorting all target words in each target metadata according to word frequency from high to low, wherein each target word in the sorting is different to form a vocabulary, and performing repeated target word de-duplication processing in the target metadata in a mode of sorting all target words by word frequency, wherein each target word has a one-hot initial target word vector, the dimension of the vector is consistent with the number of target words after de-duplication, if the target word appears in the vocabulary, the corresponding position in the vocabulary in the initial target word vector is 1, the other positions are all 0, and if the target word does not appear in the vocabulary, the vector is all 0. Then, generating a one-hot initial target word vector for each target word of the text of the target word segmentation information of each target metadata; then, determining the dimension of a one-hot initial target word vector of each target word in each target metadata, wherein the dimension of the one-hot initial target word vector of the target word in each target metadata is the number of target words which are duplicate-removed with all target words contained in the target metadata; finally, respectively inputting initial target word vectors of all target words in each target metadata and the corresponding segmented target metadata into a related model for generating word vectors: word2vec functions are used for training word vectors to obtain target word vectors of each target metadata, and training and generating of the target word vectors of each target metadata are achieved.
In the above embodiment of the present application, the metadata in the database is intelligently analyzed and processed by using the open-source deep learning framework Keras and the programming language Python, and the best implementation effect can be achieved by using multiple CPUs for parallel acceleration training. The application is not limited to the specific embodiments described above, but various variations or modifications can be made by those skilled in the art within the scope of the claims, which do not affect the gist of the application.
According to another aspect of the present application, there is also provided a non-volatile storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement a metadata processing method based on big data as described above.
According to another aspect of the present application, there is also provided an apparatus for data processing, wherein the apparatus includes:
one or more processors;
A computer readable medium for storing one or more computer readable instructions,
The one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement the metadata processing method based on big data as described above.
For details of each embodiment of the apparatus for data processing, reference may be made to the corresponding portion of the embodiment of the metadata processing method based on big data, which is not described herein.
In summary, the training model for judging and determining the relationships among tables of different databases is determined first; in an actual application scene, acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; word segmentation processing and word vector training are sequentially carried out on each target metadata respectively to obtain a target word vector corresponding to each target metadata; inputting word vectors corresponding to the two target metadata into the training model for association relation analysis to obtain an association result between the two target metadata, wherein the association result comprises no association and association; if the two target metadata to be processed are associated, the association result further comprises a target association relationship corresponding to the two target metadata, so that the association relationship between the two target metadata from different tables is obtained by analyzing the table structure information of the tables in the database, the table relationship of the tables in the database is reflected, a user has clear knowledge of the table structure of the unknown database, and the subsequent use and utilization of the database are facilitated.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present application may be executed by a processor to perform the steps or functions described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for invoking the inventive methods may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. An embodiment according to the application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the application as described above.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (7)

1. A metadata processing method based on big data, wherein the method comprises:
Determining a training model for judging and determining the relations among tables of different databases;
acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database;
word segmentation processing and word vector training are sequentially carried out on each target metadata respectively to obtain a target word vector corresponding to each target metadata;
inputting word vectors corresponding to the two target metadata into the training model for association relation analysis to obtain association results between the two target metadata, wherein the association results comprise no association and association;
if the association exists, the association result also comprises a target association relationship corresponding to the two target metadata;
The determining the training model for judging and determining the relations among tables of different databases comprises the following steps: acquiring a preset number of sample metadata and determining a sample word vector of each sample metadata; grouping the preset number of sample metadata in any two-two combination by taking the two sample metadata as a unit to obtain each group of training samples; respectively splicing sample word vectors of the two sample metadata in each group of training samples to obtain spliced word vectors of each group of training samples; deep learning and model training are carried out on the spliced word vectors of all groups of training samples in the preset number of sample data, so as to obtain training models for judging and determining the table relations of different databases;
The word segmentation processing and word vector training are sequentially performed on each target metadata respectively to obtain a target word vector corresponding to each target metadata, and the word segmentation processing and word vector training comprise the following steps: performing word segmentation processing on each target metadata respectively to obtain target word segmentation information of each target metadata and segmented target metadata; word vector training is carried out on the basis of the target word segmentation information of each target metadata and the corresponding target metadata after word segmentation, so that a target word vector of each target metadata is obtained;
The word vector training is performed on the basis of the target word segmentation information of each target metadata and the corresponding segmented target metadata respectively, so as to obtain a target word vector of each target metadata, and the word vector training method comprises the following steps: counting the target word segmentation information of each target metadata respectively, and sequencing all target words in each target metadata according to the sequence of word frequency from high to low to obtain an initial target word vector of each target word in each target metadata; and respectively inputting the initial target word vectors of all target words in each target metadata and the corresponding target metadata after word segmentation into a related model for generating word vectors to carry out word vector training, so as to obtain the target word vector of each target metadata.
2. The method of claim 1, wherein the obtaining a predetermined number of sample metadata and determining a sample word vector for each of the sample metadata comprises:
Acquiring a preset number of sample metadata, wherein each sample metadata is table structure information of a table of a database;
performing word segmentation processing on each sample metadata respectively to obtain word segmentation information of each sample metadata and segmented sample metadata;
and training word vectors based on the word segmentation information of each sample metadata and the segmented sample metadata respectively to obtain sample word vectors of each sample metadata.
3. The method of claim 2, wherein the training word vectors based on the word segmentation information of each sample metadata and the segmented sample metadata respectively to obtain a sample word vector of each sample metadata comprises:
Counting word segmentation information of each sample metadata respectively, and sequencing all words in each sample metadata according to a word frequency sequence from high to low to obtain an initial word vector of each word in each sample metadata;
And respectively inputting the initial word vectors of all words in each sample metadata and the corresponding segmented sample metadata into a related model for generating word vectors to perform word vector training, so as to obtain sample word vectors of each sample metadata.
4. The method of claim 1, wherein the deep learning and model training of the concatenated word vectors of all the training samples in the preset number of sample data to obtain training models for evaluating and determining the relationships between tables of different databases, comprises:
deep learning is respectively carried out on the spliced word vectors of all groups of training samples in the preset number of sample data, so that sample characterization vectors of each group of training samples are obtained;
Calculating the characterization value of each group of training samples based on the sample characterization vector of each group of training samples;
Model training is carried out based on the characterization values of all groups of training samples, and a training model for judging and determining the table relations of different databases is obtained.
5. The method of claim 4, wherein the model training based on the characterization values of all sets of the training samples results in a training model for evaluating and determining the relationships between tables of different databases, comprising:
And respectively carrying out the following operations on each training sample to obtain training models for judging and determining the relations among tables of different databases:
Judging whether the characterization value of each training sample is larger than a preset association characterization threshold value,
If so, associating two sample metadata in the training sample, calculating the similarity between the two associated sample metadata, setting corresponding association relations for the two associated sample metadata based on the similarity, and setting different association relations for different values or value intervals of the similarity;
If not, no correlation exists between the two sample metadata in the training samples.
6. A non-volatile storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement the method of any of claims 1 to 5.
7. An apparatus for data processing, wherein the apparatus comprises:
one or more processors;
A computer readable medium for storing one or more computer readable instructions,
When executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 5.
CN202110517886.7A 2021-05-12 2021-05-12 Metadata processing method and device based on big data Active CN113312396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110517886.7A CN113312396B (en) 2021-05-12 2021-05-12 Metadata processing method and device based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110517886.7A CN113312396B (en) 2021-05-12 2021-05-12 Metadata processing method and device based on big data

Publications (2)

Publication Number Publication Date
CN113312396A CN113312396A (en) 2021-08-27
CN113312396B true CN113312396B (en) 2024-04-19

Family

ID=77373043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110517886.7A Active CN113312396B (en) 2021-05-12 2021-05-12 Metadata processing method and device based on big data

Country Status (1)

Country Link
CN (1) CN113312396B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116010349A (en) * 2023-02-17 2023-04-25 广州汇通国信科技有限公司 Metadata-based data checking method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
EP3376400A1 (en) * 2017-03-14 2018-09-19 Fujitsu Limited Dynamic context adjustment in language models
CN109446263A (en) * 2018-11-02 2019-03-08 成都四方伟业软件股份有限公司 A kind of data relationship correlating method and device
CN110941629A (en) * 2019-10-12 2020-03-31 中国平安财产保险股份有限公司 Metadata processing method, device, equipment and computer readable storage medium
CN111061833A (en) * 2019-12-10 2020-04-24 北京明略软件系统有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111931064A (en) * 2020-08-28 2020-11-13 张坚伟 Information analysis method based on big data and artificial intelligence and cloud service information platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
EP3376400A1 (en) * 2017-03-14 2018-09-19 Fujitsu Limited Dynamic context adjustment in language models
CN109446263A (en) * 2018-11-02 2019-03-08 成都四方伟业软件股份有限公司 A kind of data relationship correlating method and device
CN110941629A (en) * 2019-10-12 2020-03-31 中国平安财产保险股份有限公司 Metadata processing method, device, equipment and computer readable storage medium
CN111061833A (en) * 2019-12-10 2020-04-24 北京明略软件系统有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111931064A (en) * 2020-08-28 2020-11-13 张坚伟 Information analysis method based on big data and artificial intelligence and cloud service information platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于词语关系的词向量模型;蒋振超;李丽双;黄德根;;中文信息学报(第03期);30-36 *

Also Published As

Publication number Publication date
CN113312396A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
US10585905B2 (en) Internet search result intention
CN109299320B (en) Information interaction method and device, computer equipment and storage medium
CN106033416B (en) Character string processing method and device
US8073877B2 (en) Scalable semi-structured named entity detection
US11366840B2 (en) Log-aided automatic query expansion approach based on topic modeling
CN110390054B (en) Interest point recall method, device, server and storage medium
CN108520046B (en) Method and device for searching chat records
CN110232112B (en) Method and device for extracting keywords in article
US11151191B2 (en) Video content segmentation and search
WO2019113124A1 (en) Mtransaction processing improvements
CN109325108B (en) Query processing method, device, server and storage medium
US11562029B2 (en) Dynamic query processing and document retrieval
CN111767738A (en) Label checking method, device, equipment and storage medium
US20220092262A1 (en) Text classification using models with complementary granularity and accuracy
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN113312396B (en) Metadata processing method and device based on big data
CN111159334A (en) Method and system for house source follow-up information processing
US20200226213A1 (en) Dynamic Natural Language Processing
US20200110834A1 (en) Dynamic Linguistic Assessment and Measurement
CN113392305A (en) Keyword extraction method and device, electronic equipment and computer storage medium
CN110929085B (en) System and method for processing electric customer service message generation model sample based on meta-semantic decomposition
US11170010B2 (en) Methods and systems for iterative alias extraction
US11163953B2 (en) Natural language processing and candidate response evaluation
US11120060B2 (en) Efficient resolution of syntactic patterns in question and answer (QA) pairs in an n-ary focus cognitive QA system
CN111444345A (en) Dish name classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant