CN113312396A - Metadata processing method and equipment based on big data - Google Patents

Metadata processing method and equipment based on big data Download PDF

Info

Publication number
CN113312396A
CN113312396A CN202110517886.7A CN202110517886A CN113312396A CN 113312396 A CN113312396 A CN 113312396A CN 202110517886 A CN202110517886 A CN 202110517886A CN 113312396 A CN113312396 A CN 113312396A
Authority
CN
China
Prior art keywords
metadata
sample
target
word
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110517886.7A
Other languages
Chinese (zh)
Other versions
CN113312396B (en
Inventor
程大伟
朱鹏
盛程凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhejin Information Technology Co ltd
Original Assignee
Shanghai Zhejin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhejin Information Technology Co ltd filed Critical Shanghai Zhejin Information Technology Co ltd
Priority to CN202110517886.7A priority Critical patent/CN113312396B/en
Publication of CN113312396A publication Critical patent/CN113312396A/en
Application granted granted Critical
Publication of CN113312396B publication Critical patent/CN113312396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method comprises the steps of determining a training model for judging and determining the relationship between tables of different databases; acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; the method comprises the steps of performing word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata, inputting the word vectors corresponding to the two target metadata into a training model to perform association relationship analysis to obtain an association result between the two target metadata, wherein the association result comprises no association and existing association, and if the association exists, the association result further comprises a target association relationship corresponding to the two target metadata, so that the table structure information of the tables in the database is analyzed to reflect the inter-table relationship of the tables in the database.

Description

Metadata processing method and equipment based on big data
Technical Field
The present application relates to the field of computer technologies, and in particular, to a metadata processing method and device based on big data.
Background
In the prior art, metadata is an important component of a data warehouse, is an indicator diagram of the data warehouse, and plays an important role in data source extraction, data warehouse development, business analysis, data warehouse service, data refinement and reconstruction and other processes. The huge influence of metadata in the whole data warehouse development and application process can be seen in fig. 1, and metadata with strong definition description capability and complete content has a decisive significance for effective development and management of the data warehouse to a certain extent. In addition, metadata is an efficient way to describe complex structured data, which encapsulates the semantic essence of a data set in an aggregated form. Metadata plays a central role in making data available for a long time, is crucial for data understanding, storage, preservation, management, and discovery for future use, and makes it easier to apply complex research to data by allowing input data to be searched based on its content and context. In addition to better utilization, other data management aspects are facilitated, such as managing and utilizing similarities between data sets. However, in different research areas, different metadata standards and data management methods exist, each standard containing different research-specific data characteristics.
Currently, management of metadata is an immature area. From the business point of view, many people are not clear about the purpose of establishing a metadata management and exchange platform, and cannot determine how much value is brought to enterprises by centralizing the metadata. From the current stage of an enterprise, how much value cannot be generated by establishing the platform; even in the problems of the platform, such as the initial design and the use of the platform. From the technical aspect, a unified metadata standard is not really established, and most of common metadata management tools provide metadata exchange functions, but the exchange cannot deal with the condition that interaction with numerous tools is involved. If the tools can not completely centralize all metadata, data streams in the whole data warehouse can be faulted, and a so-called consistent metadata management platform cannot be established. Therefore, the association relationship between the metadata is missing, and the data sharing and the mutual association with other data sets cannot be realized.
Disclosure of Invention
An object of the present application is to provide a metadata processing method and apparatus based on big data, which can obtain the association relationship between tables of a database, so that a user has a clear understanding of the table structure of an unknown database, and is beneficial to subsequent use and utilization of the database.
According to an aspect of the present application, there is provided a metadata processing method based on big data, wherein the method includes:
determining a training model for judging and determining the relationship between tables of different databases;
acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database;
performing word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata;
inputting word vectors corresponding to the two target metadata into the training model for incidence relation analysis to obtain incidence results between the two target metadata, wherein the incidence results comprise no incidence and existing incidence;
and if the association exists, the association result also comprises a target association relation corresponding to the two target metadata.
Further, in the above method, determining a training model for evaluating and determining relationships between tables of different databases includes:
obtaining a preset number of sample metadata and determining a sample word vector of each sample metadata;
with the two sample metadata as a unit, grouping the preset number of sample metadata in any pairwise combination to obtain each group of training samples;
respectively splicing the sample word vectors of the two sample metadata in each group of training samples to obtain spliced word vectors of each group of training samples;
and performing deep learning and model training on spliced word vectors of all groups of training samples in the preset number of sample data to obtain a training model for judging and determining the relationship between tables of different databases.
Further, in the above method, the obtaining a preset number of sample metadata and determining a sample word vector of each sample metadata includes:
obtaining a preset amount of sample metadata, wherein each sample metadata is table structure information of a table of a database;
performing word segmentation processing on each sample metadata to obtain word segmentation information of each sample metadata and the sample metadata after word segmentation;
and performing word vector training respectively based on the word segmentation information of each sample metadata and the segmented sample metadata to obtain a sample word vector of each sample metadata.
Further, in the above method, the performing word vector training based on the word segmentation information of each sample metadata and the segmented sample metadata to obtain a sample word vector of each sample metadata respectively includes:
respectively counting word segmentation information of each sample metadata, and sequencing all words in each sample metadata according to the sequence of word frequency from high to low to obtain an initial word vector of each word in each sample metadata;
and respectively inputting the initial word vectors of all words in each sample metadata and the corresponding sample metadata after word segmentation into a relevant model for generating word vectors to carry out word vector training, so as to obtain the sample word vectors of each sample metadata.
Further, in the above method, the deep learning and model training of the spliced word vectors of all the groups of training samples in the preset number of sample data to obtain a training model for evaluating and determining the relationships between tables of different databases includes:
performing deep learning on spliced word vectors of all groups of training samples in the preset number of sample data respectively to obtain sample characterization vectors of each group of training samples;
respectively calculating the characteristic value of each group of training samples based on the sample characteristic vector of each group of training samples;
and performing model training based on the characteristic values of all the groups of training samples to obtain a training model for judging and determining the relationships among the tables of different databases.
Further, in the above method, the performing model training based on the characterization values of all the sets of training samples to obtain a training model for evaluating and determining the relationships between tables of different databases includes:
respectively carrying out the following operations on each training sample to obtain a training model for judging and determining the relationship between tables of different databases:
judging whether the characteristic value of each training sample is larger than a preset associated characteristic threshold value or not,
if so, associating two sample metadata in the training sample, calculating the similarity between the two associated sample metadata, setting a corresponding association relationship for the two associated sample metadata based on the similarity, and correspondingly setting different association relationships for different values or value intervals of the similarity;
if not, the two sample metadata in the training samples are not associated.
Further, in the above method, the performing word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata respectively includes:
performing word segmentation processing on each target metadata to obtain target word segmentation information of each target metadata and the segmented target metadata;
and performing word vector training respectively based on the target word segmentation information of each target metadata and the corresponding segmented target metadata to obtain a target word vector of each target metadata.
Further, in the above method, the performing word vector training based on the target word segmentation information of each target metadata and the corresponding segmented target metadata to obtain the target word vector of each target metadata respectively includes:
respectively counting the target word segmentation information of each target metadata, and sequencing all target words in each target metadata according to the sequence of word frequency from high to low to obtain an initial target word vector of each target word in each target metadata;
and respectively inputting the initial target word vectors of all target words in each target metadata and the corresponding segmented target metadata into a relevant model for generating word vectors to perform word vector training, so as to obtain the target word vectors of each target metadata.
According to another aspect of the present application, there is also provided a non-volatile storage medium having computer-readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the big-data based metadata processing method as described above.
According to another aspect of the present application, there is also provided an apparatus for data processing, wherein the apparatus comprises:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a big data based metadata processing method as described above.
Compared with the prior art, the method and the device have the advantages that the training model for judging and determining the relationships among the tables of different databases is determined; in an actual application scene, acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; performing word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata; inputting word vectors corresponding to the two target metadata into the training model for incidence relation analysis to obtain incidence results between the two target metadata, wherein the incidence results comprise no incidence and existing incidence; if the two target metadata to be processed are associated, the association result further comprises a target association relation corresponding to the two target metadata, and the table structure information of the tables in the database is analyzed to obtain the association relation between the two target metadata from different tables, so that the inter-table relation of the tables in the database is embodied, a user can clearly know the table structure of the unknown database, and the use and the utilization of the database are facilitated subsequently.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a diagram illustrating an influence domain of data warehouse metadata in the prior art;
FIG. 2 illustrates a flow diagram of a big data based metadata processing method in accordance with an aspect of the subject application;
FIG. 3 illustrates a block diagram of a segmentation tool in a big data based metadata processing approach in accordance with an aspect of the subject application;
FIG. 4 illustrates a training diagram of a training model used to evaluate and determine relationships between tables of different databases for training in a big data based metadata processing method according to an aspect of the subject application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
According to an embodiment of an aspect of the present application, a metadata processing method based on big data is provided, and a specific interaction flow is shown in fig. 2, the method relates to a Client and a server of a full text search engine, wherein the Client of the full text search engine may preferably be a Rest High Level Client, and the method includes steps S11, S12, S13 and S14 executed by the Client, and specifically includes the following steps:
step S11, determining a training model for judging and determining the relationship between tables of different databases; the training model is obtained by training according to the table structure information of different tables of different databases, so that in a practical application scene, the inter-table relationship of different tables, that is, the association relationship between different tables can be calculated through the training model.
Step S12, acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; here, the two target metadata may be from different tables of the same database, or from different tables of different databases, so that the table structure information of the table of the database to be processed is used as the target metadata to be processed.
And step S13, performing word segmentation processing and word vector training on each target metadata in sequence respectively to obtain a target word vector corresponding to each target metadata.
Step S14, inputting the word vectors corresponding to the two target metadata into the training model for association analysis, and obtaining an association result between the two target metadata, where the association result includes no association and existing association.
And if the association exists, the association result also comprises a target association relation corresponding to the two target metadata.
Through the steps S11 to S14, the table structure information of the tables in the database is analyzed to obtain the association relationship between two target metadata from different tables, so that the inter-table relationship of the tables of the database is embodied, a user can clearly know the table structure of an unknown database, and the use and the utilization of the database are facilitated subsequently.
For example, before the calculation of the inter-table relationship of the tables of the database is performed, a training Model for evaluating and determining the inter-table relationship of different databases is determined in step S11; in an actual application scenario, the step S12 obtains two target metadata to be processed, namely target metadata Data (target 1) and target metadata Data (target 2), respectively, where each target metadata is table structure information of a table of a database; in step S13, word segmentation processing and word vector training are sequentially performed on the target metadata Data (target 1) and the target metadata Data (target 2), respectively, to obtain a target word vector V (target 1) corresponding to the target metadata Data (target 1) and a target word vector V (target 2) corresponding to the target metadata Data (target 2); in step S14, a target word vector V (target 1) corresponding to the target metadata Data (target 1) and a target word vector V (target 2) corresponding to the target metadata Data (target 2) are inputted to the training Model together for association analysis, and an association result between the target metadata Data (target 1) and the target metadata Data (target 2) is obtained, wherein the association result includes absence of association and presence of association, so that calculation and output of a relationship between tables for the two target metadata through the training Model in an actual application scenario are realized, and a target association relationship between the target metadata Data (target 1) and the target metadata Data (target 2) is outputted when there is an association between the target metadata Data (target 1) and the target metadata Data (target 2), so that not only the association result between the two target metadata to be processed can be calculated and outputted through the training Model, and when the two target metadata are associated, calculating and outputting the specific corresponding target association relationship between the two target metadata, so that a user can further know the table structure of the table of the database corresponding to the two target metadata respectively, and the method is favorable for subsequent use and utilization of the database, is protected and the like.
Following the above embodiment of the present application, the step S11 determines a training model for evaluating and determining relationships between tables of different databases, specifically including:
obtaining a preset number of sample metadata and determining a sample word vector of each sample metadata;
with the two sample metadata as a unit, grouping the preset number of sample metadata in any pairwise combination to obtain each group of training samples;
respectively splicing the sample word vectors of the two sample metadata in each group of training samples to obtain spliced word vectors of each group of training samples;
and performing deep learning and model training on spliced word vectors of all groups of training samples in the preset number of sample data to obtain a training model for judging and determining the relationship between tables of different databases.
For example, in the training process of training the training Model, a preset amount of sample metadata for Model training is first acquired, where if the acquired sample metadata is N, where N may be a positive integer greater than or equal to 2, the N sample metadata are respectively sample metadata Data (sample 1), sample metadata Data (sample 2), … …, sample metadata Data (sample (N-1)), and sample metadata Data (sample N), and a sample word vector of each sample metadata is determined, where the sample word vectors of each sample metadata in sample metadata Data (sample 1), sample metadata Data (sample 2), sample … …, sample metadata Data (sample (N-1)), and sample metadata Data (sample N) are sequentially: sample word vector V (sample 1), sample word vector V (sample 2), … …, sample word vector V (sample (N-1)), and sample word vector V (sample N); then, the N sample metadata are grouped arbitrarily two by two in units of two sample metadata, for example, sample metadata Data (sample 1) and sample metadata Data (sample 2) as a set, sample metadata Data (sample 1) and sample metadata Data (sample 3) as a set, sample metadata Data (sample 2) and sample metadata Data (sample 3) as a set, … …, sample metadata Data (sample 1) and sample metadata Data (sample N) as a set, and sample metadata Data (sample (N-1)) and sample metadata Data (sample N) as a set, to obtain N (N-1)/2 sets of training samples, that is, any two sample metadata of N sample metadata form one training sample; then, respectively splicing the sample word vectors of the two sample metadata in each set of training samples to obtain a spliced word vector of each set of training samples, for example, the spliced word vector of a set of training samples formed by the sample metadata Data (sample 1) and the sample metadata Data (sample 2) is V (sample 12), the spliced word vector of a set of training samples formed by the sample metadata Data (sample 1) and the sample metadata Data (sample 3) is V (sample 13), … …, and the spliced word vector of a set of training samples formed by the sample metadata Data (sample (N-1)) and the sample metadata Data (sample N) as a set is V (sample (N-1) N), so as to obtain a spliced word vector of each set of training samples formed by N sample metadata; and finally, performing deep learning and Model training on spliced word vectors of N (N-1)/2 groups of training samples formed by N sample metadata to obtain training Model models for judging and determining the inter-table relations of different databases, and realizing the training and determination of the training models for judging and determining the inter-table relations of different databases.
Next, in the above embodiment of the present application, the obtaining of a preset number of sample metadata and determining a sample word vector of each sample metadata specifically includes:
obtaining a preset amount of sample metadata, wherein each sample metadata is table structure information of a table of a database; here, the sample metadata is data about data, and the sample metadata may be classified into description metadata, structural metadata, management metadata, and the like according to functions; the description metadata mainly provides information of retrieval and positioning and is used for discovering and identifying specific data, the structure metadata mainly records the composition and the mutual relation of data, and the management metadata provides information required by data resource management, so that in the model training process, the table structure information of a table of a database is used as sample metadata for training a model to achieve acquisition of the sample metadata.
Performing word segmentation processing on each sample metadata to obtain word segmentation information of each sample metadata and the sample metadata after word segmentation;
and performing word vector training respectively based on the word segmentation information of each sample metadata and the segmented sample metadata to obtain a sample word vector of each sample metadata.
In this embodiment, when performing word segmentation on each acquired sample metadata, a jieba word segmentation tool is used to perform word segmentation processing on each sample metadata, where a skeleton diagram of the jieba word segmentation tool is shown in fig. 3. The most important word in the jieba word segmentation tool is a dictionary, if the jieba word segmentation tool is adopted, a universal dictionary table of the jieba word segmentation tool is used, so that a desired effect cannot be segmented in a specific word segmentation environment. Wherein, the code for changing the dictionary table comprises the following steps: firstly, loading a dictionary to generate a trie tree word segmentation model, in a preferred embodiment of the application, taking a Linux system as an example, a cache file can be generated when the dictionary is loaded, because two jieba word segmentations are adopted in the word segmentation processing environment of the application, one is regional and the other is a keyword jieba, if the related configuration is not changed, the file names are the same and cannot be used at the same time; secondly, giving a sentence to be segmented (for example, sample metadata to be segmented), acquiring continuous Chinese characters and English characters in use, segmenting the sample metadata into a phrase list, using DAG (dictionary lookup) and dynamic programming for each phrase to obtain a maximum probability path, combining characters which are not searched in a dictionary table in DAG into a new segment phrase, and segmenting by using a Hidden Markov Model (HMM), namely recognizing a new word by using an HHM Model, namely recognizing the new word outside the dictionary table; and finally, generating a word generator by using the yield grammar of python, and returning words one by one to realize word segmentation processing on each sample metadata.
For example, in a training process for training the training Model, a preset number of sample metadata for Model training is first acquired, where if the acquired sample metadata is N, where N may be a positive integer greater than or equal to 2, the N sample metadata are sample metadata Data (sample 1), sample metadata Data (sample 2), … …, sample metadata Data (sample (N-1)), and sample metadata Data (sample N), respectively; then, the sample metadata Data (sample 1), the sample metadata Data (sample 2), … …, the sample metadata Data (sample (N-1)) and the sample metadata Data (sample N) are respectively subjected to word segmentation processing, so that word segmentation information 1 of the sample metadata Data (sample 1), word segmentation sample metadata Data '(sample 1), word segmentation information 2 of the sample metadata Data (sample 2), word segmentation sample metadata Data' (sample 2), … …, word segmentation information (N-1) of the sample metadata Data (sample (N-1)), word segmentation sample metadata Data '(sample (N-1)), word segmentation information N of the sample metadata Data (sample N) and word segmentation sample metadata Data' (sample) are obtained, and word segmentation processing is performed on each sample metadata; finally, word vector training is performed on the basis of the participle information 1 of the sample metadata Data (sample 1) and the participled sample metadata Data '(sample 1) to obtain a sample word vector V (sample 1) of the sample metadata Data (sample 1), word vector training is performed on the basis of the participle information 2 of the sample metadata Data (sample 2) and the participled sample metadata Data' (sample 2) to obtain a sample word vector V (sample 2), … … of the sample metadata Data (sample 2), word vector training is performed on the basis of the participle information (N-1) of the sample metadata Data (sample (N-1)) and the participled sample metadata Data '(sample (N-1)) to obtain a sample word vector V (sample (N-1)) of the sample metadata Data (sample N-1)), and word vector training is performed on the basis of the participle information N of the sample metadata Data (sample N) and the participled sample metadata Data' (sample) A sample word vector V (sample N) of sample metadata Data (sample N) to enable training and determination of the sample word vector for each sample metadata.
Next, in the foregoing embodiment of the present application, the performing word vector training based on the word segmentation information of each sample metadata and the sample metadata after word segmentation, respectively, to obtain a sample word vector of each sample metadata specifically includes:
respectively counting word segmentation information of each sample metadata, and sequencing all words in each sample metadata according to the sequence of word frequency from high to low to obtain an initial word vector of each word in each sample metadata;
and respectively inputting the initial word vectors of all words in each sample metadata and the corresponding sample metadata after word segmentation into a relevant model for generating word vectors to carry out word vector training, so as to obtain the sample word vectors of each sample metadata.
For example, after the word segmentation information of each sample metadata is obtained through calculation, the representation of each sample metadata, that is, the sample word vector of each sample metadata, needs to be obtained. It should be noted that a word vector is actually a vector obtained by mapping words into a semantic space. The related model word2vec for generating word vectors is a tool for training word vectors in the prior art, the word2vec is realized by means of a neural network, and in consideration of the context relationship of texts, two models are provided, which are respectively: CBOW and Skip-gram, which are similar during training. The Skip-gram model takes a word as input to predict its surrounding context, and the CBOW model takes the context of a word as input to predict the word itself. The preprocessing of the sample word vector training of each sample metadata in this embodiment includes the following steps: firstly, generating a vocabulary table for the text of the word segmentation information of each input sample metadata, counting the word frequency of each word, sequencing all words in the sample metadata according to the sequence of the word frequency from high to low for each sample metadata, wherein each word in the sequence is different to form the vocabulary table, and implementing the duplicate removal processing of the duplicate words in a mode of sequencing all words through the word frequency, wherein each word has a one-hot initial word vector, the dimension of the vector is consistent with the number of the words after the duplicate removal, if the word appears in the vocabulary table, the corresponding position in the vocabulary table in the initial word vector is 1, and the other positions are all 0, and if the word does not appear in the vocabulary table, the vector is all 0. Then, generating a one-hot initial word vector for each word of the text of the word segmentation information of each input sample metadata, where the one-hot initial word vector needs to include an original position of each word because the one-hot initial word vector is context-dependent, for example, if the sample metadata includes 80 words and is sorted from high to low according to word frequency, the dimension of the one-hot initial word vector of the word corresponding to the highest word frequency is 80, the value of the first dimension is 1, the values of other dimensions are 0, further, the dimension of the one-hot initial word vector of the word sorted at the word frequency order 30 is 80, the value of the 30 th dimension is 1, and the values of other dimensions are 0, so as to obtain the one-hot initial word vector of each word; then, determining the dimension of the one-hot initial word vector of each word in each sample metadata, wherein the dimension of the one-hot initial word vector of each word in each sample metadata is the number of words after the duplication of all the words contained in the sample metadata is removed; finally, inputting the initial word vectors of all words in each sample metadata and the corresponding segmented sample metadata into a correlation model for generating word vectors: and training a word vector in the word2vec function to obtain a sample word vector of each sample metadata, so that training and generation of the sample word vector of each sample metadata are realized.
Next, in the above embodiment of the present application, the deep learning and model training of the spliced word vectors of all the groups of training samples in the preset number of sample data to obtain a training model for evaluating and determining the relationships between tables of different databases includes:
performing deep learning on spliced word vectors of all groups of training samples in the preset number of sample data respectively to obtain sample characterization vectors of each group of training samples;
respectively calculating the characteristic value of each group of training samples based on the sample characteristic vector of each group of training samples;
and performing model training based on the characteristic values of all the groups of training samples to obtain a training model for judging and determining the relationships among the tables of different databases.
For example, after obtaining the sample word vector of each sample metadata, a model for determining whether and what kind of relationship exists between every two sample metadata needs to be trained, as shown in fig. 4, which is a training flow chart of a training model for training to determine and judge the relationships between tables of different databases. Firstly, splicing sample word vectors of two sample metadata to obtain a spliced word vector of a training sample formed by every two sample metadata; then, the spliced word vector of each group of training samples is put into a multilayer neural network for deep characterization analysis and learning, and then put into an attention mechanism for continuous better characterization analysis and learning so as to obtain the sample characterization vector of each group of training samples; and finally, performing Model training on the characterization values of N (N-1)/2 groups of training samples formed by the N sample metadata to obtain training Model models for judging and determining the inter-table relations of different databases, so as to train and determine the training models for judging and determining the inter-table relations of different databases.
Next, in the above embodiment of the present application, the performing model training based on the characterization values of all the sets of training samples to obtain a training model for evaluating and determining the relationships between tables of different databases includes:
respectively carrying out the following operations on each training sample to obtain a training model for judging and determining the relationship between tables of different databases:
and judging whether the characteristic value of each training sample is greater than a preset associated characteristic threshold value, wherein the preset associated characteristic threshold value can take any value between 0 and 1, and in a preferred embodiment of the application, the preset associated characteristic threshold value is preferably 0.5.
If so, associating two sample metadata in the training sample, calculating the similarity between the two associated sample metadata, setting a corresponding association relationship for the two associated sample metadata based on the similarity, and correspondingly setting different association relationships for different values or value intervals of the similarity;
if not, the two sample metadata in the training samples are not associated.
In a preferred embodiment of the present application, after obtaining a characterization value of each set of training samples after performing deep learning, each training sample is respectively subjected to the following operations to obtain a training model for evaluating and determining relationships between tables of different databases: inputting the characteristic value of each training sample into a discriminator shown in fig. 4, and judging whether the characteristic value of each training sample is greater than a preset associated characteristic threshold value: 0.5, if so, indicating that the representation value of the training sample is between more than 0.5 and less than or equal to 1, associating two sample metadata in the training sample, when the two sample metadata in the training sample are associated, calculating similarity between the two associated sample metadata, setting a corresponding association relationship for the two associated sample metadata based on the similarity, and correspondingly setting different association relationships for values or intervals of different similarities; if not, indicating that the representation value of the training sample is greater than or equal to zero and less than or equal to 0.5, and determining that the metadata of the two samples in the training sample are not related so as to train and determine a training model for judging and determining the relationship between tables of different databases.
Next, in the foregoing embodiment of the present application, the step 13 respectively performs word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata, and specifically includes:
performing word segmentation processing on each target metadata to obtain target word segmentation information of each target metadata and the segmented target metadata;
and performing word vector training respectively based on the target word segmentation information of each target metadata and the corresponding segmented target metadata to obtain a target word vector of each target metadata.
In an actual application scenario, two target metadata to be processed are acquired in the step S12: after the target metadata Data (target 1) and the target metadata Data (target 2), in step S13, the target metadata Data (target 1) and the target metadata Data (target 2) are respectively participled to obtain target participle information of the target metadata Data (target 1), target participle information of the participled target metadata Data '(target 1) and target participle information of the target metadata Data (target 2), and target metadata Data' (target 2) of the participle; then, the target segmentation information of the target metadata Data (target 1) and the segmented target metadata Data '(target 1), and the target segmentation information of the target metadata Data (target 2) and the segmented target metadata Data' (target 2) are input to a correlation model for generating a word vector: and training word vectors in the word2vec function to obtain a target word vector V (target 1) corresponding to the target word vector of the target metadata Data (target 1) and a target word vector V (target 2) corresponding to the target metadata Data (target 2), so that the training and the generation of the target word vector of each target metadata to be processed are realized.
Next to the foregoing embodiment of the present application, in step S14, word vector training is performed based on the target word segmentation information of each target metadata and the corresponding segmented target metadata, respectively, to obtain a target word vector of each target metadata, and specifically includes:
respectively counting the target word segmentation information of each target metadata, and sequencing all target words in each target metadata according to the sequence of word frequency from high to low to obtain an initial target word vector of each target word in each target metadata;
and respectively inputting the initial target word vectors of all target words in each target metadata and the corresponding segmented target metadata into a relevant model for generating word vectors to perform word vector training, so as to obtain the target word vectors of each target metadata.
For example, after the word segmentation information of each target metadata is obtained through calculation, a representation of each target metadata, that is, a target word vector of each target metadata, needs to be obtained. It should be noted that a word vector is actually a vector obtained by mapping words into a semantic space. The preprocessing of the target word vector training of each target metadata in the present embodiment includes the following steps: firstly, generating a vocabulary table for a text of target word segmentation information of each input target metadata, counting word frequency of each target word, sequencing all target words in each target metadata according to the sequence of the word frequency from high to low, wherein each target word in the sequence is different to form the vocabulary table, and implementing deduplication processing of repeated target words in the target metadata in a mode of sequencing all the target words through the word frequency, wherein each target word has a one-hot initial target word vector, the dimension of the vector is consistent with the number of the target words after deduplication, if the target word appears in the vocabulary table, the corresponding position in the vocabulary table in the initial target word vector is 1, and other positions are all 0, and if the target word does not appear in the vocabulary table, the vector is all 0. Then, generating a one-hot initial target word vector for each target word of the text of the target word segmentation information of each input target metadata; then, determining the dimension of the one-hot initial target word vector of each target word in each target metadata, wherein the dimension of the one-hot initial target word vector of each target word in each target metadata is the number of target words after duplication removal with all target words contained in the target metadata; finally, inputting the initial target word vectors of all target words in each target metadata and the corresponding segmented target metadata into a correlation model for generating word vectors: and training word vectors in the word2vec function to obtain the target word vector of each target metadata, so that the training and the generation of the target word vector of each target metadata are realized.
In the above embodiments of the present application, the open source deep learning framework Keras and the programming language Python are used to perform intelligent analysis processing on the metadata in the database, and multiple CPUs are used to perform parallel accelerated training to achieve the best implementation effect. The present application is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims without affecting the essence of the present application.
According to another aspect of the present application, there is also provided a non-volatile storage medium having computer-readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the big-data based metadata processing method as described above.
According to another aspect of the present application, there is also provided an apparatus for data processing, wherein the apparatus comprises:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a big data based metadata processing method as described above.
Here, for details of each embodiment of the device for data processing, reference may be specifically made to corresponding portions of the above embodiment of the metadata processing method based on big data, and details are not described here again.
In summary, the present application determines a training model for evaluating and determining relationships between tables of different databases; in an actual application scene, acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database; performing word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata; inputting word vectors corresponding to the two target metadata into the training model for incidence relation analysis to obtain incidence results between the two target metadata, wherein the incidence results comprise no incidence and existing incidence; if the two target metadata to be processed are associated, the association result further comprises a target association relation corresponding to the two target metadata, and the table structure information of the tables in the database is analyzed to obtain the association relation between the two target metadata from different tables, so that the inter-table relation of the tables in the database is embodied, a user can clearly know the table structure of the unknown database, and the use and the utilization of the database are facilitated subsequently.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (10)

1. A big data-based metadata processing method, wherein the method comprises the following steps:
determining a training model for judging and determining the relationship between tables of different databases;
acquiring two target metadata to be processed, wherein each target metadata is table structure information of a table of a database;
performing word segmentation processing and word vector training on each target metadata in sequence to obtain a target word vector corresponding to each target metadata;
inputting word vectors corresponding to the two target metadata into the training model for incidence relation analysis to obtain incidence results between the two target metadata, wherein the incidence results comprise no incidence and existing incidence;
and if the association exists, the association result also comprises a target association relation corresponding to the two target metadata.
2. The method of claim 1, wherein determining a training model for evaluating and determining the inter-table relationships of different databases comprises:
obtaining a preset number of sample metadata and determining a sample word vector of each sample metadata;
with the two sample metadata as a unit, grouping the preset number of sample metadata in any pairwise combination to obtain each group of training samples;
respectively splicing the sample word vectors of the two sample metadata in each group of training samples to obtain spliced word vectors of each group of training samples;
and performing deep learning and model training on spliced word vectors of all groups of training samples in the preset number of sample data to obtain a training model for judging and determining the relationship between tables of different databases.
3. The method of claim 2, wherein the obtaining a preset number of sample metadata and determining a sample word vector for each of the sample metadata comprises:
obtaining a preset amount of sample metadata, wherein each sample metadata is table structure information of a table of a database;
performing word segmentation processing on each sample metadata to obtain word segmentation information of each sample metadata and the sample metadata after word segmentation;
and performing word vector training respectively based on the word segmentation information of each sample metadata and the segmented sample metadata to obtain a sample word vector of each sample metadata.
4. The method of claim 3, wherein the performing word vector training based on the word segmentation information of each sample metadata and the segmented sample metadata to obtain a sample word vector of each sample metadata comprises:
respectively counting word segmentation information of each sample metadata, and sequencing all words in each sample metadata according to the sequence of word frequency from high to low to obtain an initial word vector of each word in each sample metadata;
and respectively inputting the initial word vectors of all words in each sample metadata and the corresponding sample metadata after word segmentation into a relevant model for generating word vectors to carry out word vector training, so as to obtain the sample word vectors of each sample metadata.
5. The method according to claim 2, wherein the deep learning and model training of the concatenated word vectors of all the sets of the training samples in the preset amount of sample data to obtain the training model for evaluating and determining the inter-table relationship of different databases comprises:
performing deep learning on spliced word vectors of all groups of training samples in the preset number of sample data respectively to obtain sample characterization vectors of each group of training samples;
respectively calculating the characteristic value of each group of training samples based on the sample characteristic vector of each group of training samples;
and performing model training based on the characteristic values of all the groups of training samples to obtain a training model for judging and determining the relationships among the tables of different databases.
6. The method of claim 5, wherein the model training based on the characterization values of all the sets of training samples to obtain a training model for evaluating and determining the inter-table relationship of different databases comprises:
respectively carrying out the following operations on each training sample to obtain a training model for judging and determining the relationship between tables of different databases:
judging whether the characteristic value of each training sample is larger than a preset associated characteristic threshold value or not,
if so, associating two sample metadata in the training sample, calculating the similarity between the two associated sample metadata, setting a corresponding association relationship for the two associated sample metadata based on the similarity, and correspondingly setting different association relationships for different values or value intervals of the similarity;
if not, the two sample metadata in the training samples are not associated.
7. The method according to any one of claims 1 to 6, wherein the performing word segmentation processing and word vector training on each of the target metadata in sequence to obtain a target word vector corresponding to each of the target metadata respectively comprises:
performing word segmentation processing on each target metadata to obtain target word segmentation information of each target metadata and the segmented target metadata;
and performing word vector training respectively based on the target word segmentation information of each target metadata and the corresponding segmented target metadata to obtain a target word vector of each target metadata.
8. The method of claim 7, wherein the performing word vector training based on the target word segmentation information of each target metadata and the corresponding segmented target metadata to obtain the target word vector of each target metadata comprises:
respectively counting the target word segmentation information of each target metadata, and sequencing all target words in each target metadata according to the sequence of word frequency from high to low to obtain an initial target word vector of each target word in each target metadata;
and respectively inputting the initial target word vectors of all target words in each target metadata and the corresponding segmented target metadata into a relevant model for generating word vectors to perform word vector training, so as to obtain the target word vectors of each target metadata.
9. A non-transitory storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement the method of any one of claims 1 to 8.
10. An apparatus for data processing, wherein the apparatus comprises:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
CN202110517886.7A 2021-05-12 2021-05-12 Metadata processing method and device based on big data Active CN113312396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110517886.7A CN113312396B (en) 2021-05-12 2021-05-12 Metadata processing method and device based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110517886.7A CN113312396B (en) 2021-05-12 2021-05-12 Metadata processing method and device based on big data

Publications (2)

Publication Number Publication Date
CN113312396A true CN113312396A (en) 2021-08-27
CN113312396B CN113312396B (en) 2024-04-19

Family

ID=77373043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110517886.7A Active CN113312396B (en) 2021-05-12 2021-05-12 Metadata processing method and device based on big data

Country Status (1)

Country Link
CN (1) CN113312396B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116010349A (en) * 2023-02-17 2023-04-25 广州汇通国信科技有限公司 Metadata-based data checking method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
EP3376400A1 (en) * 2017-03-14 2018-09-19 Fujitsu Limited Dynamic context adjustment in language models
CN109446263A (en) * 2018-11-02 2019-03-08 成都四方伟业软件股份有限公司 A kind of data relationship correlating method and device
CN110941629A (en) * 2019-10-12 2020-03-31 中国平安财产保险股份有限公司 Metadata processing method, device, equipment and computer readable storage medium
CN111061833A (en) * 2019-12-10 2020-04-24 北京明略软件系统有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111931064A (en) * 2020-08-28 2020-11-13 张坚伟 Information analysis method based on big data and artificial intelligence and cloud service information platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
EP3376400A1 (en) * 2017-03-14 2018-09-19 Fujitsu Limited Dynamic context adjustment in language models
CN109446263A (en) * 2018-11-02 2019-03-08 成都四方伟业软件股份有限公司 A kind of data relationship correlating method and device
CN110941629A (en) * 2019-10-12 2020-03-31 中国平安财产保险股份有限公司 Metadata processing method, device, equipment and computer readable storage medium
CN111061833A (en) * 2019-12-10 2020-04-24 北京明略软件系统有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111931064A (en) * 2020-08-28 2020-11-13 张坚伟 Information analysis method based on big data and artificial intelligence and cloud service information platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蒋振超;李丽双;黄德根;: "基于词语关系的词向量模型", 中文信息学报, no. 03, pages 30 - 36 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116010349A (en) * 2023-02-17 2023-04-25 广州汇通国信科技有限公司 Metadata-based data checking method and device, electronic equipment and storage medium
CN116010349B (en) * 2023-02-17 2024-05-31 广州汇通国信科技有限公司 Metadata-based data checking method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113312396B (en) 2024-04-19

Similar Documents

Publication Publication Date Title
US11586987B2 (en) Dynamically updated text classifier
CN109299320B (en) Information interaction method and device, computer equipment and storage medium
US8073877B2 (en) Scalable semi-structured named entity detection
CN108520046B (en) Method and device for searching chat records
CN113449099B (en) Text classification method and text classification device
US11372942B2 (en) Method, apparatus, computer device and storage medium for verifying community question answer data
US20200327160A1 (en) Video content segmentation and search
CN109325108B (en) Query processing method, device, server and storage medium
CN108228567B (en) Method and device for extracting short names of organizations
CN110232112A (en) Keyword extracting method and device in article
CN111259262A (en) Information retrieval method, device, equipment and medium
CN113392305A (en) Keyword extraction method and device, electronic equipment and computer storage medium
US20100125448A1 (en) Automated identification of documents as not belonging to any language
CN113312396B (en) Metadata processing method and device based on big data
CN117332789A (en) Semantic analysis method and system for dialogue scene
CN115150354B (en) Method and device for generating domain name, storage medium and electronic equipment
CN114691907B (en) Cross-modal retrieval method, device and medium
CN113094547B (en) Method for searching specific action video clip in Japanese online video corpus
CN110929085B (en) System and method for processing electric customer service message generation model sample based on meta-semantic decomposition
CN114398489A (en) Entity relation joint extraction method, medium and system based on Transformer
US20210295036A1 (en) Systematic language to enable natural language processing on technical diagrams
CN114021064A (en) Website classification method, device, equipment and storage medium
CN113971403A (en) Entity identification method and system considering text semantic information
CN112308453B (en) Risk identification model training method, user risk identification method and related devices
CN114912455B (en) Named entity identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant