CN116662434A - Multi-source heterogeneous big data processing system - Google Patents

Multi-source heterogeneous big data processing system Download PDF

Info

Publication number
CN116662434A
CN116662434A CN202310736600.3A CN202310736600A CN116662434A CN 116662434 A CN116662434 A CN 116662434A CN 202310736600 A CN202310736600 A CN 202310736600A CN 116662434 A CN116662434 A CN 116662434A
Authority
CN
China
Prior art keywords
main data
feature
data
features
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310736600.3A
Other languages
Chinese (zh)
Other versions
CN116662434B (en
Inventor
张晶
董哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Weijia Information Technology Co ltd
Original Assignee
Hebei Weijia Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Weijia Information Technology Co ltd filed Critical Hebei Weijia Information Technology Co ltd
Priority to CN202310736600.3A priority Critical patent/CN116662434B/en
Publication of CN116662434A publication Critical patent/CN116662434A/en
Application granted granted Critical
Publication of CN116662434B publication Critical patent/CN116662434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of big data, and discloses a multi-source heterogeneous big data processing system, which comprises: a category feature generation module that generates a main data category feature based on field names of main data, one of which corresponds to each of the main data category features; a data source feature generation module that generates data source features based on the original data set linked by the main data; a generated feature extractor for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features; a model generation module for generating a master data generation model; the main data generation module is used for generating main data of an original data set of the main data to be generated and field names corresponding to the main data; the method and the device can automatically generate the main data matched with the big data with limited source range, and structure and unify the big data through the main data.

Description

Multi-source heterogeneous big data processing system
Technical Field
The invention relates to the technical field of big data, in particular to a multi-source heterogeneous big data processing system.
Background
Big data comprises structured, semi-structured and unstructured data, the unstructured data becomes more and more a main part of the data, the requirements for structuring and unifying the big data are larger than those for mining the big data for big data with limited source scope, such as regional government big data, but the time for structuring and unifying the big data by manually extracting features is longer.
Disclosure of Invention
The invention provides a multi-source heterogeneous big data processing system, which solves the technical problem that the structuring and unification of big data by manually extracting features take longer time in the related technology.
The invention provides a multi-source heterogeneous big data processing system, which comprises:
a category feature generation module that generates a main data category feature based on field names of main data, one of which corresponds to each of the main data category features; a data source feature generation module that generates data source features based on the original data set linked by the main data; a generated feature extractor for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features; a model generation module for generating a master data generation model; the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features; the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature; the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space comprises two classification labels which respectively represent the input as the second feature and the input as the first feature;
the main data generation module is used for inputting the field name of the main data input by the user into the category characteristic generation module to generate the category characteristic of the main data; and synthesizing the main data category characteristics with the generated characteristics generated from the original data set of the main data to be generated, inputting the basic characteristics into a first neural network of the main data generation model, and obtaining the main data of the original data set of the main data to be generated and the field names corresponding to the main data based on the first characteristics generated by the first neural network.
Further, the original data set linked to the main data refers to the original data set that needs to be associated with the main data.
Further, the first neural network and the second neural network are both multi-layer perceptron.
Further, the main data feature is spliced after the main data category feature when the main data feature and the main data category feature are synthesized.
Further, when the main data category features are synthesized with the generated features, random feature vectors are spliced after the main data category features.
Further, the dimensions of the first feature and the second feature are the same, and the first feature and the second feature are expressed as follows after matrixing:,/>an ith element representing a first row in the matrix U representing an ith primary data class feature; />An element representing the ith column of the jth row in the matrix U, a field representing that the jth main data corresponds to the ith main data class feature, m represents the total number of main data, and n represents the total number of main data class features of one main data.
Further, the second neural network outputs through the softmax layer, and the output value is a probability value.
Further, for the first neural network and the second neural network to be jointly trained, the trained loss function is:
wherein the method comprises the steps ofIndicating a loss value->Equal to the number of training samples of the training set, y is a set constant value, +.>Probability value representing classification label corresponding to second feature output when second feature of t training sample is input by second neural network,/second neural network>And when the second neural network inputs the g first feature of the t training sample, outputting a probability value of the classification label corresponding to the first feature.
Further, the training sample of the joint training is derived from the original data set which has constructed the main data, and the generated feature extractor extracts a plurality of times from one of the original data sets as the training sample to obtain a plurality of generated features, so that a plurality of basic features can be synthesized, and a plurality of first features are generated through the first neural network.
Further, the main data generating module generates a plurality of generating features from an original data set of main data to be generated, synthesizes a plurality of basic features respectively, inputs the synthesized basic features into the first neural network respectively to obtain a plurality of groups of main data, and deletes repeated main data from the plurality of groups of main data to obtain a final main data set.
The invention has the beneficial effects that: the invention can automatically generate the main data matched with the big data with limited source range, and the big data is structured and unified through the main data.
Drawings
FIG. 1 is a schematic block diagram of a multi-source heterogeneous big data processing system of the present invention.
In the figure: the system comprises a category feature generation module 101, a data source feature generation module 102, a generation feature extractor 103, a model generation module 104 and a main data generation module 105.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It is to be understood that these embodiments are merely discussed so that those skilled in the art may better understand and implement the subject matter described herein and that changes may be made in the function and arrangement of the elements discussed without departing from the scope of the disclosure herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.
As shown in fig. 1, a multi-source heterogeneous big data processing system includes:
a category feature generation module 101 that generates main data category features based on field names of main data, one of which corresponds to each of the main data category features;
a data source signature generation module 102 that generates data source signatures based on the raw data sets linked by the master data;
the original data set linked to the main data refers to an original data set that needs to be associated with the main data, and on the other hand, the information of the main data is derived from the original data set.
A generated feature extractor 103 for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features;
a model generation module 104 for generating a master data generation model;
the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features;
the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature;
the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space contains two classification labels which respectively represent the input as the second feature and the input as the first feature.
The first neural network and the second neural network are the same as the common neural network, and in one embodiment of the invention, the first neural network and the second neural network are both multi-layer perceptron;
in one embodiment of the invention, the first neural network and the second neural network are both convolutional neural networks.
In one embodiment of the invention, features are synthesized by stitching feature vectors, e.g., for two vectorsAnd->The result after synthesis is +.>
Splicing the main data characteristics after the main data category characteristics when the main data characteristics and the main data category characteristics are synthesized;
splicing random feature vectors after the main data category features when the main data category features and the generated features are synthesized;
the dimensions of the first feature and the second feature are the same, and the first feature and the second feature are expressed as follows after matrixing:,/>an ith element representing a first row in the matrix U representing an ith primary data class feature; />The element (j > 1) representing the ith column of the jth row in the matrix U represents a field of the jth main data corresponding to the ith main data class feature, m represents the total number of main data, and n represents the total number of main data class features of one main data.
For a word part (including a field name of main data) in the data, a word vector is obtained by processing the word part through a Skip-Gram model (Skip word model), the main data features are directly combined and generated by the word vector extracted from the main data and a general vector, and the general features refer to features which can be directly generated as the content of the vector in the direct main data.
To ensure consistent dimensions of the generated second features, the scope of the primary data may be limited, e.g., primary data in the primary data set are all of the same class, generated based on the same primary data table;
the first neural network and the second neural network combine to generate a neural network for the countermeasure.
In one embodiment of the invention, the first neural network and the second neural network are jointly trained with a trained loss function of:
wherein the method comprises the steps ofIndicating a loss value->Equal to the number of training samples of the training set, y is a set constant value, +.>Representing output of the second neural network when the second feature of the t training sample is input corresponds to the tProbability value of classification label of two features, +.>When the second neural network inputs the g first feature of the t training sample, the probability value of the output classification label corresponding to the first feature is represented;
the second neural network outputs through a softmax (normalized exponential function) layer, and the output value is a probability value.
The default value of y is 12.
The training samples of the joint training are derived from the original dataset from which the master data has been constructed. The generated feature extractor 103 extracts a plurality of generated features from one original data set as a training sample a plurality of times, so that a plurality of basic features can be synthesized, and a plurality of first features are generated through a first neural network;
the original data set and the training template of the training set to be processed are generally derived from the same type of data source, for example, from regional government data.
The category feature generation module 101 generates a main data category feature based on field names of main data of an original data set of the training set.
A main data generation module 105 for inputting a field name of main data input by a user into the category feature generation module 101 to generate a main data category feature; and synthesizing the main data category characteristics with the generated characteristics generated from the original data set of the main data to be generated, inputting the basic characteristics into a first neural network of the main data generation model, and obtaining the main data of the original data set of the main data to be generated and the field names corresponding to the main data based on the first characteristics generated by the first neural network.
The fields of the generated main data need to be mapped with the corresponding field names.
The field name corresponding to the main data of the original data set of the main data obtained based on the first feature generated by the first neural network may be different from the field name of the main data input by the user.
In one embodiment of the present invention, the main data generating module 105 generates a plurality of generating features from the original data set of the main data to be generated, synthesizes the plurality of basic features respectively, inputs the synthesized plurality of basic features into the first neural network respectively to obtain a plurality of groups of main data, and deletes the repeated main data from the plurality of groups of main data to obtain a final main data set.
The embodiment has been described above with reference to the embodiment, but the embodiment is not limited to the above-described specific implementation, which is only illustrative and not restrictive, and many forms can be made by those of ordinary skill in the art, given the benefit of this disclosure, are within the scope of this embodiment.

Claims (10)

1. A multi-source heterogeneous big data processing system, comprising:
a category feature generation module that generates a main data category feature based on field names of main data, one of which corresponds to each of the main data category features; a data source feature generation module that generates data source features based on the original data set linked by the main data; a generated feature extractor for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features; a model generation module for generating a master data generation model; the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features; the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature; the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space comprises two classification labels which respectively represent the input as the second feature and the input as the first feature;
the main data generation module is used for inputting the field name of the main data input by the user into the category characteristic generation module to generate the category characteristic of the main data; and synthesizing the main data category characteristics with the generated characteristics generated from the original data set of the main data to be generated, inputting the basic characteristics into a first neural network of the main data generation model, and obtaining the main data of the original data set of the main data to be generated and the field names corresponding to the main data based on the first characteristics generated by the first neural network.
2. A multi-source heterogeneous big data processing system according to claim 1, wherein the original data set linked to the main data is the original data set required to be associated with the main data.
3. The multi-source heterogeneous big data processing system of claim 1, wherein the first neural network and the second neural network are each a multi-layer perceptron.
4. A multi-source heterogeneous big data processing system according to claim 1, wherein the main data feature is spliced after the main data category feature when the main data feature and the main data category feature are synthesized.
5. The multi-source heterogeneous big data processing system of claim 1, wherein the random feature vector is stitched after the main data class feature when the main data class feature is combined with the generated feature.
6. The multi-source heterogeneous big data processing system of claim 1, wherein the dimensions of the first feature and the second feature are the same, and the first feature and the second feature are represented as:an ith element representing a first row in the matrix U representing an ith primary data class feature; />An element representing the ith column of the jth row in the matrix U represents that the jth main data corresponds to the ith main data class feature, m represents the total number of main data, and n represents the total number of main data class features of one main data.
7. The multi-source heterogeneous big data processing system of claim 1, wherein the second neural network outputs through the softmax layer, and the output value is a probability value.
8. The multi-source heterogeneous big data processing system of claim 7, wherein the first neural network and the second neural network are jointly trained with a loss function of:wherein->Indicating a loss value->Equal to the number of training samples of the training set, y is a set constant value, +.>Probability value representing classification label corresponding to second feature output when second feature of t training sample is input by second neural network,/second neural network>And when the second neural network inputs the g first feature of the t training sample, outputting a probability value of the classification label corresponding to the first feature.
9. The multi-source heterogeneous big data processing system of claim 8, wherein the training sample of the joint training is derived from an original data set from which the main data has been constructed, and the generated feature extractor extracts a plurality of times from the original data set as the training sample to obtain a plurality of generated features, thereby synthesizing a plurality of basic features, and generating a plurality of first features through the first neural network.
10. The multi-source heterogeneous big data processing system according to claim 1, wherein the main data generating module generates a plurality of generating features from the original data set of the main data to be generated, synthesizes the plurality of basic features respectively, inputs the synthesized plurality of basic features respectively into the first neural network to obtain a plurality of groups of main data, and deletes the repeated main data from the plurality of groups of main data to obtain the final main data set.
CN202310736600.3A 2023-06-21 2023-06-21 Multi-source heterogeneous big data processing system Active CN116662434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310736600.3A CN116662434B (en) 2023-06-21 2023-06-21 Multi-source heterogeneous big data processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310736600.3A CN116662434B (en) 2023-06-21 2023-06-21 Multi-source heterogeneous big data processing system

Publications (2)

Publication Number Publication Date
CN116662434A true CN116662434A (en) 2023-08-29
CN116662434B CN116662434B (en) 2023-10-13

Family

ID=87720639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310736600.3A Active CN116662434B (en) 2023-06-21 2023-06-21 Multi-source heterogeneous big data processing system

Country Status (1)

Country Link
CN (1) CN116662434B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157678A (en) * 2021-04-19 2021-07-23 中国人民解放军91977部队 Multi-source heterogeneous data association method
CN113505222A (en) * 2021-06-21 2021-10-15 山东师范大学 Government affair text classification method and system based on text circulation neural network
CN113590818A (en) * 2021-06-30 2021-11-02 中国电子科技集团公司第三十研究所 Government affair text data classification method based on integration of CNN, GRU and KNN
CN113626511A (en) * 2021-08-12 2021-11-09 山东勤成健康科技股份有限公司 Heterogeneous database fusion access system
CN114462603A (en) * 2022-02-09 2022-05-10 中国银行股份有限公司 Knowledge graph generation method and device for data lake
CN114661810A (en) * 2022-05-24 2022-06-24 国网浙江省电力有限公司杭州供电公司 Lightweight multi-source heterogeneous data fusion method and system
US20220245490A1 (en) * 2021-02-03 2022-08-04 Royal Bank Of Canada System and method for heterogeneous multi-task learning with expert diversity
CN115908022A (en) * 2022-12-05 2023-04-04 中信银行股份有限公司 Abnormal transaction risk early warning method and system based on network modeling
CN115936624A (en) * 2022-12-26 2023-04-07 中国电信股份有限公司 Basic level data management method and device
CN116226238A (en) * 2023-05-06 2023-06-06 合肥尚创信息技术有限公司 Multi-dimensional heterogeneous big data mining method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220245490A1 (en) * 2021-02-03 2022-08-04 Royal Bank Of Canada System and method for heterogeneous multi-task learning with expert diversity
CN113157678A (en) * 2021-04-19 2021-07-23 中国人民解放军91977部队 Multi-source heterogeneous data association method
CN113505222A (en) * 2021-06-21 2021-10-15 山东师范大学 Government affair text classification method and system based on text circulation neural network
CN113590818A (en) * 2021-06-30 2021-11-02 中国电子科技集团公司第三十研究所 Government affair text data classification method based on integration of CNN, GRU and KNN
CN113626511A (en) * 2021-08-12 2021-11-09 山东勤成健康科技股份有限公司 Heterogeneous database fusion access system
CN114462603A (en) * 2022-02-09 2022-05-10 中国银行股份有限公司 Knowledge graph generation method and device for data lake
CN114661810A (en) * 2022-05-24 2022-06-24 国网浙江省电力有限公司杭州供电公司 Lightweight multi-source heterogeneous data fusion method and system
CN115908022A (en) * 2022-12-05 2023-04-04 中信银行股份有限公司 Abnormal transaction risk early warning method and system based on network modeling
CN115936624A (en) * 2022-12-26 2023-04-07 中国电信股份有限公司 Basic level data management method and device
CN116226238A (en) * 2023-05-06 2023-06-06 合肥尚创信息技术有限公司 Multi-dimensional heterogeneous big data mining method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
惠国保;: "一种基于深度学习的多源异构数据融合方法", 《现代导航》, no. 03, pages 65 - 70 *
李刚 等: "基于信息融合的电力大数据可视化预处理方法", 《广东电力》, vol. 29, no. 12, pages 10 - 14 *

Also Published As

Publication number Publication date
CN116662434B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
Sydorova et al. Interpretable question answering on knowledge bases and text
JP2022529178A (en) Features of artificial intelligence recommended models Processing methods, devices, electronic devices, and computer programs
CN110245228A (en) The method and apparatus for determining text categories
CN113836038A (en) Test data construction method, device, equipment and storage medium
CN116226238A (en) Multi-dimensional heterogeneous big data mining method and system
CN113962224A (en) Named entity recognition method and device, equipment, medium and product thereof
CN114065750A (en) Commodity information matching and publishing method and device, equipment, medium and product thereof
Kim et al. Deep-learned event variables for collider phenomenology
CN116662434B (en) Multi-source heterogeneous big data processing system
US8495070B2 (en) Logic operation system
CN107451194A (en) A kind of image searching method and device
CN114842982B (en) Knowledge expression method, device and system for medical information system
Maddigan et al. Explaining genetic programming trees using large language models
CN106547843B (en) Multi-stage classification query method and device
CN114860946A (en) Method and device for generating map network
CN111460096B (en) Method and device for processing fragmented text and electronic equipment
CN116456289B (en) Rich media information processing method and system
JP7170299B2 (en) Information processing system, information processing method and program
WO2021154238A1 (en) A transferrable neural architecture for structured data extraction from web documents
CN108491370B (en) System and method for generating content using metadata structures and data-driven approaches
Ding et al. Research on the Application of Improved Attention Mechanism in Image Classification and Object Detection.
CN114490928B (en) Implementation method, system, computer equipment and storage medium of semantic search
JP4870732B2 (en) Information processing apparatus, name identification method, and program
CN114726870B (en) Mixed cloud resource arrangement method and system based on visual drag and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant