CN111444162A - Big data initialization method and device, electronic equipment and storage medium - Google Patents

Big data initialization method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111444162A
CN111444162A CN202010151374.9A CN202010151374A CN111444162A CN 111444162 A CN111444162 A CN 111444162A CN 202010151374 A CN202010151374 A CN 202010151374A CN 111444162 A CN111444162 A CN 111444162A
Authority
CN
China
Prior art keywords
data
data table
table set
historical
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010151374.9A
Other languages
Chinese (zh)
Inventor
李广翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010151374.9A priority Critical patent/CN111444162A/en
Publication of CN111444162A publication Critical patent/CN111444162A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a big data initialization method and device, electronic equipment and a storage medium. The method can map a historical data set imported into a distributed file system into a data table set, improves the fault tolerance of data, further detects missing values of the data in the data table set according to the historical data set to obtain a standard data table set, ensures the accuracy and the integrity of the data, analyzes the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data, randomly distributes the key field set to the standard data table set to generate a random landing data table set, improves the data processing speed, further merges the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set, and realizes the initialization of big data.

Description

Big data initialization method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a big data initialization method and apparatus, an electronic device, and a storage medium.
Background
Currently, big data is applied to various application systems, and the various application systems are also being upgraded and converted according to the big data system. For example: for system architecture upgrading, only a processing technology of replacing a data processing module with big data is needed in the development technology, but for historical data, corresponding conversion processing needs to be carried out.
When big data is upgraded, system reconstruction is usually required to be executed, historical data is required to be initialized according to a new rule, data which accords with the new system rule is generated, and only the serious performance problem is caused by simply stacking sentences, and initialization cannot be completed.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method and an apparatus for initializing big data, an electronic device, and a storage medium, which can implement fast initialization of big data, ensure accuracy and integrity of data, and improve fault tolerance of data.
A big data initialization method, the method comprising:
acquiring a historical data set from a pre-constructed database, and importing the historical data set into a distributed file system;
mapping the imported historical data set into a data table set;
according to the historical data set, missing value detection is carried out on the data in the data table set to obtain a standard data table set;
performing data analysis on the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data;
randomly distributing the key field set to the standard data table set to generate a random landing data table set;
and combining the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set.
According to a preferred embodiment of the present invention, the importing the historical data set into a distributed file system includes:
acquiring an IP address and an SID number of a server where the database is located;
logging in the database according to the IP address and the SID number;
obtaining an absolute path of the historical data set on the distributed file system from the database;
and importing the historical data set into the distributed file system according to the absolute path.
According to a preferred embodiment of the present invention, the importing the historical data set into a distributed file system includes:
acquiring attribute information of the data in the historical data set;
determining the priority of the data in the historical data set according to the attribute information;
and importing the historical data set into a distributed file system according to the priority.
According to a preferred embodiment of the present invention, the mapping the imported historical data set into a data table set includes:
mapping the historical data set into a data table by using a configuration tool;
verifying whether the historical data set is loaded into the data table according to an absolute path of the historical data set on the distributed file system;
when the historical data set is loaded into the data table, the data table set is constructed by using the data in the data table.
According to a preferred embodiment of the present invention, the performing missing value detection on the data in the data table set according to the historical data set to obtain a standard data table set includes:
missing value detection is carried out on the data in the data table set by adopting a missmap function;
when no missing value in the set of data tables is detected, determining the set of data tables as the standard set of data tables; or
And when the missing values in the data table set are detected, filling the missing values by adopting a maximum likelihood estimation algorithm to obtain the standard data table set.
According to the preferred embodiment of the present invention, when the maximum likelihood estimation algorithm is used to fill the missing values to obtain the standard data table set, the following formula is used:
Figure BDA0002402557140000031
wherein L (θ) represents the missing value of the fill, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data sets, p (x)i| θ) represents the probability of the missing value.
According to the preferred embodiment of the present invention, the randomly distributing the set of key fields into the set of standard data tables, and the generating a set of random landing data tables includes:
determining the number of key fields in the set of key fields;
generating a plurality of numerical values according to the number of the key fields, wherein the number of the plurality of numerical values is the same as the number of the key fields;
establishing a mapping relation between the plurality of numerical values and the standard data table set;
randomly matching the plurality of numerical values with the key fields to obtain a matching result;
and distributing the key fields into the standard data table set according to the mapping relation and the matching result to obtain the random landing data table set.
A big data initialization apparatus, the apparatus comprising:
the system comprises an importing unit, a data processing unit and a data processing unit, wherein the importing unit is used for acquiring a historical data set from a pre-constructed database and importing the historical data set into a distributed file system;
the mapping unit is used for mapping the imported historical data set into a data table set;
the detection unit is used for detecting missing values of the data in the data table set according to the historical data set to obtain a standard data table set;
the analysis unit is used for carrying out data analysis on the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data;
the distribution unit is used for randomly distributing the key field set to the standard data table set to generate a random landing data table set;
and the merging unit is used for merging the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set.
According to a preferred embodiment of the present invention, the importing unit is specifically configured to:
acquiring an IP address and an SID number of a server where the database is located;
logging in the database according to the IP address and the SID number;
obtaining an absolute path of the historical data set on the distributed file system from the database;
and importing the historical data set into the distributed file system according to the absolute path.
According to a preferred embodiment of the invention, the apparatus further comprises:
the acquisition unit is used for acquiring the attribute information of the data in the historical data set when the historical data set is imported into a distributed file system;
the determining unit is used for determining the priority of the data in the historical data set according to the attribute information;
the importing unit is further configured to import the historical data set into a distributed file system according to the priority.
According to a preferred embodiment of the present invention, the mapping unit is specifically configured to:
mapping the historical data set into a data table by using a configuration tool;
verifying whether the historical data set is loaded into the data table according to an absolute path of the historical data set on the distributed file system;
when the historical data set is loaded into the data table, the data table set is constructed by using the data in the data table.
According to a preferred embodiment of the present invention, the detection unit is specifically configured to:
missing value detection is carried out on the data in the data table set by adopting a missmap function;
when no missing value in the set of data tables is detected, determining the set of data tables as the standard set of data tables; or
And when the missing values in the data table set are detected, filling the missing values by adopting a maximum likelihood estimation algorithm to obtain the standard data table set.
According to the preferred embodiment of the present invention, when the detection unit fills the missing values by using a maximum likelihood estimation algorithm to obtain the standard data table set, the following formula is adopted:
Figure BDA0002402557140000051
wherein L (θ) represents the missing value of the fill, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data sets, p (x)i| θ) represents the probability of the missing value.
According to a preferred embodiment of the present invention, the generating unit is specifically configured to:
determining the number of key fields in the set of key fields;
generating a plurality of numerical values according to the number of the key fields, wherein the number of the plurality of numerical values is the same as the number of the key fields;
establishing a mapping relation between the plurality of numerical values and the standard data table set;
randomly matching the plurality of numerical values with the key fields to obtain a matching result;
and distributing the key fields into the standard data table set according to the mapping relation and the matching result to obtain the random landing data table set.
An electronic device, the electronic device comprising:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the big data initialization method.
A computer-readable storage medium having stored therein at least one instruction for execution by a processor in an electronic device to implement the big data initialization method.
According to the technical scheme, the historical data set imported into the distributed file system can be mapped into the data table set, the fault tolerance of the data is improved, missing value detection is further carried out on the data in the data table set according to the historical data set, the standard data table set is obtained, the accuracy and the integrity of the data are guaranteed, data analysis is carried out on the data in the standard data table set through the preset relevance dependency relationship, the key field set of the data is obtained, the key field set is randomly distributed in the standard data table set, the random landing data table set is generated, the data processing speed is improved, the random landing data tables in the random landing data table set are further combined according to preset conditions, the initialized data table set is obtained, and initialization of large data is achieved.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of the big data initialization method of the present invention.
FIG. 2 is a functional block diagram of a big data initialization apparatus according to a preferred embodiment of the present invention.
FIG. 3 is a schematic structural diagram of an electronic device implementing a big data initialization method according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flow chart of a preferred embodiment of the big data initialization method according to the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
The big data initialization method is applied to one or more electronic devices, where the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and hardware of the electronic devices includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive Internet Protocol Television (IPTV), an intelligent wearable device, and the like.
The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud computing (cloud computing) based cloud consisting of a large number of hosts or network servers.
The Network where the electronic device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
S10, obtaining the historical data set from the pre-constructed database, and importing the historical data set into the distributed file system.
In at least one embodiment of the present invention, the pre-constructed database is a traditional database, also referred to as a relational database, for handling permanent, stable data.
For example, the database can be an Oracle database, a MySQ L database, a graph database, and the like.
In at least one embodiment of the invention, the historical data set is formed by combining data generated from historical behavior of the user.
In at least one embodiment of the invention, the Distributed File System may be a Hadoop Distributed File System (HDFS). The distributed file system is a distributed file system deployed on a cluster, and data transmission needs to be performed through a network.
In at least one embodiment of the invention, the electronic device importing the historical data set into a distributed file system comprises:
the electronic equipment acquires an IP (Internet Protocol) address and a SID (Security Identifiers) number of a server where the database is located, logs in the database according to the IP address and the SID number, acquires an absolute path of the historical data set on the distributed file system from the database, and further imports the historical data set into the distributed file system according to the absolute path.
Specifically, the electronic device may import the historical data set from the database to the HDFS through a Sqoop transport tool.
The Sqoop is a source opening tool and is mainly used for data transmission between Hadoop and a traditional database, and the Sqoop can guide data in a relational database into an HDFS cluster of the Hadoop and can also guide data of the HDFS into the relational database.
In at least one embodiment of the invention, the electronic device importing the historical data set into a distributed file system, comprising:
the electronic equipment acquires attribute information of the data in the historical data set, determines the priority of the data in the historical data set according to the attribute information, and leads the historical data set into a distributed file system according to the priority.
Wherein the attribute information includes, but is not limited to, one or more of the following:
the size of the table in the historical data set, whether there is a primary key, a field containing a time sequence or a number sequence, etc.
And S11, mapping the imported historical data set into a data table set.
In at least one embodiment of the present invention, the electronic device mapping the imported historical data set into a data table set includes:
the electronic equipment maps the historical data set into a data table by using a configuration tool, checks whether the historical data set is loaded into the data table according to an absolute path of the historical data set on the distributed file system, and builds the data table set by using data in the data table when the historical data set is loaded into the data table.
Specifically, the configuration tool may be hive, the data table may include hive data tables, and the hive is a data warehouse tool based on Hadoop, and may map the Structured data file into a database table and provide a simple SQ L (Structured Query L angle) Query function.
Through the embodiment, the data in the historical data set can be converted into the data table in a mapping mode, and the fault tolerance of the data is improved.
And S12, according to the historical data set, carrying out missing value detection on the data in the data table set to obtain a standard data table set.
It is understood that the missing value detection is performed on the data in the data table set by the electronic device to obtain a standard data table set, because the missing data may be caused by an operation error of a developer and/or a failure of the imported transmission tool.
Specifically, the missing values include: completely random deletions, and non-random deletions.
Wherein, the completely random deletion refers to completely random deletion of a variable deletion value independent of any other reason; the random deletion refers to the deletion of a variable which is related to other variables but is not related to the value of the variable; the non-random deletion refers to the deletion of a variable and the numerical correlation of the variable itself.
In at least one embodiment of the present invention, the electronic device performs missing value detection on the data in the data table set according to the historical data set, and obtaining a standard data table set includes:
the electronic device performs missing value detection on the data in the data table set by using a mismap function, specifically:
(1) when no missing values in the set of data tables are detected, the electronic device determines the set of data tables as the standard set of data tables.
(2) And when the missing values in the data table set are detected, filling the missing values by the electronic equipment by adopting a maximum likelihood estimation algorithm to obtain the standard data table set.
Further, when the electronic device fills the missing value by using a maximum likelihood estimation algorithm to obtain the standard data table set, the following formula is adopted:
Figure BDA0002402557140000091
wherein L (θ) represents the missing value of the fill, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data sets, p (x)i| θ) represents the probability of the missing value.
Through the embodiment, the method for detecting the missing value ensures the accuracy and the integrity of the data.
And S13, performing data analysis on the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data.
In at least one embodiment of the present invention, since the distributed system is different from the relational database, the data analysis cannot be directly performed according to the index, and at this time, the electronic device needs to perform data analysis on the data in the standard data table set according to a preset association dependency relationship, so as to obtain the key field set causing data skew.
The data skew means that the proportion of the number of the fields is different greatly, for example, the number of boys in a school is 10000, and the number of girls is 100.
Wherein the preset relevance dependency relationship comprises: the method comprises an inequality key association rule and a plurality of data table joint detection rules.
Further, the electronic device obtains a distribution condition of keys according to the relevance dependency relationship, so as to obtain the key field set.
And S14, randomly distributing the key field set to the standard data table set to generate a random landing data table set.
In at least one embodiment of the present invention, the electronic device randomly distributes the set of key fields into the set of standard data tables, and generating a set of random landing data tables includes:
the electronic equipment determines the number of key fields in the key field set, generates a plurality of numerical values according to the number of the key fields, the number of the numerical values is the same as the number of the key fields, establishes a mapping relation between the numerical values and the standard data table set, randomly matches the numerical values and the key fields to obtain a matching result, and distributes the key fields into the standard data table set according to the mapping relation and the matching result to obtain the random landing data table set.
Through the implementation mode, the data processing speed can be improved by combining a mode of randomly landing and distributing the key fields.
And S15, combining the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set.
In at least one embodiment of the present invention, the preset condition may be configured by a developer according to different requirements of the random landing data table.
Wherein the preset conditions include, but are not limited to: single table merging, multi-table merging, adjacent table merging, and the like.
Through the implementation mode, the random landing data tables are combined according to the preset conditions, the difficulty of data initialization can be reduced, the data processing speed is indirectly increased, and powerful support is provided for project switching.
According to the technical scheme, the historical data set imported into the distributed file system can be mapped into the data table set, the fault tolerance of the data is improved, missing value detection is further carried out on the data in the data table set according to the historical data set, the standard data table set is obtained, the accuracy and the integrity of the data are guaranteed, data analysis is carried out on the data in the standard data table set through the preset relevance dependency relationship, the key field set of the data is obtained, the key field set is randomly distributed in the standard data table set, the random landing data table set is generated, the data processing speed is improved, the random landing data tables in the random landing data table set are further combined according to preset conditions, the initialized data table set is obtained, and initialization of large data is achieved.
Fig. 2 is a functional block diagram of a big data initialization apparatus according to a preferred embodiment of the present invention. The big data initialization apparatus 11 includes an importing unit 110, a mapping unit 111, a detecting unit 112, an analyzing unit 113, a distributing unit 114, a merging unit 115, an obtaining unit 116, and a determining unit 117. The module/unit referred to in the present invention refers to a series of computer program segments that can be executed by the processor 13 and that can perform a fixed function, and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.
The import unit 110 acquires a history data set from a pre-constructed database and imports the history data set into a distributed file system.
In at least one embodiment of the present invention, the pre-constructed database is a traditional database, also referred to as a relational database, for handling permanent, stable data.
For example, the database can be an Oracle database, a MySQ L database, a graph database, and the like.
In at least one embodiment of the invention, the historical data set is formed by combining data generated from historical behavior of the user.
In at least one embodiment of the invention, the Distributed File System may be a Hadoop Distributed File System (HDFS). The distributed file system is a distributed file system deployed on a cluster, and data transmission needs to be performed through a network.
In at least one embodiment of the present invention, the importing unit 110 importing the historical data set into a distributed file system includes:
the importing unit 110 obtains an IP (Internet Protocol) address and a SID (Security identifier) number of a server where the database is located, and logs in the database according to the IP address and the SID number, where the importing unit 110 obtains an absolute path of the historical data set on the distributed file system from the database, and further imports the historical data set into the distributed file system according to the absolute path.
Specifically, the importing unit 110 may import the historical data set from the database to the HDFS through a Sqoop transmission tool.
The Sqoop is a source opening tool and is mainly used for data transmission between Hadoop and a traditional database, and the Sqoop can guide data in a relational database into an HDFS cluster of the Hadoop and can also guide data of the HDFS into the relational database.
In at least one embodiment of the present invention, the importing unit 110 imports the historical data set into a distributed file system, including:
the obtaining unit 116 obtains attribute information of the data in the historical data set, the determining unit 117 determines a priority of the data in the historical data set according to the attribute information, and the importing unit 110 imports the historical data set into the distributed file system according to the priority.
Wherein the attribute information includes, but is not limited to, one or more of the following:
the size of the table in the historical data set, whether there is a primary key, a field containing a time sequence or a number sequence, etc.
The mapping unit 111 maps the imported historical data set into a data table set.
In at least one embodiment of the present invention, the mapping unit 111 maps the imported historical data set into a data table set, including:
the mapping unit 111 maps the historical data set into a data table by using a configuration tool, and checks whether the historical data set is loaded into the data table according to an absolute path of the historical data set on the distributed file system, and when the historical data set is loaded into the data table, the mapping unit 111 constructs the data table set by using data in the data table.
Specifically, the configuration tool may be hive, the data table may include hive data tables, and the hive is a data warehouse tool based on Hadoop, and may map the Structured data file into a database table and provide a simple SQ L (Structured Query L angle) Query function.
Through the embodiment, the data in the historical data set can be converted into the data table in a mapping mode, and the fault tolerance of the data is improved.
The detection unit 112 performs missing value detection on the data in the data table set according to the historical data set to obtain a standard data table set.
It is understood that the detection unit 112 detects missing values of the data in the data table set to obtain a standard data table set, because the missing data may be caused by an operation error of a developer and/or a failure of the imported transport tool.
Specifically, the missing values include: completely random deletions, and non-random deletions.
Wherein, the completely random deletion refers to completely random deletion of a variable deletion value independent of any other reason; the random deletion refers to the deletion of a variable which is related to other variables but is not related to the value of the variable; the non-random deletion refers to the deletion of a variable and the numerical correlation of the variable itself.
In at least one embodiment of the present invention, the detecting unit 112 performs missing value detection on the data in the data table set according to the historical data set, and obtaining a standard data table set includes:
the detecting unit 112 performs missing value detection on the data in the data table set by using a mismap function, specifically:
(1) when it is detected that there are no missing values in the set of data tables, the detection unit 112 determines the set of data tables as the standard set of data tables.
(2) When a missing value in the data table set is detected, the detecting unit 112 fills the missing value by using a maximum likelihood estimation algorithm to obtain the standard data table set.
Further, when the detection unit 112 fills the missing values by using a maximum likelihood estimation algorithm to obtain the standard data table set, the following formula is used:
Figure BDA0002402557140000141
wherein L (θ) represents the missing value of the fill, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data sets, p (x)i| θ) represents the probability of the missing value.
Through the embodiment, the method for detecting the missing value ensures the accuracy and the integrity of the data.
The analysis unit 113 performs data analysis on the data in the standard data table set through a preset relevancy dependency relationship to obtain a key field set of the data.
In at least one embodiment of the present invention, since the distributed system is different from the relational database, the analysis unit 113 may perform data analysis on the data in the standard data table set according to a preset association dependency relationship, so as to obtain the key field set causing data skew.
The data skew means that the proportion of the number of the fields is different greatly, for example, the number of boys in a school is 10000, and the number of girls is 100.
Wherein the preset relevance dependency relationship comprises: the method comprises an inequality key association rule and a plurality of data table joint detection rules.
Further, the analysis unit 113 obtains a distribution of keys according to the relevance dependency relationship, so as to obtain the key field set.
The distribution unit 114 randomly distributes the key field set to the standard data table set to generate a random landing data table set.
In at least one embodiment of the present invention, the distributing unit 114 randomly distributes the set of key fields into the set of standard data tables, and generating a set of random landing data tables includes:
the distribution unit 114 determines the number of key fields in the key field set, and generates a plurality of values according to the number of the key fields, where the number of the plurality of values is the same as the number of the key fields, the distribution unit 114 establishes a mapping relationship between the plurality of values and the standard data table set, and matches the plurality of values and the key fields at random to obtain a matching result, and the distribution unit 114 distributes the key fields in the standard data table set according to the mapping relationship and the matching result to obtain the random floor data table set.
Through the implementation mode, the data processing speed can be improved by combining a mode of randomly landing and distributing the key fields.
The merging unit 115 merges the random landing data tables in the random landing data table set according to a preset condition to obtain an initialized data table set.
In at least one embodiment of the present invention, the preset condition may be configured by a developer according to different requirements of the random landing data table.
Wherein the preset conditions include, but are not limited to: single table merging, multi-table merging, adjacent table merging, and the like.
Through the implementation mode, the random landing data tables are combined according to the preset conditions, the difficulty of data initialization can be reduced, the data processing speed is indirectly increased, and powerful support is provided for project switching.
According to the technical scheme, the historical data set imported into the distributed file system can be mapped into the data table set, the fault tolerance of the data is improved, missing value detection is further carried out on the data in the data table set according to the historical data set, the standard data table set is obtained, the accuracy and the integrity of the data are guaranteed, data analysis is carried out on the data in the standard data table set through the preset relevance dependency relationship, the key field set of the data is obtained, the key field set is randomly distributed in the standard data table set, the random landing data table set is generated, the data processing speed is improved, the random landing data tables in the random landing data table set are further combined according to preset conditions, the initialized data table set is obtained, and initialization of large data is achieved.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention for implementing a big data initialization method.
The electronic device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a big data initialization program, stored in the memory 12 and executable on the processor 13.
It will be understood by those skilled in the art that the schematic diagram is merely an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, the electronic device 1 may have a bus-type structure or a star-type structure, the electronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, the electronic device 1 may further include an input and output device, a network access device, and the like.
It should be noted that the electronic device 1 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
The memory 12 includes at least one type of readable storage medium, which includes flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, for example a removable hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only to store application software installed in the electronic device 1 and various types of data, such as a code of a big data initialization program, etc., but also to temporarily store data that has been output or is to be output.
The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (for example, executing a big data initialization program and the like) stored in the memory 12 and calling data stored in the memory 12.
The processor 13 executes an operating system of the electronic device 1 and various installed application programs. The processor 13 executes the application program to implement the steps in the above-described various big data initialization method embodiments, such as steps S10, S11, S12, S13, S14, S15 shown in fig. 1.
Alternatively, the processor 13, when executing the computer program, implements the functions of the modules/units in the above device embodiments, for example:
acquiring a historical data set from a pre-constructed database, and importing the historical data set into a distributed file system;
mapping the imported historical data set into a data table set;
according to the historical data set, missing value detection is carried out on the data in the data table set to obtain a standard data table set;
performing data analysis on the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data;
randomly distributing the key field set to the standard data table set to generate a random landing data table set;
and combining the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the electronic device 1. For example, the computer program may be divided into an import unit 110, a mapping unit 111, a detection unit 112, an analysis unit 113, a distribution unit 114, a merging unit 115, an acquisition unit 116, and a determination unit 117.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented.
Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 or the like.
Although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 13 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard (Keyboard)), optionally, a standard wired interface, a wireless interface, optionally, in some embodiments, the Display may be an L ED Display, a liquid crystal Display, a touch-sensitive liquid crystal Display, an O L ED (Organic light-Emitting Diode) touch-sensitive device, etc.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
Fig. 3 only shows the electronic device 1 with components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
With reference to fig. 1, the memory 12 of the electronic device 1 stores a plurality of instructions to implement a big data initialization method, and the processor 13 can execute the plurality of instructions to implement:
acquiring a historical data set from a pre-constructed database, and importing the historical data set into a distributed file system;
mapping the imported historical data set into a data table set;
according to the historical data set, missing value detection is carried out on the data in the data table set to obtain a standard data table set;
performing data analysis on the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data;
randomly distributing the key field set to the standard data table set to generate a random landing data table set;
and combining the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set.
Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A big data initialization method, the method comprising:
acquiring a historical data set from a pre-constructed database, and importing the historical data set into a distributed file system;
mapping the imported historical data set into a data table set;
according to the historical data set, missing value detection is carried out on the data in the data table set to obtain a standard data table set;
performing data analysis on the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data;
randomly distributing the key field set to the standard data table set to generate a random landing data table set;
and combining the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set.
2. The big data initialization method of claim 1, wherein the importing the historical data set into a distributed file system comprises:
acquiring an IP address and an SID number of a server where the database is located;
logging in the database according to the IP address and the SID number;
obtaining an absolute path of the historical data set on the distributed file system from the database;
and importing the historical data set into the distributed file system according to the absolute path.
3. The big data initialization method of claim 1, wherein the importing the historical data set into a distributed file system comprises:
acquiring attribute information of the data in the historical data set;
determining the priority of the data in the historical data set according to the attribute information;
and importing the historical data set into a distributed file system according to the priority.
4. The big data initialization method of claim 2, wherein the mapping the imported historical data set into a data table set comprises:
mapping the historical data set into a data table by using a configuration tool;
verifying whether the historical data set is loaded into the data table according to an absolute path of the historical data set on the distributed file system;
when the historical data set is loaded into the data table, the data table set is constructed by using the data in the data table.
5. The big data initialization method according to claim 1, wherein the missing value detection of the data in the data table set according to the historical data set, and obtaining a standard data table set comprises:
missing value detection is carried out on the data in the data table set by adopting a missmap function;
when no missing value in the set of data tables is detected, determining the set of data tables as the standard set of data tables; or
And when the missing values in the data table set are detected, filling the missing values by adopting a maximum likelihood estimation algorithm to obtain the standard data table set.
6. The big data initialization method according to claim 5, wherein when filling the missing values by using a maximum likelihood estimation algorithm to obtain the standard data table set, the following formula is used:
Figure FDA0002402557130000021
wherein L (θ) represents the missing value of the fill, θ represents the probability parameter corresponding to the missing value, n represents the number of the historical data sets, p (x)i| θ) represents the probability of the missing value.
7. The big data initialization method of claim 1, wherein the randomly distributing the set of key fields into the set of standard data tables, generating a set of random floor data tables comprises:
determining the number of key fields in the set of key fields;
generating a plurality of numerical values according to the number of the key fields, wherein the number of the plurality of numerical values is the same as the number of the key fields;
establishing a mapping relation between the plurality of numerical values and the standard data table set;
randomly matching the plurality of numerical values with the key fields to obtain a matching result;
and distributing the key fields into the standard data table set according to the mapping relation and the matching result to obtain the random landing data table set.
8. An apparatus for big data initialization, the apparatus comprising:
the system comprises an importing unit, a data processing unit and a data processing unit, wherein the importing unit is used for acquiring a historical data set from a pre-constructed database and importing the historical data set into a distributed file system;
the mapping unit is used for mapping the imported historical data set into a data table set;
the detection unit is used for detecting missing values of the data in the data table set according to the historical data set to obtain a standard data table set;
the analysis unit is used for carrying out data analysis on the data in the standard data table set through a preset relevance dependency relationship to obtain a key field set of the data;
the distribution unit is used for randomly distributing the key field set to the standard data table set to generate a random landing data table set;
and the merging unit is used for merging the random landing data tables in the random landing data table set according to preset conditions to obtain an initialized data table set.
9. An electronic device, characterized in that the electronic device comprises:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement the big data initialization method of any of claims 1 to 7.
10. A computer-readable storage medium characterized by: the computer-readable storage medium stores at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the big data initialization method according to any one of claims 1 to 7.
CN202010151374.9A 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium Pending CN111444162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010151374.9A CN111444162A (en) 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010151374.9A CN111444162A (en) 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111444162A true CN111444162A (en) 2020-07-24

Family

ID=71627349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010151374.9A Pending CN111444162A (en) 2020-03-06 2020-03-06 Big data initialization method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111444162A (en)

Similar Documents

Publication Publication Date Title
WO2023056943A1 (en) Internet of things rule engine-based terminal control method and apparatus, and device and medium
CN112559535B (en) Multithreading-based asynchronous task processing method, device, equipment and medium
US20170139975A1 (en) Semantic database driven form validation
CN113806434B (en) Big data processing method, device, equipment and medium
CN113434901A (en) Intelligent data query method and device, electronic equipment and storage medium
CN114610747A (en) Data query method, device, equipment and storage medium
CN112347126B (en) Big data processing method, device, equipment and medium
CN114185776A (en) Big data point burying method, device, equipment and medium for application program
CN114398346A (en) Data migration method, device, equipment and storage medium
CN113434542A (en) Data relation identification method and device, electronic equipment and storage medium
CN114816371B (en) Message processing method, device, equipment and medium
CN113703895B (en) Method, system, device, equipment and storage medium for carousel of signboard
CN113923218B (en) Distributed deployment method, device, equipment and medium for coding and decoding plug-in
CN114116673A (en) Data migration method based on artificial intelligence and related equipment
CN111444162A (en) Big data initialization method and device, electronic equipment and storage medium
CN115269523A (en) File storage and query method based on artificial intelligence and related equipment
CN112257078B (en) Block chain encryption and decryption service security trusted system based on TEE technology
CN115329002A (en) Data asynchronous processing method based on artificial intelligence and related equipment
CN112328656A (en) Service query method, device, equipment and storage medium based on middle platform architecture
CN116934263B (en) Product batch admittance method, device, equipment and medium
CN115065642B (en) Code table request method, device, equipment and medium under bandwidth limitation
CN113434365B (en) Data characteristic monitoring method and device, electronic equipment and storage medium
CN114860314B (en) Deployment upgrading method, device, equipment and medium based on database compatibility
CN115543214B (en) Data storage method, device, equipment and medium in low-delay scene
CN114860349B (en) Data loading method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination