CN117150085A - Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium - Google Patents

Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium Download PDF

Info

Publication number
CN117150085A
CN117150085A CN202311138922.4A CN202311138922A CN117150085A CN 117150085 A CN117150085 A CN 117150085A CN 202311138922 A CN202311138922 A CN 202311138922A CN 117150085 A CN117150085 A CN 117150085A
Authority
CN
China
Prior art keywords
data
index
hudi
joint
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311138922.4A
Other languages
Chinese (zh)
Inventor
黄学亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN202311138922.4A priority Critical patent/CN117150085A/en
Publication of CN117150085A publication Critical patent/CN117150085A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the field of financial and scientific data processing, and discloses an index creation method based on Hudi, which comprises the following steps: acquiring historical data operation information of a preset Hudi data source, and analyzing the historical data operation information to obtain a joint query condition corresponding to each data object in the preset Hudi data source; assembling a joint index field of the corresponding data object according to the joint query condition; extracting data characteristics of the data object corresponding to each joint index field, and determining the index type of the joint index field according to the data characteristics; acquiring address information of each data object in the preset Hudi data source; and generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object. The invention also provides an index creation device, electronic equipment and a computer readable storage medium based on Hudi. The invention can improve the data query efficiency in the field of financial data.

Description

Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium
Technical Field
The present invention relates to the field of financial and scientific data processing, and in particular, to a Hudi-based index creation method, apparatus, electronic device, and computer readable storage medium.
Background
The data volume of business modules of financial systems such as banks, dealer, trust, insurance, fund, financing and renting is huge, for example, the daily peak value of data such as customer group put-in business data, business settlement data and the like can reach tens of millions.
The inquiry performance of the financial big data can be improved by using the index, but the traditional index creation mode has certain disadvantages, such as difficult index selection, improper index selection or excessive index creation, which may cause performance degradation and resource waste; when a query involves multiple table connections or complex conditional filtering, the use of the index may not be efficient enough and may even result in the query optimizer selecting an improper execution plan, resulting in reduced query performance. Therefore, the index creation manner for financial big data needs to be further improved.
Disclosure of Invention
The invention provides an index creation method, an index creation device, electronic equipment and a computer readable storage medium based on Hudi, which mainly aim to improve the data query efficiency in the field of financial data.
In order to achieve the above object, the present invention provides a Hudi-based index creation method, including:
Acquiring historical data operation information of a preset Hudi data source, and analyzing the historical data operation information to obtain a joint query condition corresponding to each data object in the preset Hudi data source;
assembling a joint index field of the corresponding data object according to the joint query condition;
extracting data characteristics of the data object corresponding to each joint index field, and determining the index type of the joint index field according to the data characteristics;
acquiring address information of each data object in the preset Hudi data source;
and generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
Optionally, the analyzing the historical data operation information to obtain the joint query condition corresponding to each data object in the preset Hudi data source includes:
extracting a query condition part in the historical data operation information according to a preset data operation grammar;
screening the data objects with non-unique query conditions from the query condition part, and counting the duty ratio of the data objects with non-unique query conditions in the historical data operation information;
And selecting the query conditions corresponding to the data objects with the duty ratio larger than the preset proportion threshold as the joint query conditions.
Optionally, the assembling the joint index field of the corresponding data object according to the joint query condition includes:
acquiring a field set of a data table where the data object is located;
word segmentation is carried out on the combined query conditions to obtain a conditional word segmentation set;
sequentially calculating the similarity between each conditional word in the conditional word set and each field in the field set;
selecting a field with the similarity larger than a preset similarity threshold as an index field of the corresponding conditional word;
and collecting all index fields of the data object to obtain a joint index field of the data object.
Optionally, the extracting the data feature of the data object corresponding to each joint index field includes:
acquiring value description information of the data object and context information of the data object;
converting the data object into a data word vector, converting the value description information into a value vector, and converting the context information into an associated word vector;
and splicing the data word vector, the value vector and the associated word vector to obtain the data characteristics of the data object.
Optionally, the determining the index type of the joint index field according to the data feature includes:
calculating a relative probability value between the data characteristic and a preset index type label by using a pre-trained activation function;
and calculating the score of each preset index type label according to the relative probability value, and determining the index type label with the highest score as the index type of the data object.
In order to solve the above problems, the present invention also provides a Hudi-based index creation apparatus, the apparatus comprising:
the joint index determining module is used for acquiring historical data operation information of a preset Hudi data source, analyzing the historical data operation information to obtain joint query conditions corresponding to each data object in the preset Hudi data source, and assembling joint index fields of the corresponding data objects according to the joint query conditions;
the index type distribution module is used for extracting the data characteristics of the data objects corresponding to each joint index field and determining the index type of the joint index field according to the data characteristics;
an index address acquisition module, configured to acquire address information of each data object in the preset Hudi data source;
And the index file generation module is used for generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
Optionally, the joint index determining module obtains the joint query condition corresponding to each data object in the preset Hudi data source by the following method:
extracting a query condition part in the historical data operation information according to a preset data operation grammar;
screening the data objects with non-unique query conditions from the query condition part, and counting the duty ratio of the data objects with non-unique query conditions in the historical data operation information;
and selecting the query conditions corresponding to the data objects with the duty ratio larger than the preset proportion threshold as the joint query conditions.
Optionally, the index type allocation module determines the index type of the joint index field by:
calculating a relative probability value between the data characteristic and a preset index type label by using a pre-trained activation function;
and calculating the score of each preset index type label according to the relative probability value, and determining the index type label with the highest score as the index type of the data object.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
a memory storing at least one computer program; a kind of electronic device with high-pressure air-conditioning system
And a processor executing the program stored in the memory to implement the Hudi-based index creation method described above.
In order to solve the above-described problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the Hudi-based index creation method described above.
According to the method, the historical data operation information is analyzed to obtain the joint index field corresponding to each data object in the preset Hudi data source, the searching speed and the searching efficiency of the financial big data can be improved through a joint index mode, the joint index field is obtained from the historical data operation information, the fact that the selected joint index field is consistent with the actual operation of the financial big data can be guaranteed, the searching efficiency of the financial big data is improved, meanwhile, the index type of the corresponding joint index field is determined according to the data characteristics of the data objects, in the searching of the actual financial big data, the corresponding joint index field can be quickly located according to the index type in an index column file, and then the corresponding data object can be obtained according to the address information corresponding to the joint index field.
Drawings
FIG. 1 is a flowchart of a Hudi-based index creation method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a detailed implementation of one of the steps of a Hudi-based index creation method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a detailed implementation of one of the steps of the Hudi-based index creation method according to an embodiment of the present application
FIG. 4 is a functional block diagram of a Hudi-based index creating apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device implementing the Hudi-based index creating method according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The embodiment of the application provides an index creation method based on Hudi. The execution subject of the Hudi-based index creation method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the Hudi-based index creation method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flowchart of a Hudi-based index creating method according to an embodiment of the present invention is shown. In this embodiment, the method for creating an index based on Hudi includes:
s1, acquiring historical data operation information of a preset Hudi data source, and analyzing the historical data operation information to obtain a joint query condition corresponding to each data object in the preset Hudi data source;
in the embodiment of the present invention, the preset Hudi data source refers to a storage area of financial big data for data management by using Hudi technology, where the storage area includes, but is not limited to, a database, a blockchain, a network cache, and the like.
In the embodiment of the present invention, the historical data operation information refers to a data operation instruction set, such as data query, data addition, data insertion, data modification, data deletion, etc., that occurs in the preset Hudi data source in a historical period, where the historical period may be set according to the size of the data volume of the preset Hudi data source and the frequency and number of data operations in practice, and may be set to be the last half year or three months.
In the embodiment of the present invention, the data object refers to a data field of a final operation corresponding to each data operation instruction in the historical data operation information, for example, a field that needs to be added, a field that needs to be deleted, or a field that needs to be searched.
It can be understood that, searching for a data object that can be locked under a single query condition has relatively high data query efficiency, and searching for a data object that can be locked under multiple query conditions that are stacked and screened layer by layer has relatively low data query efficiency. For example, the search risk type is the life insurance's applied years of life insurance for the user who has the life insurance and has the records of the claim, wherein the applied years are the data objects which need to be queried finally, the screening condition corresponding to the data objects is that the risk must be the life insurance, and the corresponding life insurance has the records of the claim greater than 0.
In the embodiment of the invention, the historical data operation information can be analyzed in a grammar manner to obtain the joint query condition corresponding to each data object.
In detail, referring to fig. 2, the parsing the historical data operation information to obtain the joint query condition corresponding to each data object in the preset Hudi data source includes:
s11, extracting a query condition part in the historical data operation information according to a preset data operation grammar;
s12, screening the data objects with non-unique query conditions from the query condition part, and counting the duty ratio of the data objects with non-unique query conditions in the historical data operation information;
S13, selecting a query condition corresponding to the data object with the duty ratio larger than the preset proportion threshold value as a joint query condition.
In the embodiment of the present invention, the preset data operation grammar may be determined according to a data operation language adopted by the preset Hudi data source actually, for example, the preset data operation grammar may be an SQL grammar.
In the embodiment of the invention, the preset proportion threshold value can be set according to the actual situation of the financial big data. The aim of selecting the query condition corresponding to the data object with the duty ratio larger than the preset proportion threshold value as the joint query condition is to ensure the necessity of setting the joint index for the data object.
The embodiment of the invention acquires the joint query condition from the historical data operation information so as to screen out the data object suitable for setting the joint index.
S2, assembling a joint index field of the corresponding data object according to the joint query condition;
it will be appreciated that the federated query terms include at least 2 or more query constraints, and that all query terms are and in relationship, no single query term can determine the final data object.
Illustratively, a user ID of a male customer who simultaneously applies life insurance and automobile insurance is searched, wherein the data object to be queried is the user ID, the joint query condition corresponding to the user ID comprises that the dangerous seed type is the life insurance and the automobile insurance, and the sex corresponding to the user ID is male.
Compared with the traditional index setting mode, the method has the advantages that the first data object with the index field being life insurance is required to be searched, the second data object is obtained by filtering the first data object through the index field automobile insurance, and finally the third data object with the index field being male is selected from the second data object, so that the data query mode is low in efficiency, and in this case, the method can be used for quickly positioning the third data object by setting the joint index field.
In detail, referring to fig. 3, the assembling the federated index field of the corresponding data object according to the federated query condition includes:
s21, acquiring a field set of a data table where the data object is located;
s22, word segmentation is carried out on the joint query conditions to obtain a conditional word segmentation set;
s23, sequentially calculating the similarity between each conditional word in the conditional word set and each field in the field set;
s24, selecting a field with the similarity larger than a preset similarity threshold value as an index field of the corresponding conditional word;
s25, collecting all index fields of the data object to obtain a joint index field of the data object.
In the embodiment of the invention, the joint query condition is text content expressed by natural language, and the existing word segmentation tool can be utilized to segment words of the joint query condition.
In the embodiment of the invention, the similarity between each conditional word in the conditional word set and each field in the field set can be calculated in a fuzzy matching mode, and the higher the matching degree of fuzzy matching between each conditional word and each field is, the higher the similarity between the corresponding conditional word and the field is.
In another alternative embodiment of the present invention, the similarity between the corresponding conditional word and the field may be determined by calculating the text semantic similarity between each conditional word and each field.
In the embodiment of the invention, the preset similarity threshold can be set according to the actual situation of big financial data.
S3, extracting data characteristics of the data objects corresponding to each joint index field, and determining the index type of the joint index field according to the data characteristics;
in the embodiment of the invention, the purpose of setting the corresponding index types for different data objects is to improve the index retrieval efficiency by setting the proper index types. For example, a binary tree index or a b+ tree index may be employed for the fields of the data range query.
In the embodiment of the present invention, the selection of the corresponding index type is performed according to the data feature corresponding to each data object, where the data feature includes, but is not limited to, a data value feature, a data relevance feature, and the like.
In the embodiment of the invention, the data characteristics of the data object corresponding to each joint index field can be extracted by using a convolutional neural network model based on deep learning.
In detail, the extracting the data features of the data object corresponding to each joint index field includes:
acquiring value description information of the data object and context information of the data object;
converting the data object into a data word vector, converting the value description information into a value vector, and converting the context information into an associated word vector;
and splicing the data word vector, the value vector and the associated word vector to obtain the data characteristics of the data object.
In the embodiment of the invention, a word2vec model, an NLP (Natural Language Processing ) model and other models with word vector conversion functions can be adopted to respectively convert the data object into a data word vector, convert the value description information into a value vector and convert the context information into an associated word vector.
In the embodiment of the invention, the data features exist in a vector form, and the index type corresponding to the data features can be obtained by calculating the vector corresponding to the data features.
In detail, the determining the index type of the joint index field according to the data features includes:
calculating a relative probability value between the data characteristic and a preset index type label by using a pre-trained activation function;
and calculating the score of each preset index type label according to the relative probability value, and determining the index type label with the highest score as the index type of the data object.
In the embodiment of the present invention, the pre-trained activation function includes, but is not limited to, a softmax activation function, a sigmoid activation function, and a relu activation function, and the preset index type tag includes, but is not limited to, a Bloom Filter index tag, a binary tree index tag, or a b+ tree index tag.
In another alternative embodiment of the present invention, the index type of each of the data objects may also be determined by benchmark testing and trial and error.
S4, acquiring address information of each data object in the preset Hudi data source, and generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
In the embodiment of the invention, the address information of each data object in the preset Hudi data source can be obtained by using a part file corresponding to each data object, and the part file is a column type storage format file and can be applied to efficiently storing and processing data in a financial big data environment.
Preferably, the column coding format of the part file may be used to code the joint index field, the index type and the address information of the data object according to the part file corresponding to each data object, for example, coding formats such as Run Length Encoding (RLE), delta Encoding and Bit Packing are adopted to assemble the index column file of the preset Hudi data source.
The index column file generated based on the part file can further reduce the storage space of the index file by using column coding, and meanwhile, the definition of adding, deleting or modifying columns to the index column file is supported under the condition that the existing data is not damaged, so that the flexibility and compatibility of the index column file are improved.
According to the method, the historical data operation information is analyzed to obtain the joint index field corresponding to each data object in the preset Hudi data source, the searching speed and the searching efficiency of the financial big data can be improved through a joint index mode, the joint index field is obtained from the historical data operation information, the fact that the selected joint index field is consistent with the actual operation of the financial big data can be guaranteed, the searching efficiency of the financial big data is improved, meanwhile, the index type of the corresponding joint index field is determined according to the data characteristics of the data objects, in the searching of the actual financial big data, the corresponding joint index field can be quickly located according to the index type in an index column file, and then the corresponding data object can be obtained according to the address information corresponding to the joint index field.
Fig. 4 is a functional block diagram of a Hudi-based index creating apparatus according to an embodiment of the present invention.
The Hudi-based index creating apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the Hudi-based index creating apparatus 100 includes a joint index determining module 101, an index type allocating module 102, an index address obtaining module 103, and an index file generating module 104: the module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the joint index determining module 101 is configured to obtain historical data operation information of a preset Hudi data source, analyze the historical data operation information to obtain joint query conditions corresponding to each data object in the preset Hudi data source, and assemble joint index fields of corresponding data objects according to the joint query conditions;
the index type allocation module 102 is configured to extract a data feature of a data object corresponding to each joint index field, and determine an index type of the joint index field according to the data feature;
The index address obtaining module 103 is configured to obtain address information of each data object in the preset Hudi data source;
the index file generating module 104 is configured to generate an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
In detail, the specific embodiments of the modules of the Hudi-based index creating apparatus 100 are as follows:
step one, acquiring historical data operation information of a preset Hudi data source, and analyzing the historical data operation information to obtain a joint query condition corresponding to each data object in the preset Hudi data source;
in the embodiment of the present invention, the preset Hudi data source refers to a storage area of financial big data for data management by using Hudi technology, where the storage area includes, but is not limited to, a database, a blockchain, a network cache, and the like.
In the embodiment of the present invention, the historical data operation information refers to a data operation instruction set, such as data query, data addition, data insertion, data modification, data deletion, etc., that occurs in the preset Hudi data source in a historical period, where the historical period may be set according to the size of the data volume of the preset Hudi data source and the frequency and number of data operations in practice, and may be set to be the last half year or three months.
In the embodiment of the present invention, the data object refers to a data field of a final operation corresponding to each data operation instruction in the historical data operation information, for example, a field that needs to be added, a field that needs to be deleted, or a field that needs to be searched.
It can be understood that, searching for a data object that can be locked under a single query condition has relatively high data query efficiency, and searching for a data object that can be locked under multiple query conditions that are stacked and screened layer by layer has relatively low data query efficiency. For example, the search risk type is the life insurance's applied years of life insurance for the user who has the life insurance and has the records of the claim, wherein the applied years are the data objects which need to be queried finally, the screening condition corresponding to the data objects is that the risk must be the life insurance, and the corresponding life insurance has the records of the claim greater than 0.
In the embodiment of the invention, the historical data operation information can be analyzed in a grammar manner to obtain the joint query condition corresponding to each data object.
In detail, the analyzing the historical data operation information to obtain the joint query condition corresponding to each data object in the preset Hudi data source includes:
Extracting a query condition part in the historical data operation information according to a preset data operation grammar;
screening the data objects with non-unique query conditions from the query condition part, and counting the duty ratio of the data objects with non-unique query conditions in the historical data operation information;
and selecting the query conditions corresponding to the data objects with the duty ratio larger than the preset proportion threshold as the joint query conditions.
In the embodiment of the present invention, the preset data operation grammar may be determined according to a data operation language adopted by the preset Hudi data source actually, for example, the preset data operation grammar may be an SQL grammar.
In the embodiment of the invention, the preset proportion threshold value can be set according to the actual situation of the financial big data. The aim of selecting the query condition corresponding to the data object with the duty ratio larger than the preset proportion threshold value as the joint query condition is to ensure the necessity of setting the joint index for the data object.
The embodiment of the invention acquires the joint query condition from the historical data operation information so as to screen out the data object suitable for setting the joint index.
Step two, assembling a joint index field of the corresponding data object according to the joint query condition;
It will be appreciated that the federated query terms include at least 2 or more query constraints, and that all query terms are and in relationship, no single query term can determine the final data object.
Illustratively, a user ID of a male customer who simultaneously applies life insurance and automobile insurance is searched, wherein the data object to be queried is the user ID, the joint query condition corresponding to the user ID comprises that the dangerous seed type is the life insurance and the automobile insurance, and the sex corresponding to the user ID is male.
Compared with the traditional index setting mode, the method has the advantages that the first data object with the index field being life insurance is required to be searched, the second data object is obtained by filtering the first data object through the index field automobile insurance, and finally the third data object with the index field being male is selected from the second data object, so that the data query mode is low in efficiency, and in this case, the method can be used for quickly positioning the third data object by setting the joint index field.
In detail, the assembling the joint index field of the corresponding data object according to the joint query condition includes:
Acquiring a field set of a data table where the data object is located;
word segmentation is carried out on the combined query conditions to obtain a conditional word segmentation set;
sequentially calculating the similarity between each conditional word in the conditional word set and each field in the field set;
selecting a field with the similarity larger than a preset similarity threshold as an index field of the corresponding conditional word;
and collecting all index fields of the data object to obtain a joint index field of the data object.
In the embodiment of the invention, the joint query condition is text content expressed by natural language, and the existing word segmentation tool can be utilized to segment words of the joint query condition.
In the embodiment of the invention, the similarity between each conditional word in the conditional word set and each field in the field set can be calculated in a fuzzy matching mode, and the higher the matching degree of fuzzy matching between each conditional word and each field is, the higher the similarity between the corresponding conditional word and the field is.
In another alternative embodiment of the present invention, the similarity between the corresponding conditional word and the field may be determined by calculating the text semantic similarity between each conditional word and each field.
In the embodiment of the invention, the preset similarity threshold can be set according to the actual situation of big financial data.
Extracting data characteristics of the data objects corresponding to each joint index field, and determining the index type of the joint index field according to the data characteristics;
in the embodiment of the invention, the purpose of setting the corresponding index types for different data objects is to improve the index retrieval efficiency by setting the proper index types. For example, a binary tree index or a b+ tree index may be employed for the fields of the data range query.
In the embodiment of the present invention, the selection of the corresponding index type is performed according to the data feature corresponding to each data object, where the data feature includes, but is not limited to, a data value feature, a data relevance feature, and the like.
In the embodiment of the invention, the data characteristics of the data object corresponding to each joint index field can be extracted by using a convolutional neural network model based on deep learning.
In detail, the extracting the data features of the data object corresponding to each joint index field includes:
acquiring value description information of the data object and context information of the data object;
Converting the data object into a data word vector, converting the value description information into a value vector, and converting the context information into an associated word vector;
and splicing the data word vector, the value vector and the associated word vector to obtain the data characteristics of the data object.
In the embodiment of the invention, a word2vec model, an NLP (Natural Language Processing ) model and other models with word vector conversion functions can be adopted to respectively convert the data object into a data word vector, convert the value description information into a value vector and convert the context information into an associated word vector.
In the embodiment of the invention, the data features exist in a vector form, and the index type corresponding to the data features can be obtained by calculating the vector corresponding to the data features.
In detail, the determining the index type of the joint index field according to the data features includes:
calculating a relative probability value between the data characteristic and a preset index type label by using a pre-trained activation function;
and calculating the score of each preset index type label according to the relative probability value, and determining the index type label with the highest score as the index type of the data object.
In the embodiment of the present invention, the pre-trained activation function includes, but is not limited to, a softmax activation function, a sigmoid activation function, and a relu activation function, and the preset index type tag includes, but is not limited to, a Bloom Filter index tag, a binary tree index tag, or a b+ tree index tag.
In another alternative embodiment of the present invention, the index type of each of the data objects may also be determined by benchmark testing and trial and error.
And step four, acquiring address information of each data object in the preset Hudi data source, and generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
In the embodiment of the invention, the address information of each data object in the preset Hudi data source can be obtained by using a part file corresponding to each data object, and the part file is a column type storage format file and can be applied to efficiently storing and processing data in a financial big data environment.
Preferably, the column coding format of the part file may be used to code the joint index field, the index type and the address information of the data object according to the part file corresponding to each data object, for example, coding formats such as Run Length Encoding (RLE), delta Encoding and Bit Packing are adopted to assemble the index column file of the preset Hudi data source.
The index column file generated based on the part file can further reduce the storage space of the index file by using column coding, and meanwhile, the definition of adding, deleting or modifying columns to the index column file is supported under the condition that the existing data is not damaged, so that the flexibility and compatibility of the index column file are improved.
According to the invention, the historical data operation information is analyzed to obtain the joint index field corresponding to each data object in the preset Hudi data source, the retrieval speed and the query efficiency of the financial big data can be improved through a joint index mode, the joint index field is obtained from the historical data operation information, the selected joint index field is ensured to be consistent with the actual operation of the financial big data, the query efficiency of the financial big data is facilitated to be improved, meanwhile, the index type of the corresponding joint index field is determined according to the data characteristics of the data objects, in the query of the actual financial big data, the corresponding joint index field can be rapidly positioned according to the index type in an index column file, and then the corresponding data object is obtained according to the address information corresponding to the joint index field.
Fig. 5 is a schematic structural diagram of an electronic device implementing a Hudi-based index creation method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program stored in the memory 11 and executable on the processor 10, such as created based on the Hudi's index.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes created based on the Hudi index, etc., but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects the respective components of the entire electronic device using various interfaces and lines, executes programs or modules stored in the memory 11 (for example, index creation based on Hudi, etc.), and invokes data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
The bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 5 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The Hudi-based index creation stored by the memory 11 in the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
acquiring historical data operation information of a preset Hudi data source, and analyzing the historical data operation information to obtain a joint query condition corresponding to each data object in the preset Hudi data source;
assembling a joint index field of the corresponding data object according to the joint query condition;
Extracting data characteristics of the data object corresponding to each joint index field, and determining the index type of the joint index field according to the data characteristics;
acquiring address information of each data object in the preset Hudi data source;
and generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
Acquiring historical data operation information of a preset Hudi data source, and analyzing the historical data operation information to obtain a joint query condition corresponding to each data object in the preset Hudi data source;
assembling a joint index field of the corresponding data object according to the joint query condition;
extracting data characteristics of the data object corresponding to each joint index field, and determining the index type of the joint index field according to the data characteristics;
acquiring address information of each data object in the preset Hudi data source;
and generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the holographic projection technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. A Hudi-based index creation method, the method comprising:
acquiring historical data operation information of a preset Hudi data source, and analyzing the historical data operation information to obtain a joint query condition corresponding to each data object in the preset Hudi data source;
assembling a joint index field of the corresponding data object according to the joint query condition;
extracting data characteristics of the data object corresponding to each joint index field, and determining the index type of the joint index field according to the data characteristics;
Acquiring address information of each data object in the preset Hudi data source;
and generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
2. The Hudi-based index creating method of claim 1, wherein the parsing the historical data operation information to obtain the joint query condition corresponding to each data object in the preset Hudi data source includes:
extracting a query condition part in the historical data operation information according to a preset data operation grammar;
screening the data objects with non-unique query conditions from the query condition part, and counting the duty ratio of the data objects with non-unique query conditions in the historical data operation information;
and selecting the query conditions corresponding to the data objects with the duty ratio larger than the preset proportion threshold as the joint query conditions.
3. The Hudi-based index creation method of claim 1, wherein the assembling the federated index field of the corresponding data object according to the federated query condition comprises:
acquiring a field set of a data table where the data object is located;
Word segmentation is carried out on the combined query conditions to obtain a conditional word segmentation set;
sequentially calculating the similarity between each conditional word in the conditional word set and each field in the field set;
selecting a field with the similarity larger than a preset similarity threshold as an index field of the corresponding conditional word;
and collecting all index fields of the data object to obtain a joint index field of the data object.
4. The Hudi-based index creating method as claimed in claim 1, wherein said extracting data features of the data object corresponding to each of the joint index fields comprises:
acquiring value description information of the data object and context information of the data object;
converting the data object into a data word vector, converting the value description information into a value vector, and converting the context information into an associated word vector;
and splicing the data word vector, the value vector and the associated word vector to obtain the data characteristics of the data object.
5. The Hudi-based index creation method of claim 1, wherein said determining an index type of the joint index field from the data characteristics comprises:
Calculating a relative probability value between the data characteristic and a preset index type label by using a pre-trained activation function;
and calculating the score of each preset index type label according to the relative probability value, and determining the index type label with the highest score as the index type of the data object.
6. An apparatus for creating an index based on Hudi, the apparatus comprising:
the joint index determining module is used for acquiring historical data operation information of a preset Hudi data source, analyzing the historical data operation information to obtain joint query conditions corresponding to each data object in the preset Hudi data source, and assembling joint index fields of the corresponding data objects according to the joint query conditions;
the index type distribution module is used for extracting the data characteristics of the data objects corresponding to each joint index field and determining the index type of the joint index field according to the data characteristics;
an index address acquisition module, configured to acquire address information of each data object in the preset Hudi data source;
and the index file generation module is used for generating an index column file of the preset Hudi data source according to the joint index field, the index type and the address information of each data object.
7. The Hudi-based index creating device of claim 6, wherein the joint index determining module obtains the joint query condition corresponding to each data object in the preset Hudi data source by:
extracting a query condition part in the historical data operation information according to a preset data operation grammar;
screening the data objects with non-unique query conditions from the query condition part, and counting the duty ratio of the data objects with non-unique query conditions in the historical data operation information;
and selecting the query conditions corresponding to the data objects with the duty ratio larger than the preset proportion threshold as the joint query conditions.
8. The Hudi-based index creation apparatus of claim 6, wherein the index type assignment module determines the index type of the joint index field by:
calculating a relative probability value between the data characteristic and a preset index type label by using a pre-trained activation function;
and calculating the score of each preset index type label according to the relative probability value, and determining the index type label with the highest score as the index type of the data object.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the Hudi-based index creation method of any of claims 1 to 5.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the Hudi-based index creation method according to any of claims 1 to 5.
CN202311138922.4A 2023-09-05 2023-09-05 Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium Pending CN117150085A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311138922.4A CN117150085A (en) 2023-09-05 2023-09-05 Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311138922.4A CN117150085A (en) 2023-09-05 2023-09-05 Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium

Publications (1)

Publication Number Publication Date
CN117150085A true CN117150085A (en) 2023-12-01

Family

ID=88886539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311138922.4A Pending CN117150085A (en) 2023-09-05 2023-09-05 Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium

Country Status (1)

Country Link
CN (1) CN117150085A (en)

Similar Documents

Publication Publication Date Title
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113836131B (en) Big data cleaning method and device, computer equipment and storage medium
CN114979120B (en) Data uploading method, device, equipment and storage medium
CN112231417A (en) Data classification method and device, electronic equipment and storage medium
CN114077841A (en) Semantic extraction method and device based on artificial intelligence, electronic equipment and medium
CN113886708A (en) Product recommendation method, device, equipment and storage medium based on user information
CN113658002B (en) Transaction result generation method and device based on decision tree, electronic equipment and medium
CN114840684A (en) Map construction method, device and equipment based on medical entity and storage medium
CN113434542B (en) Data relationship identification method and device, electronic equipment and storage medium
CN113157853A (en) Problem mining method and device, electronic equipment and storage medium
CN116578696A (en) Text abstract generation method, device, equipment and storage medium
CN116662488A (en) Service document retrieval method, device, equipment and storage medium
CN114708073B (en) Intelligent detection method and device for surrounding mark and serial mark, electronic equipment and storage medium
CN113626605B (en) Information classification method, device, electronic equipment and readable storage medium
CN113656690B (en) Product recommendation method and device, electronic equipment and readable storage medium
CN113806492B (en) Record generation method, device, equipment and storage medium based on semantic recognition
CN114610854A (en) Intelligent question and answer method, device, equipment and storage medium
CN115186188A (en) Product recommendation method, device and equipment based on behavior analysis and storage medium
CN114138243A (en) Function calling method, device, equipment and storage medium based on development platform
CN113706207A (en) Order transaction rate analysis method, device, equipment and medium based on semantic analysis
CN117150085A (en) Hudi-based index creation method, hudi-based index creation device, hudi-based index creation equipment and Hudi-based index creation medium
CN113344674A (en) Product recommendation method, device, equipment and storage medium based on user purchasing power
CN114819590B (en) Policy intelligent recommendation method, device, equipment and storage medium
CN112328960B (en) Optimization method and device for data operation, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination