CN114860722A - Data fragmentation method, device, equipment and medium based on artificial intelligence - Google Patents

Data fragmentation method, device, equipment and medium based on artificial intelligence Download PDF

Info

Publication number
CN114860722A
CN114860722A CN202210449787.4A CN202210449787A CN114860722A CN 114860722 A CN114860722 A CN 114860722A CN 202210449787 A CN202210449787 A CN 202210449787A CN 114860722 A CN114860722 A CN 114860722A
Authority
CN
China
Prior art keywords
data
fragmentation
fragmented
fragment
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210449787.4A
Other languages
Chinese (zh)
Inventor
陈海钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210449787.4A priority Critical patent/CN114860722A/en
Publication of CN114860722A publication Critical patent/CN114860722A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data fragmentation method and device based on artificial intelligence, an electronic device and a storage medium, wherein the data fragmentation method based on artificial intelligence comprises the following steps: acquiring configuration information of data to be fragmented; classifying the data to be fragmented according to service types to obtain classified data, and storing the classified data to a fragmentation table based on the configuration information to serve as first fragmentation data; receiving a data reading request to read target data from the fragment table; counting the number of each key field in the historical target data to obtain calling frequency so as to screen all the key fields to obtain a hot field data set; and calculating the text similarity of the first fragment data and the hotspot field data set to obtain semantic correlation so as to fragment the first fragment data to obtain second fragment data. According to the method and the device, the data to be fragmented can be stored in the sub-table mode according to the called frequency of the data to be fragmented in the database, and therefore the data calling efficiency of the database is improved.

Description

Data slicing method, device, equipment and medium based on artificial intelligence
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a data fragmentation method and apparatus, an electronic device, and a storage medium based on artificial intelligence.
Background
The traditional relational database, such as the commonly used MySQL, has a performance bottleneck, that is, after data reaches a certain magnitude, the performance of the database is significantly reduced, and the read-write operation of the database is affected accordingly. Data fragmentation is a common way to optimize database performance, and since a single table/single library has a performance bottleneck of data storage, the data to be stored can be distributed into multiple library tables.
In the prior art, data is generally fragmented according to a single mode such as the data size or date of data to be fragmented, however, a large amount of data to be fragmented corresponds to multiple different service types, and different frequencies of data to be fragmented being invoked in a database are not the same, so that a more flexible and effective data fragmentation mode is urgently needed to perform data fragmentation, thereby improving the data invocation efficiency of the database.
Disclosure of Invention
In view of the foregoing, there is a need for providing a data fragmentation method, apparatus, electronic device and storage medium based on artificial intelligence, so as to solve the technical problem of how to improve the data call efficiency of a database.
The application provides a data fragmentation method based on artificial intelligence, which comprises the following steps:
acquiring configuration information of data to be fragmented, wherein the configuration information comprises the data volume of the data to be fragmented, a table building statement and a plurality of databases;
classifying the data to be fragmented according to service types to obtain classified data, and storing the classified data to a fragmentation table based on the configuration information to serve as first fragmentation data;
receiving a data reading request to read target data from the fragment table, wherein the data reading request carries a key field of the target data;
counting the number of each key field in the historical target data to obtain calling frequency, and screening all the key fields based on the calling frequency to obtain a hot field data set;
and calculating the text similarity of the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm to obtain semantic relevance, and fragmenting the first fragment data based on the semantic relevance to obtain second fragment data.
In some embodiments, the obtaining configuration information of data to be fragmented, where the configuration information includes a data size of the data to be fragmented, a table building statement, and multiple databases, includes:
acquiring configuration information input by a user through a web interface, wherein the data volume of the data to be fragmented is used for indicating the number of fragmented data into which the data to be fragmented is fragmented; the table building statement is used for generating a fragment table which can be written in the data to be fragmented; the databases are used for storing the data to be fragmented.
In some embodiments, the classifying the data to be fragmented according to the service type to obtain classified data, and storing the classified data in a fragmentation table as first fragmentation data based on the configuration information includes:
setting different coding labels for the data to be fragmented of each service type according to a preset mode;
classifying the data to be fragmented based on the coding labels to obtain classified data, storing the classified data of the same category to the same fragmentation table based on the configuration information, storing the classified data of different categories to different fragmentation tables, and taking all the classified data stored in the fragmentation tables as first fragmentation data.
In some embodiments, the receiving a data read request to read target data from the fragmentation table, where a key field carrying the target data in the data read request includes:
receiving a data reading request to obtain a key field of target data carried in the data reading request;
reading a routing address of the target data according to a pre-configured routing rule and the key field, wherein the routing address comprises a fragmentation table and a database of the target data;
and reading and feeding back the target data based on the routing address.
In some embodiments, counting the number of each key field in the historical target data to obtain a calling frequency, and screening all the key fields based on the calling frequency to obtain a hotspot field data set includes:
counting the number of each key field in the historical target data, and taking the ratio of the number of each key field to the total number of all key fields as the calling frequency;
and classifying the calling frequency, and screening all key fields according to a classification result to obtain a hot field data set.
In some embodiments, the classifying the call frequency and screening all the key fields according to the classification result to obtain the hotspot field data set includes:
calculating the error square sum of all the calling frequencies to determine the effective classification number of all the calling frequencies;
classifying all the calling frequencies based on the effective classification number to obtain a plurality of calling frequency classes;
calculating the average value of the calling frequencies contained in each calling frequency category, selecting the calling frequency category corresponding to the maximum average value, and taking the key fields corresponding to all the calling frequencies in the calling frequency category as the hotspot field data set.
In some embodiments, the calculating, according to a text similarity algorithm, a text similarity between the first sliced data and each key field in the hotspot field data set to obtain a semantic relevance, and slicing the first sliced data based on the semantic relevance to obtain second sliced data includes:
respectively calculating the text similarity between each classified data in the first fragment data and each key field in the hot field data set according to a text similarity algorithm;
taking the calling frequency corresponding to each key field in the hotspot field data set as a feature weight, and performing weighted summation on the feature weight and the text similarity to obtain the semantic relevancy of each classification data in the first fragment data;
and calculating the average value of the semantic relevance of all classified data in the first fragment data as a fragment threshold, and performing fragmentation on the first fragment data based on the fragment threshold to obtain second fragment data.
The embodiment of the present application further provides a data fragmentation device based on artificial intelligence, the device includes:
the device comprises an acquisition unit, a data processing unit and a data processing unit, wherein the acquisition unit is used for acquiring configuration information of data to be fragmented, and the configuration information comprises the data volume of the data to be fragmented, a table building statement and a plurality of databases;
the classification unit is used for classifying the data to be fragmented according to the service type to obtain classified data, and storing the classified data into a fragmentation table based on the configuration information to serve as first fragmentation data;
a reading unit, configured to receive a data reading request to read target data from the fragment table, where the data reading request carries a key field of the target data;
the screening unit is used for counting the number of each key field in the historical target data to obtain calling frequency, and screening all the key fields based on the calling frequency to obtain a hot field data set;
and the calculating unit is used for calculating the text similarity of the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm to obtain semantic relevance, and fragmenting the first fragment data based on the semantic relevance to obtain second fragment data.
An embodiment of the present application further provides an electronic device, where the electronic device includes:
a memory storing at least one instruction;
and the processor executes the instructions stored in the memory to realize the artificial intelligence based data fragmentation method.
The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is executed by a processor in an electronic device to implement the artificial intelligence based data fragmentation method.
According to the method and the device, all key fields are screened by counting the number of each key field in historical target data, then the text similarity of the data to be fragmented and the hot field data set is calculated to obtain the semantic relevance, classification data are fragmented based on the semantic relevance, the hot data can be flexibly screened, and the hot data in the data to be fragmented can be subjected to table division storage, so that the data calling efficiency of a database is improved.
Drawings
FIG. 1 is a flow diagram of a preferred embodiment of an artificial intelligence based data fragmentation method to which the present application relates.
FIG. 2 is a functional block diagram of a preferred embodiment of an artificial intelligence based data slicing apparatus according to the present application.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the artificial intelligence based data fragmentation method according to the present application.
Fig. 4 is a graph of the increase of SSE with k to which the present application relates.
Detailed Description
For a clearer understanding of the objects, features and advantages of the present application, reference is made to the following detailed description of the present application along with the accompanying drawings and specific examples. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict. In the following description, numerous specific details are set forth to provide a thorough understanding of the present application, and the described embodiments are merely a subset of the embodiments of the present application and are not intended to be a complete embodiment.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The embodiment of the present Application provides a data slicing method based on artificial intelligence, which can be applied to one or more electronic devices, where an electronic device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and hardware of the electronic device includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The electronic device may be any electronic product capable of performing human-computer interaction with a client, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an Internet Protocol Television (IPTV), an intelligent wearable device, and the like.
The electronic device may also include a network device and/or a client device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network servers.
The Network where the electronic device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
Fig. 1 is a flowchart of a preferred embodiment of the artificial intelligence based data fragmentation method according to the present application. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
S10, obtaining configuration information of the data to be fragmented, wherein the configuration information comprises the data volume of the data to be fragmented, a table building statement and a plurality of databases.
In an optional embodiment, the obtaining configuration information of the data to be fragmented, where the configuration information includes a data size of the data to be fragmented, a table building statement, and multiple databases, includes:
acquiring configuration information input by a user through a web interface, wherein the data volume of the data to be fragmented is used for indicating the number of fragmented data into which the data to be fragmented is fragmented; the table building statement is used for generating a fragment table which can be written in the data to be fragmented; the databases are used for storing the data to be fragmented.
In this alternative embodiment, a web interface may be provided in advance to obtain the configuration information input by the user through the web interface. The data size of the data to be fragmented is a unit size, and the data to be fragmented needs to be split into a plurality of fragmented data of the size of the data size. For example, when one piece of data to be fragmented is 1000GB, and the data volume of the fragment data configured on the web interface in advance is 50GB, the piece of data to be fragmented can be split into 20 pieces of fragment data with the data volume of 50 GB.
In the optional embodiment, a form building statement input by a user on a web interface is obtained. The table building statement is used for generating a fragmentation table with a uniform format, and index columns used for fragmentation in the fragmentation table can be correspondingly specified. The index column is used for storing all index data in the fragmentation table in a column form, and the index data may be a key field of the data to be fragmented. In addition, the fragment table is used for writing fragment data, so that the fragment table which can keep the unified format when storing data in different databases is obtained.
In this alternative embodiment, multiple databases of user inputs at the web interface are obtained. The database is a warehouse specified by a user for storing data to be fragmented, and the storage objects of the database are in the form of tables. Therefore, in the embodiment of the present application, data to be fragmented needs to be written into the fragmentation table, and then the fragmentation table is stored to different databases by routing.
Therefore, the configuration information of the data to be fragmented can be quickly acquired based on the web interface, data support is provided for the subsequent process, and meanwhile, the human-computer interaction efficiency is improved.
S11, classifying the data to be fragmented according to the service type to obtain classified data, and storing the classified data to a fragmentation table as first fragmentation data based on the configuration information.
In an optional embodiment, the classifying the data to be fragmented according to service type to obtain classified data, and storing the classified data into a fragment table as first fragment data based on the configuration information includes:
and S111, setting different coding labels for the data to be fragmented of each service type according to a preset mode.
And S112, classifying the data to be fragmented based on the coding labels to obtain classified data, storing the classified data of the same category into the same fragmentation table based on the configuration information, storing the classified data of different categories into different fragmentation tables, and taking all the classified data stored into the fragmentation tables as first fragmentation data.
In this optional embodiment, when the application program continuously executes the service function, the data to be fragmented is uninterruptedly written; the data to be fragmented is specifically related to the service expanded by the application program, and is also or will be used in the subsequent service, so that the data to be fragmented has important utilization value and needs to be stored. And the commonly used storage mode is database storage, such as common data like MYSQL, MYCAT, etc.
In this optional embodiment, different encoding labels may be set for the data to be fragmented of each service type according to a preset manner, where the encoding labels may be letters, numbers, or special symbols, and this scheme is not particularly limited.
In this optional embodiment, the data to be fragmented having the same encoding label is classified into the same category, so as to obtain classification data of multiple categories, and then the classified classification data is divided into multiple fragment data according to the data amount, and then the multiple fragment data is stored in multiple databases. In terms of setting, the object of the database for storing data is a table, so that the fragment data needs to be written into the table to perform corresponding storage service. Therefore, the corresponding fragment table can be generated according to the table building statement, and then the fragment data belonging to the same category is respectively written into the corresponding fragment table generated in advance to obtain a plurality of target fragment tables, so that the fragment data is stored in the database in the form of the fragment table, and all the classification data stored in the fragment table is used as the first fragment data.
In this alternative embodiment, a plurality of target shard tables are sequentially stored in one of the plurality of databases. If target fragment tables which are not stored in the database remain, continuously storing the remaining target fragment tables in the target fragment table to a database except the database in which the target fragment table is stored in the plurality of databases until all the target fragment tables are stored in the database.
In this alternative embodiment, since the data size of a piece of fragment data is usually large, a piece of fragment table cannot be written into a complete piece of fragment data. Therefore, the same fragment data can be written by using a plurality of fragment tables, and the tables are distinguished by sequence numbers. For example: the fragment table 1-5000 is used for writing first fragment data; fragmentation table 5001-fragmentation table 10000 are used to write second fragmentation data … and so on. Wherein, the same fragmentation table only stores the fragmentation data with the same encoding label.
If the fragmentation table written by the first fragmentation data includes fragmentation table 1-fragmentation table 5000, theoretically, according to the principle of "one database and one chip", all 5000 fragmentation tables should be stored in the same database. However, because the database carrying capacity is limited, if the data amount of the fragment table to be stored exceeds the carrying capacity of the database, a part of the fragment table exceeding the carrying capacity needs to be stored in another database. Of course, this approach is only used as an optional backup operation, since the load capacity of the database is currently huge and also supports capacity expansion operations.
Therefore, the data volume of the data to be fragmented can be effectively dispersed, the reading/writing pressure of a single server resource is reduced, and the database has a good horizontal expansion function.
And S12, receiving a data reading request to read the target data from the fragment table, wherein the data reading request carries the key field of the target data.
In an optional embodiment, the receiving a data read request to read target data from the fragmentation table, where a key field carrying the target data in the data read request includes:
and S121, receiving a data reading request to obtain a key field of target data carried in the data reading request.
And S122, reading a routing address of the target data according to a pre-configured routing rule and the key field, wherein the routing address comprises a fragmentation table and a database of the target data.
And S123, reading and feeding back the target data based on the routing address.
In this alternative embodiment, when the application program needs to read the target data currently due to business requirements, a corresponding data read request may be issued, where the target data refers to a certain data in all classified data. The data reading request carries a key field of target data, which can be obtained by analyzing the data reading request, and the key field can be used as an index of the target data, that is, the corresponding target data can be directly found through the index.
In this optional embodiment, based on the key field, a routing address where the target data is located is read through a preconfigured routing rule, where the routing address includes a fragmentation table and a database where the target data is located. The route is equivalent to a route, and the routing rule is equivalent to a road sign for instructing the application program how to obtain the target data in the corresponding database and fragmentation table, so that the corresponding routing rule needs to be set in advance by writing a script command.
The scheme provides a routing rule engine which is used for carrying out database cache management, and the scheme can use a ThinsBoard rule engine. The main logic unit of the rule engine is a rule node, the rule nodes have relevance, and the rule node can route the message to the next designated node to complete the data reading work. Therefore, when data is read subsequently, based on the key field obtained by analyzing the request, the routing address of the target data, namely the position of the fragmentation table and the database where the target data is located, can be read by using the routing rule, and the target data is read and fed back according to the routing address.
In this alternative embodiment, after the target data is read, the target data is fed back to the corresponding application program. In a special case, when a plurality of target data that needs to be read by an application program are stored in different fragmentation tables, aggregation operation needs to be performed on the plurality of read target data to obtain aggregated target data. Therefore, the embodiment of the present application provides an sql parsing and aggregating engine, which is responsible for parsing and processing an sql request for processing an application program, splitting a processed sql instruction according to a routing rule engine, concurrently processing a plurality of fragmentation tables and a plurality of databases, performing an aggregation operation after obtaining an execution result, and finally returning data or an operation result required by the application program.
Therefore, the target data can be quickly acquired according to the key fields carried in the data reading request, the acquisition efficiency of the target data is improved, and meanwhile, data support can be provided for the analysis of the hotspot data in the subsequent process according to the key fields.
And S13, counting the number of each key field in the historical target data to obtain calling frequency, and screening all the key fields based on the calling frequency to obtain a hot field data set.
In an optional embodiment, counting the number of each key field in the historical target data to obtain a calling frequency, and screening all the key fields based on the calling frequency to obtain a hotspot field data set includes:
s131, counting the number of each key field in the historical target data, and taking the ratio of the number of each key field to the total number of all key fields as the calling frequency.
S132, classifying the calling frequency, and screening all the key fields according to the classification result to obtain a hot field data set.
In this optional embodiment, first, the number of each key field in the historical target data acquired by the application program is counted, and the ratio of the number of each key field to the total number of all key fields is used as the calling frequency of the corresponding key field.
In this alternative embodiment, the calling frequencies may be classified by a K-means clustering algorithm. The K-means clustering algorithm is a clustering analysis algorithm for iterative solution, and comprises the steps of dividing all calling frequency data into K groups in advance, randomly selecting K calling frequency data as initial clustering centers, calculating the distance between each calling frequency data and each clustering center, and allocating each calling frequency data to the nearest clustering center. The cluster center and all the call frequency data assigned to the cluster center represent a cluster. When each calling frequency data is distributed, the cluster center of the cluster where the calling frequency data is located is recalculated according to all the calling frequency data existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or a minimum number) of the calling frequency data is reassigned to a different cluster, no (or a minimum number) cluster center is changed again, and the sum of squared errors is locally minimal.
In this optional embodiment, the K-means clustering algorithm is used on the premise that the classification number K of the calling frequency data needs to be specified, the calling frequency is divided more finely as the classification number increases, the aggregation degree of each class is gradually increased, and the corresponding error square sum SSE is gradually decreased, so that the reasonable classification number K is obtained by calculating the corresponding error square sum SSE under different classification numbers in the present scheme. Wherein, the calculation formula of the error square sum is:
Figure BDA0003616801950000111
wherein SSE is the sum of squared errors, C i Is the ith calling frequency class, k is the number of preset calling frequency classes, p is the calling frequency class C i The data point in (a) is,
Figure BDA0003616801950000112
indicating a calling frequency class C i Average of all the call frequency data points in (a).
For example, assuming that the number k of classes of the call frequency is 1 and the corresponding call frequency data is (10,20,30), the average value of all data points is 20, SSE is (10-20) 2 +(20-20) 2 +(30-20) 2 The sum of the squared errors calculated is 200.
In this optional embodiment, when k is smaller than the ideal number of classifications, since the increase of k greatly increases the aggregation degree of each class of the calling frequency, the decrease of the SSE is large, and when k reaches the ideal number of classifications, the aggregation degree obtained by increasing k is rapidly reduced, so the decrease of the SSE is rapidly reduced and then becomes gentle as the value of k continues to increase, so that the value of k corresponding to the data point with the highest curvature can be determined by establishing an SSE increasing relationship graph with k, and the value of k at this time is the ideal number of classifications of the calling frequency.
For example, in the SSE increasing with k relationship diagram shown in fig. 4, when k is 4, the curvature of the corresponding data point is the largest, so the optimal classification number is k 4.
In this optional embodiment, the finally determined classification number K of the ideal calling frequency is used as the effective classification number, and the calling frequency is classified by a K-means clustering algorithm based on the effective classification number to obtain a plurality of calling frequency classes.
In this optional embodiment, an average value of the calling frequencies included in each calling frequency category is calculated, and after the calling frequency category corresponding to the maximum average value is selected, the key fields corresponding to all the calling frequencies in the calling frequency category are used as the hotspot field data set.
Therefore, hot spot data in the classified data can be obtained through screening, and the subsequent process can conveniently carry out sub-table storage according to the hot spot data, so that accurate fragmentation is carried out on the data to be fragmented.
S14, calculating the text similarity of the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm to obtain semantic relevance, and fragmenting the first fragment data based on the semantic relevance to obtain second fragment data.
In an optional embodiment, the calculating, according to a text similarity algorithm, a text similarity between the first sliced data and each key field in the hotspot field data set to obtain a semantic relevance, and slicing the first sliced data based on the semantic relevance to obtain second sliced data includes:
and S141, respectively calculating the text similarity of each classification data in the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm.
And S142, taking the calling frequency corresponding to each key field in the hot spot field data set as a feature weight, and performing weighted summation on the feature weight and the text similarity to obtain the semantic relevancy of each classification data in the first fragment data.
S143, calculating an average value of semantic relevance of all classified data in the first fragmented data as a fragmentation threshold, and fragmenting the first fragmented data based on the fragmentation threshold to obtain second fragmented data.
In this optional embodiment, the text similarity algorithm may use a Vector Space Model (VSM) algorithm, where the VSM forms each data in the first sliced data into one point in space, and provides the point in space in a vector form, and simplifies the processing of the first sliced data into a vector operation in a vector space, thereby reducing the complexity of performing text similarity calculation on classified data in the first sliced data.
In this optional embodiment, the text similarity between each classification data in the first fragment data and each key field in the hot spot field data set is calculated by using a VSM algorithm, then the calling frequency corresponding to each key field in the hot spot field data set is used as a feature weight, and the feature weight and the text similarity are weighted and summed to obtain the semantic relevance of each classification data in the first fragment data.
Illustratively, there are 4 key fields A, B, C, D in the hotspot data set, and the text similarity between the classification data E in the first fragment data and the 4 key fields A, B, C, D is calculated by the VSM algorithm to be 0.6, 0.2, 0.1, 0.3, respectively. If the calling frequencies corresponding to the 4 key fields A, B, C, D are 0.8, 0.5, 0.6, and 0.2, the semantic relevance of the classification data E in the first tile data is calculated by weighted summation to be 0.6 × 0.8+0.2 × 0.5+0.1 × 0.6+0.3 × 0.2 ═ 0.7.
In this optional embodiment, an average value of semantic relevance of all classification data in the first fragmented data is calculated as a fragmentation threshold, and the first fragmented data is fragmented based on the fragmentation threshold to obtain second fragmented data, and the specific process is as follows: and splitting each fragmentation table into a data master table and a data slave table, writing the classified data which is greater than the fragmentation threshold into the data slave table, and writing the classified data which is not greater than the fragmentation threshold into the data master table. When each fragment table is split, each fragment table can be used as a data master table, and a fragment table not containing any data is selected as a data slave table of the fragment table. And finally, taking all the first fragmentation data written into the data master table and the data slave table as second fragmentation data, thereby finishing the data fragmentation process of the data to be fragmented.
Therefore, the hot data can be flexibly screened and the hot data in the classified data can be stored in a sub-table mode, and therefore the data calling efficiency of the database is improved.
Referring to fig. 2, fig. 2 is a functional block diagram of a preferred embodiment of the artificial intelligence based data slicing apparatus according to the present application. The artificial intelligence based data slicing device 11 comprises an acquisition unit 110, a classification unit 111, a reading unit 112, a screening unit 113 and a calculation unit 114. A module/unit as referred to herein is a series of computer readable instruction segments capable of being executed by the processor 13 and performing a fixed function, and is stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.
In an optional embodiment, the obtaining unit 110 is configured to obtain configuration information of the data to be fragmented, where the configuration information includes a data size of the data to be fragmented, a table building statement, and multiple databases.
In an optional embodiment, the obtaining configuration information of the data to be fragmented, where the configuration information includes a data size of the data to be fragmented, a table building statement, and multiple databases, includes:
acquiring configuration information input by a user through a web interface, wherein the data volume of the data to be fragmented is used for indicating the number of fragmented data into which the data to be fragmented is split; the table building statement is used for generating a fragment table which can be written in the data to be fragmented; the databases are used for storing the data to be fragmented.
In this alternative embodiment, a web interface may be provided in advance to obtain the configuration information input by the user through the web interface. The data size of the data to be fragmented is a unit size, and the data to be fragmented needs to be split into a plurality of fragmented data of the size of the data size. For example, when one piece of data to be fragmented is 1000GB, and the data volume of the fragment data configured on the web interface in advance is 50GB, the piece of data to be fragmented can be split into 20 pieces of fragment data with the data volume of 50 GB.
In the optional embodiment, a form building statement input by a user on a web interface is obtained. The table building statement is used for generating a fragmentation table with a uniform format, and index columns used for fragmentation in the fragmentation table can be correspondingly specified. The index column comprises a plurality of indexes, the index is a special database structure which is independent and physical and sorts one or more columns of values in a database table, the index is a set of one or more columns of values in a certain table and a corresponding logical pointer list which points to data pages in the table for physically identifying the values, the index is equivalent to a directory of a book, and required contents can be quickly found according to page numbers in the directory. In addition, the fragment table is used for writing fragment data, so that the fragment table which can keep the unified format when storing data in different databases is obtained.
In this alternative embodiment, multiple databases of user inputs at the web interface are obtained. The database is a warehouse specified by a user for storing data to be fragmented, and the storage objects of the database are in the form of tables. Therefore, in the embodiment of the present application, data to be fragmented needs to be written into the fragmentation table, and then the fragmentation table is stored to different databases by routing.
In an optional embodiment, the classifying unit 111 is configured to classify the data to be fragmented according to a service type to obtain classified data, and store the classified data in a fragmentation table based on the configuration information to serve as first fragmentation data.
In an optional embodiment, the classifying the data to be fragmented according to the service type to obtain classified data, and storing the classified data in a fragmentation table as first fragmentation data based on the configuration information includes:
setting different coding labels for the data to be fragmented of each service type according to a preset mode;
classifying the data to be fragmented based on the coding labels to obtain classified data, storing the classified data of the same category to the same fragmentation table based on the configuration information, storing the classified data of different categories to different fragmentation tables, and taking all the classified data stored in the fragmentation tables as first fragmentation data.
In this optional embodiment, when the application program continuously executes the service function, the data to be fragmented is uninterruptedly written; the data to be fragmented is specifically related to the service expanded by the application program, and is also or will be used in the subsequent service, so that the data to be fragmented has important utilization value and needs to be stored. And the commonly used storage mode is database storage, such as common data like MYSQL, MYCAT, etc.
In this optional embodiment, different encoding labels may be set for the data to be fragmented of each service type according to a preset manner, where the encoding labels may be letters, numbers, or special symbols, and this scheme is not particularly limited.
In this optional embodiment, the data to be fragmented having the same encoding label is classified into the same category, so as to obtain classification data of multiple categories, and then the classified classification data is divided into multiple fragment data according to the data amount, and then the multiple fragment data is stored in multiple databases. In terms of setting, the object of the database for storing data is a table, so that the fragment data needs to be written into the table to perform corresponding storage service. Therefore, the corresponding fragment table can be generated according to the table building statement, and then the fragment data belonging to the same category is respectively written into the corresponding fragment table generated in advance to obtain a plurality of target fragment tables, so that the fragment data is stored in the database in the form of the fragment table, and all the classification data stored in the fragment table is used as the first fragment data.
In this alternative embodiment, a plurality of target shard tables are sequentially stored in one of the plurality of databases. And if the target fragment tables which are not stored in the database remain, continuously storing the remaining target fragment tables in the target fragment table to one database except the database in which the target fragment table is stored in the plurality of databases until all the target fragment tables are stored in the database.
In this alternative embodiment, since the data size of a piece of fragment data is usually large, a piece of fragment table cannot be written into a complete piece of fragment data. Therefore, the same fragmentation data can be written by using a plurality of fragmentation tables, and the tables are distinguished by sequence numbers. For example: the fragment table 1-5000 is used for writing first fragment data; fragmentation table 5001-fragmentation table 10000 are used to write second fragmentation data … and so on. Wherein, the same fragmentation table only stores the fragmentation data with the same encoding label.
If the fragmentation table written by the first fragmentation data includes fragmentation table 1-fragmentation table 5000, theoretically, according to the principle of "one database and one chip", all 5000 fragmentation tables should be stored in the same database. However, because the database carrying capacity is limited, if the data amount of the fragment table to be stored exceeds the carrying capacity of the database, a part of the fragment table exceeding the carrying capacity needs to be stored in another database. Of course, this approach is only used as an optional backup operation, since the load capacity of the database is currently huge and also supports capacity expansion operations.
In an optional embodiment, the reading unit 112 is configured to receive a data read request to read target data from the fragmentation table, where the data read request carries a key field of the target data.
In an optional embodiment, the receiving a data read request to read target data from the fragmentation table, where a key field carrying the target data in the data read request includes:
receiving a data reading request to obtain a key field of target data carried in the data reading request;
reading a routing address of the target data according to a pre-configured routing rule and the key field, wherein the routing address comprises a fragmentation table and a database of the target data;
and reading and feeding back the target data based on the routing address.
In this alternative embodiment, when the application program needs to read the target data currently due to business requirements, a corresponding data read request may be issued, where the target data refers to a certain data in all classified data. The data reading request carries a key field of target data, which can be obtained by analyzing the data reading request, and the key field can be used as an index of the target data, that is, the corresponding target data can be directly found through the index.
In this optional embodiment, based on the key field, a routing address where the target data is located is read through a preconfigured routing rule, where the routing address includes a fragmentation table and a database where the target data is located. The route is equivalent to a route, and the routing rule is equivalent to a road sign for instructing the application program how to obtain the target data in the corresponding database and fragmentation table, so that the corresponding routing rule needs to be set in advance by writing a script command.
The scheme provides a routing rule engine which is used for carrying out database cache management, and the scheme can use a ThinsBoard rule engine. The main logic unit of the rule engine is a rule node, the rule nodes have relevance, and the rule node can route the message to the next designated node to complete the data reading work. Therefore, when data is read subsequently, based on the key field obtained by analyzing the request, the routing address where the target data is located, namely the position of the fragmentation table where the target data is located and the database where the target data is located, is read by using the routing rule, and the target data is read and fed back according to the routing address.
In this alternative embodiment, after the target data is read, the target data is fed back to the corresponding application program. In a special case, when a plurality of target data that needs to be read by an application program are stored in different fragmentation tables, aggregation operation needs to be performed on the plurality of read target data to obtain aggregated target data. Therefore, an embodiment of the present application provides an sql parsing and aggregating engine, which is responsible for parsing and processing an sql request for processing an application program, splitting a processed sql instruction according to a routing rule engine, concurrently processing multiple fragmentation tables and multiple databases, performing an aggregation operation after obtaining an execution result, and finally returning data or an operation result required by the application program.
In an optional embodiment, the screening unit 113 is configured to count the number of each key field in the historical target data to obtain a calling frequency, and screen all the key fields based on the calling frequency to obtain a hot field data set.
In an optional embodiment, counting the number of each key field in the historical target data to obtain a calling frequency, and screening all the key fields based on the calling frequency to obtain a hotspot field data set includes:
counting the number of each key field in the historical target data, and taking the ratio of the number of each key field to the total number of all key fields as the calling frequency;
and classifying the calling frequency, and screening all the key fields according to a classification result to obtain a hot field data set.
In this optional embodiment, first, the number of each key field in the historical target data acquired by the application program is counted, and the ratio of the number of each key field to the total number of all key fields is used as the calling frequency of the corresponding key field.
In this alternative embodiment, the calling frequencies may be classified by a K-means clustering algorithm. The K-means clustering algorithm is a clustering analysis algorithm for iterative solution, and comprises the steps of dividing all calling frequency data into K groups in advance, randomly selecting K calling frequency data as initial clustering centers, calculating the distance between each calling frequency data and each clustering center, and allocating each calling frequency data to the nearest clustering center. The cluster center and all the call frequency data assigned to the cluster center represent a cluster. When each calling frequency data is distributed, the cluster center of the cluster where the calling frequency data is located is recalculated according to all the calling frequency data existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or a minimum number) of the calling frequency data is reassigned to a different cluster, no (or a minimum number) cluster center is changed again, and the sum of squared errors is locally minimal.
In this optional embodiment, the K-means clustering algorithm is used on the premise that the classification number K of the calling frequency data needs to be specified, the calling frequency is divided more finely as the classification number increases, the aggregation degree of each class is gradually increased, and the corresponding error square sum SSE is gradually decreased, so that the reasonable classification number K is obtained by calculating the corresponding error square sum SSE under different classification numbers in the present scheme. Wherein, the calculation formula of the error square sum is:
Figure BDA0003616801950000181
wherein SSE is the sum of squared errors, C i Is the ith calling frequency class, k is the number of preset calling frequency classes, p is the calling frequency class C i The data point in (a) is,
Figure BDA0003616801950000182
indicating a calling frequency class C i Average of all the call frequency data points in (a).
For example, assuming that the number k of classes of the call frequency is 1 and the corresponding call frequency data is (10,20,30), the average value of all data points is 20, SSE is (10-20) 2 +(20-20) 2 +(30-20) 2 The sum of the squared errors calculated is 200.
In this optional embodiment, when k is smaller than the ideal number of classifications, since the increase of k greatly increases the aggregation degree of each class of the calling frequency, the decrease of the SSE is large, and when k reaches the ideal number of classifications, the aggregation degree obtained by increasing k is rapidly reduced, so the decrease of the SSE is rapidly reduced and then becomes gentle as the value of k continues to increase, so that the value of k corresponding to the data point with the highest curvature can be determined by establishing an SSE increasing relationship graph with k, and the value of k at this time is the ideal number of classifications of the calling frequency.
For example, in the SSE increasing with k relationship diagram shown in fig. 4, when k is 4, the curvature of the corresponding data point is the largest, so the optimal classification number is k 4.
In this optional embodiment, the finally determined classification number K of the ideal calling frequency is used as the effective classification number, and the calling frequency is classified by a K-means clustering algorithm based on the effective classification number to obtain a plurality of calling frequency classes.
In this optional embodiment, an average value of the calling frequencies included in each calling frequency category is calculated, and after the calling frequency category corresponding to the maximum average value is selected, the key fields corresponding to all the calling frequencies in the calling frequency category are used as the hotspot field data set.
In an optional embodiment, the calculating unit 114 is configured to calculate a text similarity between the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm to obtain a semantic relevance, and perform fragmentation on the first fragment data based on the semantic relevance to obtain second fragment data.
In an optional embodiment, the calculating, according to a text similarity algorithm, a text similarity between the first sliced data and each key field in the hotspot field data set to obtain a semantic relevance, and slicing the first sliced data based on the semantic relevance to obtain second sliced data includes:
respectively calculating the text similarity of each classified data in the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm;
taking the calling frequency corresponding to each key field in the hotspot field data set as a feature weight, and performing weighted summation on the feature weight and the text similarity to obtain the semantic relevancy of each classification data in the first fragment data;
and calculating the average value of the semantic relevance of all classified data in the first fragment data as a fragment threshold, and performing fragmentation on the first fragment data based on the fragment threshold to obtain second fragment data.
In this optional embodiment, the text similarity algorithm may use a Vector Space Model (VSM) algorithm, where the VSM forms each data in the first sliced data into one point in space, and provides the point in space in a vector form, and simplifies the processing of the first sliced data into a vector operation in a vector space, thereby reducing the complexity of performing text similarity calculation on classified data in the first sliced data.
In this optional embodiment, the text similarity between each classification data in the first fragment data and each key field in the hot spot field data set is calculated by using a VSM algorithm, then the calling frequency corresponding to each key field in the hot spot field data set is used as a feature weight, and the feature weight and the text similarity are weighted and summed to obtain the semantic relevance of each classification data in the first fragment data.
Illustratively, there are 4 key fields A, B, C, D in the hotspot data set, and the text similarity between the classification data E in the first fragment data and the 4 key fields A, B, C, D is calculated by the VSM algorithm to be 0.6, 0.2, 0.1, 0.3, respectively. If the calling frequencies corresponding to the 4 key fields A, B, C, D are 0.8, 0.5, 0.6, and 0.2, the semantic relevance of the classification data E in the first tile data is calculated by weighted summation to be 0.6 × 0.8+0.2 × 0.5+0.1 × 0.6+0.3 × 0.2 ═ 0.7.
In this optional embodiment, an average value of semantic relevance of all classification data in the first fragmented data is calculated as a fragmentation threshold, and the first fragmented data is fragmented based on the fragmentation threshold to obtain second fragmented data, and the specific process is as follows: and splitting each fragmentation table into a data master table and a data slave table, writing the classified data which is greater than the fragmentation threshold into the data slave table, and writing the classified data which is not greater than the fragmentation threshold into the data master table. When each fragment table is split, each fragment table can be used as a data master table, and a fragment table not containing any data is selected as a data slave table of the fragment table. And finally, taking all the first fragmentation data written into the data master table and the data slave table as second fragmentation data, thereby finishing the data fragmentation process of the data to be fragmented.
According to the technical scheme, all key fields can be screened by counting the number of the key fields in historical target data, then the text similarity between the data to be segmented and the hot field data set is calculated to obtain the semantic relevance, the classified data is segmented based on the semantic relevance, the hot data can be flexibly screened, and the hot data in the data to be segmented is stored in a table, so that the data calling efficiency of the database is improved.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device 1 comprises a memory 12 and a processor 13. The memory 12 is used for storing computer readable instructions, and the processor 13 is used for executing the computer readable instructions stored in the memory to implement the artificial intelligence based data fragmentation method described in any of the above embodiments.
In an alternative embodiment, the electronic device 1 further comprises a bus, a computer program, such as an artificial intelligence based data slicing program, stored in said memory 12 and executable on said processor 13.
Fig. 3 shows only the electronic device 1 with the memory 12 and the processor 13, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
In conjunction with fig. 1, the memory 12 in the electronic device 1 stores a plurality of computer-readable instructions to implement an artificial intelligence based data slicing method, and the processor 13 can execute the plurality of instructions to implement:
acquiring configuration information of data to be fragmented, wherein the configuration information comprises the data volume of the data to be fragmented, a table building statement and a plurality of databases;
classifying the data to be fragmented according to service types to obtain classified data, and storing the classified data to a fragmentation table based on the configuration information to serve as first fragmentation data;
receiving a data reading request to read target data from the fragment table, wherein the data reading request carries a key field of the target data;
counting the number of each key field in the historical target data to obtain calling frequency, and screening all the key fields based on the calling frequency to obtain a hot field data set;
and calculating the text similarity of the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm to obtain semantic relevance, and fragmenting the first fragment data based on the semantic relevance to obtain second fragment data.
Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again.
It will be understood by those skilled in the art that the schematic diagram is only an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, the electronic device 1 may have a bus-type structure or a star-shaped structure, the electronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, the electronic device 1 may further include an input and output device, a network access device, etc.
It should be noted that the electronic device 1 is only an example, and other existing or future electronic products, such as those that may be adapted to the present application, should also be included in the scope of protection of the present application, and are included by reference.
Memory 12 includes at least one type of readable storage medium, which may be non-volatile or volatile. The readable storage medium includes flash memory, removable hard disks, multimedia cards, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, e.g. a removable hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 1. The memory 12 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as code of an artificial intelligence-based data slicing program, but also for temporarily storing data that has been output or is to be output.
The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., executing artificial intelligence based data slicing programs, etc.) stored in the memory 12 and calling data stored in the memory 12.
The processor 13 executes an operating system of the electronic device 1 and various installed application programs. The processor 13 executes the application program to implement the steps in each of the artificial intelligence based data fragmentation method embodiments described above, such as the steps shown in fig. 1.
Illustratively, the computer program may be partitioned into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to accomplish the present application. The one or more modules/units may be a series of computer-readable instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the electronic device 1. For example, the computer program may be divided into an acquisition unit 110, a classification unit 111, a reading unit 112, a screening unit 113, a calculation unit 114.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the artificial intelligence based data fragmentation method according to the embodiments of the present application.
The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the processes in the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer-readable storage medium and executed by a processor, to implement the steps of the embodiments of the methods described above.
Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), random-access Memory and other Memory, etc.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connection communication between the memory 12 and at least one processor 13 or the like.
An embodiment of the present application further provides a computer-readable storage medium (not shown), in which computer-readable instructions are stored, and the computer-readable instructions are executed by a processor in an electronic device to implement the artificial intelligence based data fragmentation method described in any of the above embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the specification may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present application and not for limiting, and although the present application is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application.

Claims (10)

1. An artificial intelligence based data fragmentation method, the method comprising:
acquiring configuration information of data to be fragmented, wherein the configuration information comprises the data volume of the data to be fragmented, a table building statement and a plurality of databases;
classifying the data to be fragmented according to service types to obtain classified data, and storing the classified data to a fragmentation table based on the configuration information to serve as first fragmentation data;
receiving a data reading request to read target data from the fragment table, wherein the data reading request carries a key field of the target data;
counting the number of each key field in the historical target data to obtain calling frequency, and screening all the key fields based on the calling frequency to obtain a hot field data set;
and calculating the text similarity of the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm to obtain semantic relevance, and fragmenting the first fragment data based on the semantic relevance to obtain second fragment data.
2. The artificial intelligence based data fragmentation method of claim 1, wherein the obtaining configuration information of the data to be fragmented, the configuration information including data volume of the data to be fragmented, table building statements and multiple databases, comprises:
acquiring configuration information input by a user through a web interface, wherein the data volume of the data to be fragmented is used for indicating the number of fragmented data into which the data to be fragmented is fragmented; the table building statement is used for generating a fragment table which can be written in the data to be fragmented; the databases are used for storing the data to be fragmented.
3. The artificial intelligence based data fragmentation method of claim 1, wherein the classifying the data to be fragmented according to service type to obtain classified data, and storing the classified data into a fragmentation table as first fragmentation data based on the configuration information comprises:
setting different coding labels for the data to be fragmented of each service type according to a preset mode;
classifying the data to be fragmented based on the coding labels to obtain classified data, storing the classified data of the same category to the same fragmentation table based on the configuration information, storing the classified data of different categories to different fragmentation tables, and taking all the classified data stored in the fragmentation tables as first fragmentation data.
4. The artificial intelligence based data fragmentation method of claim 1, wherein the receiving a data read request to read target data from the fragmentation table, the data read request carrying key fields of the target data comprises:
receiving a data reading request to obtain a key field of target data carried in the data reading request;
reading a routing address of the target data according to a pre-configured routing rule and the key field, wherein the routing address comprises a fragmentation table and a database of the target data;
and reading and feeding back the target data based on the routing address.
5. The artificial intelligence based data slicing method of claim 1, wherein the counting the number of each key field in the historical target data to obtain a calling frequency, and the screening all key fields based on the calling frequency to obtain a hot field data set comprises:
counting the number of each key field in the historical target data, and taking the ratio of the number of each key field to the total number of all key fields as the calling frequency;
and classifying the calling frequency, and screening all key fields according to a classification result to obtain a hot field data set.
6. The artificial intelligence based data fragmentation method of claim 5, wherein the classifying the call frequency and screening all key fields according to classification results to obtain a hot field data set comprises:
calculating the error square sum of all the calling frequencies to determine the effective classification number of all the calling frequencies;
classifying all the calling frequencies based on the effective classification number to obtain a plurality of calling frequency classes;
calculating the average value of the calling frequencies contained in each calling frequency category, selecting the calling frequency category corresponding to the maximum average value, and taking the key fields corresponding to all the calling frequencies in the calling frequency category as the hotspot field data set.
7. The artificial intelligence based data slicing method of claim 1, wherein the calculating the text similarity of the first sliced data and each key field in the hot field data set according to a text similarity algorithm to obtain a semantic relevance, and slicing the first sliced data based on the semantic relevance to obtain a second sliced data comprises:
respectively calculating the text similarity of each classified data in the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm;
taking the calling frequency corresponding to each key field in the hotspot field data set as a feature weight, and performing weighted summation on the feature weight and the text similarity to obtain the semantic relevancy of each classification data in the first fragment data;
and calculating the average value of the semantic relevance of all classified data in the first fragment data as a fragment threshold, and performing fragmentation on the first fragment data based on the fragment threshold to obtain second fragment data.
8. An artificial intelligence based data slicing apparatus, the apparatus comprising:
the device comprises an acquisition unit, a data processing unit and a data processing unit, wherein the acquisition unit is used for acquiring configuration information of data to be fragmented, and the configuration information comprises the data volume of the data to be fragmented, a table building statement and a plurality of databases;
the classification unit is used for classifying the data to be fragmented according to service types to obtain classified data, and storing the classified data into a fragmentation table based on the configuration information to serve as first fragmentation data;
a reading unit, configured to receive a data reading request to read target data from the fragment table, where the data reading request carries a key field of the target data;
the screening unit is used for counting the number of each key field in the historical target data to obtain calling frequency, and screening all the key fields based on the calling frequency to obtain a hot field data set;
and the calculating unit is used for calculating the text similarity of the first fragment data and each key field in the hotspot field data set according to a text similarity algorithm to obtain semantic relevance, and fragmenting the first fragment data based on the semantic relevance to obtain second fragment data.
9. An electronic device, characterized in that the electronic device comprises:
a memory storing computer readable instructions; and
a processor executing computer readable instructions stored in the memory to implement the artificial intelligence based data slicing method of any one of claims 1 to 7.
10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the artificial intelligence based data sharding method of any one of claims 1-7.
CN202210449787.4A 2022-04-26 2022-04-26 Data fragmentation method, device, equipment and medium based on artificial intelligence Pending CN114860722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210449787.4A CN114860722A (en) 2022-04-26 2022-04-26 Data fragmentation method, device, equipment and medium based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210449787.4A CN114860722A (en) 2022-04-26 2022-04-26 Data fragmentation method, device, equipment and medium based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN114860722A true CN114860722A (en) 2022-08-05

Family

ID=82633390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210449787.4A Pending CN114860722A (en) 2022-04-26 2022-04-26 Data fragmentation method, device, equipment and medium based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN114860722A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132610A (en) * 2024-05-07 2024-06-04 江西科技学院 Electronic information collection method, system and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132610A (en) * 2024-05-07 2024-06-04 江西科技学院 Electronic information collection method, system and storage medium

Similar Documents

Publication Publication Date Title
US20200356901A1 (en) Target variable distribution-based acceptance of machine learning test data sets
US8725734B2 (en) Sorting multiple records of data using ranges of key values
CN111258966A (en) Data deduplication method, device, equipment and storage medium
CN111464583B (en) Computing resource allocation method, device, server and storage medium
Ferraro Petrillo et al. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
CN111752944B (en) Data allocation method, device, computer equipment and storage medium
CN111611249A (en) Data management method, device, equipment and storage medium
US20230161811A1 (en) Image search system, method, and apparatus
US20150235038A1 (en) Method and apparatus for access control
US10599614B1 (en) Intersection-based dynamic blocking
CN112631731A (en) Data query method and device, electronic equipment and storage medium
CN114860722A (en) Data fragmentation method, device, equipment and medium based on artificial intelligence
CN116719822B (en) Method and system for storing massive structured data
CN110019017B (en) High-energy physical file storage method based on access characteristics
US7509461B1 (en) Method and apparatus for intelligent buffer cache pre-emption
Zhang et al. Parallel top-k algorithms on gpu: A comprehensive study and new methods
CN117609181A (en) Method and system for migrating TCHouse database
CN116132448B (en) Data distribution method based on artificial intelligence and related equipment
CN116842012A (en) Method, device, equipment and storage medium for storing Redis cluster in fragments
CN115221174A (en) Data grading storage method, device, equipment and medium based on artificial intelligence
CN116150185A (en) Data standard extraction method, device, equipment and medium based on artificial intelligence
CN113625967B (en) Data storage method, data query method and server
CN115562934A (en) Service flow switching method based on artificial intelligence and related equipment
CN115328950A (en) Secondary index-based hbase query method, terminal device and storage medium
US20220092049A1 (en) Workload-driven database reorganization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination