CN113568895A - Database data processing method and device and electronic equipment - Google Patents

Database data processing method and device and electronic equipment Download PDF

Info

Publication number
CN113568895A
CN113568895A CN202110138158.5A CN202110138158A CN113568895A CN 113568895 A CN113568895 A CN 113568895A CN 202110138158 A CN202110138158 A CN 202110138158A CN 113568895 A CN113568895 A CN 113568895A
Authority
CN
China
Prior art keywords
data
data table
database
service
business
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110138158.5A
Other languages
Chinese (zh)
Inventor
赵文
林岳
刘妍
陈守志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110138158.5A priority Critical patent/CN113568895A/en
Publication of CN113568895A publication Critical patent/CN113568895A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Abstract

The application provides a data processing method and device of a database, electronic equipment and a computer readable storage medium; relates to database technology and artificial intelligence technology; the method comprises the following steps: inquiring a data table of a missing service field in a database, and acquiring characteristic information of the data table; acquiring a feature vector corresponding to the feature information, and mapping the feature vector into a plurality of candidate service field probabilities; determining the service field corresponding to the probability meeting the probability value condition as the service field to which the data table belongs; and writing the business field to which the data table belongs in the data table of the database. By the method and the device, intelligent data management can be realized on the database, so that the efficiency of data management is improved.

Description

Database data processing method and device and electronic equipment
Technical Field
The present disclosure relates to database technologies, and in particular, to a method and an apparatus for processing data in a database, an electronic device, and a computer-readable storage medium.
Background
Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. Artificial intelligence is widely used in internet enterprises, for example, for data management based on artificial intelligence.
However, in data management of internet enterprises, as business progresses and user liveness increases, a large amount of valuable data is deposited in data tables. In the related art, the database is managed and maintained by a database administrator of the data, which causes unnecessary labor consumption and is inefficient.
The related technology lacks an effective scheme for how to perform automatic data management on the database based on the artificial intelligence technology.
Disclosure of Invention
The embodiment of the application provides a data processing method and device for a database, an electronic device and a computer readable storage medium, which can realize intelligent data management on the database, so that the data management efficiency is improved.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data processing method of a database, which comprises the following steps:
inquiring a data table of a missing service field in a database, and acquiring characteristic information of the data table;
acquiring a feature vector corresponding to the feature information, and mapping the feature vector into a plurality of candidate service field probabilities;
determining the service field corresponding to the probability meeting the probability value condition as the service field to which the data table belongs;
and writing the business field to which the data table belongs in the data table of the database.
An embodiment of the present application provides a data processing apparatus for a database, including:
the query module is used for querying a data table of the missing service field in the database;
the acquisition module is used for acquiring the characteristic information of the data table, acquiring a characteristic vector corresponding to the characteristic information, and mapping the characteristic vector into the probability of a plurality of candidate service fields;
the determining module is used for determining the business field corresponding to the probability meeting the probability value-taking condition as the business field to which the data table belongs;
and the writing module is used for writing the business field to which the data table belongs in the data table of the database.
In the foregoing solution, the obtaining module is further configured to:
determining an index value corresponding to each word in the feature information;
converting based on the index value to obtain a word vector corresponding to each word;
adding and averaging the word vectors corresponding to each word to obtain an average word vector;
and taking the average word vector as a feature vector corresponding to the feature information.
In the foregoing solution, the obtaining module is further configured to:
performing word segmentation processing on the table name and the table description information of the data table, which are included in the feature information, to obtain a plurality of words;
determining the corresponding index value of each word in the index table;
wherein the index table comprises different words and corresponding index values.
In the foregoing solution, the obtaining module is further configured to:
converting the index value corresponding to each word to obtain a corresponding unique heat vector;
and multiplying the one-hot vector corresponding to each word by the weight matrix to obtain a word vector corresponding to each word.
In the foregoing solution, the obtaining module is further configured to:
coding the characteristic vector to obtain a coding processing result;
and activating the coding processing to obtain the probability of the service fields corresponding to a plurality of candidates.
In the foregoing solution, the determining module is further configured to:
taking the service field with the maximum probability as the service field to which the data table belongs; or
And sorting the service fields corresponding to the probabilities exceeding the probability threshold value according to a descending order of the probabilities, and selecting a plurality of service fields which are sorted in the front as the service fields to which the data table belongs.
In the foregoing solution, the query module is further configured to:
screening a plurality of candidate data tables from the logs of the database, wherein the candidate data tables meet at least one of the following conditions: the use frequency is lower than a frequency threshold value, and the last use time is before a preset time;
determining a data table of the missing service domain from the plurality of candidate data tables.
In the foregoing solution, the query module is further configured to:
determining data tables which are not overlapped with each other in a plurality of nodes corresponding to the distributed database;
and traversing the data tables which are respectively stored in the nodes and are not overlapped with each other to determine the data table of the missing service field.
In the above solution, the data processing apparatus of the database further includes a replacement module, configured to:
periodically traversing data tables in the database to determine data tables in which data changes occur, the type of data change comprising at least one of: adding data, deleting data, changing data;
acquiring the characteristic information of the data table with the changed data, and determining a new service field based on the characteristic information of the data table with the changed data;
and replacing the business field to which the data table with the changed data belongs with the new business field.
In the foregoing solution, the determining module is further configured to:
determining a plurality of groups of data tables of which the similarity exceeds a similarity threshold in a plurality of data tables in the field of missing services, wherein each group of data tables comprises at least two data tables;
selecting one data table from each group of data tables as a representative data table, acquiring characteristic information of each representative data table, and determining the business field to which the representative data table belongs based on the characteristic information;
and taking the business field to which the representative data table belongs as the business field to which other data tables in the same group of data tables belong.
In the above solution, the data processing apparatus of the database further includes a retrieval module, configured to:
responding to a retrieval request, and screening based on the business field to which the data table belongs in the database to obtain the data table in a first range;
screening based on the table description information of the data table in the first range to obtain the data table in a second range;
and screening based on the table names of the data tables in the second range, and returning the obtained data tables in the target range as a search result.
In the foregoing solution, the business field is determined by a machine learning model, and the data processing apparatus of the database further includes a training module, configured to:
performing, by the machine learning model: acquiring characteristic information of a data table sample, and determining a prediction service field to which the data table sample belongs based on the characteristic information of the data table sample;
determining an error based on the prediction business field of the data table sample and the labeling business field of the data table sample;
back-propagating the error in the machine learning model to update parameters of the machine learning model.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the data processing method of the database provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the application provides a computer-readable storage medium, which stores executable instructions and is used for realizing the data processing method of the database provided by the embodiment of the application when being executed by a processor.
The embodiment of the application has the following beneficial effects:
the method has the advantages that the service field of the data table is predicted by providing the characteristic information of the data table in the missing service field, the service field is automatically filled in the data table, the service field of the data table is automatically supplemented, the operation and maintenance cost of the database is obviously saved for the service needing to operate and maintain the mass data table in the data table, and the data management efficiency is improved.
Drawings
FIG. 1A is a block diagram of an architecture of a data processing system 10 according to an embodiment of the present application;
FIG. 1B is a block diagram illustrating an architecture of data processing system 10 according to an embodiment of the present application;
FIG. 1C is a block diagram illustrating an architecture of data processing system 10 according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a server 200-1 provided in an embodiment of the present application;
fig. 3A is a schematic flowchart of a data processing method of a database according to an embodiment of the present application;
fig. 3B is a schematic flowchart of a data processing method of a database according to an embodiment of the present application;
fig. 4 is a schematic diagram of a service domain of an internet service provided by an embodiment of the present application;
FIG. 5A is a diagram of an electronic commerce problem table provided by an embodiment of the present application;
FIG. 5B is a diagram of an e-commerce problem table provided by an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a text classification model provided in an embodiment of the present application;
FIG. 7 is an interaction flow diagram of a data processing method of a database according to an embodiment of the present application;
fig. 8 is a schematic diagram of training and prediction of a text classification model according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first/second/third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first/second/third" may, where permissible, be interchanged with a particular order or sequence so that embodiments of the application described herein may be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Metadata: the data used for describing the data is mainly used for describing the information of the data attribute, and can support the functions of indicating the storage position, historical data, resource searching, file recording and the like.
2) Data table: also known as a table, is a collection of two-dimensional arrays that represent and store relationships between data objects. A table is a basic structure constituting a table space, and is composed of intervals. It consists of vertical columns and horizontal rows, for example in a table named authors with information about authors, each column containing information of a certain type for all authors, such as "last name", and each row containing all information for a certain author: last name, first name, address, etc.
3) Distributed database: the data tables in the distributed database are stored on different physical nodes, managed by different database management systems, run on different machines, supported by different operating systems, and connected together by different communication networks. The physical nodes for storing the data table include two categories, a master node and a slave node, each master node having at least one slave node synchronized therewith. All data tables in the distributed database are stored in each main node in a scattered mode, and the slave nodes are used for improving the reliability and the availability of the database and preventing the whole database from being crashed due to the failure of the main nodes.
4) Structured Query Language (SQL): is a special purpose programming language, a database query and programming language, used to access data and query, update and manage relational database systems.
The database stores data in a table as an organization unit. In data management of internet enterprises, with development of business and increase of user activity, a large amount of valuable data can be deposited in a data table. The business domain, as part of the metadata, categorizes the data from the business dimension for ease of use by database administrators. However, if some valuable data tables lack specific business fields, the management and maintenance of a database administrator are difficult to obtain, and the use value of the data tables is greatly reduced.
In the related art, there are two main methods for completing the service field in the metadata: 1. manually completing, namely finding database managers of all data tables, and completing all missing information through the database managers; 2. and the method limits that a database administrator must fill the business field of the data table when creating a new task, otherwise the data table cannot be stored.
The disadvantages of method 1 are two-fold: considering personnel change, each data table is uncertain to be allocated to a specific database manager, which also causes that some data tables cannot complement the business field; secondly, a large amount of manpower is consumed for manual completion. In a specific scenario, a database administrator may need to complete hundreds of business fields, and such workload may directly affect the normal work of the database administrator, and even affect the normal operation of a certain business. The method 2 has the disadvantages that the newly added data tables can only be guaranteed to have complete service fields, but a large number of data tables in the previous data tables lack service fields, and the use value of the data tables is still seriously influenced.
The embodiment of the application provides a data processing method of a database, which realizes intelligent data management on the database, thereby improving the efficiency of data management.
The data processing method of the database provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal or a server alone, or may be implemented by a server and a terminal in cooperation. For example, the terminal itself performs a data processing method of a database described below, or the terminal transmits a data processing request to the server, and the server executes the data processing method of the database based on the received data processing request.
The electronic device provided by the embodiment of the application can be various types of terminal devices or servers, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and an artificial intelligence platform; the terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the present application.
Taking a server as an example, for example, the server cluster may be deployed in a cloud, and open an artificial intelligence cloud Service (AI as a Service, AIaaS) to users, the AIaaS platform may split several types of common AI services, and provide an independent or packaged Service in the cloud, this Service mode is similar to an AI theme mall, and all users may access one or more artificial intelligence services provided by the AIaaS platform by using an application programming interface.
For example, one of the artificial intelligence cloud services may be a data processing service, that is, a cloud server encapsulates the data processing program provided in the embodiment of the present application. The method comprises the steps that a terminal of a database administrator responds to a processing triggering operation of the database administrator on a data table, a data processing request is sent to a server at the cloud end, the server at the cloud end calls a packaged data processing program, the data table of a missing business field in the database is determined, a business field corresponding to the data table is determined, the business field is written into the data table, and the business field is returned to the terminal of the database administrator.
In some embodiments, a data processing method of a database provided in the embodiments of the present application is described as an example in which a server separately implements the database. The server periodically traverses the data tables in the database to determine the data tables of the missing service fields in the database and acquire the characteristic information of the data tables of the missing service fields; determining a service field corresponding to the data table based on the characteristic information; and replacing the service field with a new service field and writing the new service field into the corresponding data table.
In some embodiments, an exemplary data processing system is described by taking an example that a server and a terminal cooperate to implement the data processing method of the database provided in the embodiments of the present application. Referring to fig. 1A, fig. 1A is a schematic diagram of an architecture of a data processing system 10 according to an embodiment of the present application. The terminal 400-1 is connected to the server 200-1 and the server 200-2 via the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both. The server 200-1 may be a database server, and may determine the business domain to which the data table belongs based on the database server. For example, the server 200-1 receives a data processing request from the terminal 400-1 of the database administrator via the network 300, determines a business area corresponding to the data table based on the data processing request, writes the determined business area into the data table, and returns the written situation of the business area to the terminal 400-1 via the network 300. Server 200-2 in fig. 1A is a server deployed by a database server that is dedicated to determining missing business domains in a data table. In some possible examples, the business domain to which the data table belongs may also be determined by the server 200-2. As shown in fig. 1A, the server 200-1 forwards the received data processing request from the terminal 400-1 to the server 200-2, so that the server 200-2 determines the service domain, and then the server 200-1 writes the determined service domain into the data table and returns the written condition of the service domain to the terminal 400-1.
In other embodiments, a data processing system may be as shown in FIG. 1B. The terminal 400-2 of the internet user connects the server 200-1 and the server 200-3 through the network 300, and the server 200-3 is a background server for the internet application. The server 200-3 receives a service request of an internet application client in the terminal 400-2, such as an answer request for a specific service field, and sends a query request to the server 200-1 to query a database of the server 200-1 for a data table of the corresponding service field. If the data table of the missing service field is found in the query process, the missing service field in the data table is filled by the data processing method of the database provided in the embodiment of the present application, then the data table (such as the answer table) corresponding to the specific service field is determined, and the data processing result (data table) is returned to the terminal 400-2 of the internet user.
For example, the internet application client is an answer client, after a user logs in the answer client, the terminal 400-2 receives an answer operation of the user, displays a plurality of service fields for the user to select, and after receiving a selection operation of the user for a service field, sends a query request for a data table of the service field selected by the user to the server 200-1. In response to the query request, the server 200-1 traverses the data table in the database corresponding to the answer client, and obtains the data table of the service field selected by the user. Meanwhile, the data tables which lack the business fields and are found when traversing the database are filled with the business fields. The server 200-1 selects a plurality of questions from the acquired data table, transmits the selected questions to the terminal 400-2, and the terminal displays an answer page after receiving the questions.
The embodiment of the present application can also be implemented by using a block chain technology, and referring to fig. 1C, the server and the terminal described above can both be added to the block chain network 300 to become one of the nodes. The type of blockchain network 300 is flexible and may be, for example, any of a public chain, a private chain, or a federation chain. Taking a public link as an example, an electronic device such as a server of any service agent may access the blockchain network 300 without authorization, so as to serve as a common node of the blockchain network 300, for example, the server 200-3 is mapped to the common node 300-0 in the blockchain network 300, the server 200-1 is mapped to the common node 300-1 in the blockchain network 300, and the server 200-2 is mapped to the common node 300-2 in the blockchain network 300.
Taking the blockchain network 300 as an example of a federation chain, the server 200-1, the server 200-2, and the server 200-3 may access the blockchain network 300 to become nodes after being authorized. When the node receives the determined business domain, it may determine whether to write the business domain into the data table in a manner that implements an intelligent contract. When a node determines that a business segment can be written to a data table, the business segment will be signed with a digital signature (i.e., endorsement), when a business segment has sufficient endorsements, such as endorsements for nodes that exceed a number threshold, the business segment will be written by all servers to the respective maintained data tables, and the database in which the data tables reside may run in the blockchain network 300 or server 200-1, or may be deployed independently of the blockchain network 300 and server 200-1. Therefore, the reliability and the accuracy of the service field written in the data table are further improved by carrying out consensus confirmation on the service field through the plurality of nodes.
If the node finds other data tables with service fields similar to the description information in the data table, the service fields of the other data tables can be directly written into the data table as the service fields of the data table.
Taking an electronic device implementing the embodiment of the present application as an example of the server 200-1 shown in fig. 1A, a structure of the electronic device provided in the embodiment of the present application is described. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200-1 according to an embodiment of the present application, where the server 200-1 shown in fig. 2 includes: at least one processor 210, memory 240, at least one network interface 220. The various components in server 200-1 are coupled together by a bus system 230. It is understood that the bus system 230 is used to enable connected communication between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 230 in fig. 2.
The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The memory 240 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 240 optionally includes one or more storage devices physically located remote from processor 210.
The memory 240 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 240 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 240 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, to support various operations, as exemplified below.
An operating system 241, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
a network communication module 242 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
in some embodiments, the data processing device of the database provided in the embodiments of the present application may be implemented in software, and fig. 2 illustrates the data processing device 243 of the database stored in the memory 240, which may be software in the form of programs and plug-ins, and includes the following software modules: query module 2431, obtain module 2432, determine module 2433, write module 2434, replace module 2435, retrieve module 2436, and train module 2437, which are logical and thus can be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be explained below.
The following is a description of an execution subject of the data processing method in which a server is used as a database, and the execution subject can be specifically realized by the server by running the above various computer programs; of course, as will be understood from the following description, it is understood that the data processing method of the database provided in the embodiments of the present application may be implemented by a terminal and a server in cooperation.
Referring to fig. 3A, fig. 3A is a schematic flowchart of a data processing method of a database according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.
In step 101, a data table of a missing service field in a database is queried, and feature information of the data table is obtained.
In some embodiments, the databases include relational databases, non-relational databases, and key-value databases. The relational database may be Mysql, Sybase, etc., the non-relational database may be BigTable, Cassandra, etc., and the key-value database may be Apache Cassandra, Dynamo, etc. The characteristic information of the data table may include a table name and table description information of the data table.
Each data table has a corresponding table name, table description information and service field. The table name is a unique identifier for a user accessing the data table, and is unique in the database. The service domain is used to identify the service type for which the contents in the data table are intended. As shown in fig. 4, fig. 4 is a schematic view of a service domain of an internet service provided in an embodiment of the present application, and the internet service may be classified into an internet basic service industry and an internet application service industry. The service fields subdivided by the Internet basic service industry can be access service, Domain Name Server (DNS) service, Internet Data Center (IDC) service, Content Delivery Network (CDN) service and electronic commerce, and the service fields subdivided by the Internet application service industry can be Network entertainment, Network Content, Network communication, information retrieval, Data mining and emerging applications. The table description information may include field ID, field name, data type, length, precision, null, default, auto-increment, primary key and column description, etc.
Taking data tables in a database corresponding to the answer client as an example to describe table description information, where the data tables include a question table, a friend table, a default value table, and the like, as shown in fig. 5A, the table description information may be presented in the data tables, and a table name of the data table in fig. 5A may be an "e-commerce question table," where a first column is a field name, a second column is a data type, a third column is a length, a fourth column is a precision, and a fifth column (which is a default of relevant data in the column) is null or not. In some embodiments, the business field corresponding to each data table in the database may be queried through an SQL query statement to determine the data table missing the business field.
In other embodiments, it is considered that the data tables with less access in the database may be due to low use value of the data tables themselves or due to missing business fields, and meanwhile, although some data tables lack business fields, the data tables can still be accessed quickly according to the table names and/or table description information of the data tables, so the access frequency of the data tables may still be high. Therefore, when the data tables of the missing service fields in the database are queried, that is, the data tables of the service fields need to be complemented, in order to improve efficiency, the data tables which are less accessed can be focused so as to discover the data tables with higher possible use values and complement the missing service fields.
In some possible examples, a plurality of candidate data tables with usage frequency lower than a frequency threshold and/or with last usage time before a preset time are screened out from the database based on information such as Input/Output (IO) recorded by logs in the database; a data table of the missing service domain is determined from the plurality of candidate data tables. The usage frequency, i.e. the frequency of the data table usage, includes the access (i.e. lookup) of the database administrator or internet user and the modification operation of the database administrator such as adding, deleting, modifying, etc. to the data table. The last time of use may be the last time a database administrator or an internet user accesses the data table, or the last time the database administrator performs a modification operation on the data table. The preset time may be an exact time point (e.g. 10/1/2020), or may be a time period before the current time (e.g. 3 months before the current time), where the current time is a time when the data table is determined to be missing from the service domain.
For example, if the frequency threshold is set to 10 and the frequency of use of the data table H is determined to be 3 by performing the screening based on the log, the frequency of use of the data table H can be determined to be low, and the data table H can be determined as the candidate data table. For another example, if the preset time is 3 months before the current time, the last time the data table G was accessed is 4 months before, and exceeds 3 months of the preset time, it is known that the data table G has not been used recently, and the data table G is used as a candidate data table. After the candidate data tables are determined, the data tables of the missing business fields can be determined from the candidate data tables through SQL query statements.
Therefore, the data tables in the missing service field are determined from the fewer data tables, so that the number of the data tables needing to be searched in a traversing manner is greatly reduced, and the searching efficiency is improved.
In some embodiments, when the database is a distributed database, data tables that do not overlap with each other in nodes corresponding to the distributed database are determined, that is, data tables in each master node are determined. Wherein the master node may be a server storing the data table. And traversing the data tables stored in the main nodes to determine the data tables of the missing service fields in the main nodes, summarizing the data tables, and acquiring the characteristic information of each data table in the summarized data tables. Therefore, when the data is a distributed database, the storage pressure of the server can be reduced, and the stored data tables which are not overlapped with each other can be simultaneously inquired on different nodes to determine the data table in the missing service field, so that the time required for inquiry is saved, and the inquiry efficiency is improved.
In step 102, a feature vector corresponding to the feature information is acquired, and the feature vector is mapped to a plurality of probabilities of candidate service domains.
In some embodiments, obtaining the feature vector corresponding to the feature information may be implemented by steps 1021 to 1024 as shown in fig. 3B.
In step 1021, an index value corresponding to each word in the feature information is determined.
In some embodiments, as shown in fig. 6, fig. 6 is a schematic structural diagram of a text classification model provided in an embodiment of the present application. The text classification model comprises an input layer, an embedding layer, a full connection layer and a Softmax layer. The word segmentation processing can be carried out on the table name and the table description information of the data table included in the characteristic information in the input layer of the text classification model to obtain a plurality of words. For example, in the data table corresponding to the answering client, the table name of one of the data tables in the missing service field is "e-commerce problem table", and the table description information is "development history of e-commerce". At this time, the feature information in the input layer is "e-commerce … … history" as shown in fig. 6, and a plurality of words obtained after the acne processing of the feature information are "e" and "commerce" and the like. Then, an index value corresponding to each word in an index table is determined, wherein the index table comprises different words and corresponding index values. When the feature information is chinese, each chinese character/word and the corresponding index value are stored in the index table, for example, the index value corresponding to "electronic" in the index table is 1100, and the index value corresponding to "business" is 0100.
In some possible examples, the feature information may also include english, and at this time, a corresponding english index table may be queried, where the english index table stores index values corresponding to the respective letters. For example, the letter a in the index table corresponds to 0 (corresponding to binary code 0 … … 00, a total of 27 bits), b corresponds to 1 (corresponding to binary code 0 … … 01, a total of 27 bits) … … z corresponds to 25, and the blank corresponds to 26. For any word, the binary code corresponding to each letter can be determined by querying the index table, and the binary codes corresponding to the letters are spliced to determine the index value corresponding to the word.
In step 1022, conversion is performed based on the index value to obtain a word vector corresponding to each word.
In some embodiments, the index value corresponding to each word is converted into a unique heat vector in the embedding layer of the text classification model, so that the corresponding unique heat vector can be obtained. And then multiplying the one-hot vector corresponding to each word by the weight matrix of the embedded layer to obtain a word vector corresponding to each word. Wherein, only 1 element in the one-hot vector is 1, and the rest elements are all 0. The embedding layer is used for extracting feature vectors, namely extracting word vectors corresponding to words.
In step 1023, the word vectors corresponding to each word are added and averaged to obtain an average word vector.
For example, if the word vector corresponding to "electronic" is [0, 0, 1] and the word vector corresponding to "business" is [1, 0, 1], the word vectors corresponding to these two words are added and averaged to obtain an average word vector [0.5, 0, 1 ].
In step 1024, the average word vector is used as the feature vector corresponding to the feature information.
In some embodiments, mapping the feature vector into a plurality of candidate probabilities of the service domain may be implemented as follows: coding the characteristic vector to obtain a coding processing result; and activating the coding process through a softmax function to obtain the probability of the service fields corresponding to a plurality of candidates. The encoding process includes a down-sampling process, a convolution process, and a linear transformation process. The feature vectors are subjected to encoding processing, that is, downsampling processing, convolution processing, and linear transformation processing are sequentially performed on the feature vectors. Wherein the linear transformation process can be implemented by a fully connected layer in the text classification model.
In step 103, the service domain corresponding to the probability satisfying the probability value condition is determined as the service domain to which the data table belongs.
In some embodiments, the probability value condition may include: the probability is maximum; the probability threshold is exceeded. Correspondingly, the service domain with the highest probability can be used as the service domain to which the data table belongs; or one or more service fields corresponding to the probability exceeding the probability threshold are used as the service fields to which the data table belongs. For example, when a plurality of business fields (e.g., e-commerce, network entertainment and network content) corresponding to the probability exceeding the probability threshold are used as the business fields to which the data table belongs, the business fields may be stored in the form of "e-commerce-network entertainment-network content".
For example, in some possible examples, when the probability value-taking condition is that the probability is the maximum, after determining the service domain with the maximum probability, it may be further determined whether the probability value-taking condition is greater than a probability threshold, if so, the probability value-taking condition is used as the service domain to which the data table belongs, and then the probability value-taking condition is written into the data table; if not, the determined service field is not accurate, and the data is abandoned to be written into the data table.
Therefore, the embodiment of the application not only can efficiently determine the business fields to which the data table belongs, but also can be used as the business fields to which the data table belongs when a plurality of business fields are determined according to the characteristic information of the data table. In this way, when a task such as search is carried out, the search can be carried out based on the business field, the search range is expanded, the search range can be covered in a possible data table as much as possible, and the omission in the search result is reduced.
In step 104, the business domain to which the data table belongs is written in the data table of the database.
For example, if the business domain to which the data table in fig. 5A belongs is determined to be "e-commerce," a column may be added to the data table to write the business domain into the data table as shown in fig. 5B.
In some embodiments, the confirmation of the man-machine interaction on the management side may be triggered before writing the business domain to the data table. The server sends a write-in confirmation request to the terminal, and the terminal displays the table name, the table description information and the determined service field of the data table, and displays a determination button and a rejection button. If the database administrator clicks the confirmation button, the business field is written into the data table; if the database administrator clicks the reject button, the business domain will not be written to the data table. Therefore, the accuracy of the business field can be further improved by introducing a manual confirmation link before the business field is written into the data table.
In some possible examples, considering that the data table of each missing business domain is written into the data table after being confirmed by the database administrator, the workload of the database administrator is increased, and the efficiency is low, and the database administrator may be required to confirm the data table in the following two cases.
The first case is that for the first or first batch (e.g. 10 or 20) data tables of missing business fields, after determining their corresponding business fields, the data tables can be written into the business fields by confirmation of the database administrator. Therefore, the data tables of the missing business fields can be used as samples for learning of the machine learning model, and the machine learning model is trained to predict the business fields to which the data tables of the missing business fields belong.
The second case is to confirm only the data tables at specific locations in the database (for example, the locations of the data tables are represented by the numbers of the data tables), wherein the specific locations may be the densely distributed locations of the data tables of the missing service domains (for example, there is one data table of the missing service domains in every 5 data tables on average at the densely distributed locations), or the densely distributed locations of the data tables with higher access amount than the average access amount. Therefore, the data tables in the service areas which are easy to be lost and the data tables which are used more can be confirmed in a centralized mode, and the confirmation efficiency is improved. Moreover, the confirmed data table can be further used as a sample of the machine learning model to perform incremental training on the machine learning model, so that the prediction accuracy is improved.
In some embodiments, the business domain is determined by a machine learning model, which may be the text classification model described above, or a fully-connected network model such as FastText, TextCNN, BERT, RoBERTa, or elettra, or an attention network model, a recurrent neural network model, or a convolutional neural network model.
As an example, the training process of the machine learning model is as follows: acquiring characteristic information of a data table sample through a machine learning model, and determining a prediction business field to which the data table sample belongs based on the characteristic information of the data table sample; determining an error based on the prediction business field of the data table sample and the marking business field of the data table sample; the error is propagated back through the machine learning model to update parameters of the machine learning model. In the training process, parameters (such as weight and deviation) in the machine learning model can be updated layer by layer through a gradient descent method, and the gradient descent method can be a full-scale gradient descent method, a random gradient descent method, a small-scale gradient descent method, a momentum gradient descent method and the like.
In some possible examples, the machine learning model is initially obtained based on training of a general training set, a data table with correct service field and a data table with incorrect service field are obtained in a manual confirmation link, relevant data of the data table with the incorrect service field are used as negative samples, relevant data of the data table with the correct service field are used as positive samples to form a special training set, and the machine learning model is subjected to incremental training regularly or in real time by using the special training set until the error rate in a subsequent manual confirmation link is lower than a threshold value or the error frequency in a window time is lower than a preset frequency, so that the prediction accuracy of the machine learning model can be improved. Wherein the machine learning model is incrementally trained in real time, i.e., after each manual validation. If there are insufficient samples (i.e., less than the minimum number of samples per round of training) during incremental training (real-time training or periodic training), sample augmentation may be performed on the basis of samples collected for the database and/or the specialized training set may be complemented with samples in the general training set. With the increase of samples collected in the manual confirmation process, the number of samples in the universal training set can be gradually reduced.
In some embodiments, the data processing system 10 supports a write-back function of the business domain, i.e. after the server automatically writes the business domain in the data table, if a subsequent database administrator (e.g. operation and maintenance personnel) finds the business domain is not reasonable, the written and updated business domain can be withdrawn, so as to ensure the traceability of the data table.
In some embodiments, periodically traversing the data tables in the database to determine the data tables in which the data change occurred, the type of data change comprising at least one of: adding data, deleting data, changing data; acquiring characteristic information of a data table with data change, and determining a new service field based on the characteristic information of the data table with data change; and replacing the business field to which the data table with the changed data belongs with a new business field. The periodic traversal may be performed once every month or once every week, which is not limited in the embodiments of the present application.
Therefore, when a user accesses the data table and modifies the data table to change data in the data table, the embodiment of the application can adjust the service field to which the data table belongs in time according to the changed characteristic information in the data table, and the accuracy of the service field is ensured.
In some embodiments, considering the number of data tables in the database is too large, it may take a lot of time to determine the service domain for the data table of the missing service domain. Therefore, the data tables of the plurality of similar missing service fields in the database can be determined, the service fields do not need to be repeatedly calculated for the plurality of similar data tables, and only the service field of one data table in the plurality of similar data tables is used as the service field to which other similar data tables belong. The realization process is as follows: determining a plurality of groups of data tables of which the similarity exceeds a similarity threshold in a plurality of data tables in the field of missing services, wherein each group of data tables comprises at least two data tables; randomly selecting one data table from each group of data tables as a representative data table, acquiring characteristic information of each representative data table, and determining the service field to which the representative data table belongs based on the characteristic information; and taking the service field to which the representative data table belongs as the service field to which other data tables in the same group of data tables belong.
It can be seen that, for data tables of a plurality of similar missing service fields, the service field to which one data table belongs can be determined by the data processing method of the database provided in the embodiment of the present application, and the service field is used as the service field to which other similar data tables belong, thereby greatly improving the completion efficiency of the service field.
In some embodiments, after determining the business domain to which the data table belongs in the database, the search may be performed based on information such as the business domain of the data table, and the search process is described below. Responding to a retrieval request carrying a keyword, and screening by the server based on the business field to which the data table belongs in the database to obtain the data table in a first range; screening based on the table description information of the data table in the first range to obtain the data table in the second range; and screening based on the table names of the data tables in the second range, and returning the obtained data tables in the target range as a search result.
In some possible examples, in the process of screening the data tables layer by layer, if the number of the obtained data tables in the range is less than or equal to the number threshold, the screening is stopped, and the data tables in the range are returned as the retrieval result. For example, if the number threshold is 50, and the data table in the first range is 45, the data table in the first range is returned to the terminal as the search result.
Therefore, after the business field corresponding to the data table is determined, the retrieval strategy can be optimized based on the business field, and the retrieval accuracy is improved.
It can be seen that, in the embodiment of the present application, the service field of the data table is predicted by providing the feature information of the data table in the missing service field, and the service field is automatically filled in the data table, so that the service field of the data table is automatically completed, for the service requiring operation and maintenance of a large amount of data tables in the data table, the operation and maintenance cost of the database is significantly saved, and the data management efficiency is improved.
In the following, it is assumed that an answer client is operated in a terminal of an internet user, the user can participate in the interaction of answers or questions in various service fields in the answer client, and the user can select the service field of answer/question making. For example, when a user selects to answer a question in the business field of e-commerce, the terminal responds to the answer request of the user and sends the answer request to a background server applied to the internet to obtain the question in the e-commerce field. An exemplary application in an answering scenario of an internet application will be described in connection with the steps in the interaction flow diagram shown in fig. 7.
In step 201, the terminal sends an answer request to a background server in response to the answer request for the electronic commerce.
The service field of the question to be answered is carried in the answer request, and here, the e-commerce field is taken as an example.
In step 202, the background server sends a query request to the database server in response to the answer request.
The query request is used for querying the questions in the e-commerce field in the database of the database server.
In step 203, the database server traverses the database to obtain a data table of which the business field is the e-commerce in other data tables except the data table of the missing business field.
In some embodiments, there are multiple types of data tables in the database corresponding to the answering client for each service domain, and the number of each type of table may be multiple. For example, when the business field is e-commerce, the corresponding data table may include a question table, a friend table, a tacit value table, and the like. The problem table stores a plurality of problems aiming at the field of electronic commerce, the friend table stores information (such as user names and the like) of other friends related to one user, and the tacit value table indicates tacit values between the user and the friends. Because a part of data tables of the missing business field exist in the database (the data tables possibly miss the field of the business field in the process of storing the data tables into the database to cause missing), the data tables of the business field, namely the electronic commerce, are screened from other data tables except the data tables of the missing business field.
In step 204, the database server determines the corresponding service domain according to the table name and table description information of the data table of the missing service domain.
In some embodiments, step 204 may be implemented by the text classification model shown in FIG. 6. The text classification model can also be replaced by full-connection network models such as FastText, TextCNN, BERT, RoBERTA or ELECTRA, and can also be replaced by an attention network model, a cyclic neural network model or a convolutional neural network model.
As shown in fig. 8, fig. 8 is a schematic diagram of training and predicting a text classification model according to an embodiment of the present application. In the process of training the text classification model, the table name and the table description information of the data table are used as sample data, the business field of the data table is used as a label (namely, the labeling business field) to train the model, and the model parameters obtained by training are stored. The text classification model can be trained by using a random gradient descent method, so that the text classification model realizes optimal parameters or local optimal parameters. In the process of predicting the business field to which the data table belongs, model parameters are loaded firstly, then the table name and the table description information are used as the input of a trained text classification model, and the business field to which the data table belongs is output by the text classification model.
The process of predicting the business domain to which the data table belongs will be specifically described below with reference to the structure of the text classification model.
The text classification model comprises an input layer, an embedding layer, a full connection layer and a Softmax layer. In the input layer, the table name or the table description information is subjected to word segmentation processing, and the corresponding index value is determined. In thatIn the word segmentation process, sentences in the text are segmented into words or characters. In determining the index value, a corresponding index value needs to be found for each segmented word or word. For example, the ith word in the input text is wiAfter indexing, obtaining a unique integer number Ii=I(wi) Where I represents the correspondence of the shaping number (i.e., index value) to the word in the text in the index table.
In the embedding layer, a sentence vector corresponding to the text is obtained from the embedding layer by using the index value of each word. Assuming the matrix of the embedding layer is E ∈ RV×DWhere V is the total number of all words and D is the dimension of each vector. To obtain a word vector e corresponding to the ith word in the textiFirstly, converting the index value of the ith word into a unique heat vector, and obtaining V unique heat vectors, wherein except that the element at the ith position in each unique heat vector is 1, the rest elements are 0. Then, multiplying the unique heat vector matrix corresponding to the ith word by the weight matrix E corresponding to the embedding layer to obtain a word vector E corresponding to the ith wordi
After word vectors corresponding to each word are determined, the word vectors corresponding to each word are added to calculate the average to obtain sentence vectors corresponding to the text. Assuming that the text includes L words, the sentence vector s can be expressed as:
Figure BDA0002927575590000201
in the full-connection layer, the sentence vector is transformed by the following formula (1).
a=f(Ws+b) (1)
Wherein W is a weight parameter of the fully-connected layer, b is a bias parameter of the fully-connected layer, f is an activation function, a is an output of the fully-connected layer, a is an A-dimensional vector, and A is a total number of candidate service domains.
In the Softmax layer, the probability p of the candidate traffic domains to which the data table may belong is outputiThe probability calculation is shown in equation (2).
Figure BDA0002927575590000211
Wherein, aiRepresents the ith element in a, and the value range of i is 0-A.
And finally, selecting the service field with the maximum probability as the service field to which the data table belongs and outputting the service field. The method can determine the service fields corresponding to the data tables of all the missing service fields in the database.
In step 205, the database server obtains a data table with business field as e-commerce from the data table of the missing business field.
After determining the business field to which the data table of each missing business field belongs and writing the data table into the corresponding data table, screening according to the business field of each data table in the database, and finding out the data table of which the business field is electronic commerce from the data table.
In step 206, the database server screens the problem table from the business domain e-commerce data table according to the table name and/or table description information.
Because each business field corresponds to a plurality of types of tables, a data table with a type of a problem table needs to be screened from data tables with business fields of electronic commerce. The problem table may be determined based on the table name (e.g., # problem table), or may be determined based on the table description information (e.g., ".
In step 207, the database server randomly generates a plurality of topics from the questions in each question table.
After the database server obtains the problem table of the electronic commerce field in the whole database, the problems in the problem table are extracted and summarized, and a plurality of (such as a specified number of) topics are randomly selected from the problem table to serve as a data processing result.
In step 208, the database server transmits a plurality of titles of the e-commerce domain to the terminal.
The database server transmits data processing results, i.e., a plurality of titles (and corresponding answers) in the e-commerce field, to the terminal.
In step 209, the terminal displays a plurality of titles.
In step 210, the terminal obtains answers of the user to a plurality of topics and gives corresponding scores.
And the terminal determines whether the answer selected by the user/the filled-in answer is correct according to the answer corresponding to the title, and gives a corresponding score according to the scoring standard.
Therefore, the embodiment of the application determines the business field to which the data table belongs based on the data table name/table description information in the data table, so that all the problem tables belonging to a specific business field (such as e-commerce) in the database can be further determined, and the questions in the problem tables are returned to the user. Therefore, the number of effective questions in the database is increased, more diversified questions can be provided for the user, and the enthusiasm of the user for answering the questions is improved. And the business field is automatically determined through the machine learning model, human participation is not needed, the consumption of manpower is reduced, and the integrity of the field information of the data sheet is ensured.
Continuing with the exemplary structure of the data processing device 243 of the database provided by the embodiment of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the data processing device 243 of the database of the memory 240 may include: query module 2431, obtain module 2432, determine module 2433, write module 2434.
The query module 2431 is configured to query a data table of a missing service field in the database; an obtaining module 2432, configured to obtain feature information of the data table, and configured to obtain a feature vector corresponding to the feature information, and map the feature vector into probabilities of multiple candidate service domains; a determining module 2433, configured to determine a service domain corresponding to the probability that satisfies the probability value condition as a service domain to which the data table belongs; the writing module 2434 is configured to write the business field to which the data table belongs in the data table of the database.
In some embodiments, the obtaining module 2432 is further configured to determine an index value corresponding to each word in the feature information; converting based on the index value to obtain a word vector corresponding to each word; adding and averaging the word vectors corresponding to each word to obtain an average word vector; and taking the average word vector as a feature vector corresponding to the feature information.
In some embodiments, the obtaining module 2432 is further configured to perform word segmentation on the table name and the table description information of the data table included in the feature information to obtain a plurality of words; determining the corresponding index value of each word in the index table; the index table comprises different words and corresponding index values.
In some embodiments, the obtaining module 2432 is further configured to perform conversion processing based on the index value corresponding to each word to obtain a corresponding unique heat vector; and multiplying the one-hot vector corresponding to each word by the weight matrix to obtain a word vector corresponding to each word.
In some embodiments, the obtaining module 2432 is further configured to perform encoding processing on the feature vector to obtain an encoding processing result; and activating the coding processing to obtain the probability of the service fields corresponding to the plurality of candidates.
In some embodiments, the determining module 2433 is further configured to use the service domain with the highest probability as the service domain to which the data table belongs; or sorting the service fields corresponding to the probabilities exceeding the probability threshold value according to the descending order of the probabilities, and selecting a plurality of service fields which are sorted in the front as the service fields to which the data table belongs.
In some embodiments, the query module 2431 is further configured to screen out a plurality of candidate data tables from the log of the database, where the candidate data tables satisfy at least one of the following conditions: the use frequency is lower than a frequency threshold value, and the last use time is before a preset time; a data table of the missing service domain is determined from the plurality of candidate data tables.
In some embodiments, the query module 2431 is further configured to determine data tables that are not overlapped with each other in a plurality of nodes corresponding to the distributed database; and traversing the data tables which are respectively stored in the nodes and are not overlapped with each other to determine the data table of the missing service field.
In some embodiments, the data processing apparatus of the database further includes a replacement module 2435 configured to periodically traverse the data tables in the database to determine the data tables in which the data change occurs, the type of the data change including at least one of: adding data, deleting data, changing data; acquiring characteristic information of a data table with data change, and determining a new service field based on the characteristic information of the data table with data change; and replacing the business field to which the data table with the changed data belongs with a new business field.
In some embodiments, the determining module 2433 is further configured to determine multiple sets of data tables with similarity exceeding a similarity threshold in multiple data tables of the missing service domain, where each set of data tables includes at least two data tables; selecting one data table from each group of data tables as a representative data table, acquiring characteristic information of each representative data table, and determining the business field to which the representative data table belongs based on the characteristic information; and taking the service field to which the representative data table belongs as the service field to which other data tables in the same group of data tables belong.
In some embodiments, the data processing apparatus of the database further includes a retrieval module 2436, configured to, in response to the retrieval request, perform screening based on a business domain to which the data tables in the database belong, to obtain data tables within a first range; screening based on the table description information of the data table in the first range to obtain the data table in the second range; and screening based on the table names of the data tables in the second range, and returning the obtained data tables in the target range as a search result.
In some embodiments, the business domain is determined by a machine learning model, and the data processing apparatus of the database further includes a training module 2437 for performing the following processes by the machine learning model: acquiring characteristic information of a data table sample, and determining a prediction service field to which the data table sample belongs based on the characteristic information of the data table sample; determining an error based on the prediction business field of the data table sample and the marking business field of the data table sample; the error is propagated back through the machine learning model to update parameters of the machine learning model.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method of the database described above in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which, when executed by a processor, will cause the processor to perform a data processing method of a database provided by embodiments of the present application, for example, the data processing method of the database shown in fig. 3A.
In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a Hyper-log Markup Language (HT ML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
To sum up, in the embodiment of the present application, a data table of a missing service field in a database is first determined; and then determining the business fields to which the data tables belong through a text classification model, and finally writing the business fields into the data tables, so that the business fields of the data tables are automatically supplemented, the labor consumption is reduced, and the data management efficiency is improved. For the data tables of a plurality of similar missing service fields, the service field to which one data table belongs can be determined by the data processing method of the database provided by the embodiment of the application, and the service field is used as the service field to which other similar data tables belong, so that the completion efficiency of the service field is greatly improved. When the data table is changed, the business field to which the data table belongs can be adjusted in time according to the changed characteristic information, and the accuracy of the business field is ensured. In addition, after the business field corresponding to the data table is determined, the retrieval strategy can be optimized based on the business field, and the retrieval accuracy is improved.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (15)

1. A data processing method of a database is characterized by comprising the following steps:
inquiring a data table of a missing service field in a database, and acquiring characteristic information of the data table;
acquiring a feature vector corresponding to the feature information, and mapping the feature vector into probabilities corresponding to a plurality of candidate service fields;
determining the service field corresponding to the probability meeting the probability value condition as the service field to which the data table belongs;
and writing the business field to which the data table belongs in the data table of the database.
2. The method of claim 1, wherein the obtaining a feature vector corresponding to the feature information comprises:
determining an index value corresponding to each word in the feature information;
converting based on the index value to obtain a word vector corresponding to each word;
adding and averaging the word vectors corresponding to each word to obtain an average word vector;
and taking the average word vector as a feature vector corresponding to the feature information.
3. The method of claim 2, wherein the determining an index value corresponding to each word in the feature information comprises:
performing word segmentation processing on the table name and the table description information of the data table, which are included in the feature information, to obtain a plurality of words;
determining the corresponding index value of each word in the index table;
wherein the index table comprises different words and corresponding index values.
4. The method of claim 2, wherein the converting based on the index value to obtain a word vector corresponding to each word comprises:
converting the index value corresponding to each word to obtain a corresponding unique heat vector;
and multiplying the one-hot vector corresponding to each word by the weight matrix to obtain a word vector corresponding to each word.
5. The method of claim 1, wherein the mapping the feature vector to probabilities corresponding to a plurality of candidate traffic domains comprises:
coding the characteristic vector to obtain a coding processing result;
and activating the coding processing to obtain the probability of the service fields corresponding to a plurality of candidates.
6. The method according to claim 1, wherein the determining the service domain corresponding to the probability satisfying the probability value condition as the service domain to which the data table belongs includes:
taking the service field with the maximum probability as the service field to which the data table belongs; or
And sorting the service fields corresponding to the probabilities exceeding the probability threshold value according to a descending order of the probabilities, and selecting a plurality of service fields which are sorted in the front as the service fields to which the data table belongs.
7. The method of claim 1, wherein querying the database for data tables missing a business domain comprises:
screening a plurality of candidate data tables from the logs of the database, wherein the candidate data tables meet at least one of the following conditions: the use frequency is lower than a frequency threshold value, and the last use time is before a preset time;
determining a data table of the missing service domain from the plurality of candidate data tables.
8. The method of claim 1, wherein when the database is a distributed database, the querying a data table of a missing business domain in the database comprises:
determining data tables which are not overlapped with each other in a plurality of nodes corresponding to the distributed database;
and traversing the data tables which are respectively stored in the nodes and are not overlapped with each other to determine the data table of the missing service field.
9. The method of claim 1, further comprising:
periodically traversing data tables in the database to determine data tables in which data changes occur, the type of data change comprising at least one of: adding data, deleting data, changing data;
acquiring the characteristic information of the data table with the changed data, and determining a new service field based on the characteristic information of the data table with the changed data;
and replacing the business field to which the data table with the changed data belongs with the new business field.
10. The method of claim 1, wherein after the data table of the service domain is missing from the query database, the method further comprises:
determining a plurality of groups of data tables of which the similarity exceeds a similarity threshold in a plurality of data tables in the field of missing services, wherein each group of data tables comprises at least two data tables;
selecting one data table from each group of data tables as a representative data table, acquiring characteristic information of each representative data table, and determining the business field to which the representative data table belongs based on the characteristic information;
and taking the business field to which the representative data table belongs as the business field to which other data tables in the same group of data tables belong.
11. The method according to claim 1, wherein after writing the business domain to which the data table belongs in the data table of the database, the method further comprises:
responding to a retrieval request, and screening based on the business field to which the data table belongs in the database to obtain the data table in a first range;
screening based on the table description information of the data table in the first range to obtain the data table in a second range;
and screening based on the table names of the data tables in the second range, and returning the obtained data tables in the target range as a search result.
12. The method of claim 1, wherein the business domain is determined by a machine learning model, and wherein before the data table of the business domain is missing from the query database, the method further comprises:
performing, by the machine learning model: acquiring characteristic information of a data table sample, and determining a prediction service field to which the data table sample belongs based on the characteristic information of the data table sample;
determining an error based on the prediction business field of the data table sample and the labeling business field of the data table sample;
back-propagating the error in the machine learning model to update parameters of the machine learning model.
13. A data processing apparatus of a database, comprising:
the query module is used for querying a data table of the missing service field in the database;
the acquisition module is used for acquiring the characteristic information of the data table, acquiring a characteristic vector corresponding to the characteristic information, and mapping the characteristic vector into the probability of a plurality of candidate service fields;
the determining module is used for determining the business field corresponding to the probability meeting the probability value-taking condition as the business field to which the data table belongs;
and the writing module is used for writing the business field to which the data table belongs in the data table of the database.
14. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing a data processing method of the database of any one of claims 1 to 12 when executing executable instructions stored in the memory.
15. A computer-readable storage medium having stored thereon executable instructions for causing a processor to perform a method of data processing of a database as claimed in any one of claims 1 to 12.
CN202110138158.5A 2021-02-01 2021-02-01 Database data processing method and device and electronic equipment Pending CN113568895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110138158.5A CN113568895A (en) 2021-02-01 2021-02-01 Database data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110138158.5A CN113568895A (en) 2021-02-01 2021-02-01 Database data processing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113568895A true CN113568895A (en) 2021-10-29

Family

ID=78161093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110138158.5A Pending CN113568895A (en) 2021-02-01 2021-02-01 Database data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113568895A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918577A (en) * 2021-12-15 2022-01-11 北京新唐思创教育科技有限公司 Data table identification method and device, electronic equipment and storage medium
CN114637736A (en) * 2022-03-09 2022-06-17 北京金堤科技有限公司 Database splitting method and device
CN115408396A (en) * 2022-09-02 2022-11-29 金蝶征信有限公司 Business data storage method and device, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918577A (en) * 2021-12-15 2022-01-11 北京新唐思创教育科技有限公司 Data table identification method and device, electronic equipment and storage medium
CN113918577B (en) * 2021-12-15 2022-03-11 北京新唐思创教育科技有限公司 Data table identification method and device, electronic equipment and storage medium
CN114637736A (en) * 2022-03-09 2022-06-17 北京金堤科技有限公司 Database splitting method and device
CN115408396A (en) * 2022-09-02 2022-11-29 金蝶征信有限公司 Business data storage method and device, computer equipment and storage medium
CN115408396B (en) * 2022-09-02 2024-04-05 金蝶征信有限公司 Method, device, computer equipment and storage medium for storing business data

Similar Documents

Publication Publication Date Title
CN113568895A (en) Database data processing method and device and electronic equipment
Wu et al. Big data analytics= machine learning+ cloud computing
CN102184204B (en) Auto fill method and system of intelligent Web form
CN103412897B (en) A kind of parallel data processing method based on distributed frame
Bao et al. Managing massive trajectories on the cloud
EP3131021A1 (en) Hybrid data storage system and method and program for storing hybrid data
CN108874971A (en) A kind of tool and method applied to the storage of magnanimity labeling solid data
CN110941612A (en) Autonomous data lake construction system and method based on associated data
CN109815254B (en) Cross-region task scheduling method and system based on big data
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
CN108540351B (en) Automatic testing method for distributed big data service
Sivaraman et al. High performance and fault tolerant distributed file system for big data storage and processing using hadoop
CN109213820B (en) Method for realizing fusion use of multiple types of databases
CN113434623A (en) Fusion method based on multi-source heterogeneous space planning data
CN106484813A (en) A kind of big data analysis system and method
CN107870949A (en) Data analysis job dependence relation generation method and system
CN114818353A (en) Train control vehicle-mounted equipment fault prediction method based on fault characteristic relation map
CN116680278B (en) Data processing method, device, electronic equipment and storage medium
CN116127047B (en) Method and device for establishing enterprise information base
CN110134688B (en) Hot event data storage management method and system in online social network
CN116301656A (en) Data storage method, system and equipment based on log structure merging tree
CN115827797A (en) Environmental data analysis and integration method and system based on big data
CN115269862A (en) Electric power question-answering and visualization system based on knowledge graph
CN113779215A (en) Data processing platform
Perko et al. Evaluating probability of default: Intelligent agents in managing a multi-model system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40056463

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination