CN114138761A - Data query method, device, equipment and storage medium based on python - Google Patents

Data query method, device, equipment and storage medium based on python Download PDF

Info

Publication number
CN114138761A
CN114138761A CN202111431187.7A CN202111431187A CN114138761A CN 114138761 A CN114138761 A CN 114138761A CN 202111431187 A CN202111431187 A CN 202111431187A CN 114138761 A CN114138761 A CN 114138761A
Authority
CN
China
Prior art keywords
data
database
python
query
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111431187.7A
Other languages
Chinese (zh)
Inventor
韩延福
吴燕平
卓陈朋
张彩平
李晓雄
汤慧
翁泽钏
邱龙根
黄梦如
黄陈海
黄静看
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN202111431187.7A priority Critical patent/CN114138761A/en
Publication of CN114138761A publication Critical patent/CN114138761A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a python-based data query method, which comprises the following steps: the method comprises the steps of utilizing a python mapping process to conduct database configuration on an original data set and a preset database to obtain an in-memory database, conducting data cleaning and data classification on the original data set to obtain a database table, constructing an inquiry main process based on a data inquiry instruction, setting a timing updating sub-process based on a preset python process tool, utilizing the timing updating sub-process to conduct data updating on the database table, and querying the updated database table based on the inquiry main process to obtain a data inquiry result. In addition, the invention also relates to a block chain technology, and the data query result can be stored in a node of the block chain. The invention also provides a data query method and device based on python, electronic equipment and a computer readable storage medium. The method and the device can solve the problem of low query efficiency when the python is used for data query.

Description

Data query method, device, equipment and storage medium based on python
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data query method and device based on python, electronic equipment and a computer readable storage medium.
Background
python is a clear and simple programming language, has a rich and practical third party library, is more and more favored by programmers, and is the most widely used language by developers at present. But due to the characteristics of the interpretative language, the operating efficiency is low, the speed is low, and the processing efficiency of python is not ideal for scenes with large data volume or complex calculation. For example, in data query, due to the existence of a Global Interpreter Lock (GIL), one main thread and multiple execution threads of user programs are in one python Interpreter process, but even on a multi-core CPU platform, the python prohibits the parallel execution of multiple threads, which are actually coroutines, and the support of the python on the multi-CPU device is naturally inferior, so that the data query speed and efficiency are low.
Disclosure of Invention
The invention provides a data query method, a data query device, data query equipment and a storage medium based on python, and mainly aims to solve the problem of low query efficiency when the python is used for querying data.
In order to achieve the above object, the present invention provides a data query method based on python, including:
acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain a memory database;
performing data cleaning and data classification on the original data set in the memory database to obtain a database table;
receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool;
and updating the data of the database table by using the timing updating sub-process, and inquiring the updated database table based on the inquiring main process to obtain a data inquiring result.
Optionally, the obtaining the original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range order to obtain an in-memory database includes:
acquiring original data in a plurality of original databases to obtain an original data set;
establishing a mapping relation between original data in the original data set and an original database corresponding to the original data by using the python mapping program, and summarizing the mapping relation of all the original databases to obtain a database configuration file;
performing access configuration according to the preset database to obtain an access configuration file;
and adding the database configuration file and the access configuration file in the preset database to obtain the memory database.
Optionally, the performing data cleaning and data classification on the original data set to obtain a database table includes:
performing data duplication removal, data exception removal and data missing value filling on data in the original data set to obtain a standard data set;
and performing database and table dividing processing on the standard data set to obtain the database table.
Optionally, the performing data deduplication, data anomaly removal and data missing value filling on the data in the original data set to obtain a standard data set includes:
calculating a distance value of data in the original data set by using a distance formula, and removing duplication of the data in the original data set according to the distance value to obtain a duplication-removed data set;
removing abnormal values of the duplicate data removing set by using a unilateral test formula to obtain an abnormal data removing set;
and carrying out missing value detection on the data in the abnormal data removing set by using a preset missing value detection function, and filling missing values based on a preset filling algorithm to obtain the standard data set.
Optionally, the setting of the timed update sub-process based on the preset python process tool includes:
acquiring a timing execution event, and packaging the timing execution event into an execution function;
and creating a sub-process by using the python process tool, and packaging the execution function into the sub-process to obtain the timing updating sub-process.
Optionally, the performing, by using the timing update sub-process, data update on the database table includes:
constructing a data initialization process by using the python process tool;
executing the timing execution event in the timing updating subprocess to obtain a new original data set;
and performing data initialization processing on the new original data set by using the data initialization process, and filling the processed data into the database table.
Optionally, the querying the updated database table based on the query main process includes:
generating an access file according to the query main process;
analyzing the query field of the data query instruction in the query main process;
and searching the updated database table from the query entry by using the query field by taking the access file as the query entry.
In order to solve the above problem, the present invention further provides a python-based data query apparatus, including:
the database configuration module is used for acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain an internal memory database;
the data processing module is used for carrying out data cleaning and data classification on the original data set in the memory database to obtain a database table;
the process construction module is used for receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool;
and the data query module is used for updating the database table by utilizing the timing update sub-process and querying the updated database table based on the query main process to obtain a data query result.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
a processor executing a computer program stored in the memory to implement the python-based data query method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the python-based data query method described above.
According to the method, the original data set is subjected to database configuration through the python mapping program to obtain the memory database, and compared with the traditional relational database or file database, the memory database has the advantages that the same data reading speed is obviously improved, the response time of data query is reduced, and the query speed is further improved. Meanwhile, through data cleaning and data classification, the data volume is further reduced, data subdivision is realized, and the data query speed can also be improved. And moreover, a process is constructed by using a python process tool instead of a thread, and program parallelism can be realized by bypassing GIL in multi-process processing, so that the efficiency of data query is improved. Therefore, the python-based data query method, the device, the electronic equipment and the computer-readable storage medium can solve the problem of low query efficiency when the python is used for querying data.
Drawings
FIG. 1 is a flow chart of a python-based data query method according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of a python-based data query apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing the python-based data query method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a data query method based on python. The executing body of the python-based data query method includes, but is not limited to, at least one of the electronic devices that can be configured to execute the method provided by the embodiment of the application, such as a server, a terminal, and the like. In other words, the python-based data query method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Referring to fig. 1, a flowchart of a data query method based on python according to an embodiment of the present invention is shown. In this embodiment, the python-based data query method includes:
and S1, acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain an in-memory database.
In the embodiment of the invention, the original data set can be products, user information and the like in different fields. For example, in the financial field, the raw data set may be stock information, fund information, trust information, and the like of the user. The preset python Mapping program may be an Object Relative Mapping (ORM) in python. The preset database can be a sqlite3 internal memory database, and compared with a traditional relational database or a file type database, the internal memory database has the advantages that the same data reading speed is obviously improved, the response time of data query is prolonged, and the query speed is further improved.
Specifically, the acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain an in-memory database includes:
acquiring original data in a plurality of original databases to obtain an original data set;
establishing a mapping relation between original data in the original data set and an original database corresponding to the original data by using the python mapping program, and summarizing the mapping relation of all the original databases to obtain a database configuration file;
performing access configuration according to the preset database to obtain an access configuration file;
and adding the database configuration file and the access configuration file in the preset database to obtain the memory database.
In an optional embodiment of the present invention, the original database may be a relational database Oracle, SQLServer, DB2, Mysql, or the like. Moreover, by using an Object Mapping (Object Mapping) program in python, different original databases and data are mapped, which facilitates management and switching. For example, the original database can be uniformly configured by the ORM using the SQLAlchemy, which facilitates management and switching.
In an optional embodiment of the present invention, the access configuration may include: 1. client side link access; 2. accessing a program api interface; 3. and accessing in an http mode.
In the embodiment of the invention, the original data set is subjected to database configuration by using the ORM in python, so that the memory database with higher response speed can be used for data query, and the data query speed is improved.
And S2, performing data cleaning and data classification on the original data set in the memory database to obtain a database table.
In the embodiment of the invention, the original data set is from a plurality of databases, so that the query is disordered due to large data volume, and the data is refined through data cleaning and data classification, so that the response speed of the data query can be improved.
In detail, the performing data cleaning and data classification on the original data set to obtain a database table includes:
performing data duplication removal, data exception removal and data missing value filling on data in the original data set to obtain a standard data set;
and performing database and table dividing processing on the standard data set to obtain the database table.
In the embodiment of the present invention, the performing data deduplication, data anomaly removal and data missing value filling on the data in the original data set to obtain a standard data set includes:
calculating a distance value of data in the original data set by using a distance formula, and removing duplication of the data in the original data set according to the distance value to obtain a duplication-removed data set;
removing abnormal values of the duplicate data removing set by using a unilateral test formula to obtain an abnormal data removing set;
and carrying out missing value detection on the data in the abnormal data removing set by using a preset missing value detection function, and filling missing values based on a preset filling algorithm to obtain the standard data set.
In an optional embodiment of the present invention, the distance formula may be:
Figure BDA0003380211400000061
where d represents the distance value between any two data in the original data set, w1jAnd w2jRepresenting any two data in the original data set. And deleting any one of the data when the distance value is smaller than a preset distance value, and simultaneously keeping the two data if the distance value is not smaller than the preset distance value. Preferably, the preset distance value may be 0.1.
In an optional embodiment of the present invention, the single-side test rejection includes a minimum single-side test rejection and a maximum single-side test rejection.
The calculation method for the minimum unilateral test rejection comprises the following steps:
Figure BDA0003380211400000062
wherein G represents a test value, S represents a standard deviation of data in the original data set after deduplication,
Figure BDA0003380211400000063
representing the mean value, Y, of the data in said original data set after deduplicationminRepresenting the smallest data in the original data set after deduplication. And when G is larger than a preset test threshold, determining the minimum data as abnormal data.
The calculation method for the maximum unilateral test rejection comprises the following steps:
Figure BDA0003380211400000064
wherein G represents a test value, S represents a standard deviation of data in the original data set after deduplication,
Figure BDA0003380211400000065
representing the mean value, Y, of the data in said original data set after deduplicationmaxRepresenting the largest data in the original data set after deduplication. And when G is larger than a preset test threshold value, determining the maximum data as abnormal data.
In an optional embodiment of the present invention, the missing value detection function may be a mismap function missing function, if no data missing value is detected, no processing is performed, and if a data missing value is detected, the missing value is filled by a preset filling algorithm in the embodiment of the present invention, where the preset filling algorithm includes:
Figure BDA0003380211400000071
wherein L (θ) represents a filled data missing value, xiIndicates the ith missing data value, theta indicates the filled missing dataProbability parameter corresponding to the failure value, n represents the number of data in the abnormal data collection, p (x)i| θ) represents the probability of the data missing value of the padding.
In the embodiment of the present invention, the performing database-based and table-based processing on the standard data set to obtain the database table includes:
dividing the memory database into a preset number of data sub-databases, wherein the data sub-databases comprise pre-constructed intermediate tables;
classifying the standard data set according to a preset data classification rule to obtain a classified data set;
and filling the classified data set into the intermediate table to obtain the database table.
In an optional embodiment of the present invention, the same number of data sub-databases may be constructed according to the number of the original databases, and the data is classified and divided, and is filled into the intermediate table in the data sub-databases, so as to obtain the database table. For example, there are two primary databases: the method comprises a database 1 and a database 2, wherein data in the database 1 is stock information, data in the database 2 is fund information, the database is divided into two sub-databases, the data are divided according to blocks of new energy, liquor, medical treatment and the like, and the divided data are filled into a preset blank intermediate table to obtain a database table.
S3, receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool.
In the embodiment of the present invention, the data query instruction refers to a data query request sent by a user, and includes: user information query request, business information query request, financial information query request, and the like. The preset python process tool may be a multiprocessing tool in python.
In detail, the setting of the timing update sub-process based on the preset python process tool comprises the following steps:
acquiring a timing execution event, and packaging the timing execution event into an execution function;
and creating a sub-process by using the python process tool, and packaging the execution function into the sub-process to obtain the timing updating sub-process.
In an optional embodiment of the present invention, before the obtaining the timing execution event, the method further includes:
and acquiring preset execution time and execution frequency, and packaging the execution time and the execution frequency by using a preset timing tool to obtain the timing execution event.
In the embodiment of the invention, the apscheduler tool can be used for creating the timed execution task (comprising the execution time and the execution frequency), and meanwhile, the multiprocessing tool is used for creating the process instead of the thread for data processing, so that the GIL can be bypassed, the aim of program parallelism can be achieved, and the data parallelism processing capability can be improved.
And S4, updating the database table by using the timing updating sub-process, and inquiring the updated database table based on the inquiry main process to obtain a data inquiry result.
In detail, the updating the database table by using the timing update sub-process includes:
constructing a data initialization process by using the python process tool;
executing the timing execution event in the timing updating subprocess to obtain a new original data set;
and performing data initialization processing on the new original data set by using the data initialization process, and filling the processed data into the database table.
In an optional embodiment of the present invention, also because a multiprocessing tool is used to create processes instead of threads to update data, the multi-process processing can bypass the GIL to achieve the purpose of program parallelism.
In an embodiment of the present invention, the querying an updated database table based on the querying main process includes:
generating an access file according to the query main process;
analyzing the query field of the data query instruction in the query main process;
and searching the updated database table from the query entry by using the query field by taking the access file as the query entry.
In the embodiment of the invention, because the memory database uses a single memory to store data, different processes do not share the same memory, the memory database needs to access by using a file as a medium, an access file is generated when the query main process is started, and the access file is different from other files and is equivalent to opening an inlet in the memory, so that the query is carried out through the inlet. For example, in the financial field, a user inputs a financial data query instruction, and a financial data query result is obtained by analyzing a query field in the financial data query instruction.
In another optional embodiment of the present invention, before the parsing a query field of a data query instruction in the query host process, the method further includes:
checking whether a memory file exists in the memory database;
if the memory file does not exist in the memory database, no processing is performed;
and if the memory file exists in the memory database, deleting the memory file.
In the embodiment of the invention, when the memory database taking the file as a medium is established, a file with the size of 0 is generated under the main directory, and a certain magnetic disc sector is occupied, so that whether the file exists or not is checked when the inquiry main process is started, if the file needs to be deleted, otherwise, the magnetic disc sector occupied when the inquiry main process is started every time is continuously enlarged, and the data inquiry speed is influenced.
According to the method, the original data set is subjected to database configuration through the python mapping program to obtain the memory database, and compared with the traditional relational database or file database, the memory database has the advantages that the same data reading speed is obviously improved, the response time of data query is reduced, and the query speed is further improved. Meanwhile, through data cleaning and data classification, the data volume is further reduced, data subdivision is realized, and the data query speed can also be improved. And moreover, a process is constructed by using a python process tool instead of a thread, and program parallelism can be realized by bypassing GIL in multi-process processing, so that the efficiency of data query is improved. Therefore, the data query method based on python provided by the invention can solve the problem of low query efficiency when the python is used for querying data.
FIG. 2 is a functional block diagram of a python-based data query apparatus according to an embodiment of the present invention.
The python-based data query device 100 of the present invention can be installed in an electronic device. According to the realized functions, the python-based data query device 100 can comprise a database configuration module 101, a data processing module 102, a process construction module 103 and a data query module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the database configuration module 101 is configured to acquire an original data set, and perform database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain an in-memory database;
the data processing module 102 is configured to perform data cleaning and data classification on the original data set in the in-memory database to obtain a database table;
the process construction module 103 is configured to receive a data query instruction, construct a query main process based on the data query instruction, and set a timing update sub-process based on a preset python process tool;
the data query module 104 is configured to perform data update on the database table by using the timing update sub-process, and query the updated database table based on the query main process to obtain a data query result.
In detail, the specific implementation of the modules of the python-based data query apparatus 100 is as follows:
the method comprises the steps of firstly, obtaining an original data set, and carrying out database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain an internal memory database.
In the embodiment of the invention, the original data set can be products, user information and the like in different fields. For example, in the financial field, the raw data set may be stock information, fund information, trust information, and the like of the user. The preset python Mapping program may be an Object Relative Mapping (ORM) in python. The preset database can be a sqlite3 internal memory database, and compared with a traditional relational database or a file type database, the internal memory database has the advantages that the same data reading speed is obviously improved, the response time of data query is prolonged, and the query speed is further improved.
Specifically, the acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain an in-memory database includes:
acquiring original data in a plurality of original databases to obtain an original data set;
establishing a mapping relation between original data in the original data set and an original database corresponding to the original data by using the python mapping program, and summarizing the mapping relation of all the original databases to obtain a database configuration file;
performing access configuration according to the preset database to obtain an access configuration file;
and adding the database configuration file and the access configuration file in the preset database to obtain the memory database.
In an optional embodiment of the present invention, the original database may be a relational database Oracle, SQLServer, DB2, Mysql, or the like. Moreover, by using an Object Mapping (Object Mapping) program in python, different original databases and data are mapped, which facilitates management and switching. For example, the original database can be uniformly configured by the ORM using the SQLAlchemy, which facilitates management and switching.
In an optional embodiment of the present invention, the access configuration may include: 1. client side link access; 2. accessing a program api interface; 3. and accessing in an http mode.
In the embodiment of the invention, the original data set is subjected to database configuration by using the ORM in python, so that the memory database with higher response speed can be used for data query, and the data query speed is improved.
And step two, performing data cleaning and data classification on the original data set in the memory database to obtain a database table.
In the embodiment of the invention, the original data set is from a plurality of databases, so that the query is disordered due to large data volume, and the data is refined through data cleaning and data classification, so that the response speed of the data query can be improved.
In detail, the performing data cleaning and data classification on the original data set to obtain a database table includes:
performing data duplication removal, data exception removal and data missing value filling on data in the original data set to obtain a standard data set;
and performing database and table dividing processing on the standard data set to obtain the database table.
In the embodiment of the present invention, the performing data deduplication, data anomaly removal and data missing value filling on the data in the original data set to obtain a standard data set includes:
calculating a distance value of data in the original data set by using a distance formula, and removing duplication of the data in the original data set according to the distance value to obtain a duplication-removed data set;
removing abnormal values of the duplicate data removing set by using a unilateral test formula to obtain an abnormal data removing set;
and carrying out missing value detection on the data in the abnormal data removing set by using a preset missing value detection function, and filling missing values based on a preset filling algorithm to obtain the standard data set.
In an optional embodiment of the present invention, the distance formula may be:
Figure BDA0003380211400000111
where d represents the distance value between any two data in the original data set, w1jAnd w2jRepresenting any two data in the original data set. And deleting any one of the data when the distance value is smaller than a preset distance value, and simultaneously keeping the two data if the distance value is not smaller than the preset distance value. Preferably, the preset distance value may be 0.1.
In an optional embodiment of the present invention, the single-side test rejection includes a minimum single-side test rejection and a maximum single-side test rejection.
The calculation method for the minimum unilateral test rejection comprises the following steps:
Figure BDA0003380211400000121
wherein G represents a test value, S represents a standard deviation of data in the original data set after deduplication,
Figure BDA0003380211400000122
representing the mean value, Y, of the data in said original data set after deduplicationminRepresenting the smallest data in the original data set after deduplication. And when G is larger than a preset test threshold, determining the minimum data as abnormal data.
The calculation method for the maximum unilateral test rejection comprises the following steps:
Figure BDA0003380211400000123
wherein G represents a test value, S represents a standard deviation of data in the original data set after deduplication,
Figure BDA0003380211400000124
representing the mean value, Y, of the data in said original data set after deduplicationmaxRepresenting the largest data in the original data set after deduplication. And when G is larger than a preset test threshold value, determining the maximum data as abnormal data.
In an optional embodiment of the present invention, the missing value detection function may be a mismap function missing function, if no data missing value is detected, no processing is performed, and if a data missing value is detected, the missing value is filled by a preset filling algorithm in the embodiment of the present invention, where the preset filling algorithm includes:
Figure BDA0003380211400000125
wherein L (θ) represents a filled data missing value, xiRepresenting the ith data missing value, theta representing the probability parameter corresponding to the filled data missing value, n representing the number of data in the de-abnormal data set, p (x)i| θ) represents the probability of the data missing value of the padding.
In the embodiment of the present invention, the performing database-based and table-based processing on the standard data set to obtain the database table includes:
dividing the memory database into a preset number of data sub-databases, wherein the data sub-databases comprise pre-constructed intermediate tables;
classifying the standard data set according to a preset data classification rule to obtain a classified data set;
and filling the classified data set into the intermediate table to obtain the database table.
In an optional embodiment of the present invention, the same number of data sub-databases may be constructed according to the number of the original databases, and the data is classified and divided, and is filled into the intermediate table in the data sub-databases, so as to obtain the database table. For example, there are two primary databases: the method comprises a database 1 and a database 2, wherein data in the database 1 is stock information, data in the database 2 is fund information, the database is divided into two sub-databases, the data are divided according to blocks of new energy, liquor, medical treatment and the like, and the divided data are filled into a preset blank intermediate table to obtain a database table.
And step three, receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool.
In the embodiment of the present invention, the data query instruction refers to a data query request sent by a user, and includes: user information query request, business information query request, financial information query request, and the like. The preset python process tool may be a multiprocessing tool in python.
In detail, the setting of the timing update sub-process based on the preset python process tool comprises the following steps:
acquiring a timing execution event, and packaging the timing execution event into an execution function;
and creating a sub-process by using the python process tool, and packaging the execution function into the sub-process to obtain the timing updating sub-process.
In an optional embodiment of the present invention, before the obtaining the timing execution event, the method further includes:
and acquiring preset execution time and execution frequency, and packaging the execution time and the execution frequency by using a preset timing tool to obtain the timing execution event.
In the embodiment of the invention, the apscheduler tool can be used for creating the timed execution task (comprising the execution time and the execution frequency), and meanwhile, the multiprocessing tool is used for creating the process instead of the thread for data processing, so that the GIL can be bypassed, the aim of program parallelism can be achieved, and the data parallelism processing capability can be improved.
And fourthly, updating the data of the database table by utilizing the timing updating sub-process, and inquiring the updated database table based on the inquiry main process to obtain a data inquiry result.
In detail, the updating the database table by using the timing update sub-process includes:
constructing a data initialization process by using the python process tool;
executing the timing execution event in the timing updating subprocess to obtain a new original data set;
and performing data initialization processing on the new original data set by using the data initialization process, and filling the processed data into the database table.
In an optional embodiment of the present invention, also because a multiprocessing tool is used to create processes instead of threads to update data, the multi-process processing can bypass the GIL to achieve the purpose of program parallelism.
In an embodiment of the present invention, the querying an updated database table based on the querying main process includes:
generating an access file according to the query main process;
analyzing the query field of the data query instruction in the query main process;
and searching the updated database table from the query entry by using the query field by taking the access file as the query entry.
In the embodiment of the invention, because the memory database uses a single memory to store data, different processes do not share the same memory, the memory database needs to access by using a file as a medium, an access file is generated when the query main process is started, and the access file is different from other files and is equivalent to opening an inlet in the memory, so that the query is carried out through the inlet. For example, in the financial field, a user inputs a financial data query instruction, and a financial data query result is obtained by analyzing a query field in the financial data query instruction.
In another optional embodiment of the present invention, before the parsing a query field of a data query instruction in the query host process, the method further includes:
checking whether a memory file exists in the memory database;
if the memory file does not exist in the memory database, no processing is performed;
and if the memory file exists in the memory database, deleting the memory file.
In the embodiment of the invention, when the memory database taking the file as a medium is established, a file with the size of 0 is generated under the main directory, and a certain magnetic disc sector is occupied, so that whether the file exists or not is checked when the inquiry main process is started, if the file needs to be deleted, otherwise, the magnetic disc sector occupied when the inquiry main process is started every time is continuously enlarged, and the data inquiry speed is influenced.
According to the method, the original data set is subjected to database configuration through the python mapping program to obtain the memory database, and compared with the traditional relational database or file database, the memory database has the advantages that the same data reading speed is obviously improved, the response time of data query is reduced, and the query speed is further improved. Meanwhile, through data cleaning and data classification, the data volume is further reduced, data subdivision is realized, and the data query speed can also be improved. And moreover, a process is constructed by using a python process tool instead of a thread, and program parallelism can be realized by bypassing GIL in multi-process processing, so that the efficiency of data query is improved. Therefore, the data query device based on python provided by the invention can solve the problem of low query efficiency when the python is used for querying data.
Fig. 3 is a schematic structural diagram of an electronic device implementing a python-based data query method according to an embodiment of the present invention.
The electronic device may comprise a processor 10, a memory 11, a communication interface 12 and a bus 13, and may further comprise a computer program, such as a python-based data query program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a python-based data query program, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., python-based data query programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The communication interface 12 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
The bus 13 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 13 may be divided into an address bus, a data bus, a control bus, etc. The bus 13 is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 3 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The PYTHON-based data query program stored in the memory 11 of the electronic device is a combination of instructions that, when executed in the processor 10, may implement:
acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain a memory database;
performing data cleaning and data classification on the original data set in the memory database to obtain a database table;
receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool;
and updating the data of the database table by using the timing updating sub-process, and inquiring the updated database table based on the inquiring main process to obtain a data inquiring result.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the electronic device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain a memory database;
performing data cleaning and data classification on the original data set in the memory database to obtain a database table;
receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool;
and updating the data of the database table by using the timing updating sub-process, and inquiring the updated database table based on the inquiring main process to obtain a data inquiring result.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A python-based data query method, comprising:
acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain a memory database;
performing data cleaning and data classification on the original data set in the memory database to obtain a database table;
receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool;
and updating the data of the database table by using the timing updating sub-process, and inquiring the updated database table based on the inquiring main process to obtain a data inquiring result.
2. The python-based data query method according to claim 1, wherein the obtaining of the original data set and the database configuration of the original data set and the preset database by using a preset python mapping range order to obtain the in-memory database comprises:
acquiring original data in a plurality of original databases to obtain an original data set;
establishing a mapping relation between original data in the original data set and an original database corresponding to the original data by using the python mapping program, and summarizing the mapping relation of all the original databases to obtain a database configuration file;
performing access configuration according to the preset database to obtain an access configuration file;
and adding the database configuration file and the access configuration file in the preset database to obtain the memory database.
3. The python-based data query method as claimed in claim 1, wherein said performing data cleansing and data classification on said original data set to obtain a database table comprises:
performing data duplication removal, data exception removal and data missing value filling on data in the original data set to obtain a standard data set;
and performing database and table dividing processing on the standard data set to obtain the database table.
4. The python-based data query method as claimed in claim 1, wherein the performing data deduplication, data anomaly removal and data missing value filling on the data in the original data set to obtain a standard data set comprises:
calculating a distance value of data in the original data set by using a distance formula, and removing duplication of the data in the original data set according to the distance value to obtain a duplication-removed data set;
removing abnormal values of the duplicate data removing set by using a unilateral test formula to obtain an abnormal data removing set;
and carrying out missing value detection on the data in the abnormal data removing set by using a preset missing value detection function, and filling missing values based on a preset filling algorithm to obtain the standard data set.
5. The python-based data query method as claimed in claim 1, wherein said updating the sub-process periodically based on the preset python process tool setting comprises:
acquiring a timing execution event, and packaging the timing execution event into an execution function;
and creating a sub-process by using the python process tool, and packaging the execution function into the sub-process to obtain the timing updating sub-process.
6. The python-based data query method as claimed in claim 5, wherein said updating the database table with the timed update sub-process comprises:
constructing a data initialization process by using the python process tool;
executing the timing execution event in the timing updating subprocess to obtain a new original data set;
and performing data initialization processing on the new original data set by using the data initialization process, and filling the processed data into the database table.
7. The python-based data query method of claim 1, wherein querying the updated database table based on the querying primary process comprises:
generating an access file according to the query main process;
analyzing the query field of the data query instruction in the query main process;
and searching the updated database table from the query entry by using the query field by taking the access file as the query entry.
8. An apparatus for data query based on python, the apparatus comprising:
the database configuration module is used for acquiring an original data set, and performing database configuration on the original data set and a preset database by using a preset python mapping range sequence to obtain an internal memory database;
the data processing module is used for carrying out data cleaning and data classification on the original data set in the memory database to obtain a database table;
the process construction module is used for receiving a data query instruction, constructing a query main process based on the data query instruction, and setting a timing update sub-process based on a preset python process tool;
and the data query module is used for updating the database table by utilizing the timing update sub-process and querying the updated database table based on the query main process to obtain a data query result.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a python-based data query method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the python-based data query method as claimed in any one of claims 1 to 7.
CN202111431187.7A 2021-11-29 2021-11-29 Data query method, device, equipment and storage medium based on python Pending CN114138761A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111431187.7A CN114138761A (en) 2021-11-29 2021-11-29 Data query method, device, equipment and storage medium based on python

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111431187.7A CN114138761A (en) 2021-11-29 2021-11-29 Data query method, device, equipment and storage medium based on python

Publications (1)

Publication Number Publication Date
CN114138761A true CN114138761A (en) 2022-03-04

Family

ID=80389098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111431187.7A Pending CN114138761A (en) 2021-11-29 2021-11-29 Data query method, device, equipment and storage medium based on python

Country Status (1)

Country Link
CN (1) CN114138761A (en)

Similar Documents

Publication Publication Date Title
CN112559535B (en) Multithreading-based asynchronous task processing method, device, equipment and medium
WO2022105135A1 (en) Information verification method and apparatus, and electronic device and storage medium
CN113946690A (en) Potential customer mining method and device, electronic equipment and storage medium
CN112364107A (en) System analysis visualization method and device, electronic equipment and computer readable storage medium
CN112115152A (en) Data increment updating and querying method and device, electronic equipment and storage medium
CN114610747A (en) Data query method, device, equipment and storage medium
CN113806434A (en) Big data processing method, device, equipment and medium
CN112506486A (en) Search system establishing method and device, electronic equipment and readable storage medium
CN114185895A (en) Data import and export method and device, electronic equipment and storage medium
CN115543198A (en) Method and device for lake entering of unstructured data, electronic equipment and storage medium
CN114880368A (en) Data query method and device, electronic equipment and readable storage medium
WO2022179122A1 (en) Big-data-based data storage method and apparatus, and electronic device and storage medium
CN113434542B (en) Data relationship identification method and device, electronic equipment and storage medium
CN114816371B (en) Message processing method, device, equipment and medium
CN114860314B (en) Deployment upgrading method, device, equipment and medium based on database compatibility
CN113849520B (en) Intelligent recognition method and device for abnormal SQL, electronic equipment and storage medium
CN115033605A (en) Data query method and device, electronic equipment and storage medium
CN114626103A (en) Data consistency comparison method, device, equipment and medium
CN114138761A (en) Data query method, device, equipment and storage medium based on python
CN114547011A (en) Data extraction method and device, electronic equipment and storage medium
CN114528593A (en) Data authority control method, device, equipment and storage medium
CN112686759A (en) Account checking monitoring method, device, equipment and medium
CN114860349B (en) Data loading method, device, equipment and medium
CN115174698B (en) Market data decoding method, device, equipment and medium based on table entry index
CN115543214B (en) Data storage method, device, equipment and medium in low-delay scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination