CN117093640B

CN117093640B - Data extraction method and device based on pooling technology

Info

Publication number: CN117093640B
Application number: CN202311347005.7A
Authority: CN
Inventors: 叶大江; 彭伟
Original assignee: Shanghai Clinbrain Information Technology Co Ltd
Current assignee: Shanghai Clinbrain Information Technology Co Ltd
Priority date: 2023-10-18
Filing date: 2023-10-18
Publication date: 2024-01-23
Anticipated expiration: 2043-10-18
Also published as: CN117093640A

Abstract

The application provides a data extraction method and device based on a pooling technology, comprising the following steps: acquiring configuration information; determining the total data extraction amount, the thread number of a thread pool, the connection number of a connection pool, the primary cache data amount, the CPU core number of a server and the running memory number according to the configuration information; extracting target data from each service database and caching; after the extracted target data reaches the primary cache data amount, at least one CPU core uses at least one thread in a thread pool to call one or more connections in a connection pool to write the target data into one or more distributed storage nodes of the distributed database; wherein each CPU core is configured to process one thread. The method can generate parameter configuration information according to the hardware condition and the data condition of each site, and configure a thread pool and a connection pool based on the parameter configuration information, so that the overall efficiency of data extraction and writing can be improved to the greatest extent, and server resources are reasonably utilized.

Description

Data extraction method and device based on pooling technology

Technical Field

The application relates to the technical field of medical data processing, in particular to a data extraction technology based on a pooling technology.

Background

When extracting data from each service database for distributed storage, the prior art generally extracts the data by using a data extraction tool of the distributed database, or extracts the data by using a self-developed data extraction tool; when the data volume is large, the preset number of data is required to be acquired in batches, then a data writing execution thread is used for establishing connection with each distributed storage node, and the read data is written into each distributed storage node, and as the data writing speed is low, when the data writing is not completed, the data extraction end is forced to pause the data reading operation and wait for the writing operation to be completed, the execution thread is disconnected with each node after the writing operation is completed, a batch of data is read again and cached in a memory, and the operation is repeated, so that the data extraction efficiency is low; the data is read and needed to be waited for consuming time when the data is written, the data is also in an interruption state when the data is read again, the waiting and reading time is needed, and when the data is written again, the connection with the distributed storage nodes is needed to be reestablished, so that the overall extraction efficiency of the data is reduced, and the occupation of resources is increased.

In the prior art, the thread pool, the connection pool and other technical means are also available, but the thread pool, the connection pool and the data batch extraction number are initialized and configured in a mode of manually pre-configuring parameters, and the requirements of automatically and rapidly extracting data of multiple sites and multiple databases cannot be met in time.

Disclosure of Invention

An object of the present application is to provide a data extraction method and apparatus based on pooling technology, which aims to maximally utilize server resources and improve overall efficiency of data extraction and writing.

To achieve the above object, some embodiments of the present application provide a data extraction method based on pooling technology, which is applied to a server, and the method includes: acquiring configuration information, wherein the configuration information comprises service database information, and/or distributed database information, and/or server information; determining the total data extraction amount, the thread number of a thread pool, the connection number of a connection pool, the primary cache data amount, the CPU core number of a server and the running memory number according to the configuration information; extracting target data from each service database and caching; when the extracted target data reach the primary cache data amount, invoking one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write the target data into one or more distributed storage nodes of a distributed database; wherein each CPU core is configured to process one thread.

Optionally, the method for determining the total data extraction amount includes: acquiring a target extraction data table according to the service database information in the configuration information; and calculating to obtain the size and the total data extraction amount of the single target extraction data according to the data amount of the target extraction data table and the data column information of the table structure.

Optionally, the method for determining the number of threads in the thread pool includes: determining the CPU core number of the server according to the server information in the configuration information; determining the number of threads according to the CPU core number and generating a thread pool; wherein the number of threads is not less than the number of CPU cores and/or the number of threads is a predetermined multiple of the number of CPU cores.

Optionally, the method for determining the connection number of the connection pool includes: determining the CPU core number of the server according to the server information in the configuration information; determining the distributed node number of the distributed database according to the distributed database information in the configuration information; determining the number of connections according to the CPU core number and the distributed node number and generating a connection pool; wherein the number of connections is not less than a product of the number of CPU cores and the number of distributed nodes, and/or the number of connections is a predetermined multiple of the product of the number of CPU cores and the number of distributed nodes.

Optionally, the method for determining the data amount of one cache comprises the following steps: determining the running memory number of the server according to the server information in the configuration information; obtaining a cache coefficient; the cache coefficient is preset and/or determined based on a service end's executable running state; calculating available memory according to the running memory number and the cache coefficient; and setting one-time cache data volume according to the available memory, the data size of single data of each target extraction data table and the thread number.

Optionally, after the target data extracted reaches the primary cache data amount, invoking, by at least one CPU core, one or more connections in the connection pool by using at least one thread in the thread pool to write target data to one or more distributed storage nodes of the distributed database, including: when the extracted target data reaches the primary cache data amount, a CPU core is used for calling one thread in the thread pool, and the target data is pushed to a thread queue; invoking one or more connections in the connection pool, and writing the target data in the thread queue into each distributed storage node of a distributed database respectively; after the writing is completed, the called connection or connections are released to the connection pool.

Optionally, after the target data extracted reaches the primary cache data amount, invoking, by at least one CPU core, one or more connections in the connection pool to write target data to one or more distributed storage nodes of a distributed database using at least one thread in the thread pool, and further including: determining whether the number of CPU cores for processing the written target data exceeds a preset number; if the target data is not exceeded, invoking one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write the target data to one or more distributed storage nodes of a distributed database; and if the target data exceeds the target data, calling an idle thread in the thread pool, pushing the target data to a thread queue, and waiting for the end of writing of the writing thread.

Optionally, the extracting and caching the target data from each service database includes: and determining the number of data extraction processes to be started and target data to be extracted by each data extraction process according to the service database information and/or the distributed database information and/or preset extraction time.

Optionally, the calling one or more connections in the connection pool writes the target data in the thread queue into each distributed storage node of a distributed database, including: calculating a hash value according to the cached identification information in the target data, taking a model with the distributed storage nodes, and determining the distributed storage nodes corresponding to each item of target data; and calling one or more connections in the connection pool, and writing the target data in the thread queue into the corresponding distributed storage nodes respectively.

According to another aspect of the present application, there is also provided a data extraction device based on pooling technology, including: the data acquisition module is used for acquiring configuration information, wherein the configuration information comprises service database information, and/or distributed database information, and/or server information; the data calculation module is used for determining the total data extraction amount, the thread number of the thread pool, the connection number of the connection pool, the primary cache data amount, the CPU core number of the server and the running memory number according to the configuration information; the extraction and caching module is used for extracting target data from each business database and caching the target data; the data processing module is used for calling one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write target data into one or more distributed storage nodes of the distributed database after the extracted target data reach the primary cache data amount; wherein each CPU is configured to correspond to a thread.

According to the technical scheme, the hardware condition and the database structure of each site are automatically acquired, and configuration information (comprising business database information and/or distributed database information and/or server information) is generated according to the hardware condition and the data condition of each site; and configuring a thread pool, a connection pool and the like (namely determining the total data extraction amount, the thread number of the thread pool, the connection number of the connection pool, the primary cache data amount, the CPU core number of a server side and the running memory number) according to the configuration information, extracting target data from each service database and caching, and calling one or more connection in the connection pool to write the target data into one or more distributed storage nodes of the distributed database by using at least one thread in the thread pool through at least one CPU core, so that the server resource can be reasonably utilized, and the overall efficiency of data extraction and writing is improved to the greatest extent.

Drawings

Fig. 1 is a flowchart of a data extraction method based on pooling technology according to an embodiment of the present application;

fig. 2 is a diagram of a data extraction architecture according to an embodiment of the present application;

FIG. 3 is a diagram of a data writing architecture according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of an initialization system configuration provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a data extraction device based on pooling technology according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Referring to fig. 1, an embodiment of the present application provides a data extraction method based on a pooling technology, where the method includes:

Step S101: acquiring configuration information, wherein the configuration information comprises service database information, and/or distributed database information, and/or server information;

specifically, the obtained configuration information is configuration information related to data extraction, including reading user-defined configuration file information and server side information, the reading user-defined configuration file information may include service database information to be extracted and distributed database information to be written, the service database information to be extracted may be one or more service databases, and the distributed database information to be written may be information of each storage node, such as information of the number of storage nodes, addresses of the storage nodes, and the like. Meanwhile, computer information (service hardware information) for running the data extraction method, including, for example, the number of CPU cores, the number of running memories, and the like, may also be read.

The user configuration information may be generated in advance by the user, may be directly input by the user, or may be generated based on the visual configuration or selection operation of the user for obtaining all the database and server information of the data extraction site.

Step S102: determining the total data extraction amount, the thread number of a thread pool, the connection number of a connection pool, the primary cache data amount, the CPU core number of a server and the running memory number according to the configuration information;

Step S103: extracting target data from each service database and caching;

specifically, as shown in fig. 2, the framework chart of data extraction is shown, where the service database may be any database such as MySQL, SQL Server, oracle, postgre SQL, and the like, and JDBC (Java DataBase Connectivity, java database connection) is a Java API for executing SQL statements, and may provide unified access for multiple relational databases, which is composed of a set of classes and interfaces written in Java language. JDBC provides a benchmark from which higher level tools and interfaces can be built to enable database developers to write database applications. Config is configuration information, service database information, distributed database information and server information can be obtained, and a server is computer equipment for running the data extraction method, and can be a local server or a cloud configuration server.

Step S104: when the extracted target data reach the primary cache data amount, invoking one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write the target data into one or more distributed storage nodes of a distributed database; wherein each CPU core is configured to process one thread. Specifically, as shown in fig. 3, a frame diagram for writing data is shown, where Config is configuration information, service database information, distributed database information and server information may be obtained, solvent Client (Pool) represents a connection Pool, thread Pool represents a Thread Pool, retry is a Retry mechanism, and the Retry mechanism may ensure that data percentage is written into a relevant distributed storage database when a network or a program is in error.

Specifically, as shown in fig. 4, initializing the system configuration includes: reading/obtaining configuration files (comprising user-defined configuration files and service hardware information) for comprehensive calculation to obtain the document size (total data to be extracted), the queue size (primary cache data size) and the number of the connection pools and the thread pools, so that the sizes of the thread pools and the connection pools can be set, and the system initialization is completed.

As an alternative embodiment, a method for determining a total amount of data extraction includes:

acquiring a target extraction data table according to the service database information in the configuration information;

and calculating to obtain the size and the total data extraction amount of the single target extraction data according to the data amount of the target extraction data table and the data column information of the table structure. Specifically, according to the business database information to be extracted in the user configuration information, inquiring the data volume in a target extraction data table and the data column information of the table structure, and calculating to obtain the size and the total data volume of single target extraction data; single data size = size of cells per column the number of columns of the target extraction data table.

As an alternative embodiment, a method of determining a number of threads of a thread pool includes:

Determining the CPU core number of the server according to the server information in the configuration information;

determining the number of threads according to the CPU core number and generating a thread pool;

wherein the number of threads is not less than the number of CPU cores and/or the number of threads is a predetermined multiple of the number of CPU cores.

Specifically, according to the number of CPU cores of the server, a thread pool including a certain number of threads is generated, where the number of threads in the thread pool=the number of CPU cores is equal to 2, each CPU core may use a data writing thread, and in order to avoid a thread failure or to avoid a situation that the data writing time is too long and the data is not sufficiently written, waiting must be stopped, so that a certain number of standby threads need to be reserved in the thread pool, and in the case that the main thread cache space is occupied, data is cached to the standby threads.

As an alternative embodiment, a method of determining a number of connections of a connection pool includes:

determining the distributed node number of the distributed database according to the distributed database information in the configuration information;

determining the number of connections according to the CPU core number and the distributed node number and generating a connection pool; wherein the number of connections is not less than a product of the number of CPU cores and the number of distributed nodes, and/or the number of connections is a predetermined multiple of the product of the number of CPU cores and the number of distributed nodes.

Specifically, according to the number of CPU cores and the number of distributed nodes of the server, a connection pool containing a certain number of connections is generated, the connection pool connection number=the number of CPU cores=the number of distributed nodes×2, each CPU core can use one thread to call one or more connections to write data into one or more distributed storage nodes, and simultaneously, multiple CPU cores can use multiple threads to simultaneously use multiple connections to simultaneously write data into multiple distributed storage nodes. At the same time, in order to avoid connection crashes or long-term unresponsive failures, a certain number of spare connections need to be reserved in the connection pool.

As an alternative embodiment, a method for determining an amount of data buffered at a time includes: determining the running memory number of the server according to the server information in the configuration information; obtaining a cache coefficient; the cache coefficient is preset and/or determined based on a service end's executable running state; calculating available memory according to the running memory number and the cache coefficient; and setting one-time cache data volume according to the available memory, the data size of single data of each target extraction data table and the thread number.

Specifically, the available memory is calculated according to the acquired running memory number and the cache coefficient, the data quantity is set according to the available memory, the data size, the thread number and the process number of single data of each target extraction data table, and when the single data is larger, the number of the data cached is relatively reduced, so that different database structures in different sites are automatically adapted, and the influence of data extraction on normal service running is reduced to the minimum through setting the cache coefficient. One buffer data amount of one data extraction process= (memory size×buffer coefficient)/(single data size×number of processes).

Optionally, a suitable cache coefficient may be selected based on the running state of the server, for example, an 8G memory, where 2 systems currently running already occupy 2G running memory, and the cache coefficient may be set to 0.7, and if 1 system currently running already occupies 0.5G running memory, the cache coefficient may be set to 0.8, or even higher, so that the memory space may be fully utilized to perform batch operation, and normal running of the service system may not be affected.

As an alternative embodiment, said writing, by at least one CPU core, target data to one or more distributed storage nodes of the distributed database using at least one thread in the thread pool to invoke one or more connections in the connection pool after the target data extracted reaches the one-time cache data amount, comprises: when the extracted target data reaches the primary cache data amount, a CPU core is used for calling one thread in the thread pool, and the target data is pushed to a thread queue; invoking one or more connections in the connection pool, and writing the target data in the thread queue into each distributed storage node of a distributed database respectively; after the writing is completed, the called connection or connections are released to the connection pool.

Specifically, a stream processing technology is adopted to acquire data from each service database and buffer the data, after the data amount is once buffered, one thread of a thread pool is called, the buffered data is pushed to a thread queue, a hash value is calculated according to EMPIID in the buffered data, and a distributed storage node corresponding to each piece of data is determined by modulo according to the number of nodes, one or more connections in a connection pool are called, and the data in the thread queue are respectively written into each distributed storage node; the connection to the connection pool is released after the write is completed.

As an alternative embodiment, after the target data extracted reaches the primary cache data amount, the writing, by at least one CPU core, the target data to one or more distributed storage nodes of the distributed database using at least one thread in the thread pool by invoking one or more connections in the connection pool, further includes: determining whether the number of CPU cores for processing the written target data exceeds a preset number (the number of CPU cores of a server); if the target data is not exceeded, invoking one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write the target data to one or more distributed storage nodes of a distributed database; if the data is exceeded, calling an idle thread in the thread pool, pushing the target data to a thread queue, waiting for the writing of the writing thread to finish, and until all standby threads in the thread pool are occupied by the cached data, wherein no idle thread exists in the thread pool.

As an alternative embodiment, the extracting and caching the target data from each service database includes: and determining the number of data extraction processes to be started and target data to be extracted by each data extraction process according to the service database information and/or the distributed database information and/or preset extraction time.

As an alternative embodiment, the calling one or more connections in the connection pool to write the target data in the thread queue to each distributed storage node of the distributed database includes:

calculating a hash value according to the cached identification information in the target data, taking a model with the distributed storage nodes, and determining the distributed storage nodes corresponding to each item of target data;

and calling one or more connections in the connection pool, and writing the target data in the thread queue into the corresponding distributed storage nodes respectively.

As a specific embodiment of the present application: only one program running node is provided with a data extraction process, the number of primary caches is reduced from the available memory amount in the prior art to the number of threads in the 1/thread pool of the available memory amount by setting the thread pool, so that the writing time of the single cache data is reduced, the data extraction process is prevented from being blocked for waiting for the cache data to be written in for a long time, and the data extraction efficiency is improved; by setting up the connection pool, the method can simultaneously write data into 4 nodes by 4 threads which are the threads with the same number as the CPU cores, avoids the problem that in the prior art, only one single thread is required to be connected with each distributed storage node for writing after the connection is established, and improves the data writing efficiency in a mode of disconnecting after the writing is completed.

Specifically: according to the scheme, 8000 pieces of data are read and cached once in the prior art, a thread queue is used for establishing connection with each storage node after waiting for reading, the connection is disconnected after writing is completed in sequence, and 8000 pieces of data are read and cached again. The method comprises the steps of converting 1000 data into one-time reading and caching of the data, pushing the data into a first thread queue in a thread pool, enabling a first CPU core to be responsible for processing the thread queue carrying the 1000 data, calling connection with each storage node in the connection pool, sequentially writing, releasing connection after writing is completed, simultaneously enabling a data reading end to continue reading and caching next 1000 data after pushing is completed, continuing pushing the data into a second thread queue in the thread pool, enabling the second CPU core to be responsible for processing the thread queue carrying the 1000 data, calling connection with each storage node in the connection pool, sequentially writing, releasing connection after writing is completed, and pushing and writing the 1000 data of a third batch and a fourth batch. When the fifth to eighth batches of data are read, whether the previous 4 batches of data are written is monitored, one of the thread queues of the fifth to eighth batches which are already cached is called after the writing is completed, the data are written into each storage node through the CPU core which completes the writing task, and the writing of one thread is completed by pushing the same, and the queued threads which are already cached with the data in the thread pool are written. Therefore, the method and the device keep 4 threads to write data in parallel all the time except for a period of starting, and compared with the prior art, the speed is improved by more than 4 times through a single thread writing mode, the method and the device avoid waiting for data reading time after writing is completed and establishing connection time with each storage node again during re-writing, and further improve data extraction efficiency. The method can read a batch of write until each CPU core is occupied by a thread, and after multi-thread parallel writing, the method is converted into reading a batch of queue until each thread in a thread pool is occupied by cache data, so that the waiting time is further shortened. Or 4 batches of read data can be simultaneously written in parallel, and the read cache data is queued in other threads in the thread pool.

As an alternative embodiment, the number of processes to be started and the data information to be extracted by each process are determined according to at least one or two of the service database information to be extracted, the distributed database information to be written and the extraction time control requirement. Specifically, the number of processes to be started and the data to be extracted allocated to each process may be determined according to the number of data tables, the number of available data extraction nodes, the data amount in each data table, the hardware information of the executable data extraction server, and the preset or user-configured data extraction time requirement, for example, the number of data tables to be extracted is 3, table a has 10000 pieces of data, one data extraction process executed by a computer with 4-core processor needs 10 seconds, table B has 5000 pieces of data, one data extraction process executed by a computer with 4-core processor needs 5 seconds, table C has 5000 pieces of data, and one data extraction process executed by a computer with 4-core processor needs 5 seconds. If no other requirements exist, a 4-core data extraction node server is adopted by default, a data extraction process is started, a table A/B/C is extracted once, and the total time for completing data extraction is 10+5+5=20 seconds; when the user sets or presets the data extraction time to be less than 10 seconds, two 4-core data extraction nodes are needed to be adopted, two data extraction processes are operated simultaneously, one process extracts the data of the table A, the other extracts the data of the table B/C, and the total data extraction time is 10 seconds; when the request is completed in the request 5S, 4 core data extraction nodes may be used, 4 data extraction processes are simultaneously run to segment the table a, 2 processes simultaneously extract the data of the table a, one extracts the data of the table B, one extracts the data of the table C, and the total data extraction time is 5 seconds, or 1 8 core data extraction nodes+2 4 core data extraction nodes may be used, 3 data extraction processes are simultaneously run, 8 core nodes run one process to extract the data of the table a, 4 core nodes each run one process to extract the data of the table B and the data of the table C, and the total data extraction time is 5 seconds.

The connection pooling technology in the application can quickly acquire relevant connection and ensure the survival time of each connection when in use, and when the survival time exceeds the preset time, the connection is invalid, new connection is recalled for retry, data is rewritten, and the problem that the connection is blocked for a long time because the connection cannot be written for a long time, and the data extraction efficiency is affected is avoided. The write thread pooling technique can ensure that the performance of the server is used to the greatest extent by writing data through multiple threads.

In this application, a retry mechanism may also be added: the retry mechanism may ensure that a percentage of the data is written to the associated distributed storage database when the network or program is in error.

According to another embodiment of the present application, there is further provided a data extraction device based on pooling technology, as shown in fig. 5, including:

the data acquisition module is used for acquiring configuration information, wherein the configuration information comprises service database information, and/or distributed database information, and/or server information;

the data calculation module is used for determining the total data extraction amount, the thread number of the thread pool, the connection number of the connection pool, the primary cache data amount, the CPU core number of the server and the running memory number according to the configuration information;

The extraction and caching module is used for extracting target data from each business database and caching the target data;

the data processing module is used for calling one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write target data into one or more distributed storage nodes of the distributed database after the extracted target data reach the primary cache data amount; wherein each CPU is configured to correspond to a thread.

For specific limitations of the apparatus, reference is made to the above limitation of the data extraction method based on pooling technology, and no further description is given here. The various modules/units in the data extraction device based on pooling techniques described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules/units may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, the present application provides a computer device comprising: the device comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are mutually connected through the bus;

The memory stores a memory of computer program instructions that, when executed, cause the processor to perform the steps of a pooling technique based data extraction method.

The computer equipment provided by the embodiment of the application can be a server, a client or other computer network communication equipment; fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Processor 601, memory 602, interface 604, bus 605, processor 601 being connected to memory 602, interface 604, bus 605 connecting processor 601, memory 602, and interface 604, respectively, interface 604 being used to receive or transmit data, processor 601 being a single or multi-core central processing unit, or being a specific integrated circuit, or being one or more integrated circuits configured to implement embodiments of the present invention. The memory 602 may be a random access memory (randomaccess memory, RAM) or a non-volatile memory (non-volatile memory), such as at least one hard disk memory. Memory 602 is used to store computer-executable instructions. Specifically, the program 603 may be included in the computer-executable instructions.

In this embodiment, when the processor 601 invokes the program 603, the management server in fig. 6 may execute the operation of sending the micro-ring prevention message, which is not described herein.

It should be appreciated that the processor provided by the above embodiments of the present application may be a central processing unit (centralprocessing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signalprocessor, DSP), application specific integrated circuits (application-specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be understood that the number of processors in the computer device in the above embodiment in the present application may be one or plural, and may be adjusted according to the actual application scenario, which is merely illustrative and not limiting. The number of the memories in the embodiment of the present application may be one or more, and may be adjusted according to the actual application scenario, which is only illustrative and not limiting.

It should be further noted that, when the computer device includes a processor (or a processing unit) and a memory, the processor in the present application may be integrated with the memory, or the processor and the memory may be connected through an interface, which may be adjusted according to an actual application scenario, and is not limited.

The present application provides a chip system comprising a processor for supporting a computer device (client or server) to implement the functions of the controller involved in the above method, e.g. to process data and/or information involved in the above method. In one possible design, the chip system further includes memory to hold the necessary program instructions and data. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

In another possible design, when the chip system is a chip in a user equipment or an access network or the like, the chip comprises: the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the client or the management server or the like to perform the steps of the common sense question-answering method. Alternatively, the storage unit is a storage unit in the chip, such as a register, a cache, or the like, and the storage unit may also be a storage unit located outside the chip in a client or a management server, such as a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM), or the like.

It should be understood that the methods and/or embodiments of the present application may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. The above-described functions defined in the method of the present application are performed when the computer program is executed by a processing unit.

It should be appreciated that the controllers or processors referred to in the above embodiments of the present application may be central processing units (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digitalsignal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be understood that the number of processors or controllers in the computer device or the chip system and the like in the above embodiments in this application may be one or more, and may be adjusted according to the actual application scenario, which is merely illustrative and not limiting. The number of the memories in the embodiment of the present application may be one or more, and may be adjusted according to the actual application scenario, which is only exemplary and not limited

It should be noted that, the computer readable medium described in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more computer readable instructions executable by a processor to implement the steps of the methods and/or techniques of the various embodiments of the present application described above. The computer may be a computer device (client or server or other computer network communication device) as described above.

In a typical configuration of the present application, the terminals, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.

In addition, the embodiment of the application also provides a computer program which is stored in the computer equipment, so that the computer equipment executes the method for executing the control code.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In some embodiments, the software programs of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, the terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the embodiments of the present application, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and "comprising," when used in this specification, do not preclude other elements or steps, and the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order. In the description of the present application, unless otherwise indicated, "/" means that the associated object is an "or" relationship, e.g., a/B may represent a or B; the term "and/or" in this application is merely an association relation describing an association object, and means that three kinds of relations may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. The word "if" or "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection", depending on the context. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. The data extraction method based on the pooling technology is characterized by being applied to a server, and comprises the following steps:

acquiring configuration information, wherein the configuration information comprises service database information, and/or distributed database information, and/or server information;

determining the total data extraction amount, the thread number of a thread pool, the connection number of a connection pool, the primary cache data amount, the CPU core number of a server and the running memory number according to the configuration information;

extracting target data from each service database and caching;

when the extracted target data reach the primary cache data amount, invoking one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write the target data into one or more distributed storage nodes of a distributed database; wherein each CPU core is configured to process one thread;

The method for determining the data quantity of one cache comprises the following steps: determining the running memory number of the server according to the server information in the configuration information; obtaining a cache coefficient; the cache coefficient is preset and/or determined based on the running state of the server; calculating available memory according to the running memory number and the cache coefficient; and setting one-time cache data volume according to the available memory, the data size of single data of each target extraction data table and the thread number.

2. The method of claim 1, wherein the method of determining the total amount of data extraction comprises:

and calculating to obtain the size and the total data extraction amount of the single target extraction data according to the data amount of the target extraction data table and the data column information of the table structure.

3. The method of claim 1, wherein the method of determining the number of threads of the thread pool comprises:

4. The method of claim 1, wherein the method of determining the number of connections of the connection pool comprises:

determining the number of connections according to the CPU core number and the distributed node number and generating a connection pool;

wherein the number of connections is not less than a product of the number of CPU cores and the number of distributed nodes, and/or the number of connections is a predetermined multiple of the product of the number of CPU cores and the number of distributed nodes.

5. The method of claim 1, wherein said invoking, by at least one CPU core, one or more connections in the connection pool to write target data to one or more distributed storage nodes of the distributed database using at least one thread in the thread pool after the target data is drawn to the one cache data amount comprises:

when the extracted target data reaches the primary cache data amount, a CPU core is used for calling one thread in the thread pool, and the target data is pushed to a thread queue;

Invoking one or more connections in the connection pool, and writing the target data in the thread queue into each distributed storage node of a distributed database respectively;

after the writing is completed, the called connection or connections are released to the connection pool.

6. The method of claim 1, wherein said writing, by at least one CPU core, target data to one or more distributed storage nodes of a distributed database using at least one thread in the thread pool to invoke one or more connections in the connection pool after the target data extracted reaches the one cache data amount, further comprises:

determining whether the number of CPU cores for processing the written target data exceeds a preset number;

if the target data is not exceeded, invoking one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write the target data to one or more distributed storage nodes of a distributed database;

and if the target data exceeds the target data, calling an idle thread in the thread pool, pushing the target data to a thread queue, and waiting for the end of writing of the writing thread.

7. The method of claim 1, wherein extracting and caching the target data from each service database comprises:

and determining the number of data extraction processes to be started and target data to be extracted by each data extraction process according to the service database information and/or the distributed database information and/or preset extraction time.

8. The method of claim 5, wherein said invoking one or more connections in said connection pool writes said target data in said thread queue to respective distributed storage nodes of a distributed database, comprising:

9. Data extraction device based on pooling technique, characterized by comprising:

the data processing module is used for calling one or more connections in the connection pool by at least one CPU core by using at least one thread in the thread pool to write target data into one or more distributed storage nodes of the distributed database after the extracted target data reach the primary cache data amount; wherein each CPU is configured to correspond to a thread;

the data calculation module is further used for executing the following steps when determining the data quantity of one cache: determining the running memory number of the server according to the server information in the configuration information; obtaining a cache coefficient; the cache coefficient is preset and/or determined based on the running state of the server; calculating available memory according to the running memory number and the cache coefficient; and setting one-time cache data volume according to the available memory, the data size of single data of each target extraction data table and the thread number.