CN114925125A

CN114925125A - Data processing method, device and system, electronic equipment and storage medium

Info

Publication number: CN114925125A
Application number: CN202210582055.2A
Authority: CN
Inventors: 罗志权
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-19

Abstract

The application provides a data processing method, a device and a system, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: reading an original data set and index information from a preset database; preprocessing an original data set to obtain a primary data set; the preliminary data set comprises M preliminary subdata, the index information comprises N main keys, each main key is used for identifying one preliminary subdata, and N is less than or equal to M; marking each preliminary subdata by a preset unique identifier to obtain target data, wherein each target data comprises a piece of target subdata and a corresponding unique identifier; writing the target data into a target database according to the primary key; writing the unique identifier into a preset message queue according to the input sequence; and sending the unique identifier of the message queue to K consumption terminals so that the consumption terminals acquire target data from a target database according to the unique identifier, wherein K is less than or equal to M. The data processing method and device can improve data processing efficiency.

Description

Data processing method, device and system, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data processing method, apparatus and system, an electronic device, and a storage medium.

Background

At present, in the data processing process, a production end often writes data to be processed into a message queue for multiple times, and this situation often causes some data to be repeatedly extracted, so that the same data is repeatedly processed, and the efficiency of data processing is affected.

Disclosure of Invention

The embodiments of the present application mainly aim to provide a data processing method, an apparatus and a system, an electronic device and a storage medium, which aim to improve the efficiency of data processing.

In order to achieve the above object, a first aspect of an embodiment of the present application provides a data processing method, where the method includes:

reading an original data set and index information from a preset database;

preprocessing the original data set to obtain a primary data set; the preliminary data set comprises M pieces of preliminary subdata, the index information comprises N main keys, each main key is used for identifying one piece of preliminary subdata, and N is smaller than or equal to M;

marking each preliminary subdata by a preset unique identifier to obtain target data, wherein each target data comprises a piece of target subdata and a corresponding unique identifier;

writing the target data into a target database according to the primary key;

writing the unique identification into a preset message queue according to a preset input sequence;

and providing the unique identification of the message queue to K consumption terminals according to the output sequence of the message queue so that each consumption terminal acquires the unique identification and acquires the target data from the target database according to the unique identification, wherein K is less than or equal to M.

In some embodiments, the step of marking each preliminary sub-data by a preset unique identifier to obtain target data includes:

acquiring M unique identifications, wherein the unique identifications are character strings;

sequencing the unique identification to obtain a first identification sequence;

and writing the unique identifier into a preset tag frame corresponding to each preliminary subdata according to the first identifier sequence to obtain the target data.

In some embodiments, the step of writing the target data to a target database according to the primary key comprises:

extracting keywords of each main key to obtain index keywords of each preliminary subdata;

identifying the position of each target datum according to the index key words to obtain the row characteristics and the column characteristics of each target subdata, wherein the target subdata is derived from the preliminary subdata;

and writing each corresponding target data into the target database according to the row characteristics and the column characteristics.

In some embodiments, the writing the unique identifier into a preset message queue according to a preset input sequence includes:

acquiring the input sequence, wherein the input sequence is determined according to the character length of the unique identifier;

sequencing the unique identification according to the input sequence to obtain a second identification sequence;

and writing the unique identifier into the message queue according to the second identifier sequence.

In some embodiments, the step of preprocessing the raw data set to obtain a preliminary data set includes:

carrying out data cleaning processing on the original data set to obtain a first data set;

and carrying out data deduplication processing on the first data set to obtain the preliminary data set.

In some embodiments, the step of providing the unique identifier of the message queue to K consumers according to the output order of the message queue, so that each consumer acquires the unique identifier and acquires the target data from the target database according to the unique identifier includes:

acquiring a data sending instruction;

and sequentially sending the unique identifier of the message queue to K consumption terminals according to the data sending instruction and the output sequence, so that each consumption terminal performs feature extraction on the obtained unique identifier to obtain a tag field value, and performing screening processing on target data of the target database according to the tag field value to obtain the target subdata corresponding to the unique identifier.

To achieve the above object, a second aspect of an embodiment of the present application proposes a data processing apparatus, including:

the first acquisition module is used for reading an original data set and index information from a preset database;

the preprocessing module is used for preprocessing the original data set to obtain a preliminary data set; the preliminary data set comprises M preliminary subdata, the index information comprises N main keys, each main key is used for identifying one preliminary subdata, and N is less than or equal to M;

the marking module is used for marking each preliminary subdata through a preset unique identifier to obtain target data, wherein each target data comprises a piece of target subdata and a corresponding unique identifier;

the data writing module is used for writing the target data into a target database according to the main key;

the identification writing module is used for writing the unique identification into a preset message queue according to a preset input sequence;

and the sending module is used for providing the unique identifier of the message queue to K consumption ends according to the output sequence of the message queue so that each consumption end acquires the unique identifier and acquires the target data from the target database according to the unique identifier, wherein K is less than or equal to M. To achieve the above object, a third aspect of the embodiments of the present application provides a data processing system, which includes a production side and a consumption side;

wherein the production side is configured to perform the steps of the data processing method according to any one of claims 1 to 6;

the consumption end is used for acquiring the unique identifier written by the production end from the message queue and acquiring the corresponding target data from the target database according to the unique identifier so as to perform data processing on the target data.

In order to achieve the above object, a fourth aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, a program stored in the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, wherein the program, when executed by the processor, implements the method of the first aspect.

To achieve the above object, a fifth aspect of embodiments of the present application proposes a storage medium, which is a computer-readable storage medium for computer-readable storage, and stores one or more programs, which are executable by one or more processors to implement the method of the first aspect.

According to the data processing method, the data processing device, the data processing system, the electronic equipment and the storage medium, an original data set and index information are read from a preset database through a production end, the original data set comprises M pieces of preliminary subdata, the index information comprises N main keys, the original data set is preprocessed to obtain the preliminary data set, each piece of preliminary subdata is marked through a preset unique identifier to obtain target data, each piece of target data comprises a piece of target subdata and a corresponding unique identifier, each piece of target subdata can correspond to one unique identifier, and the data identifiers are unique. Further, writing the target data into a target database according to the primary key; the unique identifier in the target data and the target subdata can be stored in a fixed position of the target database, and the storage normalization of the data is improved. Finally, writing the unique identifier into a preset message queue according to a preset input sequence; providing the unique identification of the message queue to K consumption ends according to the output sequence of the message queue; each consumer end obtains the unique identification and obtains target data from the target database according to the unique identification, K is less than or equal to M, in this way, the production end only needs to mark the preliminary subdata of the preset database to obtain the target data, and then stores the target data into the target database again, and only the unique identification of the target data is written into the message queue, so that the data processing efficiency of the production end is improved, and when the consumption end extracts the data, only the unique identification is extracted from the message queue, the target data can be obtained according to the corresponding relation between the unique identification and the target data, because the unique identification of the target data has uniqueness, the consumption end can only extract and process the target data once, the problem that the target data is extracted and processed repeatedly is solved, and the efficiency of data processing and the consistency of data processing are improved.

Drawings

Fig. 1 is a flowchart of a data processing method provided in an embodiment of the present application;

fig. 2 is a flowchart of step S102 in fig. 1;

FIG. 3 is a flowchart of step S103 in FIG. 1;

FIG. 4 is a flowchart of step S104 in FIG. 1;

fig. 5 is a flowchart of step S105 in fig. 1;

FIG. 6 is a flowchart of step S106 in FIG. 1;

FIG. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 8 is a block diagram of a data processing system according to an embodiment of the present application;

fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, as well as in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like related to language processing.

Thread (Thread): is the smallest unit that the operating system can perform operation scheduling. It is included in the process and is the actual unit of operation in the process. A thread refers to a single sequential control flow in a process, multiple threads can be concurrently executed in a process, and each thread executes different tasks in parallel. Unix systems V and SunOS are also called Lightweight Processes (Lightweight Processes), but Lightweight Processes are more referred to as Kernel threads (Kernel threads) and User threads (User threads) are called threads. Threads are the basic unit of independent scheduling and dispatching. The thread may be a kernel thread scheduled by the operating system kernel, such as a Win32 thread; a user Thread which is automatically scheduled by a user process, such as POSIX Thread of a Linux platform; or by the kernel in a mixed schedule with user processes, such as Windows 7 threads. Where a process may have many threads, each thread executing a different task in parallel.

Production end (Producer): also referred to as producer, refers to a thread that shares a fixed size buffer with the consumer, and the producer acts to generate a certain amount of data to put into the buffer, i.e., the producer is used to produce the data.

Consumer end (Consumer): also called consumer, refers to a thread sharing a buffer of a fixed size with the producer, and the consumer is used to fetch data put in by the producer from the buffer, i.e. the consumer is used to consume data.

Message Queue (Message Queue): is a container that holds messages during their transmission. A message queue is a form of inter-process communication or communication between different threads of the same process, and a software queue is used to process a series of inputs, usually from a user. The message queue provides an asynchronous communication protocol, and the records in each queue contain data specifying the time of occurrence, the type of input device, and specific input parameters, i.e., the sender and recipient of the message need not interact with the message queue at the same time. The message will be held in the queue until the recipient retrieves it.

Index (database terminology): the data structure is a data structure in the Mysql database, namely a data organization mode, and the data structure is also called Key (primary Key). In a relational database, an index is a single, physical storage structure that orders one or more columns of values in a database table, which is a collection of one or more columns of values in a table and a corresponding list of logical pointers to data pages in the table that physically identify the values. The index is equivalent to the directory of the book, and the required content can be quickly found according to the page number in the directory. The index provides pointers to data values stored in a designated column of the table, and then sorts these pointers according to the sorting order that you have designated. The database uses the index to find a particular value and then follows the pointer to find the row containing that value. This allows SQL statements corresponding to the tables to execute faster and to quickly access specific information in the database tables.

Primary bond (PrimaryKey): also referred to as a primary key, is one or more fields in the table whose value is used to uniquely identify a record in the table. In a two table relationship, the primary key is used to reference a particular record in one table from the other table. The primary key is a unique key that is part of the table definition. The primary key of a table may be composed of multiple keys in common, and the columns of the primary key may not contain a null value. A primary key is a column or combination of columns whose value uniquely identifies each row in a table by which the physical integrity of the table is enforced. The main key is mainly used for associating with the external key of other tables and modifying and deleting the text record.

Web crawlers: also known as web spiders, web robots, among FOAF communities, and more often called web page followers, are programs or scripts that automatically capture web information according to certain rules. Other less commonly used names are ants, automatic indexing, simulation programs, or worms.

Data cleansing (Data cleansing): the last procedure for finding and correcting recognizable errors in the data file includes checking data consistency, processing invalid values and missing values, etc. Unlike questionnaire review, cleaning of data after entry is typically done by computer rather than manually.

Data deduplication: duplicate data in the data file set is found and deleted, and only a unique data unit is saved, so that redundant data is eliminated. Data deduplication includes complete deduplication and incomplete deduplication. Full deduplication refers to eliminating completely duplicated data, which refers to data with exactly the same record field value of the data table. Incomplete deduplication refers to data scrubbing in which duplicate values with all field values equal are necessarily eliminated.

Based on this, embodiments of the present application provide a data processing method, a data processing apparatus, a data processing system, an electronic device, and a storage medium, which aim to improve the efficiency of data processing.

The data processing method, the data processing apparatus, the data processing system, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described in the following embodiments, and first, the data processing method in the embodiments of the present application is described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a data processing method, and relates to the technical field of artificial intelligence. The data processing method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and artificial intelligence platforms and the like; the software may be an application or the like that implements a data processing method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In a first aspect, fig. 1 is an optional flowchart of a data processing method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, step S101 to step S106.

Step S101, reading an original data set and index information from a preset database;

step S102, preprocessing an original data set to obtain a preliminary data set; the preliminary data set comprises M pieces of preliminary subdata, the index information comprises N main keys, each main key is used for identifying one piece of preliminary subdata, and N is smaller than or equal to M;

step S103, marking each preliminary subdata through a preset unique identifier to obtain target data, wherein each target data comprises one piece of target subdata and a corresponding unique identifier;

step S104, writing the target data into a target database according to the primary key;

step S105, writing the unique identifier into a preset message queue according to a preset input sequence;

and step S106, providing the unique identification of the message queue to K consumption ends according to the output sequence of the message queue so that each consumption end obtains the unique identification, and obtaining target data from a target database according to the unique identification, wherein K is less than or equal to M.

In steps S101 to S106 illustrated in the embodiment of the present application, the original data set and the index information are read from the preset database by the production end, the original data set is preprocessed to obtain a preliminary data set, and each preliminary subdata is marked by the preset unique identifier to obtain target data, so that each target subdata corresponds to one unique identifier, and the data identifier has uniqueness. Writing the target data into a target database according to the primary key; the unique identifier in the target data and the target subdata can be stored in a fixed position of the target database, and the storage normalization of the data is improved. Finally, writing the unique identifier into a preset message queue according to a preset input sequence; the unique identification of the message queue is sent to the K consumption ends according to the output sequence of the message queue, so that the production end only needs to mark the preliminary subdata of the preset database to obtain target data, the target data is stored in the target database again, and the unique identification of the target data is written into the message queue, the data processing efficiency of the production end is improved, the consumption end only needs to extract the unique identification from the message queue during data extraction, the target data can be obtained according to the corresponding relation between the unique identification and the target data, and the unique identification of the target data is unique, so that the consumption end only can extract and process the target data once, the problem that the target data is extracted and processed repeatedly is solved, and the data processing efficiency is improved.

In step S101 of some embodiments, the production end may obtain an original data set of a preset database by writing a web crawler, and performing targeted crawling after setting a data source. The original data set may also be obtained by other manners, which is not limited to this, where the preset database may be a database used by an enterprise to store business data, and the original data set may include various types of business data.

Further, in order to improve the normative of data processing, the production end may obtain an original data set with a fixed data volume from a preset database at every preset time interval, and both the preset time and the preset data volume may be set according to specific business requirements without limitation. For example, if the preset time is 30 seconds, the original data set is obtained from the preset database at intervals of 30 seconds, wherein the data volume of the original data set is 100 ten thousand.

Referring to fig. 2, in some embodiments, step S102 may include, but is not limited to, step S201 to step S202:

step S201, performing data cleaning processing on an original data set to obtain a first data set;

step S202, data deduplication processing is carried out on the first data set to obtain a preliminary data set.

In step S201 of some embodiments, when data cleaning is performed on the original data set, firstly cleaning missing values in the original data set, specifically, calculating each data in the original data set to obtain a missing value ratio, and cleaning the data in the original data set according to the missing ratio and the data importance. Further, unimportant data is deleted, and missing data is filled according to the mode degree of business knowledge or empirical inference and the like, so that filling data is obtained. For data with higher importance and higher missing rate, the business personnel can be consulted or corresponding complete data can be obtained through other channels, so that the first data set can be obtained.

In step S202 of some embodiments, a field value of each data in the first data set is counted, the data with the same field value is divided into the same set, and each set is subjected to data elimination, and only one data in each set is retained, so that complete deduplication of the first data set is realized, each data is unique, redundant data is eliminated, and an initial data set is obtained. The preliminary data set comprises M pieces of preliminary subdata, the index information comprises N main keys, each main key is used for identifying one piece of preliminary subdata, and N is smaller than or equal to M.

Further, when N is equal to M, each preliminary subdata corresponds to a primary key, and each primary key has a positioning function, that is, the primary key can be used to represent the position of one preliminary subdata. When N is less than M, each primary key may correspond to a row of preliminary sub-data and/or a column of preliminary sub-data, and each primary key may represent a row characteristic and/or a column characteristic of a row of preliminary sub-data, where the row characteristic is a row coordinate of the preliminary sub-data and the column characteristic is a column coordinate of the preliminary sub-data.

Through the steps S201 to S202, data in the original data set can be filtered, and duplicate data and redundant data are eliminated, so that the data content of the preliminary subdata in the original data set can be extracted and consumed by the consuming end only once, and the consistency of data processing can be effectively ensured.

Referring to fig. 3, in some embodiments, step S103 may include, but is not limited to, step S301 to step S303:

step S301, M unique identifiers are obtained, wherein the unique identifiers are character strings;

step S302, sequencing the unique identification to obtain a first identification sequence;

and step S303, writing the unique identifier into a preset label frame corresponding to each preliminary subdata according to the first identifier sequence to obtain target data.

In step S301 of some embodiments, a unique identifier is obtained, where the unique identifier may be a character string set according to different service requirements, and the character string may include, without limitation, an arabic number, a greek number, a letter, a punctuation mark, and the like.

In step S302 of some embodiments, the character length of the unique identifier, the number of letters included in the unique identifier, and the like may be counted, and the M unique identifiers are sorted according to the length of the character length or the number of letters included in each identifier to obtain a first identifier sequence, for example, according to the character length of the unique identifier, the unique identifiers are sorted from long to short according to the character length to obtain the first identifier sequence.

In step S303 of some embodiments, according to the first identification sequence, writing the unique identifier into a preset tag frame corresponding to each piece of preliminary sub-data in a data filling manner, where the preset tag frame is used to mark or annotate the preliminary sub-data, so as to obtain target data, where each piece of target data includes a piece of target sub-data and a corresponding unique identifier, and data content of the target sub-data is derived from the preliminary sub-data, that is, data content of the target sub-data is substantially the same as the preliminary sub-data.

Through the steps S301 to S303, each target sub-data can correspond to a unique identifier, so that the identifier of each target sub-data has uniqueness.

Referring to fig. 4, in some embodiments, step S104 may include, but is not limited to, step S401 to step S403:

step S401, extracting keywords of each main key to obtain index keywords of each preliminary subdata;

step S402, performing position identification on each target datum according to the index key words to obtain the row characteristics and the column characteristics of each target subdata, wherein the target subdata is derived from the preliminary subdata;

step S403, writing each corresponding target data into the target database according to the row feature and the column feature.

In step S401 in some embodiments, the primary key may be subjected to keyword extraction through a TF-IDF algorithm, and each primary key is processed into a number of character nodes, specifically, the Frequency of occurrence of each character in the primary key is calculated through the TF-IDF algorithm, and a Term Frequency (TF) of each character is obtained, where TF is the number of occurrences of the character W/the number of characters in the primary key; further, an Inverse Document Frequency (IDF) of each character is calculated, where IDF is log (total number of primary keys/(primary key comb including character w +1)), and finally, a comprehensive Frequency value of each character is calculated according to the word Frequency and the Inverse Document Frequency, and the comprehensive Frequency value is the word Frequency and the Inverse Document Frequency. And selecting the character with the maximum comprehensive frequency value from each main key as an index key word of each preliminary subdata.

In step S402 of some embodiments, since the target sub-data is derived from the preliminary sub-data, the storage location information of the target sub-data may be determined according to the index information of the preliminary sub-data, specifically, the "row" and "column" fields in the index key and the numbers closest to the fields are obtained, and the row characteristic and the column characteristic of each target sub-data are obtained, where the row characteristic and the column characteristic are the row coordinate value and the column coordinate value of the target sub-data.

In step S403 of some embodiments, each corresponding target data is written into a corresponding row and a corresponding column in the target database according to the row feature and the column feature.

The unique identifier in the target data and the target sub-data can be stored in a fixed position of the target database through the steps S401 to S403, and the storage normalization of the data is improved.

It should be understood that, compared with the preliminary data of the preset database, the target data in the target data includes target sub-data having substantially the same data content as the preliminary sub-data, and each target sub-data has a corresponding unique identifier, and in an actual application scenario, the target database may serve as an update of the preset database.

Referring to fig. 5, in some embodiments, step S105 may include, but is not limited to, step S501 to step S503:

step S501, an input sequence is obtained, wherein the input sequence is determined according to the character length of the unique identifier;

step S502, the unique identification is sequenced according to the input sequence to obtain a second identification sequence;

and step S503, writing the unique identifier into the message queue according to the second identifier sequence.

In step S501 and step S502 of some embodiments, the unique identifier is written into a preset message queue, where the message queue includes M unique identifiers. Specifically, when the unique identifier is written into the message queue, the unique identifier may be written into the message queue in a random arrangement manner, or may be written into the message queue after the unique identifier is sequentially arranged according to the character length, the identifier type, and the like of the unique identifier. The unique identifier may also be written into the preset message queue according to other sequence, which is not limited to this. For example, an input sequence is constructed according to the character length, the identification type and the like of the unique identification, and if the unique identification is determined to be input into the message queue according to the character length of the unique identification, the unique identification is sequenced from long to short according to the character length to obtain a second identification sequence.

In step S503 of some embodiments, the unique identifiers are sequentially written into the preset message queue according to the second identifier sequence, and since only one unique identifier can be written into the message queue at a time, the ordering order of each unique identifier is different. For example, the unique identifiers A, B, C, D are written in the order of B, A, D, C.

Further, in order to improve the normalization of data processing, the unique identifier may be sequentially written into the preset message queue at intervals of preset time, and the preset time may be set according to specific service requirements without limitation. For example, if the preset time is 1 minute, the unique identifier is written into the preset message queue according to the input sequence every 1 minute.

Referring to fig. 6, in some embodiments, step S106 may include, but is not limited to, step S601 to step S602:

step S601, acquiring a data sending instruction;

step S602, the unique identifiers of the message queue are sequentially sent to K consumption terminals according to the data sending instruction and the output sequence, so that each consumption terminal performs feature extraction on the obtained unique identifiers to obtain tag field values, and target data of a target database are screened according to the tag field values to obtain target subdata corresponding to the unique identifiers.

In step S601 of some embodiments, the data sending instruction is mainly issued by the control center, where the data sending instruction may be triggered by a service person in the control center according to current service requirements, or may be sent to the production end at regular time according to preset sending time, without limitation.

In step S602 in some embodiments, the production end sends the unique identifier to the K consumption ends through the message queue according to the data sending instruction; therefore, the consumption end can acquire the unique identifier from the message queue and acquire the target data from the target database according to the unique identifier, and K is smaller than or equal to M. In a specific embodiment, if the total amount of the target data is within 100 ten thousand, 2 consumption terminals are enabled by default to acquire unique identifiers corresponding to the target data from the message queue, and corresponding target data is acquired from a target database according to the unique identifiers for data processing. When the data increment of the target data takes 20 thousands as a unit, every time the data increment of the target data exceeds 20 thousands of target data, namely, 1 additional consumption end is dynamically started to carry out data processing, a mechanism for dynamically determining the number of the consumption ends according to the data total amount of the target data is realized in this way, namely, when the data total amount is large, a plurality of consumption ends are automatically and dynamically configured according to the data total amount of the target data (namely, the total amount of unique identifiers in a message queue), so that the data processing efficiency is improved; when the total amount of data is less, the total amount of the consumption end is reduced according to the actual business requirement, so that the resources are saved. In addition, due to the consumption characteristics of the Message Queue (MQ), one unique identifier can be guaranteed to be processed by only one consumption end, so that one target data corresponding to one unique identifier can be obtained by only one consumption end, data processing is performed only once, data are prevented from being repeatedly processed, and data processing can be enabled to have better consistency.

Further, the consumer can obtain the unique identifier sent by the producer from the message queue, wherein the message queue comprises M unique identifiers, and each unique identifier corresponds to one piece of target data stored in the target database. Specifically, the number of the consuming terminals may be set according to the number of the target data that needs to be processed actually. Namely, the number of the consuming terminals can be K, and K is less than or equal to M. When the consumer can obtain the target data from the target database according to the unique identifier, specifically, the consumer can extract the features of the unique identifier to obtain the value of the tag field, so that the value of the tag field is calculated according to a preset function, the target data which accord with the calculation result is screened from the target database, and the target data corresponding to the unique identifier is obtained.

Specifically, the consumer may extract a field value of the unique identifier through a preset script or a preset function to obtain a tag field value, where the preset function may be a commonly used numerical function, for example, a sum function, and the like. When the target data is obtained from the target database according to the unique identifier, feature extraction can be carried out on the unique identifier to obtain the value of the label field, so that the value of the label field is calculated according to a preset function, the target data which accords with the calculation result is screened from the target database, and the target data corresponding to the unique identifier is obtained.

According to the data processing method, the original data set and the index information are read from the preset database through the production end, the original data set comprises M pieces of preliminary subdata, the index information comprises N main keys, the original data set is preprocessed to obtain the preliminary data set, each preliminary subdata is marked through a preset unique identifier to obtain target data, each target data comprises one piece of target subdata and one piece of unique identifier, each target subdata can correspond to one unique identifier, and the data identifiers are unique. Further, writing the target data into a target database according to the primary key; the unique identifier and the target subdata in the target data can be stored in the fixed position of the target database, and the storage normalization of the data is improved. Finally, writing the unique identifiers into a preset message queue according to a preset input sequence, wherein the message queue comprises M unique identifiers; the unique identification of the message queue is sent to K consumption ends according to the output sequence of the message queue, each consumption end obtains the unique identification, target data are obtained from a target database according to the unique identification, K is smaller than or equal to M, the method ensures that a production end only needs to mark preliminary subdata of a preset database to obtain the target data, then the target data are stored in the target database again, and the unique identification of the target data is written into the message queue, improves the data processing efficiency of the production end, ensures that the consumption end only needs to extract the unique identification from the message queue when extracting data, and then obtains the target data according to the corresponding relation between the unique identification and the target data, because the unique identification of the target data has uniqueness, the consumption end only needs to extract and process the target data once, the problem that the target data is repeatedly extracted and processed is solved, and the data processing efficiency and the data processing consistency are improved.

In a second aspect, referring to fig. 7, an embodiment of the present application further provides a data processing apparatus, which is applied to a production end and can implement the data processing method in the first aspect, where the apparatus includes:

a first obtaining module 701, configured to read an original data set and index information from a preset database;

a preprocessing module 702, configured to preprocess the original data set to obtain a preliminary data set; the preliminary data set comprises M pieces of preliminary subdata, the index information comprises N main keys, each main key is used for identifying one piece of preliminary subdata, and N is smaller than or equal to M;

a marking module 703, configured to mark each preliminary subdata by using a preset unique identifier to obtain target data, where each target data includes a piece of target subdata and a corresponding unique identifier;

a data writing module 704, configured to write the target data into the target database according to the primary key;

an identifier writing module 705, configured to write the unique identifier into a preset message queue according to a preset input sequence;

and a sending module 706, configured to provide the unique identifier of the message queue to K consumers according to the output sequence of the message queue, so that each consumer obtains the unique identifier, and obtains the target data from the target database according to the unique identifier, where K is less than or equal to M.

In some embodiments, the pre-processing module 702 includes:

the cleaning unit is used for cleaning the data of the original data set to obtain a first data set;

and the duplication removing unit is used for carrying out data duplication removing processing on the first data set to obtain a preliminary data set.

In some embodiments, the tagging module 703 comprises:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring M unique identifiers, and the unique identifiers are character strings;

the first sequencing unit is used for sequencing the unique identifier to obtain a first identifier sequence;

and the first writing unit is used for writing the unique identifier into the preset label frame corresponding to each preliminary subdata according to the first identifier sequence to obtain the target data.

In some embodiments, the data writing module 704 includes:

the keyword extraction unit is used for extracting keywords from each main key to obtain an index keyword of each preliminary subdata;

the position identification unit is used for carrying out position identification on each target data according to the index key words to obtain the row characteristics and the column characteristics of each target subdata;

and the second writing unit is used for writing each corresponding target data into the target database according to the row characteristics and the column characteristics.

In some embodiments, the identity write module 705 includes:

the sequence acquisition unit is used for acquiring an input sequence, wherein the input sequence is determined according to the character length of the unique identifier;

the second sorting unit is used for sorting the unique identifier according to the input sequence to obtain a second identifier sequence;

and the third writing unit is used for writing the unique identifier into the message queue according to the second identifier sequence.

In some embodiments, the sending module 706 includes:

the instruction acquisition unit is used for acquiring a data sending instruction;

and the sending unit is used for sequentially sending the unique identifier of the message queue to the K consumption terminals according to the data sending instruction and the output sequence so that each consumption terminal can extract the characteristics of the obtained unique identifier to obtain the value of the label field, and screening the target data of the target database according to the value of the label field to obtain the target subdata corresponding to the unique identifier.

The specific implementation of the data processing apparatus is substantially the same as the specific implementation of the data processing method of the first aspect, and is not described herein again.

In a third aspect, referring to fig. 8, an embodiment of the present application further provides a data processing system, where the data processing system 800 includes a production end 801 and a consumption end 802; wherein the production side is adapted to perform the steps of the data processing method according to the first aspect. The consumption end 802 is configured to obtain the unique identifier written by the production end 801 from the message queue 803, and obtain corresponding target data from the target database 804 according to the unique identifier, so as to perform data processing on the target data.

Specifically, the production side 801 reads an original data set and index information from a preset database 805 according to a pre-obtained data processing instruction; preprocessing the original data set to obtain a primary data set; the preliminary data set comprises M pieces of preliminary subdata, the index information comprises N main keys, each main key is used for identifying one piece of preliminary subdata, and N is smaller than or equal to M; further, the production end 801 marks each preliminary subdata by a preset unique identifier to obtain target data, wherein each target data includes a piece of target subdata and a corresponding unique identifier; writing the target data into the target database 804 according to the primary key; meanwhile, the unique identifier is written into a preset message queue 803 according to a preset input sequence;

when the production side 801 acquires the data sending instruction, the production side 801 provides the unique identifier of the message queue to the K consumption sides 802 according to the output sequence of the message queue, and at this time, the consumption sides 802 acquire the unique identifier written by the production side 801 from the message queue 803 and acquire corresponding target data from the target database 804 according to the unique identifier, thereby performing data processing on the target data.

In a fourth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes: a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program, when executed by the processor, implementing the data processing method of the first aspect described above. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 901 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;

the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 902, and the processor 901 is used to call and execute the data processing method of the first aspect of the embodiments of the present application;

an input/output interface 903 for inputting and outputting information;

a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), and communication may also be implemented in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between various components of the device (e.g., the processor 901, memory 902, input/output interface 903, and communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively connected to each other within the device via a bus 905.

In a fifth aspect, an embodiment of the present application further provides a storage medium, which is a computer-readable storage medium and is used for computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the data processing method of the first aspect.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the data processing method, the data processing device, the data processing system, the electronic device and the storage medium, an original data set and index information are read from a preset database through a production end, the original data set comprises M pieces of preliminary subdata, the index information comprises N main keys, the original data set is preprocessed to obtain the preliminary data set, each piece of preliminary subdata is marked through a preset unique identifier to obtain target data, each piece of target data comprises a unique identifier corresponding to each piece of target subdata, each piece of target subdata can correspond to one unique identifier, and the data identifiers are unique. Further, writing the target data into a target database according to the primary key; the unique identifier and the target subdata in the target data can be stored in the fixed position of the target database, and the storage normalization of the data is improved. Finally, writing the unique identifiers into a preset message queue according to a preset input sequence, wherein the message queue comprises M unique identifiers; sending the unique identifier of the message queue to K consumption ends according to the output sequence of the message queue; each consumer end obtains the unique identifier and obtains target data from the target database according to the unique identifier, K is less than or equal to M, in this way, the production end only needs to mark the preliminary subdata of the preset database to obtain the target data, then the target data is stored in the target database again, and only the unique identification of the target data is written into the message queue, so that the data processing efficiency of the production end is improved, and when the consumption end extracts the data, only the unique identification is extracted from the message queue, the target data can be obtained according to the corresponding relation between the unique identification and the target data, because the unique identification of the target data has uniqueness, the consumption end can only extract and process the target data once, the problem that the target data is extracted and processed repeatedly is solved, and the efficiency of data processing and the consistency of data processing are improved.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-6 are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" is used to describe the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a storage medium, which includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A method of data processing, the method comprising:

reading an original data set and index information from a preset database;

preprocessing the original data set to obtain a preliminary data set; the preliminary data set comprises M preliminary subdata, the index information comprises N main keys, each main key is used for identifying one preliminary subdata, and N is less than or equal to M;

writing the target data into a target database according to the primary key;

writing the unique identifier into a preset message queue according to a preset input sequence;

and providing the unique identifier of the message queue to K consumption ends according to the output sequence of the message queue so that each consumption end acquires the unique identifier and acquires the target data from the target database according to the unique identifier, wherein K is less than or equal to M.

2. The data processing method of claim 1, wherein the step of marking each preliminary sub-data by a preset unique identifier to obtain the target data comprises:

sequencing the unique identification to obtain a first identification sequence;

3. The data processing method of claim 1, wherein the step of writing the target data to a target database according to the primary key comprises:

4. The data processing method according to claim 1, wherein the step of writing the unique identifier into a predetermined message queue according to a predetermined input sequence comprises:

5. The data processing method of claim 1, wherein the step of preprocessing the raw data set to obtain a preliminary data set comprises:

6. The data processing method according to any one of claims 1 to 5, wherein the step of providing the unique identifier of the message queue to K consumers according to the output order of the message queue, so that each consumer acquires the unique identifier, and acquires the target data from the target database according to the unique identifier comprises:

acquiring a data sending instruction;

7. A data processing apparatus, characterized in that the apparatus comprises:

the data writing module is used for writing the target data into a target database according to the primary key;

and the sending module is used for providing the unique identifier of the message queue to K consumption terminals according to the output sequence of the message queue so that each consumption terminal acquires the unique identifier and acquires the target data from the target database according to the unique identifier, wherein K is less than or equal to M.

8. A data processing system, characterized in that the data processing system comprises a production side and a consumption side;

9. An electronic device, characterized in that it comprises a memory, a processor, a program stored on said memory and executable on said processor, and a data bus for implementing a connection communication between said processor and said memory, said program, when executed by said processor, implementing the steps of the data processing method according to any one of claims 1 to 6.

10. A storage medium, which is a computer-readable storage medium, for computer-readable storage, characterized in that the storage medium stores one or more programs, which are executable by one or more processors, to implement the steps of the data processing method according to any one of claims 1 to 6.