CN113505134B

CN113505134B - Multithreading data processing method, multithreading base database data storage method and device

Info

Publication number: CN113505134B
Application number: CN202110557810.7A
Authority: CN
Inventors: 王金高; 黄安武; 冯曦; 甘霖
Original assignee: Wuhan Kuangshi Jinzhi Technology Co ltd; Beijing Megvii Technology Co Ltd
Current assignee: Wuhan Kuangshi Jinzhi Technology Co ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2023-02-24
Anticipated expiration: 2041-05-21
Also published as: CN113505134A

Abstract

The present disclosure relates to a multithread data processing method, a multithread data processing apparatus, a multithread base data warehousing method, a multithread base data warehousing apparatus, an electronic device, and a computer-readable storage medium. The multithread data processing method comprises the following steps: acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs; responding to the existence of a current data lock corresponding to the current data set and the current data lock is not activated, and activating the current data lock; performing data processing on the current data to be processed through the current thread; deleting the current data lock after the data processing of the current data to be processed is finished; and responding to the absence of the current data lock, and performing data processing on the current data to be processed through the current thread or finishing the current thread. By the division of data and the dynamic operation of the data lock, the processing capacity of the system can be effectively improved under the condition of ensuring the data security.

Description

Multithreading data processing method, multithreading base database storage method and device

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a multithread data processing method, a multithread data processing apparatus, a multithread base data storage method, a multithread base data storage apparatus, an electronic device, and a computer-readable storage medium.

Background

With the advent of the data era, the demand for processing mass data is increasing, and the processing of mass data is currently performed in a multithreading manner to improve the data processing efficiency. However, the threads are not independent, the threads in the same process share data, and when each thread accesses the data resource, a competition state occurs, that is, data is almost synchronously occupied by a plurality of threads, so that data confusion is caused, that is, the threads are unsafe.

Thread insecurity is particularly prominent in data transmission processes. When there is duplicate data in the data to be transmitted and there is no duplicate data in the data desired to be transmitted, it may be determined whether one of the multiple duplicate data has been transmitted before the data is transmitted, and if so, the data is not transmitted. However, in order to improve data transmission efficiency, it is often necessary for multiple threads to perform read/write operations on data in parallel in a data transmission process, and when a first thread needs to transmit one of the duplicate data and a second thread is transmitting another of the duplicate data but the transmission is not completed, a situation may occur in which the first thread determines that any one of the multiple duplicate data is not transmitted before the transmission, but finds that there are 2 transmitted duplicate data after the transmission of the first thread is completed (at this time, the transmission of the second thread is also completed). In this case, the duplicate data makes the thread insecure.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a multithread data processing method, a multithread data processing apparatus, a multithread base data warehousing method, a multithread base data warehousing apparatus, an electronic device, and a computer-readable storage medium.

According to a first aspect of the disclosed embodiments, there is provided a method for multithreaded data processing, the method comprising: acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs; the method comprises the following steps that a plurality of pieces of data to be processed which cause thread insecurity belong to the same data set; in response to the existence of a current data lock corresponding to the current data set and the current data lock is not activated, activating the current data lock, so that only the current thread is allowed to perform data processing on data to be processed in the current data set, wherein the data locks correspond to the data sets one to one; performing data processing on the current data to be processed through the current thread; deleting the current data lock after the data processing of the current data to be processed is finished; and responding to the absence of the current data lock, and performing data processing on the current data to be processed through the current thread or finishing the current thread.

In one embodiment, in response to there being a current data lock corresponding to the current data set and the current data lock is not activated, activating the current data lock includes: and activating the current data lock in response to the existence of the current data lock corresponding to the current data set, the current data lock is not activated and the number of the data to be processed in the current data set is greater than 1.

In one embodiment, the data processing method for the current data to be processed includes: judging whether the transmitted and stored data have repeated data of the current data to be processed, wherein the repeated data belong to the same data set with the current data to be processed; and if the repeated data does not exist, transmitting and storing the current data to be processed.

In an embodiment, the data processing of the current data to be processed further includes: if the repeated data exist, the updating time and/or the image quality of the current data to be processed and the repeated data are compared, if the updating time of the current data to be processed is later than the repeated data and/or the image quality of the current data to be processed is better than the repeated data, the current data to be processed is transmitted and stored, wherein the step of storing the current data to be processed comprises the step of replacing the repeated data with the current data to be processed.

In an embodiment, before the current data to be processed is obtained by the current thread, the method further includes: reading data to be processed, and acquiring structured attribute information of the data to be processed; determining a data set to which the data to be processed belongs according to the structured attribute information, thereby obtaining a plurality of data sets; and generating data locks corresponding to the data sets one by one for the data sets, and storing the corresponding relation between the data sets and the data locks.

In an embodiment, determining a data set to which the data to be processed belongs according to the structured attribute information includes: determining grouping index information of the data to be processed according to the index attribute information in the structured attribute information; the index attribute information is at least one of structured attribute information; classifying the data to be processed with the same grouping index information into a data set; the corresponding relation between the stored data set and the data lock comprises the following steps: and storing the corresponding relation between the grouping index information corresponding to the data set and the data lock.

In an embodiment, acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs includes: acquiring current index attribute information of current data to be processed; obtaining current grouping index information of current data to be processed according to the current index attribute information; the method further comprises the following steps: detecting whether a data lock corresponding to the current grouping index information exists or not according to the current grouping index information; the current index attribute information is at least one of the structural attribute information of the current data to be processed, and the data to be processed belonging to the same data set have the same grouped index information.

In one embodiment, deleting the current data lock comprises: and releasing the current data lock and deleting the corresponding relation between the current data lock and the current data set.

According to a second aspect of embodiments of the present disclosure, there is provided a multithreaded data processing apparatus, the apparatus comprising: the data acquisition unit is used for acquiring current data to be processed through a current thread and determining a current data set to which the current data to be processed belongs; the data to be processed, which cause thread insecurity, belong to the same data set; the data operation unit is used for responding to the existence of a current data lock corresponding to the current data set and the current data lock is not activated, activating the current data lock, and only allowing the current thread to perform data processing on data to be processed in the current data set, wherein the data locks correspond to the data sets one by one; performing data processing on the current data to be processed through the current thread; deleting the current data lock after the data processing of the current data to be processed is finished; and the data operation unit is also used for responding to the absence of the current data lock and performing data processing on the current data to be processed through the current thread or finishing the current thread.

In one embodiment, the data manipulation unit further comprises: and activating the current data lock in response to the existence of the current data lock corresponding to the current data set, the current data lock is not activated and the number of the data to be processed in the current data set is greater than 1.

In an embodiment, the apparatus further comprises: the batch reading unit is used for reading the data to be processed and acquiring the structured attribute information of the data to be processed; the data dividing unit is used for determining a data set to which the data to be processed belongs according to the structured attribute information, so that a plurality of data sets are obtained; and the data lock generating unit is used for generating data locks corresponding to the data sets one by one for the data sets and storing the corresponding relation between the data sets and the data locks.

In one embodiment, the data dividing unit further includes: determining grouping index information of the data to be processed according to index attribute information in the structured attribute information; the index attribute information is at least one of structured attribute information; classifying the data to be processed with the same grouping index information into a data set; the corresponding relation between the stored data set and the data lock comprises the following steps: and storing the corresponding relation between the grouping index information corresponding to the data set and the data lock.

In an embodiment, acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs includes: acquiring current index attribute information of current data to be processed; obtaining current grouping index information of current data to be processed according to the current index attribute information;

the device still includes: detecting whether a data lock corresponding to the current grouping index information exists or not according to the current grouping index information; the current index attribute information is at least one of the structural attribute information of the current data to be processed, and the data to be processed belonging to the same data set have the same grouped index information.

According to a third aspect of the embodiments of the present disclosure, a method for storing multithread base database is provided, the method including: acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs; the data to be processed which causes thread insecurity belong to the same data set; in response to the existence of a current data lock corresponding to the current data set and the current data lock is not activated, activating the current data lock, so that only the current thread is allowed to perform data processing on data to be processed in the current data set, wherein the data locks correspond to the data sets one to one; performing data processing on the current data to be processed through the current thread; deleting the current data lock after the data processing of the current data to be processed is finished; responding to the absence of the current data lock, and performing data processing on the current data to be processed through the current thread or finishing the current thread; the data is base database data, and the data processing comprises warehousing.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a multithread base database warehousing device, including: the data acquisition unit is used for acquiring current data to be processed through a current thread and determining a current data set to which the current data to be processed belongs; the data to be processed which causes thread insecurity belong to the same data set; the data operation unit is used for responding to the existence of a current data lock corresponding to the current data set and the current data lock is not activated, activating the current data lock, and only allowing the current thread to perform data processing on data to be processed in the current data set, wherein the data locks correspond to the data sets one by one; performing data processing on the current data to be processed through the current thread; deleting the current data lock after the data processing of the current data to be processed is finished; the data operation unit is also used for responding to the absence of the current data lock and carrying out data processing on the current data to be processed through the current thread or finishing the current thread; the data is base database data, and the data processing comprises base storage.

According to a fifth aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a memory to store instructions; and the processor is used for calling the instructions stored in the memory to execute the multithreading data processing method of the first aspect or the multithreading base database warehousing method of the third aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions which, when executed by a processor, perform the multithread data processing method of the first aspect or the multithread base data binning method of the third aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the embodiment of the disclosure enables the data which is unsafe to be thread due to repeated data to be in the same data set, and adds a data lock to the data set when one thread performs data processing on one data in the data set, so that other threads cannot perform data processing on other data in the data set, and deletes the data lock of the data set after the thread processing is completed, so that on one hand, data in different data sets are allowed to be processed in parallel by multiple threads, and on the other hand, data in the data set is allowed to be processed in parallel by multiple threads after the thread processing is completed. Under the condition of ensuring the thread safety, the processing capacity of the system is effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method of multithreaded data processing in accordance with an exemplary embodiment;

FIG. 2 illustrates an example of multi-threaded execution of portrait data, according to an example embodiment;

FIG. 3 is a flow diagram illustrating another method of multithreaded data processing in accordance with an illustrative embodiment;

FIG. 4 is a flow diagram illustrating data partitioning in accordance with an exemplary embodiment;

FIG. 5 is a schematic block diagram illustrating a multithreaded data processing apparatus in accordance with an exemplary embodiment;

FIG. 6 is a schematic block diagram illustrating another multithreaded data processing apparatus in accordance with an exemplary embodiment;

FIG. 7 is a flowchart illustrating a method of multi-threaded base database warehousing in accordance with an illustrative embodiment;

FIG. 8 is a schematic block diagram illustrating a multi-threaded library data-entry facility in accordance with an illustrative embodiment;

FIG. 9 is a flowchart illustrating another method of multithreaded data processing in accordance with an illustrative embodiment;

FIG. 10 is a schematic block diagram illustrating an overall flow of a multithreaded data processing apparatus in accordance with an illustrative embodiment;

FIG. 11 is a schematic block diagram illustrating an overall flow of another multithreaded data processing apparatus in accordance with an illustrative embodiment;

FIG. 12 is a schematic block diagram illustrating an apparatus in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The present disclosure provides a multithreading data processing method 10, referring to fig. 1, a multithreading data processing method 10 including steps S11-S13, described in detail below:

step S11: acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs; wherein, a plurality of data to be processed which cause thread insecurity belong to the same data set.

The method disclosed by the invention can be applied to various data processing scenes, such as scenes of data synchronization and data transmission in data processing, and is particularly suitable for scenes that the data to be processed contains repeated data and the data expected to be processed does not contain repeated data, namely the data is processed and the data is deduplicated.

The data to be processed may be text data, image data, or combined data including text and image. For example, the data to be processed may be the bottom library data including the bottom library image, the bottom library image ID, and the bottom library image structured attribute information. The data to be processed may be mass data, such as more than 1w pieces of data, and parallel processing is required to improve the data processing efficiency.

The data to be processed is allocated to a plurality of data sets, and the data in one data set can be one or more. Multiple pieces of data to be processed in the same data set are data that may make the thread unsafe. In the present disclosure, data that may render a thread insecure is, for example, duplicate data. When data processing is performed by using multiple threads, the method disclosed by the invention can be applied to processing in order to guarantee thread safety while parallel processing is performed to the maximum extent.

It can be understood that, the data to be processed does not need to be placed in a folder to be considered to be included in a certain data set, and the data to be processed can be included in the data set by marking the data to be processed belonging to the same data set as the same label and marking the data to be processed belonging to different data sets as different labels.

Take a specific application scenario as an example. For example, the data to be processed is the data of the underlying library to be processed, each data to be processed corresponds to one object, and a plurality of data to be processed may correspond to the same object. Under the base warehouse-in scene (namely base database data synchronization), a plurality of data to be processed corresponding to the same object are considered to be repeated data. Now, it is desirable to synchronize/transmit the data to be processed to the target database, and the data synchronized/transmitted to the target database does not contain repeated data, i.e. only one piece of data is reserved for each object. Suppose the data to be processed is 1w, wherein 10 data to be processed correspond to the same object a, and other data to be processed correspond to different objects. It is desirable for multiple threads to perform data synchronization tasks in parallel to improve data synchronization efficiency. In order to ensure that the data synchronized into the target database does not contain repeated data, namely, the data synchronized into the target database corresponds to the objects one to one, the thread can compare the data to be processed with the data in the target database after acquiring the data to be processed, and if the data of the same object corresponding to the data to be processed exists in the target database, the data synchronization is not carried out; and if the data does not exist, performing data synchronization. However, if the object a does not exist in the target database, and the first thread and the second thread respectively acquire the 1 st and 2 nd pieces of data of 10 pieces of a, and do not synchronize the data to the target database, the first thread and the second thread both consider that the data of a does not exist in the target database when comparing, and then both execute the operation of synchronizing the data to the target database. Thus, there will be data item 1 a and data item 2 a in the target database, resulting in data duplication.

For this reason, thread safety and data processing efficiency can be simultaneously ensured by: before data processing is carried out on data to be processed, the data to be processed is distributed to a data set to which the data to be processed belongs, and the data to be processed, which may cause thread insecurity, is classified into the same data set. Therefore, when data processing is carried out subsequently, multithreading parallel processing can be carried out on different data sets, multithreading parallel processing is carried out on other data to be processed in the data set after the fact that one piece of data in the data set is synchronized is ensured, and only the current thread which acquires the current data to be processed is allowed to process the current data to be processed when no piece of data in the data set is synchronized.

In one example, each pending data can only be retrieved by one thread. In one example, each piece of to-be-processed data is in a task list, one piece of to-be-processed data corresponds to one task in the task list, a plurality of threads traverse a consumption task list, each thread acquires one task in the task list each time, and the task acquired by the thread can be understood as being consumed and cannot be acquired again by other threads.

It should be noted that, when the thread acquires the data to be processed, the thread is not affected by whether a data lock exists in the current data set to which the data to be processed belongs or whether the data lock is in an activated state.

Step S12: in response to the presence of the current data lock corresponding to the current data set, the following steps S121-S123 are performed. I.e., in response to the activation condition being satisfied, the following steps S121-S123 are performed.

The current thread may determine whether the activation condition is satisfied after determining a current data set to which the current data to be processed belongs.

In one example, the activation condition is that a current data lock corresponding to the current data set exists and the current data lock is not activated. In another example, in addition to the existence of the current data lock corresponding to the current data set and the inactivity of the current data lock, other conditions need to be satisfied to activate the current data set. For example, the current data set has more than one data to be processed. It can be understood that if the number of the to-be-processed data in the current data set is 1, the to-be-processed data in the current data set does not cause the problem of thread insecurity, and the current lock does not need to be activated.

The data sets and the data locks correspond one to one. Theoretically, data locks for data sets exist in 3 states: first, there is no data lock corresponding to the data set; and the second method comprises the following steps: there is a data lock corresponding to the data set but the data lock is not activated: and the third is that: there is a data lock for the data set and the data lock has been activated. After setting the data lock for the data set, the data lock is in an active and inactive state. The first state indicates that the repeated data of the data to be processed is transmitted and stored, namely the data belonging to the same object with the data to be processed is processed; secondly, data belonging to the same object as the data to be processed is not processed yet; thirdly, data belonging to the same object as the data to be processed is being processed and has not been processed. Different processing modes can be adopted for different data lock states, so that the thread safety is ensured while the parallel processing is performed to the maximum extent. For example, when the data lock is in the first state, which indicates that a piece of data in the data set is synchronized, the data lock may not be activated, so that the multiple pieces of data to be processed in the data set are allowed to be processed in parallel in multiple threads. When the data lock is in the second state, it indicates that no data in the data set has been synchronized, and at this time, the current thread that has acquired the current data to be processed is only allowed to process the current data to be processed when the current data lock is activated.

In one example, after setting a data lock for a data set, the corresponding relationship between the data lock and the data set may be stored in the memory, for example, the data set is identified as a key, and the data lock corresponding to the data set is stored as a value. And if the data lock corresponding to the data set is deleted, deleting the identifier of the data set from the memory. And when judging whether the condition of activating the data lock is met, if the identifier of the current data set cannot be inquired in the memory, determining that the current data lock corresponding to the current data set does not exist.

Wherein the step S121: and activating the current data lock so that only the current thread is allowed to perform data processing on the data to be processed in the current data set, wherein the data locks correspond to the data sets one by one.

And after the current data lock is activated, only allowing the current thread to process the data to be processed in the current data set, so that other threads are in a blocking waiting state on the data in the current data set. In this way, other threads may obtain the data to be processed in the current data set but cannot process the data to be processed.

As shown in FIG. 2, FIG. 2 is an example of multi-threaded execution of portrait data, shown in accordance with an example embodiment. One or more data to be processed of the same object are contained in the same data set. Data to be processed in the same data set share the same lock space. For example, 10 pieces of to-be-processed data of the object a all belong to a data set with the number 4221, a plurality of pieces of to-be-processed data in the same data set are acquired by a plurality of threads, under the condition that a data lock corresponding to the data set exists and is activated, the plurality of pieces of to-be-processed data in the same data set only allow one thread to perform data processing, and when two threads both acquire the to-be-processed data in the data set with the number 4221, the data lock only allows the previous thread to perform data processing, so that the thread safety effect is achieved.

Step S122: and performing data processing on the current data to be processed through the current thread.

In the embodiment of the present invention, the data processing includes data duplication checking (that is, determining whether there is duplicate data of the current data to be processed in the transmitted and stored data), and further includes one or more of data synchronization, data transmission, data statistics, and data processing.

When data is processed, firstly, data duplication is checked, and different processing is carried out according to data duplication checking results. On one hand, the data to be processed is usually acquired in batches, for example, a first batch has 500 data to be processed, and after the data to be processed is processed, a second batch has 700 data to be processed. On the other hand, the processing of the data to be processed in the same batch has a sequential order. In this way, there may be a case where the first piece of data corresponding to the same object is processed (transmitted and stored) first and the second piece of data corresponding to the same object is not processed yet. In order to avoid the situation that the data in the target database is not repeated, it is necessary to determine whether the transmitted and stored data contains repeated data of the current data to be processed when the data is processed. If the transmitted and stored data contains the repeated data of the current data to be processed, the data of the same object is processed completely. At this time, the current data to be processed may not be transmitted, or the current data to be processed and the repeated data may be compared, and the better of the two may be retained. If the transmitted and stored data does not have the repeated data of the current data to be processed, the data can be directly transmitted and stored at the moment, which indicates that no data of the same object is processed completely.

In one example, the data processing includes: judging whether the transmitted and stored data have repeated data of the current data to be processed, wherein the repeated data belong to the same data set with the current data to be processed;

and if the repeated data does not exist, transmitting and storing the current data to be processed. And if the repeated data exist, not transmitting the current data to be processed.

In another example, the data processing includes: judging whether repeated data of the current data to be processed exists in the transmitted and stored data, wherein the repeated data is data which belongs to the same data set with the current data to be processed;

and if the repeated data does not exist, transmitting and storing the current data to be processed. If the repeated data exists, one of the data to be processed and the repeated data is reserved, for example, the data with better quality or the data with more new update time can be reserved. Specifically, when selecting which of the data to be processed and the duplicate data to retain, the update time and/or the image quality of the current data to be processed and the duplicate data may be compared, and if the update time of the current data to be processed is later than the duplicate data and/or the image quality of the current data to be processed is better than the duplicate data, the current data to be processed is transmitted and stored, wherein storing the current data to be processed includes replacing the duplicate data with the current data to be processed; otherwise, the current data to be processed is not transmitted.

Step S123: and deleting the current data lock after the data processing of the current data to be processed is finished.

The current thread completes data processing on the current data to be processed, which indicates that the data to be processed is synchronized to the target database or that repeated data of the current data to be processed already exists in the target database before the data to be processed is synchronized, and at this time, the data lock of the current data set is deleted, so that when the subsequent thread acquires other data to be processed in the current data set and judges whether the data lock corresponding to the current data set exists, the subsequent thread finds that the data lock corresponding to the current data set does not exist, that is, the data lock activation condition is not satisfied. In this manner, multiple threads may be allowed to perform parallel data processing on the remaining data in the current data set. On the other hand, at this time, the data to be processed is already synchronized to the target database or the repeated data of the current data to be processed already exists in the target database before the data to be processed is synchronized, and when the data to be processed and the data in the target database are compared by other threads in parallel processing, the repeated data of the data to be processed already exists in the target database, and corresponding measures (for example, data synchronization is not performed or only one of the data to be processed and the repeated data is reserved) are taken to avoid two repeated data existing in the target database at the same time.

And S13, responding to the absence of the current data lock, and performing data processing on the current data to be processed through the current thread or finishing the current thread.

At this time, corresponding to the first state of the data lock, it is indicated that a piece of data in the data set has been synchronized, and at this time, the data lock may not be activated, so that multi-thread parallel processing is allowed for multiple pieces of data to be processed in the data set. On one hand, as one piece of data in the data set is synchronized, the thread can be directly ended; on the other hand, data processing can still be carried out, and when the repeated data of the current data to be processed exists in the transmitted and stored data during the data processing, corresponding measures can be taken to avoid the repetition.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the embodiment of the disclosure enables repeated data, namely data causing thread insecurity, to be in the same data set, determines whether to activate a data lock according to whether a data lock activation condition is met, and performs data processing in different modes according to different data lock states, so that multithreading parallel processing can be performed on different data sets during subsequent data processing, multithreading parallel processing is performed on other data to be processed in the same data set after one data in the data set is synchronized, and only the current thread which acquires the current data to be processed is allowed to process the current data to be processed when no data in the data set is synchronized, thereby ensuring thread safety while maximally performing parallel processing.

In one embodiment, before acquiring the data to be processed in step S11, as shown in fig. 3, the multi-thread data processing method 10 may further include steps S14-S16, wherein:

step S14: and reading the data to be processed, and acquiring the structured attribute information of the data to be processed.

The read of the data to be processed may be a batch read. The data to be processed may be structured data, semi-structured data or unstructured data as long as the data to be processed has corresponding structured attribute information. For example, the portrait data in the portrait database corresponds to structured attribute information, including names, certificates, personnel numbers, genders, and structured information of portraits. When the data to be processed is semi-structured or unstructured, the corresponding structured attribute information can be extracted by using a data extraction technology, and the structured attribute information of the data to be processed can also be obtained. For example, in unstructured webpage text information, the information of the webpage URL is extracted by using a regular expression method to serve as structured attribute information. And obtaining the structured attribute information corresponding to the data to be processed by processing the data, and further operating the data object through the structured attribute information identifier.

Step S15: determining a data set to which the data to be processed belongs according to the structured attribute information, thereby obtaining a plurality of data sets;

the structured attribute information, such as the personnel ID, can identify different personnel data, and the data to be processed can be divided according to the structured attribute information, or the data to be processed can be divided according to the result of certain operation performed on the structured attribute information, so that the repeated data, that is, the data which may cause thread insecurity, can be ensured to be in the same data set.

In an embodiment, as shown in fig. 4, step S15 may further include:

step S151, determining grouping index information of the data to be processed according to the index attribute information in the structured attribute information; the index attribute information is at least one of structured attribute information.

The index attribute information can be set according to the requirements of users. The user can define the index attribute information for the structured attribute information of the read data. The structured data of the portrait comprises attributes A, B, C, D and the like, and single or a plurality of pieces of structured attribute information can be used as index attribute information. As shown in the following table, the result session attribute information of 2 result session of license number + personnel number can be customized as the index attribute information.

Dimension (d) of	Uniqueness of
		Name(s)
Document	√
		Personnel number	√

And then, determining grouping index information of the data to be processed according to the index attribute information in the structured attribute information. For example, a hash operation is performed based on the index attribute information, and the result of the hash operation is used as the packet index information.

It should be noted that, the selection of the index attribute information and the manner of determining the grouping index information according to the index attribute information make it possible that the data to be processed with the same grouping index information are classified into one data set, and the data to be processed with different grouping index information are classified into different data sets, and the data to be processed classified into the same data set may be unsafe to cause thread, for example, the data may be repeated data corresponding to the same object.

Step S152, grouping the data to be processed with the same grouping index information into a data set.

For example, if a batch of portrait data, the license number of the portrait can be added by self as index attribute information. When data is read in batches, a hash algorithm is carried out according to the certificate number of the data to generate a key, the key value is the grouping index information of the data, and the data to be processed of the same key is classified into a data set. By means of the hash function, the data of the same key can be ensured to be in the same data set. The corresponding relation between the grouping index information corresponding to the data set and the data lock is stored, so that the subsequent search of the data lock is facilitated, and the corresponding relation can be stored in a memory or a configuration file. When the thread acquires a piece of data to be processed, the same hash function can be used for quickly positioning the data to be processed to the corresponding data set, the data set to which the data to be processed belongs is judged, and the data processing efficiency is effectively improved.

Step S16: and generating data locks corresponding to the data sets one by one for the data sets, and storing the corresponding relation between the data sets and the data locks.

In one example, storing the correspondence between the data sets and the data locks includes: and storing the corresponding relation between the grouping index information corresponding to the data set and the data lock.

In an embodiment, a corresponding data lock may be applied for each data set in the memory, and an independent lock space may be dynamically generated. Whether a lock exists and/or a lock is acquired is determined by the packet index information key. The data lock and the key are in one-to-one correspondence. The key corresponds one-to-one to the dataset. Therefore, the lock and the data set are bound one to one, and independent lock space data is formed. Through a hash function, data can be quickly grouped, data of the same data dimension is ensured to be in the same group, a data set which is possibly unsafe to generate threads is quickly divided, a thread mutual exclusion lock is established, only one thread can be ensured to operate the data of the data set at present, the data set and the data lock are stored in a memory through the corresponding relation determined by a key value, the searching efficiency can be effectively improved, and the data processing speed is improved.

Further, in an embodiment, acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs includes: acquiring current index attribute information of the current data to be processed; obtaining current grouping index information of the current data to be processed according to the current index attribute information; the method further comprises the following steps: detecting whether a data lock corresponding to the current grouping index information exists or not according to the current grouping index information; the current index attribute information is at least one of the structural attribute information of the current data to be processed, and the data to be processed belonging to the same data set have the same grouped index information.

Each piece of data to be processed has corresponding index attribute information. After the current index attribute information of the current data to be processed is obtained, the current grouping index information of the current data to be processed can be obtained according to the current index attribute information, namely, the current data set to which the current data to be processed belongs is determined. Specifically, the hash function of the grouped index information can be calculated when the data to be processed is classified into different data sets according to the grouped index information, the grouped index information corresponding to the current index attribute information is calculated, and whether a data lock with the grouped index information as key exists or not is searched. Therefore, the thread which acquires the data to be processed can conveniently and quickly determine the current data set which the current data to be processed belongs to and quickly determine whether the data lock corresponding to the current grouping index information exists, and the efficiency of overall data processing can be effectively improved.

In one example, the index attribute information is a person number and a person name, the thread 1 obtains a piece of to-be-processed data, determines, according to the person number and the person name corresponding to the to-be-processed data, current grouped index information key1 corresponding to the to-be-processed data (for example, the to-be-processed data corresponding to the grouped index information key1 is of an object a), detects whether a data lock including the data lock corresponding to the grouped index information exists in the memory and the data lock is in an inactive state, activates, if the data lock corresponding to the key1 exists, then determines whether data corresponding to the object a exists in the base, and synchronizes the to-be-processed data to the base if the data lock does not exist. After the data to be processed is processed, the data lock is set to be in an inactivated state (the data lock is released), and the corresponding relation between the grouping index information key1 and the data lock is deleted. Meanwhile, the thread 2 acquires another to-be-processed data in the data set (that is, the packet index information of the to-be-processed data acquired by the thread 2 is also key 1), and it is detected that the data lock corresponding to the data set exists in the memory and the data lock is in an active state, so that the thread 2 blocks and waits until the data lock corresponding to the data set is restored to an inactive state, and then performs data processing on the data or ends the thread. For example, thread 2 determines that the data for object a already exists in the base, and ends the thread. After the corresponding relationship between the group index information key1 of the data set and the data lock is deleted, the thread 3 acquires another piece of to-be-processed data in the data set (that is, the group index information of the to-be-processed data acquired by the thread 3 is also key 1), and at this time, it is detected that the current data lock corresponding to the current data set does not exist, and the thread 3 directly performs data processing on the to-be-processed data or ends the thread. For example, if the thread 3 determines that the data of the object a is already present in the base, the thread is terminated. Therefore, through the setting of the thread mutual exclusion lock, the data safety can be ensured, and the operation on the repeated data is avoided.

In one embodiment, deleting the current data lock in step S123 includes: and releasing the current data lock and deleting the corresponding relation between the current data lock and the current data set.

The current data lock is released and the current data lock is in an inactive state.

And deleting the corresponding relation between the current data lock and the current data set, wherein the current data lock corresponding to the current data set does not exist.

In one example, a timing task may be further set, and after the data to be processed is processed, the lock is destroyed according to the configured survival time of the data lock at a timing, so as to clean the memory. Deleting the corresponding data lock releases the lock resource in the memory on one hand, improves the utilization rate of system resources, avoids the situation that thread deadlock is possibly caused by the fact that the established thread lock is not deleted on the other hand, and effectively improves the processing capacity of the system under the condition of ensuring thread safety.

Based on the same inventive concept, fig. 5 shows a multithreading apparatus 100, where the multithreading apparatus 100 may include: a data obtaining unit 110, configured to obtain current data to be processed through a current thread, and determine a current data set to which the current data to be processed belongs; the data to be processed which causes thread insecurity belong to the same data set; the data operation unit 120 is configured to activate a current data lock in response to that a current data lock corresponding to a current data set exists and the current data lock is not activated, so that only a current thread is allowed to perform data processing on data to be processed in the current data set, where the data locks correspond to the data sets one to one; performing data processing on the current data to be processed through the current thread; deleting the current data lock after the data processing of the current data to be processed is finished; the data operation unit 120 is further configured to perform data processing on the current data to be processed through the current thread or end the current thread in response to the absence of the current data lock.

In one embodiment, the data manipulation unit 120 further includes: and activating the current data lock in response to the existence of the current data lock corresponding to the current data set, the current data lock is not activated and the number of the data to be processed in the current data set is greater than 1.

In one embodiment, the data processing method for the current data to be processed includes: judging whether the transmitted and stored data contain repeated data of the current data to be processed, wherein the repeated data are data belonging to the same data set as the current data to be processed; and if the repeated data does not exist, transmitting and storing the current data to be processed.

In one embodiment, as shown in fig. 6, the apparatus 100 further comprises: the batch reading unit 130 is configured to read data to be processed and obtain structured attribute information of the data to be processed; the data dividing unit 140 is configured to determine, according to the structured attribute information, a data set to which the data to be processed belongs, thereby obtaining a plurality of data sets; and a data lock generating unit 150, configured to generate data locks corresponding to the data sets one to one for the multiple data sets, and store the corresponding relationship between the data sets and the data locks.

In an embodiment, the data dividing unit 140 further includes: determining grouping index information of the data to be processed according to index attribute information in the structured attribute information; the index attribute information is at least one of structured attribute information; classifying the data to be processed with the same grouping index information into a data set; the corresponding relation between the storage data set and the data lock comprises the following steps: and storing the corresponding relation between the grouping index information corresponding to the data set and the data lock.

the apparatus 100 further comprises: detecting whether a data lock corresponding to the current grouping index information exists or not according to the current grouping index information; the current index attribute information is at least one of the structural attribute information of the current data to be processed, and the data to be processed belonging to the same data set have the same grouped index information.

According to a third aspect of the embodiments of the present disclosure, as shown in fig. 7, there is provided a multi-threaded base data warehousing method 20, the method 20 comprising steps S21-S23, as follows.

S21, acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs; the data to be processed which causes thread insecurity belong to the same data set; wherein the pending data that would cause the thread to be unsafe belongs to the same data set. When the current thread reads a data object needing to be put in storage, a data set corresponding to the data object needs to be determined at first, so that the current thread can conveniently judge whether to process a data set with unsafe data or not, and further judge whether the thread is safe or not.

In response to that a current data lock corresponding to the current data set exists and the current data lock is not activated, executing step S22, activating the current data lock, so that only the current thread is allowed to perform data processing on the data to be processed in the current data set, where the data locks correspond to the data sets one to one; performing data processing on the current data to be processed through the current thread; and after the data processing of the current data to be processed is finished, deleting the current data lock. And by utilizing thread activation, only a single thread in an unsafe data set can be operated, so that efficient processing and thread safety are realized. Or comparing the update time or the image quality of the current data to be processed and the repeated data, and if the update time of the current data to be processed is later than the repeated data or the image quality of the current data to be processed is better than the repeated data, warehousing the current data to be processed and replacing the repeated data in the base by the current data to be processed. The database entry task can be completed more flexibly, and the data in the base database is guaranteed to be the data with the latest time and the highest image quality.

In response to the absence of the current data lock, executing step S23 to perform data processing on the current data to be processed through the current thread or end the current thread; the data is base database data, and the data processing comprises base storage. When the multithread data processing method is applied to the base database storage, the data to be processed can be base database data containing base database images, base database image IDs and base database image structured attribute information. At this time, deleting the corresponding data lock releases the lock resource on one hand, improves the utilization rate of the system resource, and on the other hand, avoids the situation that the established lock is not deleted and thread deadlock is possible to occur, and effectively improves the processing capacity of the system under the condition of ensuring the thread safety.

Based on the same inventive concept, fig. 8 shows a multi-threaded base database warehousing device 200, and provides the multi-threaded base database warehousing device 200, which includes: a data obtaining unit 210, configured to obtain current data to be processed through a current thread, and determine a current data set to which the current data to be processed belongs; the data to be processed which causes thread insecurity belong to the same data set; the data operation unit 220 is configured to activate a current data lock in response to that a current data lock corresponding to a current data set exists and the current data lock is not activated, so that only a current thread is allowed to perform data processing on data to be processed in the current data set, where the data locks correspond to the data sets one to one; performing data processing on the current data to be processed through the current thread; deleting the current data lock after the data processing of the current data to be processed is finished; the data operation unit 220 is further configured to perform data processing on the current data to be processed through the current thread or end the current thread in response to that the current data lock does not exist; the data is base database data, and the data processing comprises base storage.

With regard to the multithreaded data processing apparatus in the above-described embodiment, the specific manner in which each unit performs operations has been described in detail in the embodiment related to the method, and will not be described in detail herein.

Based on the same inventive concept, fig. 9 shows another multithreaded data processing device. The system comprises a data source, a data dimension setting module, an execution module, a dynamic activation module and a data processing module. The method comprises the steps of reading data from a data source in batches, preprocessing the data and the like, then creating a dimension module in a data dimension design module and configuring a user-defined dimension, wherein if a personnel number is set as the data dimension; setting an execution mode of a program in an execution module, configuring the number of threads and associated data dimensions, and setting whether the lock is destroyed or not and the destruction time of the lock in an activation module; and setting a data service processing mode in the processing module. Such as portrait warehousing, statistics, etc.

Specifically, the overall operation flow of one embodiment of the multithreaded data processing apparatus of the present disclosure is shown in fig. 10. Starting to execute a task, reading data in a human database, for example, reading 1000 objects containing images into a memory; firstly, a data dimension setting module associated with a data execution module is started to obtain a user-defined dimension, when the user-defined dimension does not exist, all data of a data source belong to thread safety data, and service processing is directly performed by adopting multiple threads; when the user-defined dimension exists, the data is represented to have thread unsafe data, at the moment, the data is divided according to the user-defined dimension, corresponding attribute values in the data are firstly read, hash operation is carried out to obtain keys, the data are grouped according to the keys, the data of the same keys are stored in one data packet, and the data belonging to the same person are ensured to be in the same data set. Creating a corresponding lock in a memory according to the obtained key, storing the key and the lock in the memory, then creating threads according to the number of the threads defined by a user, reading data to be processed by each thread, performing hash to obtain a corresponding key value, then searching whether the lock exists in a memory space according to the correspondence between the key and the lock, judging the number of data of a data set corresponding to the key when the key exists, if the number of data of the data set is greater than 1, generating the lock, and if the number of data of the data set is less than or equal to 1, directly performing service processing. And after generating the lock, activating and blocking the thread, allowing the data under the key to only operate the data set by a single thread at the moment, performing service processing if no data of the data set to which the data set belongs is detected in the target database, releasing the lock and deleting the corresponding relation between the key and the lock after the service processing is finished, ending the thread if the data of the data set to which the data set belongs is detected in the target database, and releasing the lock if the current thread is in an activated state. And other threads compete for the lock to perform data processing again until all data processing is finished.

Specifically, the general operation flow of another embodiment of the multithreaded data processing apparatus of the present disclosure is shown in fig. 11. In this embodiment, the step of deleting the correspondence of the data lock with the data set may not be performed. Starting to execute a task, reading data in a human database, for example, reading 1000 objects containing images into a memory; firstly, starting a data dimension setting module associated with a data execution module to obtain a user-defined dimension, and directly performing service processing by adopting multiple threads when the user-defined dimension does not exist and all data of a data source belong to thread safety data; when the user-defined dimension exists, the data is represented to have thread unsafe data, at the moment, the data is divided according to the user-defined dimension, corresponding attribute values in the data are firstly read, hash operation is carried out to obtain keys, the data are grouped according to the keys, the data of the same keys are stored in one data packet, and the data belonging to the same person are ensured to be in the same data set. And meanwhile, the lock is acquired from the map of the memory according to the acquired key, and when the key exists, the data service processing module is activated. And when the key does not exist, judging the data quantity of the data set corresponding to the key, if the data quantity is greater than 1, generating a lock, if the data quantity is less than or equal to 1, directly performing service processing, and after the lock is generated, activating and blocking the program, wherein the data under the key only allows a single thread to perform service processing. And releasing the lock after the service processing is finished. And other threads compete for the lock to perform data processing again until all data processing is finished. And finally, destroying the lock at regular time according to the configured survival time of the lock so as to clean the memory.

As shown in fig. 12, one embodiment of the present disclosure provides an electronic device 400. The electronic device 400 includes a memory 401, a processor 402, and an Input/Output (I/O) interface 403. The memory 401 is used for storing instructions. A processor 402 for calling the instructions stored in the memory 401 to execute the multithreading data processing method according to the embodiment of the disclosure. The processor 402 is connected to the memory 401 and the I/O interface 403, respectively, for example, through a bus system and/or other connection mechanism (not shown). The memory 401 may be used to store programs and data, including programs of the multithread data processing method according to the embodiment of the present disclosure, and the processor 402 executes various functional applications and data processing of the electronic device 400 by running the programs stored in the memory 401.

The processor 402 in the embodiment of the present disclosure may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and the processor 402 may be one or a combination of a Central Processing Unit (CPU) or other forms of Processing units with data Processing capability and/or instruction execution capability.

Memory 401 in the disclosed embodiments may comprise one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile Memory may include, for example, a Random Access Memory (RAM), a cache Memory (cache), and/or the like. The non-volatile Memory may include, for example, a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), a Solid-State Drive (SSD), or the like.

In the embodiment of the present disclosure, the I/O interface 403 may be used to receive input instructions (e.g., numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device 400, etc.), and may also output various information (e.g., images or sounds, etc.) to the outside. The I/O interface 403 may include one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a mouse, a joystick, a trackball, a microphone, a speaker, a touch panel, and the like in embodiments of the present disclosure.

It is to be understood that although operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

The methods and apparatus related to embodiments of the present disclosure can be accomplished with standard programming techniques with rule-based logic or other logic to accomplish the various method steps. It should also be noted that the words "means" and "module," as used herein and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving inputs.

Any of the steps, operations, or procedures described herein may be performed or implemented using one or more hardware or software modules, alone or in combination with other devices. In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium containing computer program code, which is executable by a computer processor for performing any or all of the described steps, operations, or procedures.

The foregoing description of implementations of the present disclosure has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the disclosure. The embodiments were chosen and described in order to explain the principles of the disclosure and its practical application to enable one skilled in the art to utilize the disclosure in various embodiments and with various modifications as are suited to the particular use contemplated.

Claims

1. A method of multithreaded data processing, the method comprising:

reading data to be processed, and acquiring structured attribute information of the data to be processed;

determining a data set to which the data to be processed belongs according to the structured attribute information, thereby obtaining a plurality of data sets;

generating data locks corresponding to the data sets one by one for the data sets, and storing the corresponding relation between the data sets and the data locks;

acquiring current data to be processed through a current thread, and determining a current data set to which the current data to be processed belongs; the data to be processed which causes thread insecurity belong to the same data set;

in response to there being a current data lock corresponding to the current data set and the current data lock is not activated,

activating the current data lock to allow the current thread to perform data processing on the data to be processed in the current data set, wherein the data lock corresponds to the data set one by one;

performing data processing on the current data to be processed through the current thread;

after the data processing of the current data to be processed is finished, deleting the current data lock;

and responding to the absence of the current data lock, and performing data processing on the current data to be processed through the current thread or finishing the current thread.

2. The method of claim 1, wherein activating the current data lock in response to the presence of a current data lock corresponding to the current data set and the current data lock being inactive comprises:

and in response to the existence of a current data lock corresponding to the current data set, the current data lock is not activated and the number of data to be processed in the current data set is greater than 1, activating the current data lock.

3. The method of claim 1, wherein said performing data processing on said current data to be processed comprises:

judging whether repeated data of the current data to be processed exists in the transmitted and stored data, wherein the repeated data is data which belongs to the same data set with the current data to be processed;

and if the repeated data does not exist, transmitting and storing the current data to be processed.

4. A method for multi-threaded data processing as claimed in claim 3, wherein said data processing of said currently pending data further comprises:

if the repeated data exists, comparing the updating time and/or the image quality of the current data to be processed with the image quality of the repeated data, and if the updating time of the current data to be processed is later than the repeated data and/or the image quality of the current data to be processed is better than the image quality of the repeated data, transmitting and storing the current data to be processed, wherein the storing of the current data to be processed comprises replacing the repeated data with the current data to be processed.

5. The multithreading data processing method of claim 1, wherein the determining, according to the structured attribute information, a data set to which the data to be processed belongs comprises:

determining grouping index information of the data to be processed according to index attribute information in the structured attribute information; wherein the index attribute information is at least one of the structured attribute information;

classifying the data to be processed with the same grouping index information into a data set;

the correspondence between the stored data set and the data lock comprises: and storing the corresponding relation between the grouping index information corresponding to the data set and the data lock.

6. The method of any one of claims 1-5, wherein obtaining current pending data by a current thread, and determining a current data set to which the current pending data belongs, comprises:

acquiring current index attribute information of the current data to be processed;

obtaining the current grouping index information of the current data to be processed according to the current index attribute information;

the method further comprises the following steps:

detecting whether a data lock corresponding to the current grouping index information exists or not according to the current grouping index information;

the current index attribute information is at least one of the structural attribute information of the current data to be processed, and the data to be processed belonging to the same data set have the same grouped index information.

7. The method of any of claims 1-5, wherein the removing the current data lock comprises:

and releasing the current data lock and deleting the corresponding relation between the current data lock and the current data set.

8. A multi-threaded data processing apparatus, the apparatus comprising:

the batch reading unit is used for reading the data to be processed and acquiring the structured attribute information of the data to be processed;

the data dividing unit is used for determining a data set to which the data to be processed belongs according to the structured attribute information, so that a plurality of data sets are obtained;

the data lock generating unit is used for generating data locks corresponding to the data sets one by one for the data sets and storing the corresponding relation between the data sets and the data locks;

the data acquisition unit is used for acquiring current data to be processed through a current thread and determining a current data set to which the current data to be processed belongs; the data to be processed, which cause thread insecurity, belong to the same data set;

a data operation unit for responding to the existence of a current data lock corresponding to the current data set and the current data lock is not activated,

activating the current data lock to only allow the current thread to perform data processing on the data to be processed in the current data set, wherein the data lock corresponds to the data set one by one;

deleting the current data lock after the data processing of the current data to be processed is finished;

and the data operation unit is also used for responding to the absence of the current data lock and performing data processing on the current data to be processed through the current thread or finishing the current thread.

9. A multithreading base database warehousing method is characterized by comprising the following steps:

responding to the absence of the current data lock, and performing data processing on the current data to be processed through the current thread or finishing the current thread;

and the data is base database data, and the data processing comprises warehousing.

10. A multi-threaded base data warehousing apparatus, the apparatus comprising:

the data operation unit is further configured to perform data processing on the current data to be processed through the current thread or end the current thread in response to the absence of the current data lock;

11. An electronic device, comprising:

a memory to store instructions; and

a processor for invoking the memory-stored instructions to perform the multi-threaded data processing method of any of claims 1 to 7 or the multi-threaded base data-binning method of claim 9.

12. A computer-readable storage medium storing instructions which, when executed by a processor, perform the method of multithreaded data processing as set forth in any one of claims 1-7 or the method of multithreaded library data warehousing as set forth in claim 9.