CN104793997A

CN104793997A - Data processing device and method

Info

Publication number: CN104793997A
Application number: CN201410023109.7A
Authority: CN
Inventors: 吴万里
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2014-01-17
Filing date: 2014-01-17
Publication date: 2015-07-22
Anticipated expiration: 2034-01-17
Also published as: CN104793997B

Abstract

The invention provides a data processing device and method. The data processing device and method are used for simplifying the process that a thread processes a large amount of data files, increasing the file processing speed for the thread, reducing the consumption of computer resources and increasing the computation speed of a Hash algorithm. The data processing method comprises the steps that for any one row of data of a file to be processed, the feature value of the row of data is determined according to preset key information of the row of data, and according to the feature value, an element used for storing the identification of the row of data is located in a preset array structure; whether the located element is occupied or not is judged, if yes, the row of data and the row of data corresponding to the identification occupying the element are determined as rows of data meeting preset conditions, and if not, the identification of the row of data is stored in the element.

Description

Data processing device and method

Technical Field

The present invention relates to the field of communications, and in particular, to a data processing apparatus and method.

Background

When a large amount of data is processed, it is usually necessary to perform deduplication processing on the large amount of data, that is, in the large amount of data, two or more associated data segments are searched according to the key information of each data segment, and the associated data segments are processed correspondingly.

Taking a data audit business process as an example, the process mainly comprises a Customer Relationship Management system (CRM), a fusion charging system (CBS) and an audit system, wherein the CRM and the CBS are responsible for providing a large number of data files, the audit system is responsible for auditing the data files, namely searching two or more related data sections in all the files according to key information of each data section, analyzing parts with the same and different information among the related data sections, forming the related data sections into report files, finally, the audit system feeds the report files back to the CRM and the CBS, and the CRM and the CBS correct the data.

Because the data volume needing to be subjected to rearrangement processing is large, in order to save software and hardware resources of a computer, a Hash (Hash) algorithm is usually adopted, an input with any length can be converted into an output with a fixed length in a specified numerical range through the Hash algorithm, the conversion is compression mapping, the space occupied by the output is usually far smaller than that occupied by the input, the outputs obtained through different inputs of the Hash algorithm can be the same, and the unique input cannot be determined according to the output; briefly, a hash algorithm is a function that compresses a message of arbitrary length to a digest message of a fixed length. In the process of arranging and processing a large amount of data, the data segments are compressed into fixed-length output through a hash algorithm, so that the processing process of the large amount of data can be simplified.

At present, the adoption of the prior art to simplify the Processing process of a large amount of data still consumes a large amount of software and hardware resources in a computer, such as computer memory and Central Processing Unit (CPU) resources.

Disclosure of Invention

The embodiment of the invention provides a data processing device and a data processing method, which are used for simplifying the processing process of a thread on a large number of data files, improving the processing speed of the thread on the files, reducing the consumption of computer resources and improving the calculation speed of a Hash algorithm.

In a first aspect, a data processing apparatus is provided, including:

the determining unit is used for determining a characteristic value of the line of data according to preset key information of the line of data in a file to be processed, and positioning an element used for storing a line of data identifier in a preset array structure according to the characteristic value;

the processing unit is used for judging whether the element positioned by the determining unit is occupied or not, and if so, determining the line data and the line data corresponding to the identifier of the occupied element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.

With reference to the first aspect, in a first possible implementation manner, the array structure is a two-dimensional array structure including row elements and column elements, and the number of the elements in the array structure is not less than the number of rows of data in all files that need to be processed.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the number of rows of data in all files to be processed is 2^m+nM and n are natural numbers;

the array structure includes 2^m+1 row elements and 2ⁿ+1 columns of elements.

With reference to the first or second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the determining unit is specifically configured to:

acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;

according to the formula r (x) = h (x)&(2^m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;

according to the formula c (x) = h (x)&(2^m+n-1)&(2ⁿ-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;

therein, 2^m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.

With reference to the first aspect, in a fourth possible implementation manner, the array structure includes 1 row of elements and 2ⁿAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.

With reference to the first aspect, in a fifth possible implementation manner, the identifier of the line data includes a number of a file where the line data is located and a line number of the line data in the file.

With reference to the first aspect, in a sixth possible implementation manner, the apparatus further includes:

the message generating unit is used for adding the related information of the line data meeting the preset conditions, which is determined by the processing unit, into the message; the related information comprises key information and identification of the line data meeting the preset condition.

With reference to the first possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the determining unit and the processing unit respectively employ multiple threads to process line data in multiple files in parallel, where one thread processes line data in one file.

With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, when it is determined that one thread accesses one element in the array structure, the determining unit and the processing unit respectively prohibit other threads from accessing the row element where the element is located.

In a second aspect, a data processing method is provided, comprising

Determining a characteristic value of the line of data according to preset key information of the line of data in a file to be processed, and positioning an element for storing a mark of the line of data in a preset array structure according to the characteristic value;

judging whether the positioned element is occupied, if so, determining the line data and the line data corresponding to the identifier occupying the element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.

With reference to the second aspect, in a first possible implementation manner, the array structure is a two-dimensional array structure including row elements and column elements, and the number of the elements in the array structure is not less than the number of rows of data in all files that need to be processed.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the number of rows of data in all files to be processed is 2^m+nM and n are natural numbers;

the array structure includes 2^m+1 row elements and 2ⁿ+1 columns of elements.

With reference to the first or second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, determining a feature value of any line of data in a file to be processed according to preset key information of the line of data, and locating, according to the feature value, an element used for storing an identifier of the line of data in a preset array structure, includes:

With reference to the second aspect, in a fourth possible implementation manner, the array structure is that the array structure includes 1 row of elements and 2ⁿAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.

With reference to the second aspect, in a fifth possible implementation manner, the identifier of the line data includes a number of a file where the line data is located and a line number of the line data in the file.

With reference to the second aspect, in a sixth possible implementation manner, the method further includes:

adding related information of the line data meeting the preset condition into the message; the related information comprises key information and identification of the line data meeting the preset condition.

With reference to the first possible implementation manner of the second aspect, in a seventh possible implementation manner of the second aspect, multiple threads are used to process line data in multiple files in parallel, where one thread processes line data in one file.

With reference to the seventh possible implementation manner of the second aspect, in an eighth possible implementation manner of the second aspect, when one thread accesses one element in the array structure, other threads are prohibited from accessing the row element where the element is located.

In the data processing method provided by the embodiment of the invention, the thread determines the characteristic value of the line data according to the key information of the preset line data, positions the element used for storing the identification of the line data in the preset array structure according to the characteristic value, and judges whether the positioned element is occupied or not so as to determine the line data meeting the preset condition, thereby simplifying the processing process of the thread on a large number of data files, improving the processing speed of the thread on the files, and reducing the consumption of computer resources.

Drawings

Fig. 1 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a processing flow of data in a file by using multiple threads in an auditing system according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a preset array structure according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a thread processing file according to an embodiment of the present invention;

fig. 7 is a schematic diagram of line data in a file processed by a thread according to an embodiment of the present invention.

Detailed Description

The invention provides a data processing device and a data processing method, which are used for simplifying the processing process of a thread on a large number of data files through a Hash algorithm, improving the processing speed of the thread on the files, reducing the consumption of computer resources and improving the calculation speed of the Hash algorithm.

As shown in fig. 1, an embodiment of the present invention provides a data processing apparatus, including:

the determining unit 11 is configured to determine a feature value of the line of data according to preset key information of the line of data in the file to be processed, and locate an element, which is used for storing an identifier of the line of data, in a preset array structure according to the feature value;

the processing unit 12 is configured to determine whether the element located by the determining unit 11 is occupied, and if so, determine the line of data and the line of data corresponding to the identifier that occupies the element as the line of data that meets the preset condition; otherwise, the identification of the line of data is stored in the element.

Preferably, the array structure is a two-dimensional array structure including row elements and column elements, and the number of the elements in the array structure is not less than the number of rows of data in all files to be processed.

Preferably, the number of rows of data in all files to be processed is 2^m+nM and n are natural numbers;

the array structure includes 2^m+1 row elements and 2ⁿ+1 columns of elements.

Preferably, the determining unit 11 is specifically configured to:

Preferably, the array structure is composed of 1 row element and 2ⁿAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.

Preferably, the identification of the line data includes the number of the file where the line data is located and the line number of the line data in the file.

Preferably, the apparatus further comprises:

the message generating unit 13 is configured to add the relevant information of the line data meeting the preset condition, which is determined by the processing unit 12, to the message; the related information comprises key information and identification of the line data meeting the preset condition.

Preferably, the determining unit 11 and the processing unit 12 respectively employ a plurality of threads to process the line data in a plurality of files in parallel, wherein one thread processes the line data in one file.

Preferably, when determining that a thread accesses an element in the array structure, the determining unit 11 and the processing unit 12 prohibit other threads from accessing the row element where the element is located.

Specifically, the determining unit 11, the processing unit 12 and the message generating unit 13 may be implemented by an entity such as a processor, and the present invention is not limited to the entity implementing these modules.

As shown in fig. 2, an embodiment of the present invention provides a data processing apparatus, including:

the processor 21 is configured to determine a feature value of any line of data in a file to be processed according to preset key information of the line of data, locate an element, which is used for storing an identifier of the line of data, in a preset array structure according to the feature value, and determine whether the located element is occupied, and if so, determine the line of data and the line of data corresponding to the identifier of the occupied element as the line of data meeting a preset condition; otherwise, the identification of the line of data is stored in the element.

And the memory 22 is used for storing preset key information of each row of data and storing a preset array structure and relevant information thereof.

the array structure includes 2^m+1 row elements and 2ⁿ+1 columns of elements.

Preferably, the processor 21 is configured to determine a feature value of the line of data according to key information of the line of data set in advance when the processor is configured to be used for any line of data in the file to be processed, and when the processor is positioned to an element used for storing an identifier of the line of data in a preset array structure according to the feature value, specifically:

Preferably, the processor 21 is further configured to add information related to the determined line data meeting the preset condition to the message; the related information comprises key information and identification of the line data meeting the preset condition.

Preferably, the processor 21 processes line data in multiple files in parallel using multiple threads, wherein one thread processes line data in one file.

Preferably, when it is determined that a thread accesses an element in the array structure, processor 21 prohibits other threads from accessing the row element in which the element is located.

As shown in fig. 3, an embodiment of the present invention provides a data processing method, where the method includes:

s31, determining a characteristic value of any line of data in the file to be processed according to preset key information of the line of data, and positioning an element used for storing a line of data identifier in a preset array structure according to the characteristic value;

s32, judging whether the positioned element is occupied, if so, determining the line data and the line data corresponding to the identifier occupying the element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.

The line data identification comprises the serial number of the file where the line data are located and the line number of the line data in the file.

Preferably, before step S31, the method further includes: presetting an array structure;

specifically, the preset array structure is a two-dimensional array structure including row elements and column elements, the elements are used for storing the identifications of row data, and the number of the elements in the array structure is not less than the number of rows of data in all files to be processed, for example: when the number of lines of data in all files to be processed is 2^m+nWhen, the array structure includes 2^m+1 row elements and 2ⁿ+1 columns of elements, m and n being natural numbers; or,the preset array structure comprises 1 row element and 2ⁿAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.

Preferably, the method in step S31 includes:

Preferably, after step S32, the method further includes:

adding the related information of the line data satisfying the preset condition determined in the step S32 to the message; the related information comprises key information and identification of the line data meeting the preset condition.

In the invention, when a large number of files need to be processed, a plurality of threads can be adopted to process the row data in the files in parallel, wherein one thread processes the row data in one file, and at the moment, the preset array structure is a two-dimensional array structure; when one thread accesses one element in the array structure, other threads are prohibited from accessing the row element where the element is located, so that access conflict of multiple threads to the same row element in the array structure can be avoided, and the processing speed of the threads to row data in the file is improved; when the amount of the files needing to be processed is small, one thread can be adopted to process the row data in the files, and the preset array structure is a one-dimensional array structure.

The following describes a data processing method provided by the embodiment of the present invention in detail by taking a process of processing data by the auditing system as an example.

As shown in fig. 4, the process flow of the audit system for processing the row data in the file by using multiple threads is as follows:

s41, presetting an array structure;

specifically, the preset array structure is a two-dimensional array structure including row elements and column elements, the elements are used for storing the identifications of row data, and the number of the elements in the array structure is not less than the number of rows of data in all files to be processed, for example: as shown in FIG. 5, when the number of lines of data in all files to be processed is 2^m+nWhen, the array structure includes 2^m+1 row elements and 2ⁿ+1 columns of elements, m and n being natural numbers. Or the preset array structure comprises 1 row element and 2ⁿAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.

Preferably, when a large number of files need to be processed, a plurality of threads can be adopted to process the row data in the files in parallel, wherein one thread processes the row data in one file, and at the moment, the preset array structure is a two-dimensional array structure; when the amount of the files needing to be processed is small, one thread can be adopted to process the row data in the files, and the preset array structure is a one-dimensional array structure.

The method for determining the number of rows of data in all files needing to be processed may be: the business system automatically analyzes the number of all files to be processed and the size of each file memory; determining the maximum byte number of the file memory in all the files to be processed and the minimum byte number in the byte number of each row of data in the file; and estimating the maximum value of the row number of the data in all the files to be processed according to the row number of the data in all the processed files = the byte number of the file with the maximum memory/the minimum byte number of the row data in the file and the number of the files to be processed.

S42, determining a characteristic value of any line of data in the file to be processed according to preset key information of the line of data, and positioning an element used for storing a line of data identifier in a preset array structure according to the characteristic value; wherein the identification of the line data includes the number of the file where the line data is located and the line number of the line data in the file, for example, as shown in fig. 6, the number of the currently processed file is from 1 to n, and each line data in each file has its own line number.

Specifically, the method comprises the following steps:

The preset key information of the row data can be set according to the needs of a user or a system or agreed in advance; for example, the line data currently processed by the thread is the first line data in the file shown in fig. 7, and the line data is: 15895868086ABC TOM NanJing, Ln =1 in FIG. 7 represents that the currently processed line data is the first line data in the file, the preset key information x includes 15895868086 (user number) and TOM (user name), the characteristic value h (x) =726346 of the line data is calculated according to the hash function, at this time, see FIG. 4, m =1, n =3,

r(x)=h(15895868086,TOM)&(2¹⁺³-1)>>3=0, i.e. row 0 element;

c(x)=h(15895868086,TOM)&(2¹⁺³-1)&(2³-1) =6, i.e. column 6 elements;

this locates the eigenvalues of the row data to the element in row 0 and column 6 in the pre-arranged array structure shown in fig. 5, i.e. the element numbered 6 in fig. 5.

S43, judging whether the element positioned in the step S42 is occupied or not;

if so, go to step S44; otherwise, go to step S45;

s44, determining the line data and the line data corresponding to the identifier occupying the element as the line data meeting the preset condition, and adding the related information of the line data meeting the preset condition into the message; the relevant information comprises key information and identification of line data meeting preset conditions;

if the element located in step S42 is occupied, it indicates that the eigenvalue of the line data corresponding to the identifier that occupies the element is the same as the eigenvalue of the line data currently processed, that is, the key information is the same, the key information and the identifier of the two line data need to be added to a message, which may be an audit report, or an alarm may be sent for the two line data.

S45, storing the identification of the line of data into the element;

s46, judging whether the line data processing in all the files is finished or not;

if so, go to step S47; otherwise, go to step S48;

s47, sending a message;

and S48, searching the unprocessed line data in the file.

If a plurality of threads are adopted to process the row data in a plurality of files in parallel in the process, wherein each thread can be allocated with a fixed file, the threads preferentially process the row data in the file allocated to the thread, and at the moment, the preset array structure is a two-dimensional array structure; when one thread accesses one element in the array structure, other threads are prohibited from accessing the row element where the element is located, so that access conflict of multiple threads to the same row element in the array structure is avoided, and the processing speed of the threads on row data in the file is improved. For example: two existing threads process two files in parallel, wherein a first thread processes data in a first row in a file 1, a second thread is responsible for processing data in the first row in a file 2, and a specific processing result is shown in table 1:

TABLE 1

At this time, the first thread locates the feature value of the first row data in the file 1 to the element of the 0 th row and the 6 th column in the preset array structure shown in fig. 5, i.e. the element labeled 6 in fig. 5; the second thread locates the characteristic value of the first row data in the file 2 to the element of the row 1 and column 7 in the preset array structure shown in fig. 5, namely the element marked with the number 16 in fig. 5; in the process that the first thread accesses the elements in the row 0 and the column 6 in the array structure, the first thread locks all the elements in the row 0, and prohibits other threads from accessing all the elements in the row 0, and after the first thread finishes processing the data in the row 0 in the file 1, all the elements in the row 0 can be accessed by other threads; similarly, during the second thread's access to the row 1, column 7 element in the array structure, the second thread locks all elements of row 1.

If the third thread locates the eigenvalue of the line of data to the element in the 6 th column of the 0 th row in the preset array structure as shown in fig. 5 when processing the first row of data in the file 3, since the element in the 6 th column of the 0 th row in the array structure is already occupied by the identifier of the first row of data in the file 1 processed by the first thread, which indicates that the eigenvalue of the first row of data in the file 3 is the same as the eigenvalue of the first row of data in the file 1, that is, the key information is the same, the key information and the identifier of the two lines of data need to be added to a message, which may be an audit report, or an alarm may be issued for the two lines of data.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A data processing apparatus, characterized in that the apparatus comprises:

2. The apparatus of claim 1, wherein the array structure is a two-dimensional array structure comprising row elements and column elements, and the number of elements in the array structure is not less than the number of rows of data in all files that need to be processed.

3. The apparatus of claim 2, wherein the number of rows of data in all files to be processed is 2^m+nM and n are natural numbers;

the array structure comprises 2^m+1 row elements and 2ⁿ+1 columns of elements.

4. The apparatus according to claim 2 or 3, wherein the determining unit is specifically configured to:

5. The apparatus of claim 1, wherein the array structure is to include 1 row of elements and 2ⁿAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.

6. The apparatus of claim 1, wherein the identification of the line of data comprises a number of a file in which the line of data is located and a line number of the line of data in the file.

7. The apparatus of claim 1, further comprising:

the message generating unit is used for adding the related information of the line data meeting the preset condition, which is determined by the processing unit, into the message; the related information comprises key information and identification of the line data meeting the preset conditions.

8. The apparatus of claim 2, wherein the determining unit and the processing unit respectively process line data in a plurality of files in parallel using a plurality of threads, wherein one thread processes line data in one file.

9. The apparatus of claim 8, wherein the determining unit and the processing unit, when determining that a thread accesses an element in the array structure, prohibit other threads from accessing the row element in which the element is located.

10. A method of data processing, the method comprising:

11. The method of claim 10, wherein the array structure is a two-dimensional array structure comprising row elements and column elements, and the number of elements in the array structure is not less than the number of rows of data in all files that need to be processed.

12. The method of claim 11, wherein the number of rows of data in all files that need to be processed is 2^m+nM and n are natural numbers;

13. The method as claimed in claim 11 or 12, wherein determining a characteristic value of any row of data in the file to be processed according to the preset key information of the row of data, and positioning the element for storing the identifier of the row of data in the preset array structure according to the characteristic value comprises:

14. The method of claim 10, wherein the array structure is comprised of 1 row elements and 2ⁿAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.

15. The method of claim 10, wherein the identification of the line of data includes a number of a file in which the line of data is located and a line number of the line of data in the file.

16. The method of claim 10, further comprising:

adding related information of the line data meeting the preset condition into the message; the related information comprises key information and identification of the line data meeting the preset conditions.

17. The method of claim 11, wherein the line data in the plurality of files is processed in parallel using a plurality of threads, wherein a thread processes line data in a file.

18. The method of claim 17 wherein when a thread accesses an element in the array structure, other threads are prohibited from accessing the row element in which the element is located.