CN104793997A - Data processing device and method - Google Patents
Data processing device and method Download PDFInfo
- Publication number
- CN104793997A CN104793997A CN201410023109.7A CN201410023109A CN104793997A CN 104793997 A CN104793997 A CN 104793997A CN 201410023109 A CN201410023109 A CN 201410023109A CN 104793997 A CN104793997 A CN 104793997A
- Authority
- CN
- China
- Prior art keywords
- data
- line
- row
- array structure
- elements
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000012545 processing Methods 0.000 title claims abstract description 55
- 238000004422 calculation algorithm Methods 0.000 abstract description 9
- 238000003672 processing method Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 12
- 238000012550 audit Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 7
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data processing device and method. The data processing device and method are used for simplifying the process that a thread processes a large amount of data files, increasing the file processing speed for the thread, reducing the consumption of computer resources and increasing the computation speed of a Hash algorithm. The data processing method comprises the steps that for any one row of data of a file to be processed, the feature value of the row of data is determined according to preset key information of the row of data, and according to the feature value, an element used for storing the identification of the row of data is located in a preset array structure; whether the located element is occupied or not is judged, if yes, the row of data and the row of data corresponding to the identification occupying the element are determined as rows of data meeting preset conditions, and if not, the identification of the row of data is stored in the element.
Description
Technical Field
The present invention relates to the field of communications, and in particular, to a data processing apparatus and method.
Background
When a large amount of data is processed, it is usually necessary to perform deduplication processing on the large amount of data, that is, in the large amount of data, two or more associated data segments are searched according to the key information of each data segment, and the associated data segments are processed correspondingly.
Taking a data audit business process as an example, the process mainly comprises a Customer Relationship Management system (CRM), a fusion charging system (CBS) and an audit system, wherein the CRM and the CBS are responsible for providing a large number of data files, the audit system is responsible for auditing the data files, namely searching two or more related data sections in all the files according to key information of each data section, analyzing parts with the same and different information among the related data sections, forming the related data sections into report files, finally, the audit system feeds the report files back to the CRM and the CBS, and the CRM and the CBS correct the data.
Because the data volume needing to be subjected to rearrangement processing is large, in order to save software and hardware resources of a computer, a Hash (Hash) algorithm is usually adopted, an input with any length can be converted into an output with a fixed length in a specified numerical range through the Hash algorithm, the conversion is compression mapping, the space occupied by the output is usually far smaller than that occupied by the input, the outputs obtained through different inputs of the Hash algorithm can be the same, and the unique input cannot be determined according to the output; briefly, a hash algorithm is a function that compresses a message of arbitrary length to a digest message of a fixed length. In the process of arranging and processing a large amount of data, the data segments are compressed into fixed-length output through a hash algorithm, so that the processing process of the large amount of data can be simplified.
At present, the adoption of the prior art to simplify the Processing process of a large amount of data still consumes a large amount of software and hardware resources in a computer, such as computer memory and Central Processing Unit (CPU) resources.
Disclosure of Invention
The embodiment of the invention provides a data processing device and a data processing method, which are used for simplifying the processing process of a thread on a large number of data files, improving the processing speed of the thread on the files, reducing the consumption of computer resources and improving the calculation speed of a Hash algorithm.
In a first aspect, a data processing apparatus is provided, including:
the determining unit is used for determining a characteristic value of the line of data according to preset key information of the line of data in a file to be processed, and positioning an element used for storing a line of data identifier in a preset array structure according to the characteristic value;
the processing unit is used for judging whether the element positioned by the determining unit is occupied or not, and if so, determining the line data and the line data corresponding to the identifier of the occupied element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.
With reference to the first aspect, in a first possible implementation manner, the array structure is a two-dimensional array structure including row elements and column elements, and the number of the elements in the array structure is not less than the number of rows of data in all files that need to be processed.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the number of rows of data in all files to be processed is 2m+nM and n are natural numbers;
the array structure includes 2m+1 row elements and 2n+1 columns of elements.
With reference to the first or second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the determining unit is specifically configured to:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
With reference to the first aspect, in a fourth possible implementation manner, the array structure includes 1 row of elements and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
With reference to the first aspect, in a fifth possible implementation manner, the identifier of the line data includes a number of a file where the line data is located and a line number of the line data in the file.
With reference to the first aspect, in a sixth possible implementation manner, the apparatus further includes:
the message generating unit is used for adding the related information of the line data meeting the preset conditions, which is determined by the processing unit, into the message; the related information comprises key information and identification of the line data meeting the preset condition.
With reference to the first possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the determining unit and the processing unit respectively employ multiple threads to process line data in multiple files in parallel, where one thread processes line data in one file.
With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, when it is determined that one thread accesses one element in the array structure, the determining unit and the processing unit respectively prohibit other threads from accessing the row element where the element is located.
In a second aspect, a data processing method is provided, comprising
Determining a characteristic value of the line of data according to preset key information of the line of data in a file to be processed, and positioning an element for storing a mark of the line of data in a preset array structure according to the characteristic value;
judging whether the positioned element is occupied, if so, determining the line data and the line data corresponding to the identifier occupying the element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.
With reference to the second aspect, in a first possible implementation manner, the array structure is a two-dimensional array structure including row elements and column elements, and the number of the elements in the array structure is not less than the number of rows of data in all files that need to be processed.
With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the number of rows of data in all files to be processed is 2m+nM and n are natural numbers;
the array structure includes 2m+1 row elements and 2n+1 columns of elements.
With reference to the first or second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, determining a feature value of any line of data in a file to be processed according to preset key information of the line of data, and locating, according to the feature value, an element used for storing an identifier of the line of data in a preset array structure, includes:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
With reference to the second aspect, in a fourth possible implementation manner, the array structure is that the array structure includes 1 row of elements and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
With reference to the second aspect, in a fifth possible implementation manner, the identifier of the line data includes a number of a file where the line data is located and a line number of the line data in the file.
With reference to the second aspect, in a sixth possible implementation manner, the method further includes:
adding related information of the line data meeting the preset condition into the message; the related information comprises key information and identification of the line data meeting the preset condition.
With reference to the first possible implementation manner of the second aspect, in a seventh possible implementation manner of the second aspect, multiple threads are used to process line data in multiple files in parallel, where one thread processes line data in one file.
With reference to the seventh possible implementation manner of the second aspect, in an eighth possible implementation manner of the second aspect, when one thread accesses one element in the array structure, other threads are prohibited from accessing the row element where the element is located.
In the data processing method provided by the embodiment of the invention, the thread determines the characteristic value of the line data according to the key information of the preset line data, positions the element used for storing the identification of the line data in the preset array structure according to the characteristic value, and judges whether the positioned element is occupied or not so as to determine the line data meeting the preset condition, thereby simplifying the processing process of the thread on a large number of data files, improving the processing speed of the thread on the files, and reducing the consumption of computer resources.
Drawings
Fig. 1 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of a data processing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a processing flow of data in a file by using multiple threads in an auditing system according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a preset array structure according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a thread processing file according to an embodiment of the present invention;
fig. 7 is a schematic diagram of line data in a file processed by a thread according to an embodiment of the present invention.
Detailed Description
The invention provides a data processing device and a data processing method, which are used for simplifying the processing process of a thread on a large number of data files through a Hash algorithm, improving the processing speed of the thread on the files, reducing the consumption of computer resources and improving the calculation speed of the Hash algorithm.
As shown in fig. 1, an embodiment of the present invention provides a data processing apparatus, including:
the determining unit 11 is configured to determine a feature value of the line of data according to preset key information of the line of data in the file to be processed, and locate an element, which is used for storing an identifier of the line of data, in a preset array structure according to the feature value;
the processing unit 12 is configured to determine whether the element located by the determining unit 11 is occupied, and if so, determine the line of data and the line of data corresponding to the identifier that occupies the element as the line of data that meets the preset condition; otherwise, the identification of the line of data is stored in the element.
Preferably, the array structure is a two-dimensional array structure including row elements and column elements, and the number of the elements in the array structure is not less than the number of rows of data in all files to be processed.
Preferably, the number of rows of data in all files to be processed is 2m+nM and n are natural numbers;
the array structure includes 2m+1 row elements and 2n+1 columns of elements.
Preferably, the determining unit 11 is specifically configured to:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
Preferably, the array structure is composed of 1 row element and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
Preferably, the identification of the line data includes the number of the file where the line data is located and the line number of the line data in the file.
Preferably, the apparatus further comprises:
the message generating unit 13 is configured to add the relevant information of the line data meeting the preset condition, which is determined by the processing unit 12, to the message; the related information comprises key information and identification of the line data meeting the preset condition.
Preferably, the determining unit 11 and the processing unit 12 respectively employ a plurality of threads to process the line data in a plurality of files in parallel, wherein one thread processes the line data in one file.
Preferably, when determining that a thread accesses an element in the array structure, the determining unit 11 and the processing unit 12 prohibit other threads from accessing the row element where the element is located.
Specifically, the determining unit 11, the processing unit 12 and the message generating unit 13 may be implemented by an entity such as a processor, and the present invention is not limited to the entity implementing these modules.
As shown in fig. 2, an embodiment of the present invention provides a data processing apparatus, including:
the processor 21 is configured to determine a feature value of any line of data in a file to be processed according to preset key information of the line of data, locate an element, which is used for storing an identifier of the line of data, in a preset array structure according to the feature value, and determine whether the located element is occupied, and if so, determine the line of data and the line of data corresponding to the identifier of the occupied element as the line of data meeting a preset condition; otherwise, the identification of the line of data is stored in the element.
And the memory 22 is used for storing preset key information of each row of data and storing a preset array structure and relevant information thereof.
Preferably, the array structure is a two-dimensional array structure including row elements and column elements, and the number of the elements in the array structure is not less than the number of rows of data in all files to be processed.
Preferably, the number of rows of data in all files to be processed is 2m+nM and n are natural numbers;
the array structure includes 2m+1 row elements and 2n+1 columns of elements.
Preferably, the processor 21 is configured to determine a feature value of the line of data according to key information of the line of data set in advance when the processor is configured to be used for any line of data in the file to be processed, and when the processor is positioned to an element used for storing an identifier of the line of data in a preset array structure according to the feature value, specifically:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
Preferably, the array structure is composed of 1 row element and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
Preferably, the identification of the line data includes the number of the file where the line data is located and the line number of the line data in the file.
Preferably, the processor 21 is further configured to add information related to the determined line data meeting the preset condition to the message; the related information comprises key information and identification of the line data meeting the preset condition.
Preferably, the processor 21 processes line data in multiple files in parallel using multiple threads, wherein one thread processes line data in one file.
Preferably, when it is determined that a thread accesses an element in the array structure, processor 21 prohibits other threads from accessing the row element in which the element is located.
As shown in fig. 3, an embodiment of the present invention provides a data processing method, where the method includes:
s31, determining a characteristic value of any line of data in the file to be processed according to preset key information of the line of data, and positioning an element used for storing a line of data identifier in a preset array structure according to the characteristic value;
s32, judging whether the positioned element is occupied, if so, determining the line data and the line data corresponding to the identifier occupying the element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.
The line data identification comprises the serial number of the file where the line data are located and the line number of the line data in the file.
Preferably, before step S31, the method further includes: presetting an array structure;
specifically, the preset array structure is a two-dimensional array structure including row elements and column elements, the elements are used for storing the identifications of row data, and the number of the elements in the array structure is not less than the number of rows of data in all files to be processed, for example: when the number of lines of data in all files to be processed is 2m+nWhen, the array structure includes 2m+1 row elements and 2n+1 columns of elements, m and n being natural numbers; or,the preset array structure comprises 1 row element and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
Preferably, the method in step S31 includes:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
Preferably, after step S32, the method further includes:
adding the related information of the line data satisfying the preset condition determined in the step S32 to the message; the related information comprises key information and identification of the line data meeting the preset condition.
In the invention, when a large number of files need to be processed, a plurality of threads can be adopted to process the row data in the files in parallel, wherein one thread processes the row data in one file, and at the moment, the preset array structure is a two-dimensional array structure; when one thread accesses one element in the array structure, other threads are prohibited from accessing the row element where the element is located, so that access conflict of multiple threads to the same row element in the array structure can be avoided, and the processing speed of the threads to row data in the file is improved; when the amount of the files needing to be processed is small, one thread can be adopted to process the row data in the files, and the preset array structure is a one-dimensional array structure.
The following describes a data processing method provided by the embodiment of the present invention in detail by taking a process of processing data by the auditing system as an example.
As shown in fig. 4, the process flow of the audit system for processing the row data in the file by using multiple threads is as follows:
s41, presetting an array structure;
specifically, the preset array structure is a two-dimensional array structure including row elements and column elements, the elements are used for storing the identifications of row data, and the number of the elements in the array structure is not less than the number of rows of data in all files to be processed, for example: as shown in FIG. 5, when the number of lines of data in all files to be processed is 2m+nWhen, the array structure includes 2m+1 row elements and 2n+1 columns of elements, m and n being natural numbers. Or the preset array structure comprises 1 row element and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
Preferably, when a large number of files need to be processed, a plurality of threads can be adopted to process the row data in the files in parallel, wherein one thread processes the row data in one file, and at the moment, the preset array structure is a two-dimensional array structure; when the amount of the files needing to be processed is small, one thread can be adopted to process the row data in the files, and the preset array structure is a one-dimensional array structure.
The method for determining the number of rows of data in all files needing to be processed may be: the business system automatically analyzes the number of all files to be processed and the size of each file memory; determining the maximum byte number of the file memory in all the files to be processed and the minimum byte number in the byte number of each row of data in the file; and estimating the maximum value of the row number of the data in all the files to be processed according to the row number of the data in all the processed files = the byte number of the file with the maximum memory/the minimum byte number of the row data in the file and the number of the files to be processed.
S42, determining a characteristic value of any line of data in the file to be processed according to preset key information of the line of data, and positioning an element used for storing a line of data identifier in a preset array structure according to the characteristic value; wherein the identification of the line data includes the number of the file where the line data is located and the line number of the line data in the file, for example, as shown in fig. 6, the number of the currently processed file is from 1 to n, and each line data in each file has its own line number.
Specifically, the method comprises the following steps:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
The preset key information of the row data can be set according to the needs of a user or a system or agreed in advance; for example, the line data currently processed by the thread is the first line data in the file shown in fig. 7, and the line data is: 15895868086ABC TOM NanJing, Ln =1 in FIG. 7 represents that the currently processed line data is the first line data in the file, the preset key information x includes 15895868086 (user number) and TOM (user name), the characteristic value h (x) =726346 of the line data is calculated according to the hash function, at this time, see FIG. 4, m =1, n =3,
r(x)=h(15895868086,TOM)&(21+3-1)>>3=0, i.e. row 0 element;
c(x)=h(15895868086,TOM)&(21+3-1)&(23-1) =6, i.e. column 6 elements;
this locates the eigenvalues of the row data to the element in row 0 and column 6 in the pre-arranged array structure shown in fig. 5, i.e. the element numbered 6 in fig. 5.
S43, judging whether the element positioned in the step S42 is occupied or not;
if so, go to step S44; otherwise, go to step S45;
s44, determining the line data and the line data corresponding to the identifier occupying the element as the line data meeting the preset condition, and adding the related information of the line data meeting the preset condition into the message; the relevant information comprises key information and identification of line data meeting preset conditions;
if the element located in step S42 is occupied, it indicates that the eigenvalue of the line data corresponding to the identifier that occupies the element is the same as the eigenvalue of the line data currently processed, that is, the key information is the same, the key information and the identifier of the two line data need to be added to a message, which may be an audit report, or an alarm may be sent for the two line data.
S45, storing the identification of the line of data into the element;
s46, judging whether the line data processing in all the files is finished or not;
if so, go to step S47; otherwise, go to step S48;
s47, sending a message;
and S48, searching the unprocessed line data in the file.
If a plurality of threads are adopted to process the row data in a plurality of files in parallel in the process, wherein each thread can be allocated with a fixed file, the threads preferentially process the row data in the file allocated to the thread, and at the moment, the preset array structure is a two-dimensional array structure; when one thread accesses one element in the array structure, other threads are prohibited from accessing the row element where the element is located, so that access conflict of multiple threads to the same row element in the array structure is avoided, and the processing speed of the threads on row data in the file is improved. For example: two existing threads process two files in parallel, wherein a first thread processes data in a first row in a file 1, a second thread is responsible for processing data in the first row in a file 2, and a specific processing result is shown in table 1:
TABLE 1
At this time, the first thread locates the feature value of the first row data in the file 1 to the element of the 0 th row and the 6 th column in the preset array structure shown in fig. 5, i.e. the element labeled 6 in fig. 5; the second thread locates the characteristic value of the first row data in the file 2 to the element of the row 1 and column 7 in the preset array structure shown in fig. 5, namely the element marked with the number 16 in fig. 5; in the process that the first thread accesses the elements in the row 0 and the column 6 in the array structure, the first thread locks all the elements in the row 0, and prohibits other threads from accessing all the elements in the row 0, and after the first thread finishes processing the data in the row 0 in the file 1, all the elements in the row 0 can be accessed by other threads; similarly, during the second thread's access to the row 1, column 7 element in the array structure, the second thread locks all elements of row 1.
If the third thread locates the eigenvalue of the line of data to the element in the 6 th column of the 0 th row in the preset array structure as shown in fig. 5 when processing the first row of data in the file 3, since the element in the 6 th column of the 0 th row in the array structure is already occupied by the identifier of the first row of data in the file 1 processed by the first thread, which indicates that the eigenvalue of the first row of data in the file 3 is the same as the eigenvalue of the first row of data in the file 1, that is, the key information is the same, the key information and the identifier of the two lines of data need to be added to a message, which may be an audit report, or an alarm may be issued for the two lines of data.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (18)
1. A data processing apparatus, characterized in that the apparatus comprises:
the determining unit is used for determining a characteristic value of the line of data according to preset key information of the line of data in a file to be processed, and positioning an element used for storing a line of data identifier in a preset array structure according to the characteristic value;
the processing unit is used for judging whether the element positioned by the determining unit is occupied or not, and if so, determining the line data and the line data corresponding to the identifier of the occupied element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.
2. The apparatus of claim 1, wherein the array structure is a two-dimensional array structure comprising row elements and column elements, and the number of elements in the array structure is not less than the number of rows of data in all files that need to be processed.
3. The apparatus of claim 2, wherein the number of rows of data in all files to be processed is 2m+nM and n are natural numbers;
the array structure comprises 2m+1 row elements and 2n+1 columns of elements.
4. The apparatus according to claim 2 or 3, wherein the determining unit is specifically configured to:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
5. The apparatus of claim 1, wherein the array structure is to include 1 row of elements and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
6. The apparatus of claim 1, wherein the identification of the line of data comprises a number of a file in which the line of data is located and a line number of the line of data in the file.
7. The apparatus of claim 1, further comprising:
the message generating unit is used for adding the related information of the line data meeting the preset condition, which is determined by the processing unit, into the message; the related information comprises key information and identification of the line data meeting the preset conditions.
8. The apparatus of claim 2, wherein the determining unit and the processing unit respectively process line data in a plurality of files in parallel using a plurality of threads, wherein one thread processes line data in one file.
9. The apparatus of claim 8, wherein the determining unit and the processing unit, when determining that a thread accesses an element in the array structure, prohibit other threads from accessing the row element in which the element is located.
10. A method of data processing, the method comprising:
determining a characteristic value of the line of data according to preset key information of the line of data in a file to be processed, and positioning an element for storing a mark of the line of data in a preset array structure according to the characteristic value;
judging whether the positioned element is occupied, if so, determining the line data and the line data corresponding to the identifier occupying the element as the line data meeting the preset condition; otherwise, the identification of the line of data is stored in the element.
11. The method of claim 10, wherein the array structure is a two-dimensional array structure comprising row elements and column elements, and the number of elements in the array structure is not less than the number of rows of data in all files that need to be processed.
12. The method of claim 11, wherein the number of rows of data in all files that need to be processed is 2m+nM and n are natural numbers;
the array structure comprises 2m+1 row elements and 2n+1 columns of elements.
13. The method as claimed in claim 11 or 12, wherein determining a characteristic value of any row of data in the file to be processed according to the preset key information of the row of data, and positioning the element for storing the identifier of the row of data in the preset array structure according to the characteristic value comprises:
acquiring preset key information x of the line of data in any line of data in a file to be processed, and calculating the x by utilizing a hash function to obtain a characteristic value h (x) of the line of data;
according to the formula r (x) = h (x)&(2m+n-1)>>n, locating the row address r (x) of the element used for storing the identification of the row of data in a preset array structure;
according to the formula c (x) = h (x)&(2m+n-1)&(2n-1) column addresses c (x) of elements located in a pre-arranged array structure for storing an identification of the row of data;
therein, 2m+nThe number of lines of data in all files to be processed is shown, and m and n are natural numbers.
14. The method of claim 10, wherein the array structure is comprised of 1 row elements and 2nAnd the one-dimensional array structure of +1 columns of elements, wherein n is a natural number.
15. The method of claim 10, wherein the identification of the line of data includes a number of a file in which the line of data is located and a line number of the line of data in the file.
16. The method of claim 10, further comprising:
adding related information of the line data meeting the preset condition into the message; the related information comprises key information and identification of the line data meeting the preset conditions.
17. The method of claim 11, wherein the line data in the plurality of files is processed in parallel using a plurality of threads, wherein a thread processes line data in a file.
18. The method of claim 17 wherein when a thread accesses an element in the array structure, other threads are prohibited from accessing the row element in which the element is located.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410023109.7A CN104793997B (en) | 2014-01-17 | 2014-01-17 | A kind of data processing equipment and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410023109.7A CN104793997B (en) | 2014-01-17 | 2014-01-17 | A kind of data processing equipment and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104793997A true CN104793997A (en) | 2015-07-22 |
CN104793997B CN104793997B (en) | 2018-06-26 |
Family
ID=53558810
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410023109.7A Active CN104793997B (en) | 2014-01-17 | 2014-01-17 | A kind of data processing equipment and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104793997B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108460453A (en) * | 2017-02-21 | 2018-08-28 | 阿里巴巴集团控股有限公司 | It is a kind of to be used for data processing method, the apparatus and system that CTC is trained |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101159795A (en) * | 2007-10-25 | 2008-04-09 | 中兴通讯股份有限公司 | Calling list rearrangement method and device |
CN102591855A (en) * | 2012-01-13 | 2012-07-18 | 广州从兴电子开发有限公司 | Data identification method and data identification system |
US20120323717A1 (en) * | 2011-06-16 | 2012-12-20 | OneID, Inc. | Method and system for determining authentication levels in transactions |
-
2014
- 2014-01-17 CN CN201410023109.7A patent/CN104793997B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101159795A (en) * | 2007-10-25 | 2008-04-09 | 中兴通讯股份有限公司 | Calling list rearrangement method and device |
US20120323717A1 (en) * | 2011-06-16 | 2012-12-20 | OneID, Inc. | Method and system for determining authentication levels in transactions |
CN102591855A (en) * | 2012-01-13 | 2012-07-18 | 广州从兴电子开发有限公司 | Data identification method and data identification system |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108460453A (en) * | 2017-02-21 | 2018-08-28 | 阿里巴巴集团控股有限公司 | It is a kind of to be used for data processing method, the apparatus and system that CTC is trained |
Also Published As
Publication number | Publication date |
---|---|
CN104793997B (en) | 2018-06-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649346B (en) | Data repeatability checking method and device | |
KR101994021B1 (en) | File manipulation method and apparatus | |
CN106407207B (en) | Real-time newly-added data updating method and device | |
CN109614238B (en) | Target object identification method, device and system and readable storage medium | |
CN107515878B (en) | Data index management method and device | |
CN110795499B (en) | Cluster data synchronization method, device, equipment and storage medium based on big data | |
CN111813805A (en) | Data processing method and device | |
CN105843819B (en) | Data export method and device | |
US20140059000A1 (en) | Computer system and parallel distributed processing method | |
CN112487083B (en) | Data verification method and device | |
CN105100050A (en) | User permission management method and system | |
CN109714249B (en) | Method and related device for pushing applet messages | |
CN104572785B (en) | A kind of distributed method and apparatus for creating index | |
CN108319608A (en) | The method, apparatus and system of access log storage inquiry | |
CN109388614A (en) | A kind of method, system and the equipment of catalogue file number quota | |
CN104484392A (en) | Method and device for generating database query statement | |
CN110851419A (en) | Data migration method and device | |
CN106933907B (en) | Processing method and device for data table expansion indexes | |
CN104778252A (en) | Index storage method and index storage device | |
CN109213972B (en) | Method, device, equipment and computer storage medium for determining document similarity | |
CN113127327B (en) | Test method and device for performance test | |
CN108897858A (en) | The appraisal procedure and device, electronic equipment of distributed type assemblies index fragment | |
CN104793997A (en) | Data processing device and method | |
CN110019357B (en) | Database query script generation method and device | |
CN106446080B (en) | Data query method, query service equipment, client equipment and data system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |