CN104572810A - Method for carrying out operation processing on massive files by using bitmap - Google Patents
Method for carrying out operation processing on massive files by using bitmap Download PDFInfo
- Publication number
- CN104572810A CN104572810A CN201410652811.XA CN201410652811A CN104572810A CN 104572810 A CN104572810 A CN 104572810A CN 201410652811 A CN201410652811 A CN 201410652811A CN 104572810 A CN104572810 A CN 104572810A
- Authority
- CN
- China
- Prior art keywords
- bitmap
- data
- carrying
- carry out
- calculation process
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/06—Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/22—Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
- G06F7/24—Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers sorting methods in general
Abstract
The invention discloses a method for carrying out operation processing on massive files by using a bitmap. The method comprises the following steps: inserting one datum into the bitmap; judging whether a certain number exists in the bitmap or not; printing all data sets in the bitmap; carrying out duplication removal; taking a union set or a crossed set or a difference set. The method can be used for carrying out operation functions of taking the crossed set and the union set, carrying out the duplication removal, and taking the difference set and the like on massive data sets, so that the data operation processing speed is extremely accelerated.
Description
Technical field
The present invention relates to the method utilizing bitmap to carry out mass file calculation process.
Background technology
In a lot of software systems, all there is similar following application scenarios:
From two lot number codes, get both and occur simultaneously.Such as QQ has 3,000 ten thousand members, and Elopichthys bambusa has 2,000 ten thousand, and needing to extract is the list that member is also Elopichthys bambusa user.
Carry out duplicate removal process to a collection of number, 10.1 short greatly activities have been carried out in such as certain electric business website, have 7,000 ten thousand person-times to be browsed by the login of QQ number and bought.After deriving these purchaser records, extract QQ list of numbers, each QQ only extracts once.
Union is got to two batch datas, the user such as playing A game has 3,000,000, the user playing B game has 4,500,000, a new game of the same type is had to go on the market, the player prepared A, B two plays throws in advertisement, but each user can only throw in once, the Players list extracting A, B two game is needed to get union, and duplicate removal.
Be extracted in a data set, but not in the part of another data set.Such as there are 3,000 ten thousand QQ member users, 2,000 ten thousand QQ Elopichthys bambusa users, need extraction to be QQ member, but be not the user list of Elopichthys bambusa.
This type of application can abstractly be, data set A has M element, and data set B has N number of element, needs to provide a kind of method, gets common factor, gets union, to certain data set duplicate removal, gets difference set to A and B;
In the application of general software systems, get the functions such as common factor duplicate removal for data set, general usage data storehouse, or the sort sequence that operating system provides, uniq duplicate removal, comm extracts and occurs simultaneously or union.
When data volume is less time, when such as the scale of A, B data set is within 10W, said method can solve problem.But when data set scale increases, said method computing is consuming time sharply to be increased.
Summary of the invention
The object of this invention is to provide a kind of method utilizing bitmap to carry out mass file calculation process.
To achieve these goals, technical scheme provided by the invention is: provide a kind of method utilizing bitmap to carry out mass file calculation process, comprises the steps: that insertion data put in place figure; Judge certain number whether in bitmap; Print all data sets in bitmap; Duplicate removal process; Get union or common factor or difference set.
Need 512M space storage bitmap data, all data need initialization to reset.
Insert data to put in place figure, needing the position of correspondence is 1.
Judge whether data centralization has existed a number, need to judge whether corresponding bit position is 1.
Compared with prior art, the present invention utilizes bitmap to carry out in the method for mass file calculation process, inserts data and to put in place figure; Judge certain number whether in bitmap; Print all data sets in bitmap; Duplicate removal process; Get union or common factor or difference set.The present invention is based on the bit arithmetic of C/C++ language, provide and the calculation functions such as common factor, union, duplicate removal, difference set are got to massive data sets.Very big raising data operation processing speed.
By following description also by reference to the accompanying drawings, the present invention will become more clear, and these accompanying drawings are for explaining embodiments of the invention.
Accompanying drawing explanation
Fig. 1 is that the present invention utilizes bitmap to carry out the schematic diagram of first embodiment of the method for mass file calculation process.
Fig. 2 is that the present invention utilizes bitmap to carry out the schematic diagram of second embodiment of the method for mass file calculation process.
Fig. 3 is that the present invention utilizes bitmap to carry out the schematic diagram of the 3rd embodiment of the method for mass file calculation process.
Fig. 4 is that the present invention utilizes bitmap to carry out the schematic diagram of the 4th embodiment of the method for mass file calculation process.
Fig. 5 is the schematic diagram that in computer system, a byte is made up of 8 bit positions.
Embodiment
With reference now to accompanying drawing, describe embodiments of the invention, element numbers similar in accompanying drawing represents similar element.
Whether core of the present invention utilizes bitmap to record certain data, have and occurred in data centralization.The time complexity that bitmap is searched is constant, greatly improves treatment effeciency.
As shown in Figure 5, in computer systems, which, a byte is made up of 8 bit positions, and each position can be 0 or 1 two states.A byte can represent at most whether 8 numbers exist, and such as, from 0-7bit, represents whether these 8 numerals of 0-7 have in data centralization respectively.If there is existence, the bit position 1 of correspondence.
32 conventional at present signless integers, span is 0 to 4294967295.If represented by a byte, need 4G internal memory.In 32-bit operating system, application program free memory is generally within 2G.Represent with bit position, then only need 4G/8=512M space, improve space efficiency.
Please refer to Fig. 1, in the embodiment shown in this figure, ask the common factor of two manifolds, is first read manifold A; Each data call insert in A is inserted bitmap; Read manifold B; Judge data in manifold B whether in bitmap; If so, then common factor is belonged to; If not, then common factor is not belonged to.
Please refer to Fig. 2, in the embodiment shown in this figure, ask the difference set of two manifolds, is first read manifold A; Each data call insert in A is inserted bitmap; Read manifold B; Judge data in manifold B whether in bitmap; If so, then difference set is not belonged to; If not, then difference set is belonged to.Therefore visible, ask the difference set of two data sets and the common factor asking two data sets, its flow process is substantially identical.
Please refer to Fig. 3, in the embodiment shown in this figure, ask the union of two data, is first read manifold A; Each data call insert in A is inserted bitmap; Read manifold B; Each data call insert in B is inserted bitmap; Print the data comprised in whole bitmap.
Please refer to Fig. 4, in the embodiment shown in this figure, for carrying out duplicate removal process to a manifold, is first read manifold A; Each data call insert in A is inserted bitmap; Print the data comprised in whole bitmap.
Test comparison result is as follows:
Remarks: test environment is CPU i51.8G, internal memory 2G, operating system is RedHat enterprise version Linux 6.032BitX86.
As can be seen from test comparison data, when data scale is less, without advantage, even may there is the decline of efficiency in the present invention.When data scale is about 10W bar, there is more than 60% improved efficiency; When data scale reach 1,000,000, ten million, even more than one hundred million after, the present invention can improve ten times of time efficiencies to decades of times.
Exemplify the example in the present invention's practical application below:
Such as, in light breath paddy registered user 1,000,000 members, micro-letter bean vermicelli 2,000,000, needs to extract:
Not only 1, be website members but also be the user profile of micro-letter bean vermicelli
2, be website members but be not the user profile of micro-letter bean vermicelli
3, be micro-letter bean vermicelli but be not the user profile of website members
Can use above method, thus carry out follow-up marketing work, the user of such as " be micro-letter bean vermicelli but be not website members " sends micro-letter encouragement and adds member.
Above disclosedly be only the preferred embodiments of the present invention, certainly can not limit the interest field of the present invention with this, therefore according to the equivalent variations that the present patent application the scope of the claims is done, still belong to the scope that the present invention is contained.
Claims (4)
1. utilize bitmap to carry out a method for mass file calculation process, it is characterized in that, comprise the steps: that insertion data put in place figure; Judge certain number whether in bitmap; Print all data sets in bitmap; Duplicate removal process; Get union or common factor or difference set.
2. utilize bitmap to carry out the method for mass file calculation process as claimed in claim 1, it is characterized in that, need 512M space storage bitmap data, all data need initialization to reset.
3. utilize bitmap to carry out the method for mass file calculation process as claimed in claim 1, it is characterized in that, insert data and to put in place figure, needing the position of correspondence is 1.
4. utilize bitmap to carry out the method for mass file calculation process as claimed in claim 1, it is characterized in that, judge whether data centralization has existed a number, need to judge whether corresponding bit position is 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410652811.XA CN104572810A (en) | 2014-11-17 | 2014-11-17 | Method for carrying out operation processing on massive files by using bitmap |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410652811.XA CN104572810A (en) | 2014-11-17 | 2014-11-17 | Method for carrying out operation processing on massive files by using bitmap |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104572810A true CN104572810A (en) | 2015-04-29 |
Family
ID=53088872
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410652811.XA Pending CN104572810A (en) | 2014-11-17 | 2014-11-17 | Method for carrying out operation processing on massive files by using bitmap |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104572810A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126334A (en) * | 2015-05-04 | 2016-11-16 | 斯特拉托斯卡莱有限公司 | The workload migration of probability data de-duplication perception |
CN110413887A (en) * | 2019-07-23 | 2019-11-05 | 杭州云梯科技有限公司 | A kind of information push frequency limit method and system |
CN111510464A (en) * | 2020-06-24 | 2020-08-07 | 同盾控股有限公司 | Epidemic situation information sharing method and system for protecting user privacy |
CN111816302A (en) * | 2020-07-08 | 2020-10-23 | 中润普达(十堰)大数据中心有限公司 | Disease symptom cognitive system based on abnormal wrinkles of human body |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6978384B1 (en) * | 2000-09-19 | 2005-12-20 | Verizon Corp. Services Group, Inc. | Method and apparatus for sequence number checking |
CN101676906A (en) * | 2008-09-18 | 2010-03-24 | 中兴通讯股份有限公司 | Method of managing memory database space by using bitmap |
CN101957998A (en) * | 2010-10-09 | 2011-01-26 | 深圳市布易科技有限公司 | Method and device for changing picture expressed by bitmap into picture expressed by vector shadow line |
-
2014
- 2014-11-17 CN CN201410652811.XA patent/CN104572810A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6978384B1 (en) * | 2000-09-19 | 2005-12-20 | Verizon Corp. Services Group, Inc. | Method and apparatus for sequence number checking |
CN101676906A (en) * | 2008-09-18 | 2010-03-24 | 中兴通讯股份有限公司 | Method of managing memory database space by using bitmap |
CN101957998A (en) * | 2010-10-09 | 2011-01-26 | 深圳市布易科技有限公司 | Method and device for changing picture expressed by bitmap into picture expressed by vector shadow line |
Non-Patent Citations (2)
Title |
---|
杨康的BLOG: "位映射对大数据排重与排序", 《HTTP://YACARE.ITEYE.COM/BLOG/1969931》 * |
规速: "海量数据处理算法—Bit-Map", 《HTTPS://BLOG.CSDN.NET/HGUISU/ARTICLE/DETAILS/7880288》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126334A (en) * | 2015-05-04 | 2016-11-16 | 斯特拉托斯卡莱有限公司 | The workload migration of probability data de-duplication perception |
CN110413887A (en) * | 2019-07-23 | 2019-11-05 | 杭州云梯科技有限公司 | A kind of information push frequency limit method and system |
CN110413887B (en) * | 2019-07-23 | 2022-03-25 | 杭州云梯科技有限公司 | Information pushing frequency limiting method and system |
CN111510464A (en) * | 2020-06-24 | 2020-08-07 | 同盾控股有限公司 | Epidemic situation information sharing method and system for protecting user privacy |
CN111510464B (en) * | 2020-06-24 | 2020-10-02 | 同盾控股有限公司 | Epidemic situation information sharing method and system for protecting user privacy |
CN111816302A (en) * | 2020-07-08 | 2020-10-23 | 中润普达(十堰)大数据中心有限公司 | Disease symptom cognitive system based on abnormal wrinkles of human body |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105183731B (en) | Recommendation information generation method, device and system | |
CN111241389B (en) | Sensitive word filtering method and device based on matrix, electronic equipment and storage medium | |
JP2013511097A5 (en) | ||
CN105224600B (en) | A kind of detection method and device of Sample Similarity | |
CN104572810A (en) | Method for carrying out operation processing on massive files by using bitmap | |
CN109034717A (en) | The method of mark string bid behavior is enclosed in a kind of identification bidding process | |
CN110046634A (en) | The means of interpretation and device of cluster result | |
CN102567534B (en) | Interactive product user generated content intercepting system and intercepting method for the same | |
CN109714356A (en) | A kind of recognition methods of abnormal domain name, device and electronic equipment | |
CN103559313B (en) | Searching method and device | |
CN110348907A (en) | A kind of orientation method and device of advertisement crowd | |
WO2022116419A1 (en) | Automatic determination method and apparatus for domain name infringement, electronic device, and storage medium | |
CN110717801A (en) | Commodity information pushing method and device | |
CN106933916B (en) | JSON character string processing method and device | |
CN107436890A (en) | A kind of detection method and device of the Type of website | |
CN107330009A (en) | Descriptor disaggregated model creation method, creating device and storage medium | |
CN106910135A (en) | User recommends method and device | |
CN109582537A (en) | Service security means of defence and its system | |
CN105608205B (en) | The finger-mark check method and device of structural data | |
CN107025567A (en) | A kind of data processing method and device | |
CN109783139B (en) | Software interface feature extraction method and device and electronic equipment | |
TWI683258B (en) | Barcode recognition method and device | |
CN113703753B (en) | Method and device for product development and product development system | |
CN106815290B (en) | Method and device for determining attribution of bank card based on graph mining | |
CN109145307A (en) | User's face sketch recognition method, method for pushing, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150429 |