CN104572810A - Method for carrying out operation processing on massive files by using bitmap - Google Patents

Method for carrying out operation processing on massive files by using bitmap Download PDF

Info

Publication number
CN104572810A
CN104572810A CN201410652811.XA CN201410652811A CN104572810A CN 104572810 A CN104572810 A CN 104572810A CN 201410652811 A CN201410652811 A CN 201410652811A CN 104572810 A CN104572810 A CN 104572810A
Authority
CN
China
Prior art keywords
bitmap
data
carrying
carry out
calculation process
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410652811.XA
Other languages
Chinese (zh)
Inventor
国睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guang Xigu Development In Science And Technology Co Ltd Of Shenzhen
Original Assignee
Guang Xigu Development In Science And Technology Co Ltd Of Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guang Xigu Development In Science And Technology Co Ltd Of Shenzhen filed Critical Guang Xigu Development In Science And Technology Co Ltd Of Shenzhen
Priority to CN201410652811.XA priority Critical patent/CN104572810A/en
Publication of CN104572810A publication Critical patent/CN104572810A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/22Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
    • G06F7/24Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers sorting methods in general

Abstract

The invention discloses a method for carrying out operation processing on massive files by using a bitmap. The method comprises the following steps: inserting one datum into the bitmap; judging whether a certain number exists in the bitmap or not; printing all data sets in the bitmap; carrying out duplication removal; taking a union set or a crossed set or a difference set. The method can be used for carrying out operation functions of taking the crossed set and the union set, carrying out the duplication removal, and taking the difference set and the like on massive data sets, so that the data operation processing speed is extremely accelerated.

Description

Bitmap is utilized to carry out the method for mass file calculation process
Technical field
The present invention relates to the method utilizing bitmap to carry out mass file calculation process.
Background technology
In a lot of software systems, all there is similar following application scenarios:
From two lot number codes, get both and occur simultaneously.Such as QQ has 3,000 ten thousand members, and Elopichthys bambusa has 2,000 ten thousand, and needing to extract is the list that member is also Elopichthys bambusa user.
Carry out duplicate removal process to a collection of number, 10.1 short greatly activities have been carried out in such as certain electric business website, have 7,000 ten thousand person-times to be browsed by the login of QQ number and bought.After deriving these purchaser records, extract QQ list of numbers, each QQ only extracts once.
Union is got to two batch datas, the user such as playing A game has 3,000,000, the user playing B game has 4,500,000, a new game of the same type is had to go on the market, the player prepared A, B two plays throws in advertisement, but each user can only throw in once, the Players list extracting A, B two game is needed to get union, and duplicate removal.
Be extracted in a data set, but not in the part of another data set.Such as there are 3,000 ten thousand QQ member users, 2,000 ten thousand QQ Elopichthys bambusa users, need extraction to be QQ member, but be not the user list of Elopichthys bambusa.
This type of application can abstractly be, data set A has M element, and data set B has N number of element, needs to provide a kind of method, gets common factor, gets union, to certain data set duplicate removal, gets difference set to A and B;
In the application of general software systems, get the functions such as common factor duplicate removal for data set, general usage data storehouse, or the sort sequence that operating system provides, uniq duplicate removal, comm extracts and occurs simultaneously or union.
When data volume is less time, when such as the scale of A, B data set is within 10W, said method can solve problem.But when data set scale increases, said method computing is consuming time sharply to be increased.
Summary of the invention
The object of this invention is to provide a kind of method utilizing bitmap to carry out mass file calculation process.
To achieve these goals, technical scheme provided by the invention is: provide a kind of method utilizing bitmap to carry out mass file calculation process, comprises the steps: that insertion data put in place figure; Judge certain number whether in bitmap; Print all data sets in bitmap; Duplicate removal process; Get union or common factor or difference set.
Need 512M space storage bitmap data, all data need initialization to reset.
Insert data to put in place figure, needing the position of correspondence is 1.
Judge whether data centralization has existed a number, need to judge whether corresponding bit position is 1.
Compared with prior art, the present invention utilizes bitmap to carry out in the method for mass file calculation process, inserts data and to put in place figure; Judge certain number whether in bitmap; Print all data sets in bitmap; Duplicate removal process; Get union or common factor or difference set.The present invention is based on the bit arithmetic of C/C++ language, provide and the calculation functions such as common factor, union, duplicate removal, difference set are got to massive data sets.Very big raising data operation processing speed.
By following description also by reference to the accompanying drawings, the present invention will become more clear, and these accompanying drawings are for explaining embodiments of the invention.
Accompanying drawing explanation
Fig. 1 is that the present invention utilizes bitmap to carry out the schematic diagram of first embodiment of the method for mass file calculation process.
Fig. 2 is that the present invention utilizes bitmap to carry out the schematic diagram of second embodiment of the method for mass file calculation process.
Fig. 3 is that the present invention utilizes bitmap to carry out the schematic diagram of the 3rd embodiment of the method for mass file calculation process.
Fig. 4 is that the present invention utilizes bitmap to carry out the schematic diagram of the 4th embodiment of the method for mass file calculation process.
Fig. 5 is the schematic diagram that in computer system, a byte is made up of 8 bit positions.
Embodiment
With reference now to accompanying drawing, describe embodiments of the invention, element numbers similar in accompanying drawing represents similar element.
Whether core of the present invention utilizes bitmap to record certain data, have and occurred in data centralization.The time complexity that bitmap is searched is constant, greatly improves treatment effeciency.
As shown in Figure 5, in computer systems, which, a byte is made up of 8 bit positions, and each position can be 0 or 1 two states.A byte can represent at most whether 8 numbers exist, and such as, from 0-7bit, represents whether these 8 numerals of 0-7 have in data centralization respectively.If there is existence, the bit position 1 of correspondence.
32 conventional at present signless integers, span is 0 to 4294967295.If represented by a byte, need 4G internal memory.In 32-bit operating system, application program free memory is generally within 2G.Represent with bit position, then only need 4G/8=512M space, improve space efficiency.
Please refer to Fig. 1, in the embodiment shown in this figure, ask the common factor of two manifolds, is first read manifold A; Each data call insert in A is inserted bitmap; Read manifold B; Judge data in manifold B whether in bitmap; If so, then common factor is belonged to; If not, then common factor is not belonged to.
Please refer to Fig. 2, in the embodiment shown in this figure, ask the difference set of two manifolds, is first read manifold A; Each data call insert in A is inserted bitmap; Read manifold B; Judge data in manifold B whether in bitmap; If so, then difference set is not belonged to; If not, then difference set is belonged to.Therefore visible, ask the difference set of two data sets and the common factor asking two data sets, its flow process is substantially identical.
Please refer to Fig. 3, in the embodiment shown in this figure, ask the union of two data, is first read manifold A; Each data call insert in A is inserted bitmap; Read manifold B; Each data call insert in B is inserted bitmap; Print the data comprised in whole bitmap.
Please refer to Fig. 4, in the embodiment shown in this figure, for carrying out duplicate removal process to a manifold, is first read manifold A; Each data call insert in A is inserted bitmap; Print the data comprised in whole bitmap.
Test comparison result is as follows:
Remarks: test environment is CPU i51.8G, internal memory 2G, operating system is RedHat enterprise version Linux 6.032BitX86.
As can be seen from test comparison data, when data scale is less, without advantage, even may there is the decline of efficiency in the present invention.When data scale is about 10W bar, there is more than 60% improved efficiency; When data scale reach 1,000,000, ten million, even more than one hundred million after, the present invention can improve ten times of time efficiencies to decades of times.
Exemplify the example in the present invention's practical application below:
Such as, in light breath paddy registered user 1,000,000 members, micro-letter bean vermicelli 2,000,000, needs to extract:
Not only 1, be website members but also be the user profile of micro-letter bean vermicelli
2, be website members but be not the user profile of micro-letter bean vermicelli
3, be micro-letter bean vermicelli but be not the user profile of website members
Can use above method, thus carry out follow-up marketing work, the user of such as " be micro-letter bean vermicelli but be not website members " sends micro-letter encouragement and adds member.
Above disclosedly be only the preferred embodiments of the present invention, certainly can not limit the interest field of the present invention with this, therefore according to the equivalent variations that the present patent application the scope of the claims is done, still belong to the scope that the present invention is contained.

Claims (4)

1. utilize bitmap to carry out a method for mass file calculation process, it is characterized in that, comprise the steps: that insertion data put in place figure; Judge certain number whether in bitmap; Print all data sets in bitmap; Duplicate removal process; Get union or common factor or difference set.
2. utilize bitmap to carry out the method for mass file calculation process as claimed in claim 1, it is characterized in that, need 512M space storage bitmap data, all data need initialization to reset.
3. utilize bitmap to carry out the method for mass file calculation process as claimed in claim 1, it is characterized in that, insert data and to put in place figure, needing the position of correspondence is 1.
4. utilize bitmap to carry out the method for mass file calculation process as claimed in claim 1, it is characterized in that, judge whether data centralization has existed a number, need to judge whether corresponding bit position is 1.
CN201410652811.XA 2014-11-17 2014-11-17 Method for carrying out operation processing on massive files by using bitmap Pending CN104572810A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410652811.XA CN104572810A (en) 2014-11-17 2014-11-17 Method for carrying out operation processing on massive files by using bitmap

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410652811.XA CN104572810A (en) 2014-11-17 2014-11-17 Method for carrying out operation processing on massive files by using bitmap

Publications (1)

Publication Number Publication Date
CN104572810A true CN104572810A (en) 2015-04-29

Family

ID=53088872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410652811.XA Pending CN104572810A (en) 2014-11-17 2014-11-17 Method for carrying out operation processing on massive files by using bitmap

Country Status (1)

Country Link
CN (1) CN104572810A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126334A (en) * 2015-05-04 2016-11-16 斯特拉托斯卡莱有限公司 The workload migration of probability data de-duplication perception
CN110413887A (en) * 2019-07-23 2019-11-05 杭州云梯科技有限公司 A kind of information push frequency limit method and system
CN111510464A (en) * 2020-06-24 2020-08-07 同盾控股有限公司 Epidemic situation information sharing method and system for protecting user privacy
CN111816302A (en) * 2020-07-08 2020-10-23 中润普达(十堰)大数据中心有限公司 Disease symptom cognitive system based on abnormal wrinkles of human body

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6978384B1 (en) * 2000-09-19 2005-12-20 Verizon Corp. Services Group, Inc. Method and apparatus for sequence number checking
CN101676906A (en) * 2008-09-18 2010-03-24 中兴通讯股份有限公司 Method of managing memory database space by using bitmap
CN101957998A (en) * 2010-10-09 2011-01-26 深圳市布易科技有限公司 Method and device for changing picture expressed by bitmap into picture expressed by vector shadow line

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6978384B1 (en) * 2000-09-19 2005-12-20 Verizon Corp. Services Group, Inc. Method and apparatus for sequence number checking
CN101676906A (en) * 2008-09-18 2010-03-24 中兴通讯股份有限公司 Method of managing memory database space by using bitmap
CN101957998A (en) * 2010-10-09 2011-01-26 深圳市布易科技有限公司 Method and device for changing picture expressed by bitmap into picture expressed by vector shadow line

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨康的BLOG: "位映射对大数据排重与排序", 《HTTP://YACARE.ITEYE.COM/BLOG/1969931》 *
规速: "海量数据处理算法—Bit-Map", 《HTTPS://BLOG.CSDN.NET/HGUISU/ARTICLE/DETAILS/7880288》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126334A (en) * 2015-05-04 2016-11-16 斯特拉托斯卡莱有限公司 The workload migration of probability data de-duplication perception
CN110413887A (en) * 2019-07-23 2019-11-05 杭州云梯科技有限公司 A kind of information push frequency limit method and system
CN110413887B (en) * 2019-07-23 2022-03-25 杭州云梯科技有限公司 Information pushing frequency limiting method and system
CN111510464A (en) * 2020-06-24 2020-08-07 同盾控股有限公司 Epidemic situation information sharing method and system for protecting user privacy
CN111510464B (en) * 2020-06-24 2020-10-02 同盾控股有限公司 Epidemic situation information sharing method and system for protecting user privacy
CN111816302A (en) * 2020-07-08 2020-10-23 中润普达(十堰)大数据中心有限公司 Disease symptom cognitive system based on abnormal wrinkles of human body

Similar Documents

Publication Publication Date Title
CN105183731B (en) Recommendation information generation method, device and system
CN111241389B (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
JP2013511097A5 (en)
CN105224600B (en) A kind of detection method and device of Sample Similarity
CN104572810A (en) Method for carrying out operation processing on massive files by using bitmap
CN109034717A (en) The method of mark string bid behavior is enclosed in a kind of identification bidding process
CN110046634A (en) The means of interpretation and device of cluster result
CN102567534B (en) Interactive product user generated content intercepting system and intercepting method for the same
CN109714356A (en) A kind of recognition methods of abnormal domain name, device and electronic equipment
CN103559313B (en) Searching method and device
CN110348907A (en) A kind of orientation method and device of advertisement crowd
WO2022116419A1 (en) Automatic determination method and apparatus for domain name infringement, electronic device, and storage medium
CN110717801A (en) Commodity information pushing method and device
CN106933916B (en) JSON character string processing method and device
CN107436890A (en) A kind of detection method and device of the Type of website
CN107330009A (en) Descriptor disaggregated model creation method, creating device and storage medium
CN106910135A (en) User recommends method and device
CN109582537A (en) Service security means of defence and its system
CN105608205B (en) The finger-mark check method and device of structural data
CN107025567A (en) A kind of data processing method and device
CN109783139B (en) Software interface feature extraction method and device and electronic equipment
TWI683258B (en) Barcode recognition method and device
CN113703753B (en) Method and device for product development and product development system
CN106815290B (en) Method and device for determining attribution of bank card based on graph mining
CN109145307A (en) User's face sketch recognition method, method for pushing, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150429