CN105930396B - A kind of repetition removing method and system based on database - Google Patents

A kind of repetition removing method and system based on database Download PDF

Info

Publication number
CN105930396B
CN105930396B CN201610236006.8A CN201610236006A CN105930396B CN 105930396 B CN105930396 B CN 105930396B CN 201610236006 A CN201610236006 A CN 201610236006A CN 105930396 B CN105930396 B CN 105930396B
Authority
CN
China
Prior art keywords
index information
index
hbase
ticket
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610236006.8A
Other languages
Chinese (zh)
Other versions
CN105930396A (en
Inventor
高洪磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Si Tech Information Technology Co Ltd
Original Assignee
Beijing Si Tech Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Si Tech Information Technology Co Ltd filed Critical Beijing Si Tech Information Technology Co Ltd
Priority to CN201610236006.8A priority Critical patent/CN105930396B/en
Publication of CN105930396A publication Critical patent/CN105930396A/en
Application granted granted Critical
Publication of CN105930396B publication Critical patent/CN105930396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of repetition removing method and system based on database, method is specifically includes the following steps: step 1: reading all tickets in CDR file, obtains the index information of every ticket, Hbase is written in all index informations;Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, unirecord again is written to weight monofile;Normal single write-in is normally picked and reppears mouth.Safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment store history index, pick weight Index Algorithm by customization, solve to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile will pick to be divided into indexing again the present invention relates to algorithm and confirm two independent modules into Hbase database and weight list, in practical application can inborn ability deployment further increase and pick weight efficiency.

Description

A kind of repetition removing method and system based on database
Technical field
The present invention relates to a kind of repetition removing method and system based on database, belongs to field of telecommunications.
Background technique
The component part that recast is telecommunication charging system is picked, is responsible for completing the single removal work of weight in charging process.
By extracting certain fields in ticket, it is combined into the index string for uniquely representing this ticket, picks Beijing South Maxpower Technology Co. Ltd's power section Point indexes confirm whether current ticket attaches most importance to list by way of string compares same history.By indexing, storage mode is unusual to be divided Weight is picked for file, memory picks weight, file is mixed with memory and to be picked again etc..It is limited to history index storage mode, the same rope of application power Drawing must be deployed on a physical host.Due to host resource, in practical application can not mass data be uniformly processed, need Ticket is distributed in the ability of different hosts according to number section or other information and is handled, result caused by this mode distributed firmly It is exactly certain number section telephone traffics capable nodes long-term disposal full load condition corresponding greatly, and certain small correspondences of number section ticket amount The capable nodes low utilization of resources.The Beijing South Maxpower Technology Co. Ltd's power of picking being deployed on same machine simultaneously is operating common index data When, it needs to avoid the different capable nodes from influencing each other for index information locking, This further reduces processing speeds.
Summary of the invention
HBase is a high reliability, high-performance, towards column, telescopic PostgreSQL database, can using HBase technology Large-scale structure storage cluster is erected on cheap PC Server.
Uniform Access is indexed using Hbase technical problem to be solved by the invention is to provide a kind of, Beijing South Maxpower Technology Co. Ltd will be picked Power node is peeled away with the history index data of processing, is picked Beijing South Maxpower Technology Co. Ltd's power by changing to pick weight Index Algorithm and solve difference and is influenced each other The repetition removing method and system based on database of problem.
The technical scheme to solve the above technical problems is that a kind of repetition removing method based on database, specific to wrap Include following steps:
Step 1: reading all tickets in CDR file, the index information of every ticket is obtained, by all index informations Hbase is written;
Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, by the single note of weight Record is written to weight monofile;Normal single write-in is normally picked and reppears mouth.
The beneficial effects of the present invention are: safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment come History index is stored, weight Index Algorithm is picked by customization, solves to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile the present invention relates to calculations Method, which will pick to be divided into indexing again, confirms two independent modules into Hbase database and weight list, in practical application can inborn ability be deployed in Weight efficiency is picked in the raising of one step.
Based on the above technical solution, the present invention can also be improved as follows.
Further, the MD5 value that the index information is made of traditional index character string, filename and line number.
Beneficial effect using above-mentioned further scheme is, wherein traditional index character string is to pick weight in general sense only One index can guarantee that different processes are single in processing weight by new index organization's mode with one ticket of unique identification When, identical new index string will not be generated, thus avoid traditional index character string identical and occur covering in Hbase Problem.
Further, the MD5 value of the filename and line number composition is according to current CDR file name and ticket in CDR file In position be calculated.
Further, the step 1 specifically includes the following steps:
Step 1.1: the ticket read in CDR file is current ticket, obtains the index information of current ticket;
Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in Database;
Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1;Otherwise, step is executed Rapid 2.
Beneficial effect using above-mentioned further scheme is that the time of index information write-in Hbase is stored in database, The middle time is accurate to Microsecond grade.
Further, the step 2 specifically includes the following steps:
Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads having in Hbase All index informations of this current string;
Step 2.2: judging whether all index informations are an index information, if so, determining this index information Corresponding ticket is normal single, execution step 2.4;Otherwise, step 2.3 is executed;
Step 2.3: the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase Whether attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4;Otherwise, step 2.4 is executed;
Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1; Otherwise, normal single write-in is normally picked and reppears mouth, terminated.
Further, the step 2.3 specifically includes the following steps:
Step 2.3.1: according to the filename in index information judge the corresponding ticket of a plurality of index information whether filename It is identical, if so, executing step 2.3.2;Otherwise, step 2.3.3 is executed;
Step 2.3.2: being inquired according to the line number in index information, if obtaining the smallest index information correspondence of line number It is singly confirmed as normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile;
Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index of entry time The corresponding ticket of information is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord will be written again To weight monofile.
The technical scheme to solve the above technical problems is that a kind of pick weight system based on database, including write Enter module and picks molality block;
The write module is used to read all tickets in CDR file, the index information of every ticket is obtained, by institute There is index information that Hbase is written;
The ticket that molality block is picked for being judged in CDR file according to index information in Hbase is that weight is single or normal It is single, unirecord again is written to weight monofile;Normal single write-in is normally picked and reppears mouth.
The beneficial effects of the present invention are: safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment come History index is stored, weight Index Algorithm is picked by customization, solves to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile the present invention relates to calculations Method, which will pick to be divided into indexing again, confirms two independent modules into Hbase database and weight list, in practical application can inborn ability be deployed in Weight efficiency is picked in the raising of one step.
Based on the above technical solution, the present invention can also be improved as follows.
Further, the MD5 value that the index information is made of traditional index character string, filename and line number.
Beneficial effect using above-mentioned further scheme is, wherein traditional index character string is to pick weight in general sense only One index can guarantee that different processes are single in processing weight by new index organization's mode with one ticket of unique identification When, identical new index string will not be generated, thus avoid traditional index character string identical and occur covering in Hbase Problem.
Further, the MD5 value of the filename and line number composition is according to current CDR file name and ticket in CDR file In position be calculated.
Further, the write module includes read module, index module and judgment module;
The ticket that the read module is used to read in CDR file is current ticket, obtains the index of current ticket Information;
The index module is used to be written the index information of current ticket Hbase, and by index information write-in Hbase's Time is stored in database;
Whether there are also unread tickets for judging CDR file for the judgment module, if so, triggering read module;It is no Then, molality block is picked in triggering.
Detailed description of the invention
Fig. 1 is a kind of repetition removing method flow chart based on database described in the embodiment of the present invention 1;
Fig. 2 picks weight system structure diagram based on database to be a kind of described in the embodiment of the present invention 1;
Fig. 3 is a kind of repetition removing method flow chart based on database described in specific example of the present invention.
In attached drawing, parts list represented by the reference numerals are as follows:
1, writing module, 2, pick molality block.
Specific embodiment
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.
As shown in Figure 1, being specifically included following for a kind of repetition removing method based on database described in the embodiment of the present invention 1 Step:
Step 1: reading all tickets in CDR file, the index information of every ticket is obtained, by all index informations Hbase is written;
Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, by the single note of weight Record is written to weight monofile;Normal single write-in is normally picked and reppears mouth.
A kind of repetition removing method based on database described in the embodiment of the present invention 2, on the basis of embodiment 1, the rope Fuse ceases the MD5 value being made of traditional index character string, filename and line number.
A kind of repetition removing method based on database described in the embodiment of the present invention 3, on the basis of embodiment 2, the text The position in CDR file is calculated according to current CDR file name and ticket for part name and the MD5 value of line number composition.
A kind of repetition removing method based on database described in the embodiment of the present invention 4, in the base of embodiment 1-3 any embodiment On plinth, the step 1 specifically includes the following steps:
Step 1.1: the ticket read in CDR file is current ticket, obtains the index information of current ticket;
Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in Database;
Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1;Otherwise, step is executed Rapid 2.
A kind of repetition removing method based on database described in the embodiment of the present invention 5, in the base of embodiment 1-4 any embodiment On plinth, the step 2 specifically includes the following steps:
Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads having in Hbase All index informations of this current string;
Step 2.2: judging whether all index informations are an index information, if so, determining this index information Corresponding ticket is normal single, execution step 2.4;Otherwise, step 2.3 is executed;
Step 2.3: the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase Whether attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4;Otherwise, step 2.4 is executed;
Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1; Otherwise, normal single write-in is normally picked and reppears mouth, terminated.
A kind of repetition removing method based on database described in the embodiment of the present invention 6, on the basis of embodiment 5, the step Rapid 2.3 specifically includes the following steps:
Step 2.3.1: according to the filename in index information judge the corresponding ticket of a plurality of index information whether filename It is identical, if so, executing step 2.3.2;Otherwise, step 2.3.3 is executed;
Step 2.3.2: being inquired according to the line number in index information, if obtaining the smallest index information correspondence of line number It is singly confirmed as normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile;
Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index of entry time The corresponding ticket of information is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord will be written again To weight monofile.
As shown in Fig. 2, picking weight system, including writing module 1 based on database to be a kind of described in the embodiment of the present invention 1 With pick molality block 2;
The write module 1 is used to read all tickets in CDR file, the index information of every ticket is obtained, by institute There is index information that Hbase is written;
The ticket that molality block 2 is picked for being judged in CDR file according to index information in Hbase is that weight is single or normal It is single;Unirecord again is written to weight monofile, normal single write-in is normally picked and reppears mouth.
It is a kind of described in the embodiment of the present invention 2 that weight system, on the basis of embodiment 1, the rope are picked based on database Fuse ceases the MD5 value being made of traditional index character string, filename and line number.
It is a kind of described in the embodiment of the present invention 3 that weight system, on the basis of embodiment 2, the text are picked based on database The position in CDR file is calculated according to current CDR file name and ticket for part name and the MD5 value of line number composition.
It is a kind of described in the embodiment of the present invention 4 that weight system is picked based on database, in the base of embodiment 1-3 any embodiment On plinth, the write module includes read module, index module and judgment module;
The ticket that the read module is used to read in CDR file is current ticket, obtains the index of current ticket Information;
The index module is used to be written the index information of current ticket Hbase, and by index information write-in Hbase's Time is stored in database;
Whether there are also unread tickets for judging CDR file for the judgment module, if so, triggering read module;It is no Then, molality block is picked in triggering.
As shown in figure 3, being the specific example of the method for the invention, comprising the following steps:
1. introducing new index organization's mode: traditional index character string _ filename and line number are combined into the MD5 value of character string; Wherein conventional characters string be in general sense pick weight unique index, can be with one ticket of unique identification.Filename and line number MD5 value can be calculated according to current CDR file name and the position of ticket hereof.Pass through new index organization's mode It can guarantee different processes when processing weight is single, identical new index string will not be generated, to avoid traditional index character Go here and there it is identical and the problem of covered in Hbase.
2. circulation reads entire file, will newly index as rowkey, filename line number is written to as value value The time for being written to Hbase completion is all indexed in Hbase, and in recording call list file into memory bank, wherein the time is accurate To Microsecond grade.
3. circulation reads entire file, the tradition index string of every ticket is obtained (assuming that the tradition index string of certain ticket For ABCDEFGHIJK, it is assumed that step 1 calculating is 16 MD5 values of filename, line number), according to the maximum value of MD5, minimum value It is assembled into 2 new strings MAX (ABCDEFGHIJK_FFFFFFFFFFFFFFFF) and MIN (ABCDEFGHIJK_ 0000000000000000)。
4. the new index string MAX and MIN obtained by step 3, searches and records sum within the scope of this.If only one Current ticket be it is normal single, if a plurality of, go to step 5.
5. pair a plurality of data checked out confirm whether the data checked out are identical with current CDR file name, I Provide then to think if they are the same the small data of line number be it is normal single, the big data of line number are attached most importance to list.Distinguish if filename difference To memory library inquiry, they enter the time of Hbase, and entering Hbase completion at first is normal single, other lists of attaching most importance to.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (4)

1. a kind of repetition removing method based on database, which is characterized in that specifically includes the following steps:
Step 1: reading all tickets in CDR file, obtain the index information of every ticket, all index informations are written Hbase;
Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, unirecord again is write Enter to weight monofile;Normal single write-in is normally picked and reppears mouth;
The step 1 specifically includes the following steps:
Step 1.1: the ticket read in CDR file is current ticket, obtains the index information of current ticket;
Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in data Library;
Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1;Otherwise, step 2 is executed;
The step 2 specifically includes the following steps:
Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads working as in Hbase with this All index informations of preceding character string;
Step 2.2: judging whether all index informations are an index information, if so, determining that this index information is corresponding Ticket be positive Chang Dan, execute step 2.4;Otherwise, step 2.3 is executed;
Step 2.3: whether the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase Attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4;Otherwise, step 2.4 is executed;
Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1;It is no Then, normal single write-in is normally picked and reppears mouth, terminated.
2. a kind of repetition removing method based on database according to claim 1, which is characterized in that the index information be by The MD5 value of traditional index character string, filename and line number composition.
3. a kind of repetition removing method based on database according to claim 2, which is characterized in that the filename and line number According to current CDR file name and ticket, the position in CDR file is calculated the MD5 value of composition.
4. a kind of repetition removing method based on database according to claim 1, which is characterized in that the step 2.3 is specific The following steps are included:
Step 2.3.1: judging the corresponding ticket of a plurality of index information according to the filename in index information, whether filename is identical, If so, executing step 2.3.2;Otherwise, step 2.3.3 is executed;
Step 2.3.2: being inquired according to the line number in index information, and it is true to obtain the corresponding ticket of the smallest index information of line number Think normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile;
Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index information of entry time Corresponding ticket is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord again is written to weight Monofile.
CN201610236006.8A 2016-04-15 2016-04-15 A kind of repetition removing method and system based on database Active CN105930396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610236006.8A CN105930396B (en) 2016-04-15 2016-04-15 A kind of repetition removing method and system based on database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610236006.8A CN105930396B (en) 2016-04-15 2016-04-15 A kind of repetition removing method and system based on database

Publications (2)

Publication Number Publication Date
CN105930396A CN105930396A (en) 2016-09-07
CN105930396B true CN105930396B (en) 2019-04-09

Family

ID=56839153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610236006.8A Active CN105930396B (en) 2016-04-15 2016-04-15 A kind of repetition removing method and system based on database

Country Status (1)

Country Link
CN (1) CN105930396B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599326B (en) * 2017-01-23 2020-02-04 北京思特奇信息技术股份有限公司 Recorded data duplication eliminating processing method and system under cloud architecture
CN112069510B (en) * 2020-07-24 2024-01-30 北京思特奇信息技术股份有限公司 Data encryption and duplication elimination method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1809108A (en) * 2006-02-20 2006-07-26 南京联创科技股份有限公司 Filter based call ticket memory repetition elimination method
CN101350869A (en) * 2007-07-19 2009-01-21 中国电信股份有限公司 Method and apparatus for removing repeat of telecom charging based on index and hash
CN101442731A (en) * 2008-12-12 2009-05-27 中国移动通信集团安徽有限公司 Method and apparatus for removing call ticket repeat
CN102156744A (en) * 2011-04-18 2011-08-17 北京神州数码思特奇信息技术股份有限公司 Method for eliminating repetition of memory dialog list

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1809108A (en) * 2006-02-20 2006-07-26 南京联创科技股份有限公司 Filter based call ticket memory repetition elimination method
CN101350869A (en) * 2007-07-19 2009-01-21 中国电信股份有限公司 Method and apparatus for removing repeat of telecom charging based on index and hash
CN101442731A (en) * 2008-12-12 2009-05-27 中国移动通信集团安徽有限公司 Method and apparatus for removing call ticket repeat
CN102156744A (en) * 2011-04-18 2011-08-17 北京神州数码思特奇信息技术股份有限公司 Method for eliminating repetition of memory dialog list

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"大数据环境下的话单排重";张超;《技术与实践》;20131231;全文
"实时计费重单剔除技术研究";吴杰 等;《计算机应用与软件》;20041031;全文
"重复话单剔除技术的探讨";杨志雄;《电信科学》;20041231;全文

Also Published As

Publication number Publication date
CN105930396A (en) 2016-09-07

Similar Documents

Publication Publication Date Title
EP3554051B1 (en) Data processing method and device
US11586673B2 (en) Data writing and reading method and apparatus, and cloud storage system
US10042875B2 (en) Bloom filter index for device discovery
CN103593257B (en) A kind of data back up method and device
CN105930396B (en) A kind of repetition removing method and system based on database
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
CN109918341A (en) Log processing method and device
CN113326161A (en) Root cause analysis method
CN114218318B (en) Data processing system and method for electric power big data
CN100493001C (en) Automatic clustering method for multi-particle size network under G bit flow rate
CN111176894A (en) Storage layering technology for data storage and data disaster recovery
CN109189343B (en) Metadata disk-dropping method, device, equipment and computer-readable storage medium
CN110018845A (en) Metadata version control methods and device
CN114124918A (en) Message parsing method and device
CN107301203B (en) Mass data comparison method and system
CN109657109A (en) Specified word lookup method, device, equipment and storage medium in a kind of document
CN117056564A (en) Power topology island detection method, device, equipment and storage medium
BRPI0920319B1 (en) method for accessing magnitude data from the smart grid services database and system and device for it
US20120066270A1 (en) Automated memory management of indexed data using weak references
CN106202303B (en) A kind of Chord routing table compression method and optimization file search method
CN107248952A (en) A kind of business substitutes route determining methods and system
CN204887003U (en) Big data processing platform network architecture
CN113872883A (en) High-precision elephant flow identification framework based on small flow filtering
CN113760907A (en) Data uniqueness identification method in database
CN107315806A (en) A kind of embedded storage method and device based on file system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant