CN105930396B - A kind of repetition removing method and system based on database - Google Patents
A kind of repetition removing method and system based on database Download PDFInfo
- Publication number
- CN105930396B CN105930396B CN201610236006.8A CN201610236006A CN105930396B CN 105930396 B CN105930396 B CN 105930396B CN 201610236006 A CN201610236006 A CN 201610236006A CN 105930396 B CN105930396 B CN 105930396B
- Authority
- CN
- China
- Prior art keywords
- index information
- index
- hbase
- ticket
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1748—De-duplication implemented within the file system, e.g. based on file segments
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of repetition removing method and system based on database, method is specifically includes the following steps: step 1: reading all tickets in CDR file, obtains the index information of every ticket, Hbase is written in all index informations;Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, unirecord again is written to weight monofile;Normal single write-in is normally picked and reppears mouth.Safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment store history index, pick weight Index Algorithm by customization, solve to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile will pick to be divided into indexing again the present invention relates to algorithm and confirm two independent modules into Hbase database and weight list, in practical application can inborn ability deployment further increase and pick weight efficiency.
Description
Technical field
The present invention relates to a kind of repetition removing method and system based on database, belongs to field of telecommunications.
Background technique
The component part that recast is telecommunication charging system is picked, is responsible for completing the single removal work of weight in charging process.
By extracting certain fields in ticket, it is combined into the index string for uniquely representing this ticket, picks Beijing South Maxpower Technology Co. Ltd's power section
Point indexes confirm whether current ticket attaches most importance to list by way of string compares same history.By indexing, storage mode is unusual to be divided
Weight is picked for file, memory picks weight, file is mixed with memory and to be picked again etc..It is limited to history index storage mode, the same rope of application power
Drawing must be deployed on a physical host.Due to host resource, in practical application can not mass data be uniformly processed, need
Ticket is distributed in the ability of different hosts according to number section or other information and is handled, result caused by this mode distributed firmly
It is exactly certain number section telephone traffics capable nodes long-term disposal full load condition corresponding greatly, and certain small correspondences of number section ticket amount
The capable nodes low utilization of resources.The Beijing South Maxpower Technology Co. Ltd's power of picking being deployed on same machine simultaneously is operating common index data
When, it needs to avoid the different capable nodes from influencing each other for index information locking, This further reduces processing speeds.
Summary of the invention
HBase is a high reliability, high-performance, towards column, telescopic PostgreSQL database, can using HBase technology
Large-scale structure storage cluster is erected on cheap PC Server.
Uniform Access is indexed using Hbase technical problem to be solved by the invention is to provide a kind of, Beijing South Maxpower Technology Co. Ltd will be picked
Power node is peeled away with the history index data of processing, is picked Beijing South Maxpower Technology Co. Ltd's power by changing to pick weight Index Algorithm and solve difference and is influenced each other
The repetition removing method and system based on database of problem.
The technical scheme to solve the above technical problems is that a kind of repetition removing method based on database, specific to wrap
Include following steps:
Step 1: reading all tickets in CDR file, the index information of every ticket is obtained, by all index informations
Hbase is written;
Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, by the single note of weight
Record is written to weight monofile;Normal single write-in is normally picked and reppears mouth.
The beneficial effects of the present invention are: safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment come
History index is stored, weight Index Algorithm is picked by customization, solves to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile the present invention relates to calculations
Method, which will pick to be divided into indexing again, confirms two independent modules into Hbase database and weight list, in practical application can inborn ability be deployed in
Weight efficiency is picked in the raising of one step.
Based on the above technical solution, the present invention can also be improved as follows.
Further, the MD5 value that the index information is made of traditional index character string, filename and line number.
Beneficial effect using above-mentioned further scheme is, wherein traditional index character string is to pick weight in general sense only
One index can guarantee that different processes are single in processing weight by new index organization's mode with one ticket of unique identification
When, identical new index string will not be generated, thus avoid traditional index character string identical and occur covering in Hbase
Problem.
Further, the MD5 value of the filename and line number composition is according to current CDR file name and ticket in CDR file
In position be calculated.
Further, the step 1 specifically includes the following steps:
Step 1.1: the ticket read in CDR file is current ticket, obtains the index information of current ticket;
Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in
Database;
Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1;Otherwise, step is executed
Rapid 2.
Beneficial effect using above-mentioned further scheme is that the time of index information write-in Hbase is stored in database,
The middle time is accurate to Microsecond grade.
Further, the step 2 specifically includes the following steps:
Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads having in Hbase
All index informations of this current string;
Step 2.2: judging whether all index informations are an index information, if so, determining this index information
Corresponding ticket is normal single, execution step 2.4;Otherwise, step 2.3 is executed;
Step 2.3: the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase
Whether attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4;Otherwise, step 2.4 is executed;
Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1;
Otherwise, normal single write-in is normally picked and reppears mouth, terminated.
Further, the step 2.3 specifically includes the following steps:
Step 2.3.1: according to the filename in index information judge the corresponding ticket of a plurality of index information whether filename
It is identical, if so, executing step 2.3.2;Otherwise, step 2.3.3 is executed;
Step 2.3.2: being inquired according to the line number in index information, if obtaining the smallest index information correspondence of line number
It is singly confirmed as normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile;
Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index of entry time
The corresponding ticket of information is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord will be written again
To weight monofile.
The technical scheme to solve the above technical problems is that a kind of pick weight system based on database, including write
Enter module and picks molality block;
The write module is used to read all tickets in CDR file, the index information of every ticket is obtained, by institute
There is index information that Hbase is written;
The ticket that molality block is picked for being judged in CDR file according to index information in Hbase is that weight is single or normal
It is single, unirecord again is written to weight monofile;Normal single write-in is normally picked and reppears mouth.
The beneficial effects of the present invention are: safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment come
History index is stored, weight Index Algorithm is picked by customization, solves to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile the present invention relates to calculations
Method, which will pick to be divided into indexing again, confirms two independent modules into Hbase database and weight list, in practical application can inborn ability be deployed in
Weight efficiency is picked in the raising of one step.
Based on the above technical solution, the present invention can also be improved as follows.
Further, the MD5 value that the index information is made of traditional index character string, filename and line number.
Beneficial effect using above-mentioned further scheme is, wherein traditional index character string is to pick weight in general sense only
One index can guarantee that different processes are single in processing weight by new index organization's mode with one ticket of unique identification
When, identical new index string will not be generated, thus avoid traditional index character string identical and occur covering in Hbase
Problem.
Further, the MD5 value of the filename and line number composition is according to current CDR file name and ticket in CDR file
In position be calculated.
Further, the write module includes read module, index module and judgment module;
The ticket that the read module is used to read in CDR file is current ticket, obtains the index of current ticket
Information;
The index module is used to be written the index information of current ticket Hbase, and by index information write-in Hbase's
Time is stored in database;
Whether there are also unread tickets for judging CDR file for the judgment module, if so, triggering read module;It is no
Then, molality block is picked in triggering.
Detailed description of the invention
Fig. 1 is a kind of repetition removing method flow chart based on database described in the embodiment of the present invention 1;
Fig. 2 picks weight system structure diagram based on database to be a kind of described in the embodiment of the present invention 1;
Fig. 3 is a kind of repetition removing method flow chart based on database described in specific example of the present invention.
In attached drawing, parts list represented by the reference numerals are as follows:
1, writing module, 2, pick molality block.
Specific embodiment
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and
It is non-to be used to limit the scope of the invention.
As shown in Figure 1, being specifically included following for a kind of repetition removing method based on database described in the embodiment of the present invention 1
Step:
Step 1: reading all tickets in CDR file, the index information of every ticket is obtained, by all index informations
Hbase is written;
Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, by the single note of weight
Record is written to weight monofile;Normal single write-in is normally picked and reppears mouth.
A kind of repetition removing method based on database described in the embodiment of the present invention 2, on the basis of embodiment 1, the rope
Fuse ceases the MD5 value being made of traditional index character string, filename and line number.
A kind of repetition removing method based on database described in the embodiment of the present invention 3, on the basis of embodiment 2, the text
The position in CDR file is calculated according to current CDR file name and ticket for part name and the MD5 value of line number composition.
A kind of repetition removing method based on database described in the embodiment of the present invention 4, in the base of embodiment 1-3 any embodiment
On plinth, the step 1 specifically includes the following steps:
Step 1.1: the ticket read in CDR file is current ticket, obtains the index information of current ticket;
Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in
Database;
Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1;Otherwise, step is executed
Rapid 2.
A kind of repetition removing method based on database described in the embodiment of the present invention 5, in the base of embodiment 1-4 any embodiment
On plinth, the step 2 specifically includes the following steps:
Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads having in Hbase
All index informations of this current string;
Step 2.2: judging whether all index informations are an index information, if so, determining this index information
Corresponding ticket is normal single, execution step 2.4;Otherwise, step 2.3 is executed;
Step 2.3: the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase
Whether attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4;Otherwise, step 2.4 is executed;
Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1;
Otherwise, normal single write-in is normally picked and reppears mouth, terminated.
A kind of repetition removing method based on database described in the embodiment of the present invention 6, on the basis of embodiment 5, the step
Rapid 2.3 specifically includes the following steps:
Step 2.3.1: according to the filename in index information judge the corresponding ticket of a plurality of index information whether filename
It is identical, if so, executing step 2.3.2;Otherwise, step 2.3.3 is executed;
Step 2.3.2: being inquired according to the line number in index information, if obtaining the smallest index information correspondence of line number
It is singly confirmed as normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile;
Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index of entry time
The corresponding ticket of information is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord will be written again
To weight monofile.
As shown in Fig. 2, picking weight system, including writing module 1 based on database to be a kind of described in the embodiment of the present invention 1
With pick molality block 2;
The write module 1 is used to read all tickets in CDR file, the index information of every ticket is obtained, by institute
There is index information that Hbase is written;
The ticket that molality block 2 is picked for being judged in CDR file according to index information in Hbase is that weight is single or normal
It is single;Unirecord again is written to weight monofile, normal single write-in is normally picked and reppears mouth.
It is a kind of described in the embodiment of the present invention 2 that weight system, on the basis of embodiment 1, the rope are picked based on database
Fuse ceases the MD5 value being made of traditional index character string, filename and line number.
It is a kind of described in the embodiment of the present invention 3 that weight system, on the basis of embodiment 2, the text are picked based on database
The position in CDR file is calculated according to current CDR file name and ticket for part name and the MD5 value of line number composition.
It is a kind of described in the embodiment of the present invention 4 that weight system is picked based on database, in the base of embodiment 1-3 any embodiment
On plinth, the write module includes read module, index module and judgment module;
The ticket that the read module is used to read in CDR file is current ticket, obtains the index of current ticket
Information;
The index module is used to be written the index information of current ticket Hbase, and by index information write-in Hbase's
Time is stored in database;
Whether there are also unread tickets for judging CDR file for the judgment module, if so, triggering read module;It is no
Then, molality block is picked in triggering.
As shown in figure 3, being the specific example of the method for the invention, comprising the following steps:
1. introducing new index organization's mode: traditional index character string _ filename and line number are combined into the MD5 value of character string;
Wherein conventional characters string be in general sense pick weight unique index, can be with one ticket of unique identification.Filename and line number
MD5 value can be calculated according to current CDR file name and the position of ticket hereof.Pass through new index organization's mode
It can guarantee different processes when processing weight is single, identical new index string will not be generated, to avoid traditional index character
Go here and there it is identical and the problem of covered in Hbase.
2. circulation reads entire file, will newly index as rowkey, filename line number is written to as value value
The time for being written to Hbase completion is all indexed in Hbase, and in recording call list file into memory bank, wherein the time is accurate
To Microsecond grade.
3. circulation reads entire file, the tradition index string of every ticket is obtained (assuming that the tradition index string of certain ticket
For ABCDEFGHIJK, it is assumed that step 1 calculating is 16 MD5 values of filename, line number), according to the maximum value of MD5, minimum value
It is assembled into 2 new strings MAX (ABCDEFGHIJK_FFFFFFFFFFFFFFFF) and MIN (ABCDEFGHIJK_
0000000000000000)。
4. the new index string MAX and MIN obtained by step 3, searches and records sum within the scope of this.If only one
Current ticket be it is normal single, if a plurality of, go to step 5.
5. pair a plurality of data checked out confirm whether the data checked out are identical with current CDR file name, I
Provide then to think if they are the same the small data of line number be it is normal single, the big data of line number are attached most importance to list.Distinguish if filename difference
To memory library inquiry, they enter the time of Hbase, and entering Hbase completion at first is normal single, other lists of attaching most importance to.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (4)
1. a kind of repetition removing method based on database, which is characterized in that specifically includes the following steps:
Step 1: reading all tickets in CDR file, obtain the index information of every ticket, all index informations are written
Hbase;
Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, unirecord again is write
Enter to weight monofile;Normal single write-in is normally picked and reppears mouth;
The step 1 specifically includes the following steps:
Step 1.1: the ticket read in CDR file is current ticket, obtains the index information of current ticket;
Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in data
Library;
Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1;Otherwise, step 2 is executed;
The step 2 specifically includes the following steps:
Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads working as in Hbase with this
All index informations of preceding character string;
Step 2.2: judging whether all index informations are an index information, if so, determining that this index information is corresponding
Ticket be positive Chang Dan, execute step 2.4;Otherwise, step 2.3 is executed;
Step 2.3: whether the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase
Attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4;Otherwise, step 2.4 is executed;
Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1;It is no
Then, normal single write-in is normally picked and reppears mouth, terminated.
2. a kind of repetition removing method based on database according to claim 1, which is characterized in that the index information be by
The MD5 value of traditional index character string, filename and line number composition.
3. a kind of repetition removing method based on database according to claim 2, which is characterized in that the filename and line number
According to current CDR file name and ticket, the position in CDR file is calculated the MD5 value of composition.
4. a kind of repetition removing method based on database according to claim 1, which is characterized in that the step 2.3 is specific
The following steps are included:
Step 2.3.1: judging the corresponding ticket of a plurality of index information according to the filename in index information, whether filename is identical,
If so, executing step 2.3.2;Otherwise, step 2.3.3 is executed;
Step 2.3.2: being inquired according to the line number in index information, and it is true to obtain the corresponding ticket of the smallest index information of line number
Think normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile;
Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index information of entry time
Corresponding ticket is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord again is written to weight
Monofile.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236006.8A CN105930396B (en) | 2016-04-15 | 2016-04-15 | A kind of repetition removing method and system based on database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610236006.8A CN105930396B (en) | 2016-04-15 | 2016-04-15 | A kind of repetition removing method and system based on database |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105930396A CN105930396A (en) | 2016-09-07 |
CN105930396B true CN105930396B (en) | 2019-04-09 |
Family
ID=56839153
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610236006.8A Active CN105930396B (en) | 2016-04-15 | 2016-04-15 | A kind of repetition removing method and system based on database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105930396B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599326B (en) * | 2017-01-23 | 2020-02-04 | 北京思特奇信息技术股份有限公司 | Recorded data duplication eliminating processing method and system under cloud architecture |
CN112069510B (en) * | 2020-07-24 | 2024-01-30 | 北京思特奇信息技术股份有限公司 | Data encryption and duplication elimination method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1809108A (en) * | 2006-02-20 | 2006-07-26 | 南京联创科技股份有限公司 | Filter based call ticket memory repetition elimination method |
CN101350869A (en) * | 2007-07-19 | 2009-01-21 | 中国电信股份有限公司 | Method and apparatus for removing repeat of telecom charging based on index and hash |
CN101442731A (en) * | 2008-12-12 | 2009-05-27 | 中国移动通信集团安徽有限公司 | Method and apparatus for removing call ticket repeat |
CN102156744A (en) * | 2011-04-18 | 2011-08-17 | 北京神州数码思特奇信息技术股份有限公司 | Method for eliminating repetition of memory dialog list |
-
2016
- 2016-04-15 CN CN201610236006.8A patent/CN105930396B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1809108A (en) * | 2006-02-20 | 2006-07-26 | 南京联创科技股份有限公司 | Filter based call ticket memory repetition elimination method |
CN101350869A (en) * | 2007-07-19 | 2009-01-21 | 中国电信股份有限公司 | Method and apparatus for removing repeat of telecom charging based on index and hash |
CN101442731A (en) * | 2008-12-12 | 2009-05-27 | 中国移动通信集团安徽有限公司 | Method and apparatus for removing call ticket repeat |
CN102156744A (en) * | 2011-04-18 | 2011-08-17 | 北京神州数码思特奇信息技术股份有限公司 | Method for eliminating repetition of memory dialog list |
Non-Patent Citations (3)
Title |
---|
"大数据环境下的话单排重";张超;《技术与实践》;20131231;全文 |
"实时计费重单剔除技术研究";吴杰 等;《计算机应用与软件》;20041031;全文 |
"重复话单剔除技术的探讨";杨志雄;《电信科学》;20041231;全文 |
Also Published As
Publication number | Publication date |
---|---|
CN105930396A (en) | 2016-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3554051B1 (en) | Data processing method and device | |
US11586673B2 (en) | Data writing and reading method and apparatus, and cloud storage system | |
US10042875B2 (en) | Bloom filter index for device discovery | |
CN103593257B (en) | A kind of data back up method and device | |
CN105930396B (en) | A kind of repetition removing method and system based on database | |
CN102867049A (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
CN109918341A (en) | Log processing method and device | |
CN113326161A (en) | Root cause analysis method | |
CN114218318B (en) | Data processing system and method for electric power big data | |
CN100493001C (en) | Automatic clustering method for multi-particle size network under G bit flow rate | |
CN111176894A (en) | Storage layering technology for data storage and data disaster recovery | |
CN109189343B (en) | Metadata disk-dropping method, device, equipment and computer-readable storage medium | |
CN110018845A (en) | Metadata version control methods and device | |
CN114124918A (en) | Message parsing method and device | |
CN107301203B (en) | Mass data comparison method and system | |
CN109657109A (en) | Specified word lookup method, device, equipment and storage medium in a kind of document | |
CN117056564A (en) | Power topology island detection method, device, equipment and storage medium | |
BRPI0920319B1 (en) | method for accessing magnitude data from the smart grid services database and system and device for it | |
US20120066270A1 (en) | Automated memory management of indexed data using weak references | |
CN106202303B (en) | A kind of Chord routing table compression method and optimization file search method | |
CN107248952A (en) | A kind of business substitutes route determining methods and system | |
CN204887003U (en) | Big data processing platform network architecture | |
CN113872883A (en) | High-precision elephant flow identification framework based on small flow filtering | |
CN113760907A (en) | Data uniqueness identification method in database | |
CN107315806A (en) | A kind of embedded storage method and device based on file system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |