CN105930396B

CN105930396B - A kind of repetition removing method and system based on database

Info

Publication number: CN105930396B
Application number: CN201610236006.8A
Authority: CN
Inventors: 高洪磊
Original assignee: Beijing Si Tech Information Technology Co Ltd
Current assignee: Beijing Si Tech Information Technology Co Ltd
Priority date: 2016-04-15
Filing date: 2016-04-15
Publication date: 2019-04-09
Anticipated expiration: 2036-04-15
Also published as: CN105930396A

Abstract

The present invention relates to a kind of repetition removing method and system based on database, method is specifically includes the following steps: step 1: reading all tickets in CDR file, obtains the index information of every ticket, Hbase is written in all index informations；Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, unirecord again is written to weight monofile；Normal single write-in is normally picked and reppears mouth.Safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment store history index, pick weight Index Algorithm by customization, solve to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile will pick to be divided into indexing again the present invention relates to algorithm and confirm two independent modules into Hbase database and weight list, in practical application can inborn ability deployment further increase and pick weight efficiency.

Description

A kind of repetition removing method and system based on database

Technical field

The present invention relates to a kind of repetition removing method and system based on database, belongs to field of telecommunications.

Background technique

The component part that recast is telecommunication charging system is picked, is responsible for completing the single removal work of weight in charging process.

By extracting certain fields in ticket, it is combined into the index string for uniquely representing this ticket, picks Beijing South Maxpower Technology Co. Ltd's power section Point indexes confirm whether current ticket attaches most importance to list by way of string compares same history.By indexing, storage mode is unusual to be divided Weight is picked for file, memory picks weight, file is mixed with memory and to be picked again etc..It is limited to history index storage mode, the same rope of application power Drawing must be deployed on a physical host.Due to host resource, in practical application can not mass data be uniformly processed, need Ticket is distributed in the ability of different hosts according to number section or other information and is handled, result caused by this mode distributed firmly It is exactly certain number section telephone traffics capable nodes long-term disposal full load condition corresponding greatly, and certain small correspondences of number section ticket amount The capable nodes low utilization of resources.The Beijing South Maxpower Technology Co. Ltd's power of picking being deployed on same machine simultaneously is operating common index data When, it needs to avoid the different capable nodes from influencing each other for index information locking, This further reduces processing speeds.

Summary of the invention

HBase is a high reliability, high-performance, towards column, telescopic PostgreSQL database, can using HBase technology Large-scale structure storage cluster is erected on cheap PC Server.

Uniform Access is indexed using Hbase technical problem to be solved by the invention is to provide a kind of, Beijing South Maxpower Technology Co. Ltd will be picked Power node is peeled away with the history index data of processing, is picked Beijing South Maxpower Technology Co. Ltd's power by changing to pick weight Index Algorithm and solve difference and is influenced each other The repetition removing method and system based on database of problem.

The technical scheme to solve the above technical problems is that a kind of repetition removing method based on database, specific to wrap Include following steps:

Step 1: reading all tickets in CDR file, the index information of every ticket is obtained, by all index informations Hbase is written；

Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, by the single note of weight Record is written to weight monofile；Normal single write-in is normally picked and reppears mouth.

The beneficial effects of the present invention are: safety in utilization of the present invention, stability, scalability and the cheap Hbase of deployment come History index is stored, weight Index Algorithm is picked by customization, solves to pick Beijing South Maxpower Technology Co. Ltd's power concurrent processing more.Meanwhile the present invention relates to calculations Method, which will pick to be divided into indexing again, confirms two independent modules into Hbase database and weight list, in practical application can inborn ability be deployed in Weight efficiency is picked in the raising of one step.

Based on the above technical solution, the present invention can also be improved as follows.

Further, the MD5 value that the index information is made of traditional index character string, filename and line number.

Beneficial effect using above-mentioned further scheme is, wherein traditional index character string is to pick weight in general sense only One index can guarantee that different processes are single in processing weight by new index organization's mode with one ticket of unique identification When, identical new index string will not be generated, thus avoid traditional index character string identical and occur covering in Hbase Problem.

Further, the MD5 value of the filename and line number composition is according to current CDR file name and ticket in CDR file In position be calculated.

Further, the step 1 specifically includes the following steps:

Step 1.1: the ticket read in CDR file is current ticket, obtains the index information of current ticket；

Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in Database；

Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1；Otherwise, step is executed Rapid 2.

Beneficial effect using above-mentioned further scheme is that the time of index information write-in Hbase is stored in database, The middle time is accurate to Microsecond grade.

Further, the step 2 specifically includes the following steps:

Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads having in Hbase All index informations of this current string；

Step 2.2: judging whether all index informations are an index information, if so, determining this index information Corresponding ticket is normal single, execution step 2.4；Otherwise, step 2.3 is executed；

Step 2.3: the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase Whether attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4；Otherwise, step 2.4 is executed；

Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1； Otherwise, normal single write-in is normally picked and reppears mouth, terminated.

Further, the step 2.3 specifically includes the following steps:

Step 2.3.1: according to the filename in index information judge the corresponding ticket of a plurality of index information whether filename It is identical, if so, executing step 2.3.2；Otherwise, step 2.3.3 is executed；

Step 2.3.2: being inquired according to the line number in index information, if obtaining the smallest index information correspondence of line number It is singly confirmed as normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile；

Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index of entry time The corresponding ticket of information is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord will be written again To weight monofile.

The technical scheme to solve the above technical problems is that a kind of pick weight system based on database, including write Enter module and picks molality block；

The write module is used to read all tickets in CDR file, the index information of every ticket is obtained, by institute There is index information that Hbase is written；

The ticket that molality block is picked for being judged in CDR file according to index information in Hbase is that weight is single or normal It is single, unirecord again is written to weight monofile；Normal single write-in is normally picked and reppears mouth.

Further, the write module includes read module, index module and judgment module；

The ticket that the read module is used to read in CDR file is current ticket, obtains the index of current ticket Information；

The index module is used to be written the index information of current ticket Hbase, and by index information write-in Hbase's Time is stored in database；

Whether there are also unread tickets for judging CDR file for the judgment module, if so, triggering read module；It is no Then, molality block is picked in triggering.

Detailed description of the invention

Fig. 1 is a kind of repetition removing method flow chart based on database described in the embodiment of the present invention 1；

Fig. 2 picks weight system structure diagram based on database to be a kind of described in the embodiment of the present invention 1；

Fig. 3 is a kind of repetition removing method flow chart based on database described in specific example of the present invention.

In attached drawing, parts list represented by the reference numerals are as follows:

1, writing module, 2, pick molality block.

Specific embodiment

The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.

As shown in Figure 1, being specifically included following for a kind of repetition removing method based on database described in the embodiment of the present invention 1 Step:

A kind of repetition removing method based on database described in the embodiment of the present invention 2, on the basis of embodiment 1, the rope Fuse ceases the MD5 value being made of traditional index character string, filename and line number.

A kind of repetition removing method based on database described in the embodiment of the present invention 3, on the basis of embodiment 2, the text The position in CDR file is calculated according to current CDR file name and ticket for part name and the MD5 value of line number composition.

A kind of repetition removing method based on database described in the embodiment of the present invention 4, in the base of embodiment 1-3 any embodiment On plinth, the step 1 specifically includes the following steps:

A kind of repetition removing method based on database described in the embodiment of the present invention 5, in the base of embodiment 1-4 any embodiment On plinth, the step 2 specifically includes the following steps:

A kind of repetition removing method based on database described in the embodiment of the present invention 6, on the basis of embodiment 5, the step Rapid 2.3 specifically includes the following steps:

As shown in Fig. 2, picking weight system, including writing module 1 based on database to be a kind of described in the embodiment of the present invention 1 With pick molality block 2；

The write module 1 is used to read all tickets in CDR file, the index information of every ticket is obtained, by institute There is index information that Hbase is written；

The ticket that molality block 2 is picked for being judged in CDR file according to index information in Hbase is that weight is single or normal It is single；Unirecord again is written to weight monofile, normal single write-in is normally picked and reppears mouth.

It is a kind of described in the embodiment of the present invention 2 that weight system, on the basis of embodiment 1, the rope are picked based on database Fuse ceases the MD5 value being made of traditional index character string, filename and line number.

It is a kind of described in the embodiment of the present invention 3 that weight system, on the basis of embodiment 2, the text are picked based on database The position in CDR file is calculated according to current CDR file name and ticket for part name and the MD5 value of line number composition.

It is a kind of described in the embodiment of the present invention 4 that weight system is picked based on database, in the base of embodiment 1-3 any embodiment On plinth, the write module includes read module, index module and judgment module；

As shown in figure 3, being the specific example of the method for the invention, comprising the following steps:

1. introducing new index organization's mode: traditional index character string _ filename and line number are combined into the MD5 value of character string； Wherein conventional characters string be in general sense pick weight unique index, can be with one ticket of unique identification.Filename and line number MD5 value can be calculated according to current CDR file name and the position of ticket hereof.Pass through new index organization's mode It can guarantee different processes when processing weight is single, identical new index string will not be generated, to avoid traditional index character Go here and there it is identical and the problem of covered in Hbase.

2. circulation reads entire file, will newly index as rowkey, filename line number is written to as value value The time for being written to Hbase completion is all indexed in Hbase, and in recording call list file into memory bank, wherein the time is accurate To Microsecond grade.

3. circulation reads entire file, the tradition index string of every ticket is obtained (assuming that the tradition index string of certain ticket For ABCDEFGHIJK, it is assumed that step 1 calculating is 16 MD5 values of filename, line number), according to the maximum value of MD5, minimum value It is assembled into 2 new strings MAX (ABCDEFGHIJK_FFFFFFFFFFFFFFFF) and MIN (ABCDEFGHIJK_ 0000000000000000)。

4. the new index string MAX and MIN obtained by step 3, searches and records sum within the scope of this.If only one Current ticket be it is normal single, if a plurality of, go to step 5.

5. pair a plurality of data checked out confirm whether the data checked out are identical with current CDR file name, I Provide then to think if they are the same the small data of line number be it is normal single, the big data of line number are attached most importance to list.Distinguish if filename difference To memory library inquiry, they enter the time of Hbase, and entering Hbase completion at first is normal single, other lists of attaching most importance to.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of repetition removing method based on database, which is characterized in that specifically includes the following steps:

Step 1: reading all tickets in CDR file, obtain the index information of every ticket, all index informations are written Hbase；

Step 2: judging that the ticket in CDR file is that weight is single or normal single according to index information in Hbase, unirecord again is write Enter to weight monofile；Normal single write-in is normally picked and reppears mouth；

The step 1 specifically includes the following steps:

Step 1.2: Hbase is written into the index information of current ticket, and the time of index information write-in Hbase is stored in data Library；

Step 1.3: judging CDR file, whether there are also unread tickets, if so, executing step 1.1；Otherwise, step 2 is executed；

The step 2 specifically includes the following steps:

Step 2.1: it is current string that a traditional index character string is chosen in Hbase, reads working as in Hbase with this All index informations of preceding character string；

Step 2.2: judging whether all index informations are an index information, if so, determining that this index information is corresponding Ticket be positive Chang Dan, execute step 2.4；Otherwise, step 2.3 is executed；

Step 2.3: whether the corresponding ticket of this index information is judged according to the time of index information and index information write-in Hbase Attach most importance to list, if so, unirecord again is written to weight monofile, executes step 2.4；Otherwise, step 2.4 is executed；

Step 2.4: judging whether there are also the traditional index character strings that do not choose in Hbase, if so, executing step 2.1；It is no Then, normal single write-in is normally picked and reppears mouth, terminated.

2. a kind of repetition removing method based on database according to claim 1, which is characterized in that the index information be by The MD5 value of traditional index character string, filename and line number composition.

3. a kind of repetition removing method based on database according to claim 2, which is characterized in that the filename and line number According to current CDR file name and ticket, the position in CDR file is calculated the MD5 value of composition.

4. a kind of repetition removing method based on database according to claim 1, which is characterized in that the step 2.3 is specific The following steps are included:

Step 2.3.1: judging the corresponding ticket of a plurality of index information according to the filename in index information, whether filename is identical, If so, executing step 2.3.2；Otherwise, step 2.3.3 is executed；

Step 2.3.2: being inquired according to the line number in index information, and it is true to obtain the corresponding ticket of the smallest index information of line number Think normal list, normal single write-in is normally picked and reppears mouth by other lists of attaching most importance to, and unirecord again is written to weight monofile；

Step 2.3.3: it is inquired according to the time that Hbase is written in index information, obtains the earliest index information of entry time Corresponding ticket is confirmed as normal list, other lists of attaching most importance to, and normal single write-in is normally picked and reppears mouth, unirecord again is written to weight Monofile.