CN110008236A

CN110008236A - A kind of data distribution formula is from increasing coding method, system, equipment and medium

Info

Publication number: CN110008236A
Application number: CN201910301360.8A
Authority: CN
Inventors: 周孝文
Original assignee: Chongqing Tianpeng Network Co Ltd
Current assignee: Chongqing Tianpeng Network Co Ltd
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2019-07-12
Anticipated expiration: 2039-04-15
Also published as: CN110008236B

Abstract

The invention discloses a kind of data distribution formulas to increase coding method, system, electronic equipment and medium certainly, comprising: obtains external source data, generates first set；Machined data are obtained, second set is generated；The union for calculating the first set and second set obtains total data set；Duplicate removal processing is carried out to the total data set, obtains duplicate removal data acquisition system；The newly-increased data in the duplicate removal data acquisition system are obtained, the newly-increased data are encoded, complete data from addendum code.Without writing UDF function, the data for realizing hive by sql to achieve the purpose that carry out data encoding to name field known to some, are effectively reduced development cost, improve development efficiency the present invention from encoding.

Description

A kind of data distribution formula is from increasing coding method, system, equipment and medium

Technical field

The present invention relates to big data technical fields, and in particular to a kind of data distribution formula is from increasing coding method, system, equipment And medium.

Background technique

With the promotion of computer storage capacity and the development of complicated algorithm, web database technology exponentially grade increases in recent years Long, the application with mass data demand such as science data processing, business intelligence data analysis becomes increasingly prevalent, mainstream Big data processing technique by hive and it is distributed carry out data processing, data from addendum code be a wherein important link. In the prior art, data are realized from addendum code by writing the modes such as UDF function, need to rely on Java exploitation, development cost compared with Greatly, efficiency is lower, and growing day by day with big data process demand, traditional data is gradually protruded from the drawbacks of addendum code, so Being badly in need of one kind helps to reduce development difficulty, and the data for improving development efficiency increase coding techniques certainly.

Summary of the invention

In view of the above-mentioned problems, the present invention provides a kind of data distribution formula from coding method, system, equipment and medium, it is not necessarily to UDF function is write, the data in hive are realized from increasing, to achieve the purpose that encode to name field known to some by sql.

The present invention specifically:

A kind of data distribution formula is from increasing coding method, comprising:

External source data are obtained, first set is generated；

Machined data are obtained, second set is generated；

The union for calculating the first set and second set obtains total data set；

Duplicate removal processing is carried out to the total data set, obtains duplicate removal data acquisition system；

The newly-increased data in the duplicate removal data acquisition system are obtained, the newly-increased data are encoded, complete data from increasing Coding.

Further, the acquisition external source data generate first set, specifically include:

External source data are obtained, code field and the first field is constructed, obtains first set；

It is described to obtain machined data, second set is generated, is specifically included:

Tight preceding manufactured data are obtained, the second field is created, obtains second set；The default value of second field Take the value of the code field.

Further, duplicate removal processing is carried out to the total data set, obtains duplicate removal data acquisition system, specifically includes:

The field encoded as required is grouped the data in the total data set；

The data in the total data set are ranked up according to the code field；

Corresponding data is taken out in the data after sequence by preset rules, forms duplicate removal data acquisition system.

Further, the newly-increased data in the duplicate removal data acquisition system are obtained, the newly-increased data are encoded, are completed Data are specifically included from addendum code:

The data in the duplicate removal data acquisition system are ranked up according to second field, and in the data after sequence The data for meeting preset condition are searched, the newly-increased data is obtained, the newly-increased data is compiled according to the code field Code obtains the value of new code field, completes data from addendum code；The process realizes encoding certainly for newly-increased data, while not Coded data can be impacted, encoded context number remains unchanged.

The above method realizes hive data from addendum code without writing UDF function, by sql.

A kind of data distribution formula is from increasing coded system, comprising:

External source data processing module generates first set for obtaining external source data；

Machined data processing module generates second set for obtaining machined data；

Data combiners block obtains total data set for calculating the union of the first set and second set；

Data deduplication module obtains duplicate removal data acquisition system for carrying out duplicate removal processing to the total data set；

Data are from coding module is increased, for obtaining the newly-increased data in the duplicate removal data acquisition system, by the newly-increased data It is encoded, completes data from addendum code.

Further, the external source data processing module, is specifically used for:

The machined data processing module, is specifically used for:

Further, the data deduplication module, is specifically used for:

The field encoded as required is grouped the data in the total data set；

The data in the total data set are ranked up according to the code field；

Further, the data are specifically used for from coding module is increased:

Above system realizes hive data from addendum code without writing UDF function, by sql.

A kind of electronic equipment, comprising: shell, processor, memory, circuit board and power circuit, wherein circuit board placement In the space interior that shell surrounds, processor and memory setting are on circuit boards；Power circuit, for being above-mentioned electronic equipment Each circuit or device power supply；Memory is for storing executable program code；Processor is stored by reading in memory Executable program code run program corresponding with executable program code, for executing aforementioned data distribution from addendum Code method.

A kind of computer readable storage medium is stored with one or more program, and one or more of programs can It is executed by one or more processor, to realize that aforementioned data distribution increases coding method certainly.

The beneficial effects of the present invention are embodied in:

The present invention realizes that the data of hive encode certainly without writing UDF function, by sql, to reach to known to some Name field carries out the purpose of data encoding, and development cost is effectively reduced, and improves development efficiency.It is simple that the present invention is different from tradition Application sequence disposably to be encoded to data, newly-increased record content can be realized from coding, and encoded content Number remains unchanged.

Detailed description of the invention

It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.In all the appended drawings, similar element Or part is generally identified by similar appended drawing reference.In attached drawing, each element or part might not be drawn according to actual ratio.

Fig. 1 is a kind of data distribution formula of the embodiment of the present invention from addendum code method flow diagram；

Fig. 2 is that a kind of data distribution formula of the embodiment of the present invention increases coded system structure chart certainly；

Fig. 3 is a kind of electronic equipment of embodiment of the present invention structural schematic diagram.

Specific embodiment

It is described in detail below in conjunction with embodiment of the attached drawing to technical solution of the present invention.Following embodiment is only used for Clearly illustrate technical solution of the present invention, therefore be only used as example, and cannot be used as a limitation and limit protection model of the invention It encloses.

It should be noted that unless otherwise indicated, technical term or scientific term used in this application should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.

As shown in Figure 1, being a kind of data distribution formula of the present invention from addendum code embodiment of the method, comprising:

S11: obtaining external source data, generates first set；

S12: obtaining machined data, generates second set；

S13: calculating the union of the first set and second set, obtains total data set；

S14: duplicate removal processing is carried out to the total data set, obtains duplicate removal data acquisition system；

S15: obtaining the newly-increased data in the duplicate removal data acquisition system, and the newly-increased data are encoded, and completes data From addendum code.

Preferably, the acquisition external source data generate first set, specifically include:

External source data are obtained, code field and the first field is constructed, obtains first set；For example, one volume of construction Code field (c_id), default value 0 reconstruct a field (c_id2), and default value is empty (null)；

Tight preceding manufactured data are obtained, the second field is created, obtains second set；The default value of second field Take the value of the code field；For example, obtaining the last data processed, a field (c_id2), default value are constructed Take the value of code field (c_id).

Preferably, duplicate removal processing is carried out to the total data set, obtains duplicate removal data acquisition system, specifically includes:

The field encoded as required is grouped the data in the total data set；

The data in the total data set are ranked up according to the code field；

Corresponding data is taken out in the data after sequence by preset rules, forms duplicate removal data acquisition system；

For example, being grouped to the field (c_n) that the data in the total data set encode as needed, according still further to coding Field (c_id) carries out descending sort, and the record of serial number 1 is taken after sequence, this step is mainly that duplicate removal keeps unique, if in institute It states in first set and second set there are the identical field (c_n) for needing to encode, takes the data in the second set, really Protect the uniqueness that identical recordings encode every time.

Preferably, the newly-increased data in the duplicate removal data acquisition system are obtained, the newly-increased data are encoded, complete number According to from addendum code, specifically include:

The data in the duplicate removal data acquisition system are ranked up according to second field, and in the data after sequence The data for meeting preset condition are searched, the newly-increased data is obtained, the newly-increased data is compiled according to the code field Code obtains the value of new code field, completes data from addendum code；For example, to the data in the duplicate removal data acquisition system according to The field (c_id2) of construction carries out descending arrangement, obtains serial number rn2, and null value can come below after descending arrangement, then judge structure Whether the field (c_id2) made is empty, is not the empty value for then taking field (c_id2), then takes serial number rn2 to obtain as coding to be empty The value of new code field (c_id)；The process realizes encoding certainly for newly-increased data, while will not cause to coded data It influences, encoded context number remains unchanged.

As shown in Fig. 2, being a kind of data distribution formula of the present invention from addendum code system embodiment, comprising:

External source data processing module 21 generates first set for obtaining external source data；

Machined data processing module 22 generates second set for obtaining machined data；

Data combiners block 23 obtains total data set for calculating the union of the first set and second set；

Data deduplication module 24 obtains duplicate removal data acquisition system for carrying out duplicate removal processing to the total data set；

Data are from coding module 25 is increased, for obtaining the newly-increased data in the duplicate removal data acquisition system, by the newly-increased number According to being encoded, data are completed from addendum code.

Preferably, the external source data processing module 21, is specifically used for:

External source data are obtained, code field and the first field is constructed, obtains first set；For example, one volume of construction Code field (c_id), default value 0 reconstruct a field (c_id2), and default value is empty (null)

The machined data processing module 22, is specifically used for:

There is piccolo, the data deduplication module 24 is specifically used for:

The field encoded as required is grouped the data in the total data set；

The data in the total data set are ranked up according to the code field；

Preferably, the data are specifically used for from coding module 25 is increased:

The embodiment of the present invention also provides a kind of electronic equipment, as shown in figure 3, embodiment illustrated in fig. 1 of the present invention may be implemented Process, as shown in figure 3, above-mentioned electronic equipment may include: shell 31, processor 32, memory 33, circuit board 34 and power supply Circuit 35, wherein circuit board 34 is placed in the space interior that shell 31 surrounds, and processor 32 and memory 33 are arranged in circuit board On 34；Power circuit 35, for each circuit or the device power supply for above-mentioned electronic equipment；Memory 33 is executable for storing Program code；Processor 32 is run by reading the executable program code stored in memory 33 and executable program code Corresponding program increases coding method for executing aforementioned data distribution certainly.

Processor 32 to the specific implementation procedures of above-mentioned steps and processor 32 by operation executable program code come The step of further executing may refer to the description of embodiment illustrated in fig. 1 of the present invention, and details are not described herein.

The electronic equipment exists in a variety of forms, including but not limited to:

(1) server: providing the equipment of the service of calculating, and the composition of server includes that processor, hard disk, memory, system are total Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, in processing energy Power, stability, reliability, safety, scalability, manageability etc. are more demanding；

(2) other electronic equipments with data interaction function.

The embodiment of the present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage There is one or more program, one or more of programs can be executed by one or more processor, aforementioned to realize Data distribution formula increases coding method certainly.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme should all cover within the scope of the claims and the description of the invention.

Claims

1. a kind of data distribution formula increases coding method certainly characterized by comprising

External source data are obtained, first set is generated；

Machined data are obtained, second set is generated；

The newly-increased data in the duplicate removal data acquisition system are obtained, the newly-increased data are encoded, complete data from addendum code.

2. the method as described in claim 1, which is characterized in that the acquisition external source data generate first set, specifically Include:

Tight preceding manufactured data are obtained, the second field is created, obtains second set；The default value of second field takes institute State the value of code field.

3. method according to claim 2, which is characterized in that carry out duplicate removal processing to the total data set, obtain duplicate removal Data acquisition system specifically includes:

The field encoded as required is grouped the data in the total data set；

The data in the total data set are ranked up according to the code field；

4. method as claimed in claim 3, which is characterized in that the newly-increased data in the duplicate removal data acquisition system are obtained, by institute It states newly-increased data to be encoded, completes data from addendum code, specifically include:

The data in the duplicate removal data acquisition system are ranked up according to second field, and are searched in the data after sequence The data for meeting preset condition obtain the newly-increased data, are encoded, are obtained to the newly-increased data according to the code field To the value of new code field, data are completed from addendum code.

5. a kind of data distribution formula increases coded system certainly characterized by comprising

Data are from coding module is increased, and for obtaining the newly-increased data in the duplicate removal data acquisition system, the newly-increased data are carried out Coding completes data from addendum code.

6. system as claimed in claim 5, which is characterized in that the external source data processing module is specifically used for:

The machined data processing module, is specifically used for:

7. system as claimed in claim 6, which is characterized in that the data deduplication module is specifically used for:

The field encoded as required is grouped the data in the total data set；

The data in the total data set are ranked up according to the code field；

8. system as claimed in claim 7, which is characterized in that the data are specifically used for from coding module is increased:

9. a kind of electronic equipment, which is characterized in that the electronic equipment includes: shell, processor, memory, circuit board and electricity Source circuit, wherein circuit board is placed in the space interior that shell surrounds, and processor and memory setting are on circuit boards；Power supply Circuit, for each circuit or the device power supply for above-mentioned electronic equipment；Memory is for storing executable program code；Processing Device runs program corresponding with executable program code by reading the executable program code stored in memory, for holding Method of the row as described in claim 1-4 is any.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage have one or Multiple programs, one or more of programs can be executed by one or more processor, to realize that claim 1-4 such as appoints Method described in one.