CN109241045A

CN109241045A - A kind of method and apparatus of preprocessed data

Info

Publication number: CN109241045A
Application number: CN201810995986.9A
Authority: CN
Inventors: 胡飞
Original assignee: Pu Xin Heng Ye Technology Development (beijing) Co Ltd; Pleasant Sunny Technology Development (beijing) Co Ltd
Current assignee: Pu Xin Heng Ye Technology Development (beijing) Co Ltd; Pleasant Sunny Technology Development (beijing) Co Ltd
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2019-01-18

Abstract

Embodiments of the present invention provide a kind of method of preprocessed data.This method comprises: extracting a characteristic variable of set of source data；The characteristic interval of set of source data is determined according to the characteristic variable；Characteristic interval is divided into multiple subcharacter sections；Processing belongs to the data in the multiple subcharacter section.The present invention improves the efficiency of data prediction, enhances the expansibility of data preprocessing module in the case where additionally not increasing O&M cost.In addition, embodiments of the present invention additionally provide device, a kind of equipment and a kind of computer readable storage medium of a kind of preprocessed data.

Description

A kind of method and apparatus of preprocessed data

Technical field

Embodiments of the present invention are related to data mining technology field, more specifically, embodiments of the present invention are related to one Method, a kind of device of preprocessed data, a kind of equipment and a kind of computer readable storage medium of kind preprocessed data.

Background technique

Background that this section is intended to provide an explanation of the embodiments of the present invention set forth in the claims or context.Herein Description recognizes it is the prior art not because not being included in this section.

Data prediction (data preprocessing) refers to some processing carried out before main processing to data. Data in the real world are generally all imperfect, inconsistent dirty datas, can not directly carry out data mining, or excavate knot Fruit is barely satisfactory.In order to improve the quality of data mining, Data Preprocessing Technology is produced.There are many methods for data prediction: Data scrubbing, data integration, data transformation, data regularization etc..Data prediction, which can be, classifies to collected data Or the necessary processing such as audit, screening, sequence done before grouping.These Data Preprocessing Technologies make before data mining With greatly improving the quality of data mining mode, reduce the time required for actual excavation.

There are the following two kinds data prediction modes for the prior art: one is directly from data source loading processing one by one；Separately One is carry out data processing using big datas technologies such as Hadoop.For the data prediction mode loaded one by one, processing Low efficiency, and with the growth of data volume, treatment effeciency decline is then more obvious, in addition, this data processing method is difficult to expand Exhibition；Data processing is carried out using the big datas technology such as Hadoop, although treatment effeciency is high, based on Hadoop distributed program Exploitation and O&M cost also increase therewith.

Summary of the invention

The present invention is directed to equilibrium data treatment effeciency, set expandability and exploitation O&M costs, are not increasing fortune additionally In the case where tieing up cost, the efficiency of data prediction is promoted, the expansibility of data preprocessing module is enhanced.

To realize that above-mentioned target, embodiments of the present invention are intended to provide a kind of method of preprocessed data, a kind of pre- place Manage device, a kind of equipment and a kind of computer readable storage medium of data.

In the first aspect of embodiment of the present invention, a kind of method of preprocessed data is provided, comprising: extraction source number According to a characteristic variable of collection；The characteristic interval of set of source data is determined according to the characteristic variable；Characteristic interval is divided into multiple Subcharacter section；Processing belongs to the data in the multiple subcharacter section.

In one embodiment of the invention, the characteristic variable is the shared feature of source data set data.

In another embodiment of the invention, the multiple subcharacter section is equal length.

In yet another embodiment of the present invention, it is described handle belong to the multiple subcharacter section data be simultaneously into Capable.

In yet another embodiment of the present invention, a kind of method of preprocessed data further include: will treated number According to deposit target position.

In yet another embodiment of the present invention, a kind of method of preprocessed data further include: verification set of source data With the consistency of target position data.

In the second aspect of embodiment of the present invention, a kind of device of preprocessed data is provided, comprising: characteristic variable Module, for extracting a characteristic variable of set of source data；Characteristic interval module, for determining source number according to the characteristic variable According to the characteristic interval of collection；Subcharacter section module, for characteristic interval to be divided into multiple subcharacter sections；Data processing module, For handling the data for belonging to the multiple subcharacter section.

In yet another embodiment of the present invention, a kind of device of preprocessed data further include: memory module is used for By treated, data are stored in target position.

In yet another embodiment of the present invention, a kind of device of preprocessed data further include: correction verification module is used for Verify the consistency of set of source data and target position data.

In the third aspect of embodiment of the present invention, a kind of equipment is provided, comprising: memory is calculated for storing Machine program；Processor, for executing the computer program stored in the memory, and the computer program is performed, Realize any one method as previously described.

In the fourth aspect of embodiment of the present invention, a kind of computer readable storage medium is provided, is stored thereon with Computer program when the computer program is executed by processor, can be realized any one method as previously described.

A kind of method of preprocessed data of embodiment, a kind of device of preprocessed data, Yi Zhongshe according to the present invention Standby and a kind of computer readable storage medium improves data prediction in the case where additionally not increasing O&M cost Efficiency, when set of source data quantity rises, it is only necessary to which increasing part preprocessing module can rapidly be expanded.

Technical solution provided by the invention is not directly handled set of source data data directly, but proposes feature Set of source data is cut into and disjoint several Sub Data Sets by distributed caching has decoupled set of source data by variate model And preprocessing module finally greatly improves the efficiency of preprocessed data to facilitate the dynamic expansion of preprocessing module, reduces The O&M cost of whole system.

Detailed description of the invention

The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:

Fig. 1 schematically shows a kind of flow chart of the method for preprocessed data according to an embodiment of the present invention；

Fig. 2 schematically shows a kind of structural representations of the device of preprocessed data according to an embodiment of the present invention Figure；

Fig. 3 schematically shows a kind of structural schematic diagram of equipment according to an embodiment of the present invention；

Fig. 4 schematically shows a kind of signals of computer readable storage medium according to an embodiment of the present invention Figure.

In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.

Specific embodiment

The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any Mode limits the scope of the invention.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and energy It is enough that the scope of the present disclosure is completely communicated to those skilled in the art.

One skilled in the art will appreciate that embodiments of the present invention can be implemented as a kind of system, device, equipment, method Or computer program product etc..Therefore, the present disclosure may be embodied in the following forms, it may be assumed that complete hardware, complete software The form that (including firmware, resident software, microcode etc.) or hardware and software combine.

Embodiment according to the present invention, propose the method for preprocessed data a kind of, a kind of device of preprocessed data, A kind of equipment and a kind of computer readable storage medium.

Below with reference to several representative embodiments of the invention, the principle and spirit of the present invention are explained in detail.

Present inventors have recognized that there is significant deficiency, direct logarithms for two kinds of data prediction modes of the prior art According to taking, processing operation or treatment effeciency are low, are difficult to expand or develop O&M cost height.

Data Preprocessing Technology scheme provided by the invention is not directly handled set of source data data directly, but Characteristic variable model is proposed, set of source data is cut into and disjoint several Sub Data Sets, passes through distributed caching, decoupling Set of source data and preprocessing module, then obtain by multiple preprocessing module multi-threaded parallels Sub Data Set and progress in caching Pretreatment operation, to facilitate the dynamic expansion of preprocessing module and promote the speed of data prediction.

After introduced the basic principles of the present invention, lower mask body introduces various non-limiting embodiment party of the invention Formula.

Embodiment according to the present invention, it includes this big scene of data mining that application scenarios of the invention, which may be implemented, More specifically, application scenarios of the invention are data predictions.

Illustrative methods

A kind of method of preprocessed data of illustrative embodiments according to the present invention is described below with reference to Fig. 1.It needs It is noted which is shown only for the purpose of facilitating an understanding of the spirit and principles of the present invention for above-mentioned application scenarios, embodiment party of the invention Formula is unrestricted in this regard.On the contrary, embodiments of the present invention can be applied to applicable any scene.

Fig. 1 schematically shows a kind of flow chart of the method for preprocessed data according to an embodiment of the invention.It should Method usually requires to realize by similar devices such as computer, intelligent terminals.Specifically, the method for the preprocessed data can wrap It includes:

S110 extracts a characteristic variable of set of source data.

In data processing, for the set of source data to be processed for one, with multiple dimensions for considering analysis Degree.For example, the national train ticket information in 21 days July in 2018 as set of source data, has 10,000 train ticket data informations. In data processing, can be lasted from the starting station, terminal station, departure time, Zhongdao time, whole process, face value and vehicle seat It is analyzed it etc. inferior multiple dimensions.

In the present invention, any of the above-described a dimension all can serve as the characteristic variable of the set of source data.Under normal circumstances, one A characteristic variable is the shared feature of source data set pieces of data.But under special case, a characteristic variable can not also It is feature common to the source data set pieces of data, such as in the case where certain data has lacked the dimensional information.

As an example, we are using the starting station as the spy of this set of source data of national train ticket information on July 21st, 2018 Variable is levied, and extracts originator information from every train ticket data.If having lacked starting station letter in certain train ticket data Breath, can be used special marking and substitutes the information of the missing, to avoid omitting the data for handling certain entries.

S120 determines the characteristic interval of set of source data according to the characteristic variable.

For different types of data, have the characteristics that different.For example, the starting station, departure time and vehicle seat Grade, they have the characteristics that different, and the departure time can directly be expressed with numeric form, and the starting station and vehicle seat grade are not Directly translate into numerical value.Data are processed for convenience, the data for not directly translating into numerical value can be passed through into tax Value is quantized.For example, carrying out continuous number layout for all different train starting stations in the whole nation.

And then previous step is being extracted the feature of set of source data national train ticket information on July 21st, 2018 After variable originator information, assignment can be carried out to this feature variable, enable to show with numeric form, this assignment can be with It is random layout, is also possible to carry out according to the original order of the every data of set of source data, be also possible to according to its data spy Point by being calculated, in short, no matter assignment mode it is simple, complicated whether, the present invention do not do any restriction to this.

It should be noted that in source data set, some characteristic variables may be identical, such as the fire at the same starting station Ticket may be very much, at this point, to give different assignment for the data with same characteristic features variable convenient for data processing.Make For a kind of example of simple assignment, the characteristic variable of set of source data national train ticket information on July 21st, 2018 can be originated It stands and carries out assignment respectively, formed characteristic interval [1,10000].Each of characteristic interval [1,10000] value all represents source 1 data in data set.

Characteristic interval is divided into multiple subcharacter sections by S130.

To realize multithreading, distributed treatment, features described above section can be divided, form multiple subcharacter areas Between, this division can be equal part, be also possible to not equal part, nevertheless, there is no appoint between each subcharacter section What intersection or coincidence.

For the processing speed of balanced each thread, it is preferable that each subcharacter section is equal length, or almost Equal length.

After carrying out above-mentioned cutting, the data in each subcharacter section can be kept in caching or other storage locations. Preferably, the data in subcharacter section can be cached by Redis.

S140, processing belong to the data in the multiple subcharacter section.

In this step, according to the data in the set above-mentioned each subcharacter section of rule process.Specifically, first from storage The data that each subcharacter section is extracted in position (as cached), are then handled according to set processing rule.Preferably, may be used To extract the data in multiple subcharacter sections from caching, parallel (simultaneously) processing is then carried out.

It is highly preferred that a kind of method of preprocessed data of the invention can also include:

S150, by treated, data are stored in target position.

And then the data of previous step, each subcharacter section after treatment are stored into target position, for rear It is continuous to further use.The target position can be any position that can store data such as caching, local disk, cloud server Or carrier.

It is highly preferred that a kind of method of preprocessed data of the invention can further include:

S160 verifies the consistency of set of source data and target position data.

Set of source data data to guarantee all have carried out pretreatment operation, can also pass through abovementioned steps implementation procedure In label it is whether consistent with target position data to verify set of source data data, under the two data unanimous circumstances, then recognize It has been completed for data prediction, otherwise data prediction work is not fully complete.If data prediction work is not complete It is complete to complete, then it can choose the step of continuation re-executes related this method from S110.

Exemplary means

After describing the method for exemplary embodiment of the invention, next, with reference to Fig. 2 to the exemplary reality of the present invention The device for applying a kind of preprocessed data of mode is illustrated.

Fig. 2 schematically shows a kind of structural representations of the device of preprocessed data according to an embodiment of the invention Figure.In general, the device can be independently integrally formed, certainly, embodiment of the present invention is also not excluded for the device or the device A part be set in server or in other equipment, the invention does not limit this.The device of the preprocessed data It may include characteristic variable module 210, characteristic interval module 220, subcharacter section module 230 and data processing module 240, Specifically:

Characteristic variable module 210, for extracting a characteristic variable of set of source data.

Characteristic interval module 220, for determining the characteristic interval of set of source data according to the characteristic variable.

And then a upper module is being extracted the feature of set of source data national train ticket information on July 21st, 2018 After variable originator information, assignment can be carried out to this feature variable, enable to show with numeric form, this assignment can be with It is random layout, is also possible to carry out according to the original order of the every data of set of source data, be also possible to according to its data spy Point by being calculated, in short, no matter assignment mode it is simple, complicated whether, the present invention do not do any restriction to this.

Subcharacter section module 230, for characteristic interval to be divided into multiple subcharacter sections.

Data processing module 240, for handling the data for belonging to the multiple subcharacter section.

In this module, according to the data in the set above-mentioned each subcharacter section of rule process.Specifically, first from storage The data that each subcharacter section is extracted in position (as cached), are then handled according to set processing rule.Preferably, may be used To extract the data in multiple subcharacter sections from caching, parallel (simultaneously) processing is then carried out.

Compared with the prior art, the present invention is not directly handled set of source data, but passes through characteristic variable shape At characteristic interval, set of source data is cut into disjoint Sub Data Set, by distributed caching, decouples set of source data and pre- Processing operation just starts to carry out parallel processing to data at data processing module 240, greatly improves the effect of data prediction Rate facilitates the dynamic expansion of preprocessing module, reduces the O&M cost of whole system.

It is highly preferred that a kind of device of preprocessed data of the invention can also include:

Memory module 250, for data to be stored in target position by treated.

And then the data of a upper module, each subcharacter section after treatment are stored into target position, for rear It is continuous to further use.The target position can be any position that can store data such as caching, local disk, cloud server Or carrier.

It is highly preferred that a kind of device of preprocessed data of the invention can further include:

Correction verification module 260, for verifying the consistency of set of source data and target position data.

Set of source data data to guarantee all have carried out pretreatment operation, can also pass through aforementioned modules implementation procedure In label it is whether consistent with target position data to verify set of source data data, under the two data unanimous circumstances, then recognize It has been completed for data prediction, otherwise data prediction work is not fully complete.If data prediction work is not complete It is complete to complete, then it can choose continuation and re-execute above-mentioned module from characteristic variable module 210.

Example devices

After describing the method, apparatus of exemplary embodiment of the invention, next, showing with reference to Fig. 3 the present invention A kind of equipment of example property embodiment is illustrated.

Fig. 3 shows the block diagram for being suitable for the exemplary computer system/server 30 for being used to realize embodiment of the present invention. The computer system/server 30 that Fig. 3 is shown is only an example, should not function and use scope to the embodiment of the present invention Bring any restrictions.

As shown in figure 3, computer system/server 30 is showed in the form of universal computing device.Computer system/service The component of device 30 can include but is not limited to: one or more processor or processing unit 301, system storage 302, even Connect the bus 303 of different system components (including system storage 302 and processing unit 301).

Computer system/server 30 typically comprises a variety of computer system readable media.These media, which can be, appoints What usable medium that can be accessed by computer system/server 30, including volatile and non-volatile media, it is moveable and Immovable medium.

System storage 302 may include the computer system readable media of form of volatile memory, for example, depositing at random Access to memory (RAM) 3021 and/or cache memory 3022.Computer system/server 30 may further include it Its removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, ROM 3023 can be with For reading and writing immovable, non-volatile magnetic media (being not shown in Fig. 3, commonly referred to as " hard disk drive ").Although not existing It is shown in Fig. 3, disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") and right can be provided The CD drive of removable anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these feelings Under condition, each driver can be connected by one or more data media interfaces with bus 303.In system storage 302 It may include at least one program product, which has one group of (for example, at least one) program module, these program moulds Block is configured to perform the function of various embodiments of the present invention.

Program/utility 3025 with one group of (at least one) program module 3024, can store in such as system In memory 302, and such program module 3024 includes but is not limited to: operating system, one or more application program, its It may include the realization of network environment in its program module and program data, each of these examples or certain combination. Program module 3024 usually executes function and/or method in embodiment described in the invention.

Computer system/server 30 can also be with one or more external equipment 304 (such as keyboard, sensing equipment, displays Device etc.) communication.This communication can be carried out by input/output (I/O) interface 305.Also, computer system/server 30 Network adapter 306 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public affairs can also be passed through Common network network, such as internet) communication.As shown in figure 3, network adapter 306 passes through bus 303 and computer system/server 30 other modules (such as processing unit 301) communication.It should be appreciated that computer can be combined although being not shown in Fig. 3 Systems/servers 30 use other hardware and/or software module.

The computer program that processing unit 301 is stored in system storage 302 by operation, thereby executing various functions Using and data processing, for example, execute for realizing each step in above method embodiment instruction；Specifically, place Reason device 301 can execute the computer program stored in memory 302, and the computer program is performed, following instruction quilts Operation: a characteristic variable of set of source data is extracted；The characteristic interval of set of source data is determined according to this feature variable；By characteristic area Between be divided into multiple subcharacter sections；Processing belongs to the data in the multiple subcharacter section.

Exemplary media

After the method, apparatus and equipment for describing exemplary embodiment of the invention, next, with reference to Fig. 4 pairs A kind of computer readable storage medium of exemplary embodiment of the invention is illustrated.

The computer readable storage medium of Fig. 4 is CD 40, is stored thereon with computer program (i.e. program product), the journey When sequence is executed by processor, documented each step in above method embodiment can be realized, for example, extracting the one of set of source data A characteristic variable；The characteristic interval of set of source data is determined according to this feature variable；Characteristic interval is divided into multiple subcharacter sections； Processing belongs to the data in the multiple subcharacter section.

It should be noted that although being referred to a kind of several modules of the device of preprocessed data in the above detailed description, Be it is this division be only exemplary it is not enforceable.In fact, embodiment according to the present invention, above-described two The feature and function of a or more module can embody in a module.Conversely, the feature of an above-described module It can be to be embodied by multiple modules with further division with function.

In addition, although describing the operation of the method for the present invention in the accompanying drawings with particular order, this do not require that or Hint must execute these operations in this particular order, or have to carry out shown in whole operation be just able to achieve it is desired As a result.Additionally or alternatively, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/or by one Step is decomposed into execution of multiple steps.

Although detailed description of the preferred embodimentsthe spirit and principles of the present invention are described by reference to several, it should be appreciated that, this It is not limited to the specific embodiments disclosed for invention, does not also mean that the feature in these aspects cannot to the division of various aspects Combination be benefited, the division only for statement convenience.The present invention is directed to cover institute in spirit and scope of the appended claims Including various modifications and equivalent arrangements.

Claims

1. a kind of method of preprocessed data characterized by comprising

Extract a characteristic variable of set of source data；

The characteristic interval of set of source data is determined according to the characteristic variable；

Characteristic interval is divided into multiple subcharacter sections；

Processing belongs to the data in the multiple subcharacter section.

2. the method as described in claim 1, which is characterized in that the characteristic variable is the shared spy of source data set data Sign.

3. the method as described in claim 1, which is characterized in that the multiple subcharacter section is equal length.

4. the method as claimed in claims 1-3, which is characterized in that described to handle the data for belonging to the multiple subcharacter section It carries out simultaneously.

5. method as claimed in claim 4, which is characterized in that further include: by treated, data are stored in target position.

6. method as claimed in claim 5, which is characterized in that further include: the one of verification set of source data and target position data Cause property.

7. a kind of device of preprocessed data characterized by comprising

Characteristic variable module, for extracting a characteristic variable of set of source data；

Characteristic interval module, for determining the characteristic interval of set of source data according to the characteristic variable；

Subcharacter section module, for characteristic interval to be divided into multiple subcharacter sections；

Data processing module, for handling the data for belonging to the multiple subcharacter section.

8. device as claimed in claim 7, which is characterized in that the characteristic variable is the shared spy of source data set data Sign.

9. device as claimed in claim 7, which is characterized in that the multiple subcharacter section is equal length.

10. the device as described in claim 7-9, which is characterized in that described to handle the number for belonging to the multiple subcharacter section According to carrying out simultaneously.

11. device as claimed in claim 10, which is characterized in that further include:

Memory module, for data to be stored in target position by treated.

12. device as claimed in claim 11, which is characterized in that further include:

Correction verification module, for verifying the consistency of set of source data and target position data.

13. a kind of equipment, comprising:

Memory, for storing computer program；

Processor, for executing the computer program stored in the memory, and the computer program is performed, and is realized Method described in any one of claim 1-6.

14. a kind of computer readable storage medium, is stored thereon with computer program, the computer program is executed by processor When, realize method described in any one of claim 1-6.