CN108984547A - The method and apparatus of data processing - Google Patents
The method and apparatus of data processing Download PDFInfo
- Publication number
- CN108984547A CN108984547A CN201710398741.3A CN201710398741A CN108984547A CN 108984547 A CN108984547 A CN 108984547A CN 201710398741 A CN201710398741 A CN 201710398741A CN 108984547 A CN108984547 A CN 108984547A
- Authority
- CN
- China
- Prior art keywords
- data
- spark
- data processing
- data source
- dataframe
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of data processing method and data processing equipments, are related to field of computer technology.One specific embodiment of the data processing method includes uniformly converting Spark Dataframe for the multiple data sources including text data, relevant database data and distributed type assemblies data, and be registered as the interim table of Spark;And the input carried out in a manner of sql according to user executes inquiry to the interim table of the Spark across the data source.The memory of original traditional database and calculating pressure can be transferred to distributed type assemblies and be converted into bandwidth pressure, and the mixing to multi-class data source is supported to inquire by the embodiment under the premise of retaining original service logic.
Description
Technical field
The present invention relates to field of computer technology more particularly to it is a kind of it is being calculated based on distributed environment and memory, be used for
Data processing method, data processing equipment, electronic equipment and the storage medium of big data quantity system based on traditional database.
Background technique
With the development of computer technology, data storage shows diversity, and related data volume also increases continuously and healthily
It is long, for data analysis, excavate the lot of challenges proposed.
In order to handle these data, using the traditional Relational DataBase of such as MySQL, by such as join,
The operation such as groupby, orderby handles data.However, the shortcomings that traditional Relational DataBase of such as MySQL, exists
In data-handling capacity is limited, and with the increase of data volume, speed pole occur in the operations such as join, groupby, orderby
Slowly, the case where or even by machine resources exhausting, cannot running.
In order to solve the problems, such as big data storage and calculate, existing feasible and universal scheme is distributed computing and divides
Cloth storage.The data of traditional database are transferred in distributed system by these existing solutions, and then using distributed
The Computational frame of system is slow to solve the problems, such as to run.
However, the shortcomings that these existing solutions, is:
1) data cost of transfer is too big.This, which is mainly manifested in, guarantees data accuracy aspect, such as in processing different data
The processing aspect such as the conversion of type and newline, needs to expend great human cost;
2) data processing has delay.The data volume of traditional database is too big, and transfer data need for quite a long time, cannot
Adapt to carry out the demand of real time business;
3) data of traditional database need to back up in a distributed system.This greatly wastes memory space;And
4) it is difficult to carry out the processing across data source.For example, it is often desirable to distributed data base, traditional database or to knot
Structure data, semi-structured data carry out mixing inquiry.In this regard, processing mode universal at present is by the data of a data source
Another data source is imported, such as by including sqoop, JDBC (Java Data Base Connectivity:java database
Connection), the technical applications such as Hive appearance.However, such processing mode long processing period, and sqoop is to the place of spcial character
There are problems for reason.
Therefore, in the prior art, for the business to be started to walk with traditional database, increase to a certain extent in data volume
When, it may appear that following problems: data processing speed is excessively slow, and since data source types disunity causes flow chart of data processing numerous
It is miscellaneous, slow.
Summary of the invention
In view of this, the embodiment of the present invention provide it is a kind of it is being calculated based on distributed environment and memory, for based on tradition
Data processing method, data processing equipment, electronic equipment and the storage medium of the big data quantity system of database can retain
Under the premise of original service logic, the memory of original traditional database and calculating pressure are transferred to distributed type assemblies and converted
For bandwidth pressure, and the mixing to multi-class data source is supported to inquire.
To achieve the above object, according to one embodiment of present invention, a kind of data processing method is provided.
The data processing method of embodiment according to the present invention includes: that will convert from the data of one or more data sources
For Spark Dataframe, the data source include one of text data source, relevant database and distributed type assemblies or
It is a variety of;The Spark Dataframe of conversion is registered as into the interim table of Spark;And it is carried out in a manner of sql according to user
Input, executes inquiry to the interim table of the Spark across the data source.
Spark is the computing engines for the Universal-purpose quick for aiming at large-scale data processing and designing.Spark Dataframe is
It is organized as the distributed data collection of the column of name.Conceptually, Spark Dataframe is equal in relevant database
Data frame (data frame) in table or R/Python, but there is richer optimization on backstage.
Optionally, when the data source is relevant database, the data processing method further include: will regularly turn
The Spark Dataframe changed imports distributed type assemblies.
Optionally, wherein when the data source is text data source, the data processing method further include: to each
Data source definitions regular expression and interim table schema;The text data is converted to by the regular expression
Spark RDD;And the Spark Dataframe is converted by the Spark RDD in conjunction with the interim table schema.
Optionally, the data processing data processing method includes: to save query result to relationship with intermediate sheet form
Type database or distributed type assemblies.
To achieve the above object, according to another embodiment of the invention, a kind of data processing equipment is provided.
The data processing equipment of embodiment according to the present invention includes: data conversion module, for that will come from one or more
The data of a data source are converted into Spark Dataframe, and the data source includes text data source, relevant database and divides
One of cloth cluster is a variety of;Table registration module, for the Spark Dataframe of conversion to be registered as Spark
Interim table;And enquiry module, the input for being carried out in a manner of sql according to user face the Spark across the data source
When table execute inquiry.
Optionally, the data processing equipment further includes loading module, for being relevant database when the data source
When, the Spark Dataframe of conversion is regularly imported into distributed type assemblies.
Optionally, the data processing equipment further includes structurized module, for being text data source when the data source
When, execute following operation: to each data source definitions regular expression and interim table schema;By the regular expression come
Spark RDD is converted by the text data;And institute is converted by the Spark RDD in conjunction with the interim table schema
State Spark Dataframe.
Optionally, the data processing equipment further includes that table restores module, for protecting query result with intermediate sheet form
It deposits to relevant database or distributed type assemblies.
To achieve the above object, another embodiment according to the present invention, provides a kind of electronic equipment.
A kind of electronic equipment according to an embodiment of the present invention includes: at least one processor;And with it is described at least one
The memory of processor communication connection;Wherein, the memory is stored with the instruction that can be executed by one processor, described
Instruction is executed by least one described processor, so that at least one described processor is able to carry out the embodiment of the present invention and is mentioned
The data processing method of confession.
To achieve the above object, according to still another embodiment of the invention, a kind of non-transient computer readable storage is provided
Medium.
A kind of non-transient computer readable storage medium according to an embodiment of the present invention, the non-transient computer is readable to deposit
Storage media stores computer instruction, and the computer instruction is for executing the computer provided by the embodiment of the present invention
Data processing method.
Embodiment in foregoing invention have the following advantages that or the utility model has the advantages that
1) for the business to start to walk with traditional Relational DataBase such as MySQL, original service logic can be retained;
2) it because converting Spark Dataframe for traditional Relational DataBase (such as MySQL) data, and is registered as
The interim table of Spark, compared with direct operation with traditional relevant database, Spark uses spark-on-yarn mode, will calculate
Pressure is transferred to distributed type assemblies from traditional Relational DataBase, so overcoming due to traditional Relational DataBase data processing
The limited technical problem of ability, and then improve arithmetic speed;And
3) because the Spark Dataframe based on traditional Relational DataBase data is regularly imported distributed type assemblies
In Hadoop and it is stored in HDFS (Hadoop Distributed File System:Hadoop distributed file system), institute
Cost is big, has the technical issues of delay, waste memory space to overcome as caused by data transfer, and then realizing will be original
The memory of traditional database calculates the bandwidth pressure that pressure is converted into low cost;And
4) because uniformly converting Spark Dataframe for all data sources, and the interim table of Spark is registered as, so
It overcomes since data source types disunity causes the technical problem that flow chart of data processing is many and diverse, slow, and then realizes to multiclass
The support of the mixing inquiry of data source.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment
With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the system framework figure for realizing the data processing system of data processing method according to an embodiment of the present invention;
Fig. 2 is an exemplary schematic diagram of data processing method according to an embodiment of the present invention;
Fig. 3 is another exemplary schematic diagram of data processing method according to an embodiment of the present invention;
Fig. 4 is another exemplary schematic diagram of data processing method according to an embodiment of the present invention;
Fig. 5 is the schematic diagram that data are accessed by data processing method according to an embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;And
Fig. 7 is adapted for the structural representation of the computer system for the terminal device or server of realizing the embodiment of the present invention
Figure.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Fig. 1 is the system framework figure for realizing the data processing system of data processing method according to an embodiment of the present invention.
Data processing system according to an embodiment of the present invention includes Spark inquiry system (Spark query system)
101.Spark inquiry system 101 is the core of data processing system according to an embodiment of the present invention, and function has two parts:
1. supporting the access to different types of one or more data sources 103,104 and 105: its Integral Thought is by institute
Have data source unification is converted into Spark Dataframe, and is registered as the interim table of Spark.To all from the point of view of user
It is the table in Spark inquiry system, user can directly be handled these tables by way of sql, can be inquired all
Data source, and do not need the storage mode of care table.As a result it can be shown in the form of Excel, such as 107 institute of appended drawing reference
Show.
2. supporting to restore to multi-class data source for result as middle table: this can be by calling directly respective data sources
Api is realized.Such as MySQL is written using JDBC api, using jedis write-in Redis etc., as indicated by reference numeral 108.
Data processing method according to an embodiment of the present invention can be realized the data warehouse (data based on Spark
Warehouse), a plurality of types of data sources, including text data (txt file etc.) 103, traditional Relational DataBase number are supported
According to (such as MySQL) 104 and distributed type assemblies data (Parquet, hbase) 105, and support the mixing to all data sources
Inquiry, user are not necessarily to know position and the storage form of specific data source, can be grasped to all data sources using sql sentence
Make.Various types of data source how is supported below in conjunction with Fig. 2 to Fig. 4 elaboration data processing method according to embodiments of the present invention and is illustrated
Its bright treatment process.
Fig. 2 is an exemplary schematic diagram of data processing method according to an embodiment of the present invention, is shown for text
The processing of file such as Log file.
For text file, such as txt, Excel or Log data, system can be by Txt (RegEx) module, to each input
Source defines the schema of regular expression Yu interim table, and semi-structured data is converted into structuring table.In treatment process
In, data processing method according to an embodiment of the present invention obtains useful information by regular expression and is converted into Spark RDD,
In conjunction with the interim table schema of input, Spark Dataframe is converted by RDD+schema, and then it is interim to be registered as Spark
Then table carries out mixing inquiry with remainder data source.
For example, in S101, Data Analyst may need to know that each server (service) application program is daily
Execute state status.The record of corresponding Log file is as follows:
Service1:[2016-07-23] INFO:core service DealService | succ
Service1:[2016-07-23] INFO:core service DealService | fail
Service2:[2016-07-23] INFO:core service DealService | succ
Service3:[2016-07-23] INFO:core service DealService | succ
Service1:[2016-07-24] INFO:core service DealService | pass
Service3:[2016-07-24] INFO:core service DealService succ
Service2:[2016-07-24] INFO:core service DealService | succ
……
In S102, regular expression, such as input RegExo:Service ([^]+) can be defined: ([^]+) INFO:
Core service DealService | Spark RDD, following institute are converted after canonical is extracted by ([^]+) for Log data
Show:
1 2016-07-23 succ
1 2016-07-23 fail
2 2016-07-23 succ
3 2016-07-23 succ
1 2016-07-24 pass
3 2016-07-24 succ
2 2016-07-24 succ
……
It, can be in conjunction with the interim table schema of input, for example, input Schema in S103:
RDD+schema is converted Spark Dataframe by Service.day.status.
In S104, Spark Dataframe is registered as into the interim table TABLE1 of Spark:
Final TABLE1 is the interim table registered, and Spark sql can be used to carry out sql statistical query later.
Fig. 3 is another exemplary schematic diagram of data processing method according to an embodiment of the present invention, is shown for biography
The processing for relevant database data such as MySQL data of uniting.
Spark supports the access to traditional Relational DataBase dependent on jdbc.For example, MySQL data are directed to,
Spark support directly converts Spark Dataframe for MySQL table, is then registered as interim table and carries out mixing inquiry.With it is straight
It meets operation MySQL to compare, since Spark uses spark-on-yarn mode, calculates pressure at this time from MySQL and be transferred to distribution
Formula cluster.It for the bigger data of data volume, can load data on HDFS, be stored with column storage Parquet format,
Using the storage advantage of distributed type assemblies, arithmetic speed is improved.
For example, converting the interim table TABLE2 of Spark for MySQL table by following procedure: in S201, Data Analyst
It may need to inquire MySQL table.In S202, Spark Dataframe is converted by MySQL table.In S203, by Spark
Dataframe is registered as the interim table TABLE2 of Spark.
Fig. 4 is that another exemplary signal of data processing method according to an embodiment of the present invention is shown for distribution
The processing of formula cluster such as Parquet table.
Parquet, hbase table in HDFS are inquired in data processing method support according to an embodiment of the present invention,
In addition inquiry of the primary support of Spark to Hive table, so after navigating to each data source by sql resolver layer 106, it is unified
It is converted into Spark Dataframe, and then database hbase, Hive, Parquet etc. on distributed type assemblies can be carried out
Mixing inquiry.
For example, converting the interim table TABLE3 of Spark for Parquet table by following procedure: in S301, data analysis
Shi Keneng needs to inquire the Parquet table on HDFS.In S302, is set based on epitope and convert Spark for Parquet table
Dataframe.In S303, Spark Dataframe is registered as into the interim table TABLE3 of Spark.
Fig. 5 is the schematic diagram that data are accessed by data processing method according to an embodiment of the present invention.
As shown in figure 5, data processing method according to an embodiment of the present invention will include text data, traditional relational data
The multiple data sources of library data and distributed type assemblies data are converted into Spark Dataframe, and are registered as interim table.Hereafter,
Data processing method according to an embodiment of the present invention carries out Hybrid access control to interim table by way of sql order.From user's
For angle, do not need to know that TABLE1 is Log file, TABLE2 is MySQL table, TABLE3 is Parquet table, TABLE4
For information such as hive tables.
The operational process for realizing the data processing system of data processing method according to an embodiment of the present invention is described below:
1. in the daily set time, thing generally small in portfolio, such as the time in morning, the data of the previous day are passed through
Spark loading system (spark loading system) 102 is converted into Spark Dataframe, and directly will
Dataframe is imported in distributed type assemblies Hadoop, is stored on HDFS, is prepared for later period distributed computing.
2. the user of such as Data Analyst when handling this partial data, inputs sql sentence, system sql resolver layer
106 can position data source
3.Spark inquiry system 101 is based on Spark sql engine and executes inquiry.
4. facilitating finishing operations if necessary to be saved to query result, data being stored in middle table: Parquet,
Hbase, MySQL etc..When inquiry next time, directly this table can be operated.It can also be stored in Redis, directly data are carried out
It shows.
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein.
As shown in fig. 6, the data processing equipment for executing data processing method according to an embodiment of the present invention includes: that data turn
Change module 601, table registration module 602, enquiry module 603, loading module 604, structurized module 605 and table and restores module 606.
Data conversion module 601 is used to convert Spark Dataframe for the data from one or more data sources.
The data source includes one of text data source, relevant database and distributed type assemblies or a variety of.
Table registration module 602 is used to the Spark Dataframe of conversion being registered as the interim table of Spark.
Enquiry module 603 is used for the input carried out in a manner of sql according to user, faces across the data source the Spark
When table execute mixing inquiry.
Loading module 604 is used to be directed to relevant database, regularly imports the Spark Dataframe of conversion
Distributed type assemblies.
Structurized module 605 is used to be directed to text data source, to each data source definitions regular expression and interim table
Schema converts Spark RDD for the text data by the regular expression, and in conjunction with the interim table
The Spark RDD is converted the Spark Dataframe by schema.
Table is restored module 606 and is used to intermediate sheet form be saved query result to relevant database or distributed collection
Group, such as save to Parquet, hbase, MySQL or Redis.
Below with reference to Fig. 7, it illustrates the departments of computer science for the data processing method for being suitable for being used to realize the embodiment of the present application
The structural schematic diagram of system 700.Computer system shown in Fig. 7 is only an example, should not be to the function of the embodiment of the present application
Any restrictions are brought with use scope.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in
Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and
Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data.
CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always
Line 704.
I/O interface 705 is connected to lower component: the importation 707 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.;
And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because
The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon
Computer program be mounted into storage section 708 as needed.
Particularly, disclosed embodiment according to the present invention, may be implemented as above with reference to Fig. 1 to Fig. 5 process described
Computer software programs.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on calculating
Computer program on machine readable medium, the computer program include the program generation for executing Fig. 1 to method shown in fig. 5
Code.In such embodiments, which can be downloaded and installed from network by communications portion 709, and/or
It is mounted from detachable media 711.When the computer program is executed by central processing unit (CPU) 701, execute the application's
The above-mentioned function of being limited in system.
It should be noted that computer-readable medium shown in the application can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In this application, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In application, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned
Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants
It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule
The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction
It closes to realize.
Being described in module involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet
It includes sending module, obtain module, determining module and first processing module.Wherein, the title of these modules is under certain conditions simultaneously
The restriction to the module itself is not constituted, for example, sending module is also described as " sending picture to the server-side connected
The module of acquisition request ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be
Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes
It obtains the equipment and converts Spark Dataframe for the data from one or more data sources, the data source includes text
One of data source, relevant database and distributed type assemblies are a variety of;The Spark Dataframe of conversion is registered
For the interim table of Spark;And the input carried out in a manner of sql according to user holds the interim table of the Spark across the data source
Row inquiry.
Data processing method, data processing equipment, electronic equipment and storage medium according to an embodiment of the present invention, being capable of band
Come it is following the utility model has the advantages that
For traditional Relational DataBase, such as MySQL, pressure will be calculated and be transferred to Hadoop (spark-on-
Yarn) in cluster;For whether there are the data analysis requirements of requirement of real-time, it is divided into two lines processing:
1) using the Spark inquiry system in figure, MySQL table real time data: is converted into Dataframe, execution pair
The sql operation of Dataframe;
2) non-real-time data: this is most of demand, and data of the general analysis by the end of the previous day are in such cases, sharp
With Spark loading system, data are converted Spark Dataframe by timing daily, and Dataframe is imported HDFS,
With the storage of Parquet table (column storage) format.
In such cases, MySQL only leads the pressure of table, this pressure, which is equivalent to, executes select*from table
Limit $ spark_partition_num, wherein spark_partition_num is Spark when importing MySQL data, root
According to the fragment size that expression condition carries out, can be set by the user.This part is imported to the full dose of data, and traditional database is not counted
Calculate pressure, system pressure concentrate bandwidth on, for intra-company 10,000,000,000 net bandwidth use enough.Off-line data imports
When in HDFS, with the storage of Parquet format, Parquet is column storage, and arithmetic speed is fast.
For it is inter-library, across data source operation, Various types of data source is converted to by Dataframe by Spark inquiry system
And the sql operation across table is run using Spark sql.
Semi-structured data is converted into structuring table by Txt (RegEx) module, usable sql is grasped
Make.
For Data Analyst, data manipulation entrance is unified, and without being concerned about data source, operation mode only has sql mono-
Kind.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright
It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any
Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention
Within.
Claims (10)
1. a kind of data processing method characterized by comprising
Spark Dataframe is converted by the data from one or more data sources, the data source includes text data
One of source, relevant database and distributed type assemblies are a variety of;
The Spark Dataframe of conversion is registered as into the interim table of Spark;And
According to the input that user is carried out in a manner of sql, inquiry is executed to the interim table of the Spark across the data source.
2. data processing method according to claim 1,
Wherein, when the data source is relevant database, the data processing method further include:
The Spark Dataframe of conversion is regularly imported into distributed type assemblies.
3. data processing method according to claim 1,
Wherein, when the data source is text data source, the data processing method further include:
To each data source definitions regular expression and interim table schema;
Spark RDD is converted by the text data by the regular expression;And
The Spark Dataframe is converted by the Spark RDD in conjunction with the interim table schema.
4. data processing method according to claim 1 to 3, further includes:
Query result is saved with intermediate sheet form to relevant database or distributed type assemblies.
5. a kind of data processing equipment characterized by comprising
Data conversion module, for converting Spark Dataframe, the number for the data from one or more data sources
It according to source include one of text data source, relevant database and distributed type assemblies or a variety of;
Table registration module, for the Spark Dataframe of conversion to be registered as the interim table of Spark;And
Enquiry module, the input for being carried out in a manner of sql according to user hold the interim table of the Spark across the data source
Row inquiry.
6. data processing equipment according to claim 5, wherein the data processing equipment further includes loading module, is used
In when the data source is relevant database, the Spark Dataframe of conversion is regularly imported into distributed collection
Group.
7. data processing equipment according to claim 5, wherein the data processing equipment further includes structurized module,
For when the data source is text data source, executing following operation:
To each data source definitions regular expression and interim table schema;
Spark RDD is converted by the text data by the regular expression;And
The Spark Dataframe is converted by the Spark RDD in conjunction with the interim table schema.
8. according to data processing equipment described in claim 5-7, wherein
The data processing equipment further includes that table restores module, for being saved query result with intermediate sheet form to relationship type number
According to library or distributed type assemblies.
9. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
The now method as described in any in claim 1-4.
10. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor
The method as described in any in claim 1-4 is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710398741.3A CN108984547A (en) | 2017-05-31 | 2017-05-31 | The method and apparatus of data processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710398741.3A CN108984547A (en) | 2017-05-31 | 2017-05-31 | The method and apparatus of data processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108984547A true CN108984547A (en) | 2018-12-11 |
Family
ID=64502015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710398741.3A Pending CN108984547A (en) | 2017-05-31 | 2017-05-31 | The method and apparatus of data processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108984547A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918429A (en) * | 2019-01-21 | 2019-06-21 | 武汉烽火众智智慧之星科技有限公司 | Spark data processing method and system based on Redis |
CN110209646A (en) * | 2019-05-14 | 2019-09-06 | 汇通达网络股份有限公司 | A kind of data platform system calculated based on real-time streaming |
CN111368097A (en) * | 2020-03-30 | 2020-07-03 | 中国建设银行股份有限公司 | Knowledge graph extraction method and device |
CN112148762A (en) * | 2019-06-28 | 2020-12-29 | 西安京迅递供应链科技有限公司 | Statistical method and device for real-time data stream |
CN112579673A (en) * | 2020-12-25 | 2021-03-30 | 中国建设银行股份有限公司 | Multi-source data processing method and device |
CN112732704A (en) * | 2019-10-14 | 2021-04-30 | 中移(苏州)软件技术有限公司 | Data processing method, device and storage medium |
CN114625757A (en) * | 2022-03-29 | 2022-06-14 | 医渡云(北京)技术有限公司 | Task execution method and device based on domain specific language, medium and equipment |
CN115905392A (en) * | 2022-12-23 | 2023-04-04 | 中电金信软件有限公司 | Stream and batch integrated data processing method, device, equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102354317A (en) * | 2011-09-22 | 2012-02-15 | 用友软件股份有限公司 | Data generation device and method |
CN104123374A (en) * | 2014-07-28 | 2014-10-29 | 北京京东尚科信息技术有限公司 | Method and device for aggregate query in distributed databases |
CN105426481A (en) * | 2015-11-19 | 2016-03-23 | 北京京东尚科信息技术有限公司 | Data processing method and device |
CN105760477A (en) * | 2016-02-15 | 2016-07-13 | 中国建设银行股份有限公司 | Data query method and system for multiple data sources and associated equipment therefore |
WO2016118783A1 (en) * | 2015-01-23 | 2016-07-28 | Attivio, Inc. | Querying across a composite join of multiple database tables using a search engine index |
CN106354876A (en) * | 2016-09-22 | 2017-01-25 | 珠海格力电器股份有限公司 | Data processing system and method |
CN106570193A (en) * | 2016-11-17 | 2017-04-19 | 深圳市康拓普信息技术有限公司 | Time series big data loading method |
CN106570022A (en) * | 2015-10-10 | 2017-04-19 | 阿里巴巴集团控股有限公司 | Cross-data-source query method, apparatus and system |
CN106649630A (en) * | 2016-12-07 | 2017-05-10 | 乐视控股(北京)有限公司 | Data query method and device |
-
2017
- 2017-05-31 CN CN201710398741.3A patent/CN108984547A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102354317A (en) * | 2011-09-22 | 2012-02-15 | 用友软件股份有限公司 | Data generation device and method |
CN104123374A (en) * | 2014-07-28 | 2014-10-29 | 北京京东尚科信息技术有限公司 | Method and device for aggregate query in distributed databases |
WO2016118783A1 (en) * | 2015-01-23 | 2016-07-28 | Attivio, Inc. | Querying across a composite join of multiple database tables using a search engine index |
CN106570022A (en) * | 2015-10-10 | 2017-04-19 | 阿里巴巴集团控股有限公司 | Cross-data-source query method, apparatus and system |
CN105426481A (en) * | 2015-11-19 | 2016-03-23 | 北京京东尚科信息技术有限公司 | Data processing method and device |
CN105760477A (en) * | 2016-02-15 | 2016-07-13 | 中国建设银行股份有限公司 | Data query method and system for multiple data sources and associated equipment therefore |
CN106354876A (en) * | 2016-09-22 | 2017-01-25 | 珠海格力电器股份有限公司 | Data processing system and method |
CN106570193A (en) * | 2016-11-17 | 2017-04-19 | 深圳市康拓普信息技术有限公司 | Time series big data loading method |
CN106649630A (en) * | 2016-12-07 | 2017-05-10 | 乐视控股(北京)有限公司 | Data query method and device |
Non-Patent Citations (1)
Title |
---|
于俊等: "《Spark核心技术与高级应用》", 31 January 2016 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918429A (en) * | 2019-01-21 | 2019-06-21 | 武汉烽火众智智慧之星科技有限公司 | Spark data processing method and system based on Redis |
CN110209646A (en) * | 2019-05-14 | 2019-09-06 | 汇通达网络股份有限公司 | A kind of data platform system calculated based on real-time streaming |
CN112148762A (en) * | 2019-06-28 | 2020-12-29 | 西安京迅递供应链科技有限公司 | Statistical method and device for real-time data stream |
CN112732704A (en) * | 2019-10-14 | 2021-04-30 | 中移(苏州)软件技术有限公司 | Data processing method, device and storage medium |
CN112732704B (en) * | 2019-10-14 | 2022-12-13 | 中移(苏州)软件技术有限公司 | Data processing method, device and storage medium |
CN111368097A (en) * | 2020-03-30 | 2020-07-03 | 中国建设银行股份有限公司 | Knowledge graph extraction method and device |
CN111368097B (en) * | 2020-03-30 | 2024-07-30 | 中国建设银行股份有限公司 | Knowledge graph extraction method and device |
CN112579673A (en) * | 2020-12-25 | 2021-03-30 | 中国建设银行股份有限公司 | Multi-source data processing method and device |
CN114625757A (en) * | 2022-03-29 | 2022-06-14 | 医渡云(北京)技术有限公司 | Task execution method and device based on domain specific language, medium and equipment |
CN115905392A (en) * | 2022-12-23 | 2023-04-04 | 中电金信软件有限公司 | Stream and batch integrated data processing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984547A (en) | The method and apparatus of data processing | |
US11630830B2 (en) | Background format optimization for enhanced queries in a distributed computing cluster | |
US9990399B2 (en) | Low latency query engine for apache hadoop | |
CN109189835A (en) | The method and apparatus of the wide table of data are generated in real time | |
CN106649630A (en) | Data query method and device | |
CN110019267A (en) | A kind of metadata updates method, apparatus, system, electronic equipment and storage medium | |
CN103646073A (en) | Condition query optimizing method based on HBase table | |
CN108804447A (en) | Utilize the method and system of cache responses request of data | |
CN107480202B (en) | Data processing method and device for multiple parallel processing frameworks | |
EP2778968B1 (en) | Mobile telecommunication device remote access to cloud-based or virtualized database systems | |
CN113704291A (en) | Data query method and device, storage medium and electronic equipment | |
CN103810272A (en) | Data processing method and system | |
CN109918425A (en) | A kind of method and system realized data and import non-relational database | |
CN109783562A (en) | A kind of method and device for business processing | |
CN110895591B (en) | Method and device for positioning self-lifting point | |
Nabi | Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark | |
Kim et al. | A study on utilization of spatial information in heterogeneous system based on apache nifi | |
US9195682B2 (en) | Integrated analytics on multiple systems | |
CN110147507A (en) | A kind of method, apparatus obtaining short chained address and server | |
CN106445645B (en) | Method and apparatus for executing distributed computing task | |
CN109960212A (en) | Task sending method and device | |
CN113760969A (en) | Data query method and device based on elastic search | |
CN110109919A (en) | The method and apparatus for determining logical message | |
CN113760240A (en) | Method and device for generating data model | |
CN112632170B (en) | SQL-based data processing method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181211 |
|
RJ01 | Rejection of invention patent application after publication |